Abstract
The demand for effective, efficient and safe methods for animal identification has been increasing significantly, due to the need for traceability, management, and control of this population, which grows at higher rates than the human population, particularly pets. Motivated by the efficacy of modern human identification methods based on face biometrics features, in this paper, we propose a dog face recognition method based on vision transformers, a deep learning approach that decomposes the input image into a sequence of patches and applies self-attention to these patches to capture spatial relationships between them. Results obtained on DogFaceNet, a public database of dog face images, show that the proposed method, which uses the EfficientFormer-L1 architecture, outperforms the state-of-the-art method proposed previously in literature based on ResNet, a deep convolutional neural network.
Similar content being viewed by others
1 Introduction
According to Pet Brazil Institute [9], in 2018, there were 54.2 million dogs, 39.8 million birds, 23.9 million cats, 19.1 million fishes, and 2.3 million reptiles and small mammals, considered pets, in the homes of Brazilian families. Compared to the 2013 census data, the cat and dog populations have grown in households at 8.1% and 5%, respectively, with an overall average growth of 5.2%. In addition, data from the World Health Organization (WHO), published by São Paulo University (USP) [14], indicate that in Brazil there are approximately 30 million lost or abandoned animals, with 10 million cats and 20 million dogs.
The uncontrolled proliferation and the inappropriate management of animals in society can be factors with a very negative impact on public health, as animals can transmit many diseases and parasites. Therefore, it is of paramount importance to carry out monitoring, control and also traceability of animals through complete medical records and effective identification mechanisms to enable monitoring and evaluation of the animal population in the country.
Some countries have already approved mandatory pet identification registration, such as South Korea, where all dogs must be monitored through RFID (Radio Frequency Identification), tags on collars or other devices [10].
As occurs in human identification, animal identification is also susceptible to fraud. In the UK, for example, pet insurance fraud cases have been increasing by 400% per year, reaching 2 million, annually, in claims, according to Youtalk-Insurance [18]. Therefore, in addition to issues of effectiveness and efficiency, animal identification methods also need to be robust enough to discourage, stop or even prevent fraud.
According to Kumar et al. [11], animal identification approaches can be classified into invasive or non-invasive. Examples of invasive approaches are ear tags, ear tattoos and microchip implantation in the animal’s body. Examples of non-invasive approaches are RFID (Radio Frequency Identification) devices, collars with GPS (Global Positioning System), Internet of Things (IoT) devices, Bluetooth devices, and biometric features (e.g. face, iris, retina and muzzles). Among these approaches, animal identification based on biometric features has proved to be the best choice, since it is effective, efficient and does not require special devices, which would introduce more costs to the identification process, being also important to discourage or prevent fraud.
Among several human biometric traits, the face is one of the most popular due to its high coverage, high acceptability, easiness of capture at a distance, convenience of use and lower costs since digital cameras are cheap and popular sensors. Motivated by the success of modern human face recognition methods, in this paper, we propose a dog face recognition method based on vision transformers, a deep learning approach that decomposes the input image into a sequence of patches and applies self-attention to these patches to capture spatial relationships between them.
2 Biometric Dog Identification
Biometric dog identification approaches have been reported in the literature by using snout and facial features, among others [10, 12, 17, 21]. This section presents some approaches proposed so far, their identification rates, as well as their advantages and disadvantages.
2.1 Identification Using Snout Features
Snout is the region of an animal’s face consisting of its nose, mouth, and jaw. In many animals, this structure is called muzzle. The muzzle information can be used as a biometric feature for dog identification, as reported by Jang et al. [10].
Figure 1 shows, on the left, a device used to capture images of dog snouts and, on the right, a captured image, with the region of interest for biometric identification highlighted.
Device used to collect images of dog snouts (on the left) and the region of interest of a collected image, used for feature extraction (on the right) [10].
Jang et al. [10] have proposed the use of some feature extraction methods, such as SIFT (Scale-Invariant Feature Transform), SURF (Speeded Up Robust Features), BRISK (Binary Robust Invariant Scalable Keypoints) and ORB (Oriented FAST and Rotated BRIEF), and the method that presented the best performance was the ORB with EER (Equal Error Rate) of 0.35%, when evaluated with 55 dog muzzle pattern images acquired from 11 dogs and 990 images augmented by the image deformation (i.e., angle, illumination, noise, affine transform). However, despite this low error rate, the technique proposed by these authors has some disadvantages:
-
Even though it is not an invasive technique, it is necessary to immobilize the animal to capture the image and extract features from the snout;
-
It is important to avoid light reflection and blur in muzzle images;
-
It is necessary to have good hygiene of the muzzle.
Figure 2 shows examples of dog muzzle images that were discarded in the work by Jang et al. [10] due to light reflection and blurring caused by dog movements.
Dog muzzle images with light reflection (1–2) and blurring (3–4) [10].
2.2 Identification Using Facial Features
Recent works on dog biometric identification based on face features have used the DogFaceNet database, proposed by Mougeot, Li and Jia [17]. This database contains 3,148 images of 485 dogs, with at least two images per individual. However, the publicly available database that was developed after the published work contains 8,363 images of 1,393 dogs.
Figure 3 shows four images of four dogs captured from different viewpoints that compose the DogFaceNet dataset.
Samples of four dog images from the DogFaceNet dataset [17].
The images of the dog’s faces were all aligned, using the animal’s eyes as a reference. Figure 4 shows the process of detecting the fiducial points (eyes and snout) on the face of a dog (on the left), used for the alignment of the images, and the result of the alignment and cropping processes applied on the six images of one dog from the DogFaceNet dataset (on the right). The dog face images of the DogFaceNet database were divided into training and test sets, with 90% and 10%, respectively.
Detection of the animal’s eyes and snout (left) and the results obtained after the aligning and cropping processes, applied to six images of one dog from the DogFaceNet dataset [17].
Dog identification through facial features is a challenging task, given the high interclass similarity and the high intraclass variability that can occur. Figure 5 shows, on the left, six pairs of images captured from distinct pairs of individuals that present faces with high similarity (false positive pairs of canine image faces according to the method proposed by Mougeot, Li and Jia [17]), and, on the right, six pairs of images, each pair from the same individual, that present high variability (false negative pairs of canine image faces according to the method proposed by Mougeot, Li and Jia [17]).
Examples of six false positive pairs of canine faces (left) and six false negative pairs of canine faces (right), according to the method proposed by Mougeot, Li and Jia [17].
Mougeot, Li and Jia [17] proposed a Deep VGG-like and a ResNet-like Convolutional Neural Network (CNN) for the dog face verification task containing 48 dogs from an open-set of test data of DogFaceNet, and concluded that the ResNet-like CNN performed better with 92% of accuracy. For the identification task, the best performance was also obtained by the ResNet-like CNN, with 60.44% of accuracy for rank-1 and 92% for rank-5.
Yoon, So and Rhee [21], based on the studies of Mougeot, Li and Jia [17], proposed a methodology for using vector spaces to improve the performance in the dog identification process. Thus, in the scenario of dog facial identification, the authors proposed, using the triplet loss, the removal of the L2 norm, for comparing feature vectors from the images of the dogs in a new space within a sphere of radius 1. With this, they obtained a mean accuracy of 88.8% with ArcFace+VL in the verification task.
3 Background
In this section, we describe the fundamental concepts used in this work, such as Convolutional Neural Network (CNN), ResNet architecture, Vision Transformers (ViT), EfficientFormer-L1 architecture and ArcFace function.
3.1 Convolutional Neural Networks and the ResNet Architecture
Convolutional Neural Networks (CNN) have been widely used in solutions for Computer Vision applications, in problems such as object detection, image recognition, and Biometrics [1, 2]. Their architectures are similar to the ones of the human visual cortex. According to Chollet [1], there are some high-performance CNN architectures, such as AlexNet, VGG, GoogleNet and Resnet.
Convolutional Neural Networks are organized into layers, such as the convolutional layers, the pooling layers and the fully connected layer. In Fig. 6 there is an example of the architecture of a CNN.
Simplified architecture of a Convolutional Neural Network [13].
In CNNs, to limit the range of a result of a convolution layer, activation functions are used, such as the linear function ReLU, the sigmoid function, the hyperbolic tangent function, etc.
Thus, according to LeCun et al. [6], the convolutional layers have the objective of performing the extraction of features through the convolution operation. Furthermore, the adjustments of the backpropagation algorithm weights are performed in this layer.
Also according to LeCun et al. [6], the pooling layer aims to reduce dimensionality by combining neighbouring pixels into a single value. Consequently, there is a reduction in time and a minimization of the use of computational resources for training.
The fully connected layer aims to perform image classification. For example, in a neural network for classifying dogs, the result of the fully connected layer expresses which dog is associated with the input image.
In addition to conventional CNN architectures, ResNet is a CNN architecture proposed by He et al. [8] and its main characteristic, as its own name indicates - Residual Networks - is the use of shortcut connections, i.e., residual information is passed from previous layers to posterior ones, which improves and, consequently, change the training process (avoiding data vanishing or attenuation as well as saving computational resources, for instance). In Fig. 7, there is an example of using shortcut connections and routing to the output layer of the neural network in order to optimize its training process.
ResNet features [7].
According to Li et al. [15], several improvements were made to the ResNet architecture to increase its performance. Among them, Zhang et al. [22] proposed multi-level shortcut connections and Targ, Almeida and Lyman [19] proposed using more convolution layers and data flow between layers, among others. Also, different benchmark architectures of ResNet neural networks were proposed, however, one of the most used and effective for many Computer Vision tasks is the ResNet-50 (with 50 layers).
3.2 Vision Transformers and the EfficientFormer Architecture
Transformers were proposed by Vaswani et al. [20] to solve a problem involving natural language processing (NLP) and have become state-of-the-art for many types of applications.
For image processing applications, Dosovitskiy et al. [5] proposed the Vision Transformers (ViT) models, in which the input image is divided into patches of fixed size, transforming it into an input sequence of image patches. Continuing through the initial patches, feature vectors are projected across an initial linear layer and position references are added directly to a standard transformer. In Fig. 8 there is an example of the architecture of a Vision Transformer and its operations.
According to [5], in the benchmarks performed, the Vision Transformers architecture outperformed state-of-the-art image recognition architectures, specifically with the use of Convolutional Neural Networks. For example, in the ImageNet database, the Vision Transformer architecture had an accuracy of 88.5%, and the others, using CNNs, were below 88%.
Vision Transformer architecture proposed by Dosovitskiy [5].
Even with the great progress caused by ViTs in the Computer Vision area, models based on transformers are still very heavy and present a problem of high latency, which hinders both the learning and the use of this type of model in less capable environments. EfficientFormer-type networks arise to mitigate this problem through a pure transformer, with consistent data dimensionality, capable of running in more diverse environments. Figure 9 shows an overview of the EfficentFormer proposed by Lie et al. [16]. The network starts with a convolution stem as patch embedding, followed by MetaBlock (MB). The \(MB^{4D}\) and \(MB^{3D}\) contain different token mixer configurations, that is, local pooling or global multi-head self-attention, arranged in a dimension-consistent manner.
Overview of the EfficientFormer [16].
3.3 ArcFace
In the ArcFace [4] error function, the objective of learning becomes jointly minimizing the angular distance between elements of the same class while maximizing the intra-class similarity through Eq. 1.
In this case, the class samples are positioned in a hypersphere of radius s, and the objective of the error function becomes to minimize the angular distance between elements of this hypersphere, in such a way that the features extracted in the last layer before the softmax, represent a sample in this space, whose elements similar to this sample always end up being positioned at an angular distance in the interval m.
The error function can be extended to perform an analysis on subspaces by adding an additional parameter of subcenter when extracting features, this parameter consists of a matrix of weights of dimensions \(C \times k\), where C is the number of classes and k the number of subspaces, such that the features are mapped to the k subspaces, so that the ArcFace error function can be applied to each of them [3].
4 Proposed Method
In order to improve the state-of-the-art results in the task of dog facial recognition, in this work we propose a method that uses EfficientFormer-L1, a Vision Transformer architecture proposed by Lie et al. [16].
In the proposed method, the EfficientFormer-L1 is coupled with the ArcFace error function [4], a popular loss function in biometrics which aims to minimize the interclass angular distance while the intraclass angular distance is maximized.
In addition, an SVM classifier will be used in the output of both architectures, since in biometrics calculation is used using distances, such as the cosine distance.
Figure 10 demonstrates the architecture of the proposed method with the architecture of EfficientFormer-L1.
5 Results and Discussion
In order to assess the proposed method for dog face recognition, using visual transformers, more specifically by using the EfficientFormer-L1 architecture, and ArcFace, two experiments were carried out on the DogFaceNet dataset. As the subset of data used by Mougeot, Li and Jia [17] is different than the one publicly shared, a protocol, inspired by them and using standard practices for open-set biometric authentication was used. The results obtained were compared with a baseline method based on the ResNet-50 architecture [8] proposed by Mougeot, Li and Jia [17].
It is important to point out that the experiments are not identical to those carried out by Mougeot, Li and Jia [17], as the publicly disclosed base is different from the original base.
In addiction, in Table 1 there are details of the servers that were used to run the experiments. Being the implementation of the ResNet architecture on server 1 and the EfficientFormer-L1 architecture on server 2.
5.1 Experiment - Verification Task
In the verification task, two architectures, EfficientFormer-L1 and ResNet-50, were employed in a standard facial biometrics evaluation protocol. This protocol involves using all the comparisons within the dataset in order to obtain the genuine and impostor score distributions and, then, construct the Receiver Operating Characteristic (ROC) curve. Furthermore, to ensure balanced evaluation, we selected all positive comparisons (a total of 2561 pairs of dogs) and an equal number of negative comparisons. From these pairs, we computed additional evaluation metrics, including Accuracy, Precision, Recall, and F-1 Score.
Initially, an experiment was carried out in relation to the verification task, an EfficientFormer-L1 architecture was used for feature extraction combined with the ArcFace error function with 3 subcenters to perform the learning. The hyperparameters in relation to the ArcFace error function were obtained using as reference the analysis proposed by the Fixed AdaCos method, proposed by [23], in which the margin parameter is fixed at \(m = 0.5\) and the scale parameter is obtained dynamically by \(s = \sqrt{2}\cdot \log (C-1)\), where C is the number of classes.
Figure 11 shows the genuine comparison score distributions and the imposter comparison score distributions obtained using ResNet-50 and EfficientFormer-L1 features. One can observe that the EfficientFormer-L1 features generated more separated distributions than the Resnet-50 features, leading to a lower intersection between the distributions and, consequently, lower error rates. EfficientFormer-L1 obtained a better AUC value, equal to 0.989319, while the Resnet-50 obtained an AUC value equal to 0.952105, as shown in Fig. 12.
Table 2 presents the values of Area Under the ROC Curve (AUC), Equal Error Rate (EER), Accuracy, F1-Score, Precision and Recall, obtained in the verification task, on the DogFaceNet dataset, using face features extracted from the EfficientFormer-L1 and ResNet-50. One can observe that EfficientFormer-L1 features performed better than the ResNet-50 features in all metrics.
5.2 Experiment - Identification Task
In the identification task experiment, the two architectures, EfficientFormer-L1 and ResNet-50, were used again. To evaluate their performance, we employed two distinct settings with two different types of data split.
In the first setting, we computed the Cumulative Matching Characteristic (CMC) curve by comparing the cosine distance between every element in the probe set and every element in the gallery set. This analysis provides insights into the model’s ability to rank the correct matches in the gallery.
The data was split by using a k-fold cross-validation strategy, with k set to 10, where 90% of the open-set data was used for gallery creation, and the remaining 10% was reserved for evaluation. To further evaluate the models, we trained an SVM using the k-fold cross-validations with k set to 10. Table 3 shows the results obtained.
For the second split, a random sample approach was employed, as proposed by Mougeot, Li and Jia [17], where m samples for each class are randomly selected from the dataset and included in the gallery set, while the rest of the samples are used for evaluation. For this data split approach, a k-NN classifier was used, with the k value being chosen also as proposed by Mougeot, Li and Jia [17]. Figure 3 presents the rank-10 cumulative match characteristic (CMC) curve in each of the data separation protocols for the identification task. One can observe that the EfficientFormer-L1 features obtained very superior performances when compared with the ResNet-50 features, in the identification task. The best rank-1 accuracy value, \(91.4\%\), was obtained by EfficientFormer, with \(m=1\), while the best rank-1 accuracy value obtained by ResNet-50 was only \(63.3\%\), also with \(m=1\).
Regarding the related works presented in Sect. 2.2, using the smaller initial version of the DogFaceNet, Mougeot, Li and Jia [17], obtained \(92\%\) of accuracy in the verification task, and \(60.44\%\) of rank-1 accuracy (best case) in the identification task, while Yoon, So and Rhee [21], using the same dataset that we have used, obtained a mean accuracy of \(88.8\%\) with ArcFace+VL in the verification task (Fig.13).
6 Conclusion
With the advancement of technology, especially in the scenario of smart cities and Internet of Things devices, there is an improvement and, consequently, sophistication in biometric applications. With this, it is possible to explore the field of animal biometrics, as there are still few applications and studies in the area applied to dogs, according to the study by Mougeot, Li and Jia [17]. In the case of dogs specifically, biometric identification through faces has shown to be very promising, as it does not require specific hardware resources and can be used through smartphone cameras. As for the recognition techniques, it is noted that the application of architectures based on transformers, especially with EfficientFormer-L1 used in this work, brings excellent results compared to the state of the art in the area of pattern recognition in images that are the Convolutional Neural Networks, in this case from ResNet-50. Still, the application of biometric techniques in dogs is a vast area to be explored and its results add positively to the population, because, through this, it provides ways of monitoring these animals in cities, searching for lost animals, reducing abandonment, carrying out disease control more effectively and help prevent or detect fraud.
References
Chollet, F.: How convolutional neural networks see the world. The Keras Blog 30 (2016)
De Souza, G.B., da Silva Santos, D.F., Pires, R.G., Marana, A.N., Papa, J.P.: Deep texture features for robust face spoofing detection. IEEE Trans. Circuits Syst. II Express Briefs 64(12), 1397–1401 (2017)
Deng, J., Guo, J., Liu, T., Gong, M., Zafeiriou, S.: Sub-center ArcFace: boosting face recognition by large-scale noisy web faces. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 741–757. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_43
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Forsyth, D.A., et al.: Object recognition with gradient-based learning. Shape, contour and grouping in computer vision, pp. 319–345 (1999)
GeeksforGeeks: Residual networks (resnet) - deep learning. https://www.geeksforgeeks.org/residual-networks-resnet-deep-learning/. Accessed 18 June 2022
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Institute, P.B.: Pet census (2019). https://institutopetbrasil.com/imprensa/censo-pet-1393-milhoes-de-animais-de-estimacao-no-brasil. Accessed 18 June 2022
Jang, D.H., Kwon, K.S., Kim, J.K., Yang, K.Y., Kim, J.B.: Dog identification method based on muzzle pattern image. Appl. Sci. 10(24), 8994 (2020)
Kumar, S., Singh, S.K.: Visual animal biometrics: survey. IET. Biometrics 6(3), 139–156 (2017)
Lai, K., Tu, X., Yanushkevich, S.: Dog identification using soft biometrics and neural networks. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2019)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Lemos, S.: The number of adoptions and abandonment of animals in the pandemic (2021). https://jornal.usp.br/atualidades/cresce-o-numero-de-adocoes-e-de-abandono-de-animais-na-pandemia. Accessed 18 June 2022
Li, S., Jiao, J., Han, Y., Weissman, T.: Demystifying resnet. arXiv preprint arXiv:1611.01186 (2016)
Li, Y., Yuan, G., Wen, Y., Hu, J., Evangelidis, G., Tulyakov, S., Wang, Y., Ren, J.: Efficientformer: vision transformers at mobilenet speed. Adv. Neural. Inf. Process. Syst. 35, 12934–12949 (2022)
Mougeot, G., Li, D., Jia, S.: A deep learning approach for dog face verification and recognition. In: Nayak, A.C., Sharma, A. (eds.) PRICAI 2019. LNCS (LNAI), vol. 11672, pp. 418–430. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29894-4_34
Software, A.: Pet insurance fraud increases (2018). https://youtalk-insurance.com/broker-news/400-rise-in-pet-insurance-fraud-highlights-need-for-new-approach. Accessed 18 June 2022
Targ, S., Almeida, D., Lyman, K.: Resnet in resnet: Generalizing residual architectures (2016). arXiv preprint arXiv:1603.08029
Vaswani, A., et al.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Yoon, B., So, H., Rhee, J.: A methodology for utilizing vector space to improve the performance of a dog face identification model. Appl. Sci. 11(5), 2074 (2021)
Zhang, K., Sun, M., Han, T.X., Yuan, X., Guo, L., Liu, T.: Residual networks of residual networks: multilevel residual networks. IEEE Trans. Circuits Syst. Video Technol. 28(6), 1303–1314 (2017)
Zhang, X., Zhao, R., Qiao, Y., Wang, X., Li, H.: Adacos: adaptively scaling cosine logits for effectively learning deep face representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10823–10832 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Canto, V.H.B., Manesco, J.R.R., de Souza, G.B., Marana, A.N. (2023). Dog Face Recognition Using Vision Transformer. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14196. Springer, Cham. https://doi.org/10.1007/978-3-031-45389-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-45389-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45388-5
Online ISBN: 978-3-031-45389-2
eBook Packages: Computer ScienceComputer Science (R0)












