key: cord-0129147-ku609m59 authors: Almadan, Ali; Rattani, Ajita title: Compact CNN Models for On-device Ocular-based User Recognition in Mobile Devices date: 2021-10-11 journal: nan DOI: nan sha: 1a6a9129d83c9f1b0a167006da7dcae9b11ec54f doc_id: 129147 cord_uid: ku609m59 A number of studies have demonstrated the efficacy of deep learning convolutional neural network (CNN) models for ocular-based user recognition in mobile devices. However, these high-performing networks have enormous space and computational complexity due to the millions of parameters and computations involved. These requirements make the deployment of deep learning models to resource-constrained mobile devices challenging. To this end, only a handful of studies based on knowledge distillation and patch-based models have been proposed to obtain compact size CNN models for ocular recognition in the mobile environment. In order to further advance the state-of-the-art, this study for the first time evaluates five neural network pruning methods and compares them with the knowledge distillation method for on-device CNN inference and mobile user verification using ocular images. Subject-independent analysis on VISOB and UPFR-Periocular datasets suggest the efficacy of layerwise magnitude-based pruning at a compression rate of 8 for mobile ocular-based authentication using ResNet50 as the base model. Further, comparison with the knowledge distillation suggests the efficacy of knowledge distillation over pruning methods in terms of verification accuracy and the real-time inference measured as deep feature extraction time on five mobile devices, namely, iPhone 6, iPhone X, iPhone XR, iPad Air 2 and iPad 7th Generation. Ocular biometrics consists of regions in the eye and those around it, i.e., iris, conjunctival vasculature, and periocular region. Ocular biometrics has obtained significant attention from the research community and industry alike due to its accuracy, security, better robustness against facial expressions, and ease of use in mobile devices [5] , [11] , [20] . An ocular region can also be scanned in the presence of facial masks worn extensively in pandemics such as in COVID-19. Further, visible light ocular biometric modality can be acquired using almost any imaging device ranging from high-end DSLR to ubiquitous mobile device RGB cameras. Ocular biometrics in the RGB spectrum have been utilized for smartphone-based login authentication by both academia and industry such as EyeVerify Inc. In fact, mobile ocular-based user authentication can operate in darker environments by using the device screen as a light source [21] . A number of methods based on hand-crafted textural descriptors such as histograms of oriented gradients (HOG), local binary patterns (LBP), local phase quantization (LPQ), and binary statistical image features (BSIF) have been used for person authentication using ocular images [18] in mobile devices. With advances in deep learning, deeply coupled autoencoders and different convolutional neural network (CNN) architectures such as ResNet-50, and MobileNet have been fine-tuned for ocular-based user recognition (authentication) in mobile devices [14] - [17] . The challenges of unconstrained mobile environments may lead to substantial variations in the ocular samples due to factors such as lighting conditions, distance, motion blur, glasses and occlusion due to hair, front-facing imaging sensor and optic aberrations (including smudged lenses), and other imaging issues such as inaccurate white balance and exposure metering. These factors result in performance degradation in the mobile environment. Datasets such as VISOB [14] , [19] and UFPR-Periocular [23] have been assembled for research and development in ocular-based user authentication 1 in the mobile environment. Most of the earlier ocular recognition studies in the mobile environment used CNNs under closed-set subject-dependent evaluation protocols, where the subjects overlap between the training and test sets, which may overestimate performance and generalizability [17] , [18] . But recently, researchers have been focusing on CNN-based methods for ocular recognition under the subject-independent protocol [14] , [21] . The subject-independent protocol means the subjects do not overlap between the training and test set. Therefore, the subjectindependent analysis, although more challenging, is much better for developing and testing generalizable models compared to the closed-set subject-dependent analysis. Further, the need to re-train the CNN model every time a new subject is enrolled (the scalability issue) is mitigated by the subject-independent protocol. These high-performing deep CNN models have enormous space and computational complexity due to the millions of parameters and computations involved [3] . The number of parameters is dependent on the learnable layers i.e., convolution and fully connected layers. The size of the model increases with the number of parameters in the learnable layers. The operations such as convolution and matrix multiplication are built upon multiply-add (MAdd) operations which constitute the parameters of the model. In a convolution layer, let W ×H be input spatial resolution for a layer with ch in feature channels and K × K convolutional kernels to generate ch out feature channels, then the number of MAdd operations is given as: Further, In a fully connected layer with F in number of input features and F out number of output features, the number of MAdd operations are given as These MAdd operations represent the parameters and computational cost of the model. The high computational cost of large deep CNNs makes the deployment to resource-constrained mobile environment challenging [3] . The size and computational cost of the deep learning models should be low for real-time and frequent on-device biometric authentication on mobile devices, for instance, in case of unlocking the phone. This would provide an enhanced user experience due to faster authentication mechanism as high-cost models are slower to execute. Also, this would give battery-friendly biometric authentication. Methods based on network pruning [1] , lower bit quantization [4] , knowledge distillation [6] , squeezed convolutional networks, and the use of separable convolutions [13] , [22] have been introduced to reduce the size and computational cost of the models. A handful of studies have proposed lightweight (compact size) CNN models for ocular recognition in the mobile environment. These studies implemented knowledge distillation [6] , an ensemble of patch-based ocular CNNs [20] and a customized version of MobileNet-V2 [21] for obtaining lightweight models for ocular based user authentication in smartphones. However, the state-of-the-art is still in its initial stages for compact size models for ocular-based mobile user authentication. In order to further advance the state-of-the-art, the contributions of this work are as follows: • Evaluation of the state-of-the-art neural network pruning models [1] This paper is organized as follows: Section 2 discusses the prior work on compact models for ocular recognition in mobile devices. Section 3 discusses the pruning and knowledge distillation techniques used in this study. Section 4 elaborates on the CNN architecture considered. Datasets and experimental protocol are discussed in section 5. Results are discussed in section 6. Conclusions are drawn in section 7. In this section, we discuss the existing studies on developing lightweight CNN models for ocular-based user recognition in smart-phones. Boutros et al. [2] proposed a lightweight DenseNet-20 deep learning model with only 1.1m trainable parameters [2] obtained via knowledge distillation. Experiments performed on the VISPI dataset showed that DenseNet-20 trained using knowledge distillation outperforms the same model trained without knowledge distillation with EER reduction from 8.36% to 4.56%. Reddy et al. [20] proposed patch-based OcularNet, a convolutional neural network (CNN) model, that used patches from the eye images for user recognition. For the OcularNet model, six registered overlapping patches were extracted from the ocular region, and a small convolutional neural network (CNN) was trained for each patch to extract feature descriptors. The verification performance of the proposed OcularNet which has 1.5M parameters was compared to the popular ResNet-50 model which has 23.4M parameters. When trained on the VISOB dataset, the proposed OcularNet model obtained equivalent performance over ResNet-50 in the subject independent verification setting. In another study [21] , authors proposed a customized version of the MobileNet-v2 architecture [21] obtained by removing the last convolutional layers from the original implementation without affecting the accuracy while reducing the model size by 3.4× compared to the original MobileNet-v2, and 36× compared to the popular ResNet-50 for mobile ocular recognition in subject-independent analysis. Jung et al. [10] proposed the transfer of information from the face to periocular modality by means of knowledge distillation (KD) for feature learning. However, the authors reported that applying typical KD techniques to heterogeneous modalities directly is sub-optimal. Several approaches have been proposed for compression of deep learning networks. In this study, we evaluate pruning and knowledge distillation techniques for obtaining lightweight (compact) models. Neural network pruning is the task of reducing the size of a network by removing parameters. This entails systematically removing parameters from an existing network [1] . Typically, the initial network is large and accurate, and the goal is to produce a smaller network with similar accuracy. There are several methods of obtaining a pruned model mainly derived from Algorithm 1. The network is first trained to convergence. Afterward, each parameter or structural element in the network is issued a score, and the network is pruned based on these scores. Pruning reduces the accuracy of the network, so it is trained further (known as fine-tuning) to recover. The process of pruning and fine-tuning is often iterated several times, gradually reducing the network's size as shown in Figure 1 and Algorithm 1. In Algorithm 1, f (X; W ) represent the neural network model to be pruned. Require: N , the number of iterations of pruning, and X, the dataset on which to train and fine-tune These pruning methods differ based on whether they prune individual parameters (i.e., unstructured pruning). Other methods consider parameters in groups (structured pruning), removing entire layers or channels to exploit hardware and software optimized for dense computation. Some pruning methods compare scores locally, pruning a fraction of the parameters with the lowest scores within each structural sub-component of the network (e.g., layers). Others consider scores globally, comparing scores to one another irrespective of the part of the network in which the parameter resides [1] . In this study, we evaluated five existing pruning methods [1] described as follows: • Global Magnitude Pruning -prunes the weights with the lowest absolute value anywhere in the network. • Layerwise Magnitude Pruning -for each layer, prunes the weights with the lowest absolute value. • Global Gradient Magnitude Pruning -prunes the weights with the lowest absolute value of (weight × gradient), evaluated on a batch of inputs. • Layerwise Gradient Magnitude Pruningfor each layer, prunes the weights with the lowest absolute value of (weight × gradient), evaluated on a batch of inputs. • Random Pruning -prunes each weight independently with the probability equal to the fraction of the network to be pruned. Knowledge Distillation (KD) [6] is a technique to improve the performance and generalization ability of smaller models by transferring the knowledge learned by a cumbersome model (teacher) to a single small model (student). The key idea is to guide the student model to learn the relationship between different classes discovered by the teacher model that contains more information beyond the ground truth labels. To be specific, given a labelled data set {xi, yi}, i= 1,. . . n where x i is the data and y i is the label, the knowledge from the teacher's prediction p T τ is distilled to the student's prediction p S τ by minimizing the loss function as follow: where L ce = −yi i logp(yi|xi) is the cross-entropy loss with student network prediction p(yi|xi) = sof tmaxf (xi) and f(.) is the logit of the network. L KD is the KD loss defined as: Here, p τ (y|x) = softmax (f (x)/τ ) and p τ (y|x) = softmax(f (x)/τ ) are the smoothen student's prediction and teacher's prediction, respectively. The smoothness of the prediction is regulated by the temperature term τ and KL is defined as the KL divergence. The hyperparameter λ decides the contribution of the KD loss. Figure 2 shows the generic architecture of the knowledge distillation using teacher-student model. We used knowledge distillation method described above to train the compact version of ResNet-50 model and other lightweight mobile friendly models for mobile ocular recognition. In this section, we will describe the CNN architectures used as a base model for compression or as a student model for knowledge distillation. • ResNet-50: ResNet [8] is a short form of residual network based on the idea of "identity shortcut connection" where input features may skip certain layers (Figure 3(a) ). The residual or shortcut connections introduced in ResNet allow for identity mappings to propagate around multiple nonlinear layers, preconditioning the optimization and alleviating the vanishing gradient problem. The ResNet-50 is used as the heavyweight model pruned using the methods discussed in section 3.1 and as a teacher for knowledge distillation mentioned in section 3.2. Other variations of ResNet consisting of eight and twenty residual layers, ResNet-8 and ResNet-20, respectively, were used as student models in this study. • MobileNet: MobileNet-V2 [22] is one of the most popular mobile-centric deep learning architectures which is small in size as well as computationally efficient. The main idea of MobileNet is that instead of using regular 3 × 3 convolution filters, the operation is split into depthwise separable 3 × 3 convolution filters followed by 1 × 1 convolutions. While achieving the same filtering and combination process as a regular convolution, the new architecture requires a fewer number of operations and parameters (Figure 3(b) ). MobileNet-V3-Small [9] is a variation of MobileNets. It is utilized to run on mobile phone CPUs targeting low-resource constraint devices. MobileNet-V2 and V3 are both used as student models in this study. • ShuffleNet: ShuffleNet-V2-50 [13] is a variation of Shuf-fleNet which is a mobile-centric deep learning model. It builds upon ShuffleNet-v1, which utilized pointwise group convolutions, bottleneck-like structures, and a channel shuffle operation. However, the direct metric, e.g., speed, also depends on the other factors such as memory access cost and platform characteristics (Figure 3(c) ). ShuffleNet-V2-50 is used as a student model in this study. In this section, we discuss datasets used for training and evaluating the compressed models. • VISOB 2.0 [14] : VISOB 2.0 is the 2nd version of VISOB 1.0 dataset used in IEEE WCCI competition 2020 that facilitates subject-independent analysis. This publicly available dataset consists of a stack of eye images captured using the burst mode via two mobile devices: Samsung Note 4 and OPPO N1. During the data collection, the volunteers were asked to take their selfie images in two visits, 2 to 4 weeks apart from each other. At each visit, the selfie-like images were captured using the front-facing camera of the mobile devices under three lighting conditions (daylight, office light, and dark light) and two sessions (about 10 to 15 minutes apart). The stack consisting of five consecutive eye images was extracted from the stack of full-face frames selected such that the correlation coefficient between the center frame and the remaining four images is greater than 90%. This dataset was used for fine-tuning the ResNet-50 models and pruned networks. We used 40K images captured from 200 subjects from VISOB 2.0 captured from all devices and different lighting conditions for CNN models training/ fine-tuning. Images of left and flipped versions of right eyes were considered to be an individual subject. The remaining 350 subjects were used for subjectindependent verification performance evaluation of the compressed models for mobile ocular recognition. • UFPR-Periocular [23] : This is the latest periocular biometric dataset containing samples from 1, 122 subjects with a total of 33, 660 images acquired in 3 sessions by 196 different mobile devices. The data is captured across race, age, and gender. The gender distribution of the subjects is (53.65%) male and (46.35%) female, and approximately 66% of the subjects are under 31 years old. The eye corners of the ocular images are annotated with 4 points per image (inside and outside eye corners), to normalize the periocular region across scale and rotation. This dataset is used for the subjectindependent evaluation of the compressed models. The ResNet-50 model fine-tuned on 200 subjects from VISOB 2.0 dataset [14] was used as a baseline model to be pruned and a teacher for knowledge distillation experiments. For the purpose of experiments, all the images were resized to 224 × 224. After the last convolutional layer, we added batch Normalization (BN), dropout, the fully connected layer of size 512, and a batch normalization followed by the final output layer. This would result in a 512-D feature embedding. The ResNet-50 model was trained using an early stopping mechanism using Adam optimizer and a cross-entropy loss function. The open-set evaluation is performed on the pruned version of the ResNet-50 model and the trained student networks (ResNet-8, ResNet-20, MobileNet-V2, MobileNet-V3, and ShuffleNet-V2). The template set consists of a stack of 5 ocular images per subject for VISOB and UFPR across two different sessions. For VISOB, the template and test sets consist of multiple stacks of 5 images per individual. The UFPR dataset has two stacks of 5 images: one used for the template and another as the test set. The matching score between deep features extracted from a pair of template and test image is calculated using the cosine similarity metric given in eq 5: where u and v are the two deep feature vectors: Cosine similarity metrics is widely adopted for measuring the angles between deep features extracted from a pair of images [12] . We trained our student models using Stochastic Gradient Descent (SGD) optimizer with a batch size of 32. MobileNet-V2 was trained for 50 epochs, ShuffleNet for 100 epochs, ResNet-20 and ResNet-8 for 200 epochs, and MobileNet-V3 for 35 epochs using an early stopping mechanism based on the validation set accuracy (indicated in Table IV ). All trained models were fine-tuned on ImageNet and Softmax used as a loss function during the training process. All pruning methods were fine-tuned iteratively for 40 epochs. The PyTorch framework was used for CNN training and feature extraction. The average of the scores from the multiple gallery images per test is used for the final identity verification. PyTorch Mobile 2 was utilized to deploy the models on-device. Table I illustrates the performance of ResNet-50 fine-tuned on 200 subjects from VISOB datasets and tested on VISOB (same dataset) and UFPR-Periocular (cross dataset) under subject independent analysis. This model is used as a base (teacher) model for pruning (knowledge distillation) method. The EERs of the model are 8.22% and 6.84% on VISOB and UFPR datasets (see Table I ). The Genuine Match Rate (GMR) of the model is highest for VISOB at 45.30 at False Match Rate (FMR) of 0.01. The Area Under Curve (AUC) of the model when evaluated on VISOB and UFPR dataset is almost equivalent. The equivalent performance on VISOB and UFPR-Periocular could be due to a low number of images per subject for UFPR-Periocular over VISOB. Table II shows the subject-independent verification performance analysis of pruning strategies evaluated using GMRs and EER on VISOB dataset. The Layerwise Magnitude (LM) method obtained the lowest EER of 14.34% at compression ratio (CR) of 8 (which is 4% increment over the baseline and 0.04 decrease in AUC) and an average EER of 20.42% over 6 compression ratios. The change in AUC of the pruned models over the AUC of ResNet-50 baseline is shown as well. In comparison to LM, other strategies namely, Global Magnitude (GM), Global Gradient Magnitude (GGM), Layerwise Gradient Magnitude (LGM), and Random Pruning (RP) obtained 34.26%, 10.99%, 8.89%, and 6.53% increase in EER, respectively. The standard deviation of intra-EERs across compression rates for each strategy is 23.0% for GM (being the highest), and for the rest falling in the range [3.38 − 5.68]. Surprisingly, the GM outperformed others at GMR scores, with a 14-point increase @FPR=0.01 and 9.37 @FPR=0.1 across all CRs. This increase in performance of GM-based pruning was not the case for UFPR, where it obtained an average EER of 86.58% with a gap of 63.24% to RP method on the UFPR dataset (shown in Table III ). Overall, GM obtained the highest error rates even over RP and LM obtained the least error rate. Similarly, LM obtained the least error rates among others on UFPR as well with an EER of 10.85% at CR of 8 which is 4% increment over the baseline (6.84%). The decrease in AUC over the baseline is 0.02 for LW at CR of 8. Table IV shows the subject-independent analysis of compact models (MobileNet-V2, MobileNet-V3, ResNet-8, ResNet-20, ShuffleNet-V2) obtained using knowledge distillation (KD) approach on VISOB and UFPR datasets. As can be seen, MobileNet-V2 outperformed other students (namely, ResNet-8, ResNet-20, MobileNet-V3) as it obtained 5.21% EER, 49.09% GMR@FMR=0.01%, and 61.67% GMR@FMR=0.1% (as shown in Table IV ). The ResNet-20 had a slightly higher EER of 8.80% over MobileNet-V2 and lower GMRs by a drop of 8.41% and 10.95 at GMR@FMR=0.01 and 0.1, respectively. However, MobileNet-V3 obtained the highest EER with an increase of 21.23% over the average of the other four students on both datasets. The EER of the MobileNet-V2 trained on VISOB without knowledge distillation is 8.0%. The increment in AUC is 0.02 over the baseline ResNet-50 model. Similarly, for UPFR dataset, MobileNet-V2 outperformed other students (namely, ResNet-8, ResNet-20, MobileNet-V3) as it obtained 5.38% EER, 50.19% GMR@FMR=0.01%, and 68.16% GMR@FMR=0.1%. The increment in AUC is 0.01 over the baseline ResNet-50 model. MobileNet-V3 obtained the least performance on both datasets. The reason could be MobileNet-V3 dependence on the auto-search mechanism to find the best mobile architecture may not be effective for KDbased training. The second-lowest performance was obtained by ShuffleNet-V2 with an average EER of 23.0% and GMR@FPR=0.001 to 13.80%. ResNet-8 obtained better overall results over ShuffleNet-V2 while being 18.7x smaller. The student MobileNet-V2 performed better than the teacher model after knowledge distillation by about 2.2% lower EER. This could be due to the sequential self-teaching of the student in better learning the inter-class similarity [6] . Overall, experiments suggest the superiority of KD over other pruning methods in subject-independent evaluation. This can be seen using Area Under the Curve (AUC) metric as the student networks obtained an average of 0.82 AUC (0.89 for VISOB and UPFR) compared to 0.93 obtained by the baseline model (ResNet-50). However, pruning methods obtained an average AUC of 0.72 (0.75 and 0.69 on VISOB and UFPR, individually). Comparison of pruning methods across compression ratios suggests that the EER averaged across all CR points remained constant for the UFPR dataset i.e., 32.96%. The least performance was obtained at CR of 64 with an average EER value of 46.14% on VISOB and 43.54% on UFPR. At the same point, the average AUC on both datasets is 0.60. However, the best performance was obtained for CR of 2 with an average EER of 21.35% and an average AUC of 0.87. To validate the latency of models and to study the tradeoff between performance and feature extraction speed, we measured the Extraction Time (ET) on real handheld devices as indicated in Table V . The aim of this experiment is not to compare the performance of different mobile devices but the real-time inferences of different compression techniques on different mobile platforms. Available Apple mobile devices were used as as they are the most adopted and popular. MobileNet-V3 was tuned to CPUs of mobile phones by the adaptation of Network Architecture Search (NAS) and complemented by the NetAdapt algorithm. This is evident as MobileNet-V3 being 1.17x faster than MobileNet-V2 across all five devices. The fastest ET was obtained by ResNet-8 with an average of 218ms across all the devices, which makes it 6.3x faster and 341x smaller than ResNet-50 with an average ET of 1381 ms. The average ET of 554ms and 520.2 ms were obtained for MobileNet-V2 and ResNet-20 models, respectively. We further calculated ET of the bestpruned model obtained using Layerwise Magnitude at CR of 2 for iPhone 6 and iPhone X. However, the ET for the pruned model averaged to be 1433 ms (shown in Table V . Thus the pruned model ran 1.6x slower than the baseline model. One drawback of the unstructured pruning methods is that they result in having sparse weight matrices, thus leading to inefficiency in speedup and compression on CPUs and GPUs, and also requiring having dedicated hardware [7] . Therefore, the pruned model did not obtain high efficiency during the run-time. The benefits of the pruning methods include reducing the total number of energy-intensive memory accesses and improving the inference time due to effectively higher memory bandwidth for fetching compressed model parameters. MobileNet-V2 and ResNet-20 student models offered the best trade-off between performance and speed when trained using KD with ResNet-50 as the teacher model. This study evaluates five neural network pruning algorithms for CNN-based on-device smartphone user authentication based on ocular images for the first time. Comparative analysis is performed with knowledge distillation methods on publicly available VISOB and UFPR periocular datasets. Experimental results suggest the efficacy of knowledge distillation-based methods over network pruning on an average. Specifically, MobileNet-V2 and ResNet-20 based student models trained using knowledge distillation with ResNet-50 as the teacher obtained the best tradeoff between accuracy and speed when evaluated on five mobile devices. Thus, demonstrating the efficacy of knowledge distillation over other pruning techniques for real-time smart-phone user authentication on mobile devices using ocular images. As a part of future work, patch-based CNN and those based on structured pruning techniques such as layer removal will be evaluated and compared on the same test bed. What is the state of neural network pruning? Compact models for periocular verification through knowledge distillation Deep learning with edge computing: A review Low-bit quantization of neural networks for efficient inference Mobile iris challenge evaluation (miche)-i, biometric iris dataset and protocols Knowledge distillation: A survey Eie: Efficient inference engine on compressed deep neural network Deep residual learning for image recognition Searching for mobilenetv3 Periocular in the wild embedding learning with cross-modal consistent knowledge distillation Periocular biometrics: A survey Sphereface: Deep hypersphere embedding for face recognition Shufflenet v2: Practical guidelines for efficient cnn architecture design VISOB 2.0 -second international competition on mobile ocular biometric recognition Deep-prwis: Periocular recognition without the iris and sclera using deep learning frameworks Collaborative representation of blur invariant deep sparse features for periocular recognition from smartphones On fine-tuning convolutional neural networks for smartphone based ocular recognition ICIP 2016 competition on mobile ocular biometric recognition ICIP 2016 competition on mobile ocular biometric recognition OcularNet: Deep Patch-based Ocular Biometric Recognition Generalizable deep features for ocular biometrics Mobilenetv2: Inverted residuals and linear bottlenecks Ufpr-periocular: A periocular dataset collected by mobile devices in unconstrained scenarios