key: cord-0213565-j59bap5y authors: Wang, Jun; Liu, Yinglu; Hu, Yibo; Shi, Hailin; Mei, Tao title: FaceX-Zoo: A PyTorch Toolbox for Face Recognition date: 2021-01-12 journal: nan DOI: nan sha: 8956aac5a7c44067cc4801ce42fedfed5eb8f182 doc_id: 213565 cord_uid: j59bap5y Deep learning based face recognition has achieved significant progress in recent years. Yet, the practical model production and further research of deep face recognition are in great need of corresponding public support. For example, the production of face representation network desires a modular training scheme to consider the proper choice from various candidates of state-of-the-art backbone and training supervision subject to the real-world face recognition demand; for performance analysis and comparison, the standard and automatic evaluation with a bunch of models on multiple benchmarks will be a desired tool as well; besides, a public groundwork is welcomed for deploying the face recognition in the shape of holistic pipeline. Furthermore, there are some newly-emerged challenges, such as the masked face recognition caused by the recent world-wide COVID-19 pandemic, which draws increasing attention in practical applications. A feasible and elegant solution is to build an easy-to-use unified framework to meet the above demands. To this end, we introduce a novel open-source framework, named FaceX-Zoo, which is oriented to the research-development community of face recognition. Resorting to the highly modular and scalable design, FaceX-Zoo provides a training module with various supervisory heads and backbones towards state-of-the-art face recognition, as well as a standardized evaluation module which enables to evaluate the models in most of the popular benchmarks just by editing a simple configuration. Also, a simple yet fully functional face SDK is provided for the validation and primary application of the trained models. Rather than including as many as possible of the prior techniques, we enable FaceX-Zoo to easily upgrade and extend along with the development of face related domains. The source code and models are available at https://github.com/JDAI-CV/FaceX-Zoo. Deep learning based face recognition has witnessed great progress in research field. Correspondingly, there emerge a number of excellent open-source projects developed for facilitating the experiments and production of deep face recognition networks. For example, Facenet [1] is a Ten-sorFlow [5] implementation of the model proposed by Schroff et al. [28] , which is a classic project for deep face recognition. OpenFace [6] is a general face recognition library, especially for the support of mobile device applications. InsightFace [2] is a toolbox for 2D&3D deep face analysis, mainly written in MXNet [8] . It includes the commonly-used training data, network settings and loss functions. face.evoLVe [3] provides a comprehensive face recognition library for face related analytics and applications. Although these projects have been widely used and brought a great deal of convenience, the rapid development of deep face recognition techniques pursuits a significant need of a more comprehensive framework and standard evaluation to facilitate the research and development. To this end, we develop a new framework, named FaceX-Zoo, in the form of PyTorch [27] library, which is highly modular, flexible and scalable. It is composed of the stateof-the-art training pipeline for discriminative face feature learning, the standard evaluation towards fair comparisons, and the deployment SDK for efficient proof of concept and further applications. We release all the source codes and trained models to facilitate the community to develop their own solutions against various real-world challenges from the perspective of training, evaluation, and deployment. We hope that FaceX-Zoo is able to provide helpful support to the community and promote the development of face recognition. The remaining part of this paper is organized as follows. In Section 2, we depict the structure and the highlight of FaceX-Zoo. In Section 3, we introduce the detailed design of this project. Section 4 provides the experiments with respect to the various supervisory heads and backbones that integrated in the training module, and reports the test accuracies on the commonly-used benchmarks which are also provided by the evaluation module. Section 5 presents our solutions for two practical situations, i.e. shallow face learn- ing and masked face recognition. Finally, we discuss about the future work and give the conclusion in Section 6 and Section 7, respectively. The overall architecture of FaceX-Zoo is subtly presented in Figure 1 . The whole project mainly consists of four parts: the training module, the evaluation module, the additional module and the face SDK, where the former two modules are the core part of this project. Several components are contained in the training and evaluation modules, including Pre-Processing, Training Mode, Backbone, Supervisory Head and Test Protocol. We elaborate on them as below. Pre-Processing. This module fulfils the basic transformations on images before sending them to the network. For training, we implement the commonly-used operations, such as resizing, normalization, random cropping, random flipping, random rotation, etc. One can add the customized operations flexibly, according to various demands. For evaluation, only resizing and normalization are employed. Likewise, the testing augmentations, such as five crops, horizontal flipping, etc., can also be easily added into our framework by customizing. Training Mode. The conventional training mode of face recognition is treated as the baseline routine. Concretely, it schedules the training inputs by DataLoader, then sends the inputs to the backbone network for forward passing, and finally computes a criterion as the training loss for backward updating. In addition, We consider a practical situation in face recognition that is to train the network with shallow distributed data [12] . Accordingly, we integrate a recent training strategy to facilitate the training on shallow face data. Backbone. The backbone network is used to extract the features of face images. We provided a series of stateof-the-art backbone architectures in FaceX-Zoo, which are listed below. Besides, any other architecture choices can be easily customized with the support of PyTorch, as long as modifying the configuration file and adding the architecture definition file. • MobileFaceNet [7] : An efficient network for the applicaiton on mobile devices. • ResNet [17] : A series of classic architectures for general vision tasks. • SE-ResNet [18] : ResNet equipped with SE blocks that recalibrates the channel wise feature responses. • HRNet [33] : A network for deep high-resolution representation learning. • EfficientNet [30] : A bunch of architectures that scale among depth, width and resolution. • GhostNet [16] : A model aiming at generating more feature maps from cheap operations. • AttentionNet [32] : A network built by stacking attention modules to learn attention-aware features. • TF-NAS [19] : A series of architectures searched by NAS with the latency constraint. • ResNeSt [41] : A series of ResNet-style networks with split-attention blocks. • ReXNet [15] : A series of models with effective channel configuration and parameterization. • RepVGG [11] : A VGG-like architecture realized by structural re-parameterization. • LightCNN [38, 37] : A light model with max feature map activations for fast face recognition. Supervisory Head. Supervisory Head is defined as the supervision single and its corresponding computation module towards accurate face recognition. In order to learn discriminative features for face recognition, the predicted logits are usually processed by some specific operations, such as normalization, scaling, adding margin, etc., before sending to the softmax layer. We implement a series of softmaxstyle losses in FaceX-Zoo as follows: • AM-Softmax [31] : An additive margin loss that adds a cosine margin penalty to the target logit. • ArcFace [9] : An additive angular margin loss that adds a margin penalty to the target angle. • AdaCos [42] : A cosine-based softmax loss that is hyperparameter-free and adaptive scaling. • AdaM-Softmax [23] : An adaptive margin loss that can adjust the margins for different classes adaptively. • CircleLoss [29] : A unified formula that learns with class-level labels and pair-wise labels. • CurricularFace [21] : An loss function that adaptively adjusts the importance of easy and hard samples during different training stages. • MV-Softmax [36] : A loss function that adaptively emphasizes the mis-classified feature vectors to guide the discriminative feature learning. • NPCFace [40] :A loss function that emphasizes the training on both the negative and positive hard cases. Test protocol. There are various benchmarks to measure the accuracy of face recognition models. Many of them focus on specific face recognition challenges, such as cross age, cross pose, and cross race. Among them, the commonly used test protocols are mainly based on the benchmarks of LFW [20] and MegaFace [22] . We integrates these protocols into FaceX-Zoo with simple usage and clear instruction, by which people can easily test their models on single or multiple benchmarks via simple configurations. Besides, it is convenient to extend additional test protocols by adding the test data and parsing the test pairs. It is worth noting that a masked face recognition benchmark based on MegaFace is provided as well. • LFW [20] : It contains 13,233 web-collected images of 5,749 identities with the pose, expression and illumination variations. We report the mean accuracy of 10-fold cross validation on this classic benchmark. • CPLFW [43] : It contains 11,652 images of 3,930 identities, which focuses on cross-pose face verification. Following the official protocol, the mean accuracy of 10-fold cross validation is adopted. • CALFW [44] : It contains 12,174 images of 4,025 identities, aiming at cross-age face verification. The mean accuracy of 10-fold cross validation is adopted. • AgeDB30 [26] : It contains 12,240 images of 440 identities, where each test pair has an age gap of 30 years. We report the mean accuracy of 10-fold cross validation. • RFW [34] : It contains 40,607 images of 11,430 identities, which is proposed to measure the potential racial bias in face recognition. There are four test subsets in RFW, named African, Asian, Caucasian and Indian, and we report the mean accuracy of each subset, respectively. • MegaFace [22] : It contains 80 probe identities with 1 million gallery distractors, aiming at evaluating largescale face recognition performance. We report the Rank-K identification accuracy on MegaFace. • MegaFace-Mask: It contains the same probe identities and gallery distractors with MegaFace [22] , while each probe image is added by a virtual mask. This protocol is designed to evaluate large-scale masked face recognition performance. More details can be found in Section 5.2. We report the Rank-K identification accuracy on MegaFace-Mask. Modular and extensible design. As described above, FaceX-Zoo is designed to be modular and extensible. It consists of a set of modules with respective functions. Most of the modules are developed following the principle of object-oriented design, so that the scalability is highly promoted. One can easily add new training modes, backbones, supervisory heads and data samplers to the training module, as well as more test protocols to the evaluation module. Last but not the least, we provide the face SDK and the additional module for the efficient deployment and flexible extension according to various demands. State-of-the-art training. We provide several state-ofthe-art practices for face recognition model training, such as the complete pre-processing operations for data augmentation, the various backbone networks for model ensemble, the softmax-style loss functions for discriminative feature learning, and the Semi-Siamese Training mode for practical shallow face learning. Easy to use and deploy. We release all the codes, models and training logs for reproducing the state-of-the-art results. Besides, we provide a simple yet fully functional face SDK written in python, which acts as a demo to help the users learn the usage of each module and develop the further versions. Standardized evaluation module. The commonly used evaluation benchmarks for face recognition are in need of a unified implementation for efficient and fair evaluation. For example, the official test protocol of MegaFace is implemented as a bin file, leading to inconvenient application in many evaluation conditions. FaceX-Zoo provides a standard and open-source implementation for evaluating on LFW-based and MegaFace-based benchmarks. Users can evaluate models on various benchmarks by editing the configuration file efficiently. We will also release the 106-point facial landmarks defined as [24] so that users can utilize them for face alignment of these benchmarks. Support for masked face recognition. Recently, due to the pandemic of COVID-19, masked face recognition has attracted increasing attention. In order to develop such a model, three essential components are indispensable: masked face training data, masked face training algorithm and masked face evaluation benchmark. FaceX-Zoo is the very project that provides all the three components via the 3D face mask adding technique. In this section, we describe the design of the training module (Figure 2) , the evaluation module (Figure 3) , and the face SDK (Figure 4) in details, which are modular and extensible. As shown in Figure 2 , the TrainingMode is the core class to aggregate all the other classes in the training module. There are mainly three classes aggregated in the Training-Mode: (1) BackboneFactory is a factory class to provide the backbone network; (2) HeadFactory is a factory class to produce the supervisory head according to the configuration; (3) DataLoader is in charge of loading the training data. As depicted in Figure 3 , the LFWEvaluator and the MegaFaceEvaluator are the core classes in the evaluation module. Both of them contain the class of CommonExtrator for face feature extraction. The CommonExtrator class depends on the ModelLoader class and the DataLoader class, where the former loads the models and the later loads the test data. Besides, the LFWEvaluator class also aggregates the PairsParseFactory class for parsing the test pairs in each test set. Differently, we split two classes for MegaFacebased evaluations, named CommonMegaFaceEvaluator and MaskedMegafaceEvaluator, for the MegaFace evaluation and the MegaFace-Mask evaluation, respectively. Both of them are inherited from the MegaFaceEvaluator class. In order to validate and demonstrate the effectiveness of the trained models for face recognition in a convenient way, we provide a simple yet fully functional module of Face SDK. As shown in Figure 4 , Face SDK includes three core classes, named ModelLoader, ImageCropper and Mod-elHandler. The ModelLoader class is used to load the models of face detection, face landmark localization and face feature extraction. The ImageCropper class is used to crop the facial area from the input image according to the de- In Face SDK, we provide a series of models, i.e. face detection, facial landmark localization, and face recognition, for the non-masked face recognition and masked face recognition scenarios. Specifically, for the non-masked face recognition scenario, we train the face detection model by RetinaFace [10] on the WiderFace dataset [39] . The facial landmark localization model is trained by PFLD [14] on the JD-landmark dataset [24] . We train the face recognition model with MobileFaceNet [7] and MV-Softmax [35] on MS-Celeb-1M-v1c [4] . For the masked face recognition scenario, we train the models with the same algorithms as the non-masked scenario while the training data is expanded by our FMA-3D method described in Section 5.2. We will continuously update the models with more methods in the future. To facilitate the readers to reproduce and fulfil their own works with our framework, we conduct extensive experiments about the backbone and supervisory head with the state-of-the-art methods. The adopted backbones and supervisory heads are listed in Section 2.1. We use MS-Celeb-1M-v1c [4] as the training data, which is well cleaned. For clear presentation, in the experiments of backbone, we adopt the same supervisory head, i.e. MV-Softmx [36] ; in the experiments of supervisory head, we adopt the same backbone, i.e. MobileFaceNet [7] . The remaining settings are kept the same for each trial. Four NVIDIA Tesla P40 GPUs are employed for training. We set the total epoch to 18 and the batch size to 512 for training. The learning rate is initialized as 0.1, and divided by ten at the epoch 10, 13 and 16. The test results of the experiments of backbone and supervisory head are shown in Table 1 and Table 2 , respectively. One can refer to these results for guiding and verifying the usage of our framework. In this section, we present to use the specific solutions for handling two challenging tasks of face recognition within the framework of FaceX-Zoo, including Semi-Siamese Training [12] for shallow face learning, and the masked face recognition for the recent demand caused by the pandemic of COVID-19. Background. In many real-world scenarios of face recognition, the training dataset is limited in depth, e.g. only two face images are available for each ID. This task, which is so called Shallow Face Learning as described in [12] , is problematic to the conventional training methods for face recognition. The shallow face data severely lacks the intraclass diversity for each ID, and leads to the collapse of feature dimension against effective training. Consequently, the trained network suffers from either model degeneration or over-fitting. As suggested in [12] , we adopt Semi-Siamese Training (SST) to tackle this issue. Furthermore, we implement it by the framework of FaceX-Zoo, in which the upstream and downstream stages (i.e. efficient data reading and unified automatic evaluation) complete the pipeline and facilitate the users to employ SST for model production. Experiments and results. For a quick verification of the effectiveness of FaceX-Zoo towards shallow face learning, we employ an off-the-shelf architecture, i.e. Mobile-FaceNet, as the model backbone, and perform a comparison experiment between the conventional training and SST. Following the settings of [12] , the training dataset is constructed by randomly selecting two facial images from each ID of MS-Celeb-1M-v1c, called MS-Celeb-1M-v1c-Shallow. The training epoch is set to 250 and the batch size is set to 512. The learning rate is initialized as 0.1, and divided by ten at the epoch 150, 200, 230. The test results on LFW, CPLFW, CALFW, AgeDB and RFW are presented in Table 3 , which verifies the effectiveness of poly-mode training on the shallow data. Background. Due to the recent world-wide COVID-19 pandemic, masked face recognition has become a crucial application demand in many scenarios. However, few masked face datasets are available for training and evaluation. To address this issue, we empower the framework of FaceX-Zoo to add virtual mask to the existing face images by the specialized module, named FMA-3D (3D-based Face Mask Adding). FMA-3D. Given a real masked face image A (Fig. 5(a) ) and a non-masked face image B (Fig. 5(d) ), we synthesize a photo-realistic masked face image with the mask from A and the facial area from B. First, we utilize a mask segmentation model [25] to extract the mask area from image A (Fig. 5(b) ), and then map the texture map into UV space by the 3D face reconstruction method PRNet [13] ( Fig. 5(c) ). For image B, we compute the texture map in UV space in the same way of A (Fig. 5(e) ). Next, we blend the mask texture map and the face texture map in UV space as Fig. 5 (f) shows. Finally, the masked face image is synthesized ( Fig. 5(g) ) by rendering the blended texture map according to the UV position map of image B. Fig. 6 shows more cases of masked face image synthesized by FMA-3D. Compared with the 2D-based and GAN-based methods, our method shows superior performance on the robustness and fidelity, especially for the large head poses. Training masked face recognition model. Resorting to our FMA-3D, it is convenient to synthesize large number of masked face images from the existing non-masked datasets, such as MS-Celeb-1M-v1c. Since the existing datasets already have the ID annotation, we can directly employ them for training the face recognition network without additional labeling. The training method can be either the conventional routine or SST, as well as the training head and backbone can be instantiated with the choices integrated in FaceX-Zoo. Note that the testing benchmark can be augmented from non-masked to masked version in the same manner. Experiments and results. By using FMA-3D, we synthesize the training data from MS-Celeb1M-v1c to its masked version, named MS-Celeb1M-v1c-Mask. It includes the original face images of each identity in MS-Celeb1M-v1c, as well as the masked face images corresponding to the original ones. We choose MobileFaceNet as the backbone, and MV-Softmax as the supervisory head. The model is trained for 18 epochs with a batch size of 512. The learning rate is initialized as 0.1, and divided by ten at the epoch 10, 13 and 16. To evaluate the model on masked face recognition task, we synthesize the masked facial datasets based on MegaFace by using FMA-3D, named MegaFace-mask, which contains the masked probe images and remains the gallery images non-masked. As shown in Figure 7 , we conduct comparison experiments among four scenarios. Specifically, model1 is the baseline which is trained on MS-Celeb1M-v1c; model2 is also trained on MS-Celeb1M-v1c, but only the upper half of face is cropped for training, which can be regarded as a naive manner to eliminate the adverse effect of mask; model3 is trained on MS-Celeb1M-v1c-Mask; model4 is the ensemble of model2 and model3. We can see that the rank1 accuracy of baseline model is 27.03%. By only utilizing the upper half of face, the performance of model2 is improved to 71.44%. M odel3 achieves the best performance of 78.39% in single models with the help of synthesized masked face images. By combining model2 and model3, the rank1 accuracy is further improved to 79.26%. In the future, we will try to improve FaceX-Zoo from three aspects: breadth, depth, and efficiency. First, more additional modules will be included, such as face parsing and face lightning, to thereby enrich the functionality "X" in FaceX-Zoo. Second, the modules of backbone architecture and supervisory heads will be continually supplemented along with the development of deep learning tech- niques. Third, we will try to improve the training efficiency via distributed data parallel technique and mixed precision training. In this work, we introduce a highly modular and scalable open-source framework for face recognition, namely FaceX-Zoo. It is easy to install and utilize. The Training Module enable users to train face recognition networks with various choices of backbone and supervisory head. The Training Mode includes both the conventional routine and the specific solution for shallow face learning. The Evaluation Module provides an automatic evaluation benchmark for standard and convenient testing. Face SDK provides modules for the whole pipeline, i.e. face detection, face landmark localization, and face feature extraction, for face recognition. It can be taken as a baseline as well as further development towards deployment. Besides, the Additional Module supports training and testing on masked face recognition via 3D virtual mask adding technique. All the source codes are released along with the logs and trained models. One can easily play with this framework as a prototype, and develop his own work from this baseline. Tensor-Flow: Large-scale machine learning on heterogeneous systems Openface: A general-purpose face recognition library with mobile applications Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems Arcface: Additive angular margin loss for deep face recognition Retinaface: Single-stage dense face localisation in the wild Repvgg: Making vgg-style convnets great again Semi-siamese training for shallow face learning Joint 3d face reconstruction and dense alignment with position map regression network Pfld: A practical facial landmark detector Rethinking channel dimensions for efficient model design Ghostnet: More features from cheap operations Deep residual learning for image recognition Squeeze-and-excitation networks Tf-nas: Rethinking three search freedoms of latency-constrained differentiable neural architecture search Labeled faces in the wild: A database forstudying face recognition in unconstrained environments Curricularface: adaptive curriculum learning loss for deep face recognition The megaface benchmark: 1 million faces for recognition at scale Adaptiveface: Adaptive margin and sampling for face recognition Grand challenge of 106-point facial landmark localization A new dataset and boundary-attention semantic segmentation for face parsing Agedb: the first manually collected, in-the-wild age database Pytorch: An imperative style, high-performance deep learning library Facenet: A unified embedding for face recognition and clustering Circle loss: A unified perspective of pair similarity optimization Rethinking model scaling for convolutional neural networks Additive margin softmax for face verification Residual attention network for image classification Deep high-resolution representation learning for visual recognition Racial faces in the wild: Reducing racial bias by information maximization adaptation network Mis-classified vector guided softmax loss for face recognition Mis-classified vector guided softmax loss for face recognition Learning an evolutionary embedding via massive knowledge distillation A light cnn for deep face representation with noisy labels Wider face: A face detection benchmark Npcface: A negative-positive cooperation supervision for training large-scale face recognition Resnest: Splitattention networks Adacos: Adaptively scaling cosine logits for effectively learning deep face representations Cross-pose lfw: A database for studying cross-pose face recognition in unconstrained environments Cross-age lfw: A database for studying cross-age face recognition in unconstrained environments Acknowledgements Thanks to Jixuan Xu (jixuanxu@shu.edu. cn), Hang Du (duhang@shu.edu.cn), Haoran Jiang (jianghaoran@shu.edu.cn) and Hanbin Dai (daihanbin@std.uestc.edu.cn), partial works were done when they interned at JD AI Research.