key: cord-0330366-n1abn1gu
authors: Seneviratne, Sachith; Kasthuriaarachchi, Nuran; Rasnayaka, Sanka
title: Multi-Dataset Benchmarks for Masked Identification using Contrastive Representation Learning
date: 2021-06-10
journal: nan
DOI: 10.1109/dicta52665.2021.9647194
sha: 5c2f0e39eb83defe85260cf0c1f7c441ee76cef6
doc_id: 330366
cord_uid: n1abn1gu

The COVID-19 pandemic has drastically changed accepted norms globally. Within the past year, masks have been used as a public health response to limit the spread of the virus. This sudden change has rendered many face recognition based access control, authentication and surveillance systems ineffective. Official documents such as passports, driving license and national identity cards are enrolled with fully uncovered face images. However, in the current global situation, face matching systems should be able to match these reference images with masked face images. As an example, in an airport or security checkpoint it is safer to match the unmasked image of the identifying document to the masked person rather than asking them to remove the mask. We find that current facial recognition techniques are not robust to this form of occlusion. To address this unique requirement presented due to the current circumstance, we propose a set of re-purposed datasets and a benchmark for researchers to use. We also propose a contrastive visual representation learning based pre-training workflow which is specialized to masked vs unmasked face matching. We ensure that our method learns robust features to differentiate people across varying data collection scenarios. We achieve this by training over many different datasets and validating our result by testing on various holdout datasets. The specialized weights trained by our method outperform standard face recognition features for masked to unmasked face matching. We believe the provided synthetic mask generating code, our novel training approach and the trained weights from the masked face models will help in adopting existing face recognition systems to operate in the current global environment. We open-source all contributions for broader use by the research community.

Facial recognition technology was generating impressive results prior to the COVID-19 pandemic. However, due to mask-based occlusions these methods now need to be investigated and adjusted to be robust to partial facial occlusion. A common scenario that occurs in this space is of unmasked vs masked identity matching. Often organizations will retain unmasked images of an individual appear-ing on various identity documents (passport, driver's license, staff identity) that need to be verified against a masked image. Traditional facial recognition methods contain feature representations that are reliant on seeing the whole face. In particular, the absence of some distinctive facial appendages in the masked image (lips, chin, moustache) is likely to lead to a false negative where an authentic user may be incorrectly categorized as an imposter. It is imperative that computer vision techniques are able to adapt to such scenarios. We find that resources for performing research in this domain are quite lacking and propose a set of benchmarks to remedy this situation.

Masked face recognition focuses on identifying people using their facial features while they are wearing masks. Masked facial recognition can be tackled across two use cases. First is to assume each user will enroll their face image while wearing a mask. This means matching is performed between two masked faces. The second use case is masked person recognition from a database of unmasked images. This use case is more receptive to using existing face databases such as passports or drivers' licenses. It has the broader advantage of not requiring an entire cohort of individuals to be re-registered within a facial database while wearing masks. Our work will focus on this scenario.

We approach this problem of masked to unmasked matching with the objective of creating a replicable workflow that can be applied in the wild. To this end, we focus our analysis on evaluating on datasets unseen by the model during training. We re-purpose some existing and easily accessible facial databases with a synthetic masking technique in order to generate new datasets for this problem. We make our evaluation more robust by using several such databases and by performing additional evaluation on a new dataset collected explicitly for this problem. Our evaluation shows that our method outperforms existing facial recognition techniques even when finetuned on the same datasets. We use a workflow we believe is generalizable to benchmarking other problems in the occluded imagery domain, and therefore avoid using task-specific optimizations such as specialized loss functions from facial recognition and focus instead on improving performance by incorporating more datasets. By using a task-general workflow to optimize performance in a task-specific manner by using data as the instrument, we position this work for use as a general benchmarking technique in one-shot learning.

2 Literature Review

With the onset of COVID-19, the task of face mask attention has received considerable attention. Several studies have focused on classifying masked and unmasked faces achieving near perfect results over 99% [27, 38, 25, 36] . These works focus only on the presence of a mask but does not ensure the mask is worn properly. Batagelj et. al. [2] introduce the Face-Mask Label Dataset (FMLD) to train models to see if a person is properly wearing a mask or not with over 97% accuracy.

While the previously mentioned studies tackle the face mask recognition problem as a classification task, object detection based approaches utilize You Only Look Once (YOLO) approaches also report 94% and 81% average precision [3, 28] .

Our work is not focused on face mask detection, we focus on masked face recognition.

Due to the sudden widespread usage of face masks, existing face recognition systems have become less reliable. The effect of face masks on existing face recognition tasks was studied by Damer et. al. [7, 6] . Both these studies show quantitative evidence to show that current face recognition models drop in accuracy when the probe images are wearing masks. This highlights the need for specialized models which can handle masked faces without a drop in authentication accuracy.

Initial work on Occlusion robust Face Recognition (OFR) [35, 30, 37, 39] and Partial Face Recognition (PFR) [18, 42, 24] has overlap with masked face recognition tasks. However, with the renewed importance of masked faces in the current global environment, there are several studies dedicated to Masked Face Recognition (MFR).

The main focus of existing MFR studies have been on developing new models which can do facial recognition for masked face datasets. Table 1 gives a summary of current research on this area and the reported accuracy values. All the work summarized in Table 1 focus on recognition where the Paper Approach

Evaluation Dataset Result Hariri [16] Occlusion removal approach and training on VGG16 architecture RWMFD 91.3% (acc)

Wang et. al [41] Face-eye based multi granularity model MFRD 95% (acc) Ejaz et. al [12] PCA Geng et. al [13] proposed a Generative Data Augmentation method to create synthesized data which is used to fine tune VGGFace2 model. In this work the authors evaluate the model on a scenario where the similarity between masked and unmasked faces are evaluated. They report a 86.5 F1 score for MFSR dataset. We will be focusing on a similar evaluation setup over many datasets. Other research in this area includes masked face recognition using near IR images by Du et. al [11] .

Unsupervised representation learning has been explored extensively in computer vision due to the ability to learn from unlabelled images. This allows for a task-independent approach to representation building since unlabelled images are commonly available for most problems. Self-supervised representation learning is a type of unsupervised representation learning which performs unsupervised learning by creating a pretext task for the representation to be built in a supervised manner. Most self-supervised techniques vary in terms of the task used, including distortion [10] , relative position prediction [9] , jig-saw puzzle solving [33] , feature counting [34] , coloring [44] . Current state of the art approaches in this area [15, 4, 17] use contrastive learning tasks to generate representations. We use MoCoV2 [5] in this work, which operates on the pre-text task of instance discrimination, as the basis for building our representations. In particular, we draw upon the idea of using a projection head during representation learning used by both SimCLR [4] and MoCoV2, followed by discarding it during evaluation and extend this idea to replicating this workflow at inference time.

Unmasked Identities/ Images Masked Identities/ Images CelebA [26] 10177 

We use two approaches to create masked faces for training and testing. (1) Use existing large scale face datasets by adding a digital mask synthetically. (2) Collect a small scale dataset of masked and unmasked images from volunteers for validation.

We follow the process proposed by Ngan et. al [32] to draw a digital mask on top a facial image. First we detect the frontal face bounding box using the face detector from [20] . After cropping the face we use facial key point predictor from [20] using 68 facial key points. A synthetic mask shape is created by generating the convex hull by combining selected key points. The intermediate steps of this process are depicted in Fig. 1 . All steps are reproducible for any data using scripts we open-source 1 . Masking is verified by performing landmark detection on the resultant masked images. Masked images for which a facial bounding box is not detected are discarded. This is so that face detection workflows can still work with such images.

Since the training data was created by adding a digital mask over an un-masked face, we collect a real dataset with masked and unmasked images for each identity. This collection is done on a voluntary basis where the participants are shown an example pair of images and asked to capture themselves using the front camera of their mobile device. This created a challenging dataset of varying lighting conditions, indoor/outdoor environments, different mask types and different camera qualities. Therefore, this validation dataset gives a good indication of how robust and genaralizable our models are. An example image pair is shown in Fig. 2 . Table 2 gives a summary of the datasets which were created and collected in this study. We use CelebA, LFW, YouTube Faces and SoF with a train and test split for training the model and testing. We keep FEI Face, Georgia Tech and In house dataset as holdout sets to validate our models' generalisability. 

We use a Siamese network with shared weights as the basis of all training workflows. The embedding outputs from standard model architectures (ResNet,VGG,MobileNet) etc, are used to compute the distance between an image pair (masked vs non-masked). This is then fed into an intermediate fully connected layer with sigmoid activation, which is connected to a final output with linear activation. Training is done using binary cross-entropy and similarity is measured at inference time using one of three methods. Where distance between two vectors is used as a measure of dis-similarity, we convert to similarity as

• Similarity output at the output level -output is passed through a sigmoid at inference time only to scale to [0,1]

• Similarity as a function of L2 distance at the intermediate fully connected level (generally 512 nodes with sigmoid activation)

• Similarity as a function of L2 distance at the bottleneck/embedding layer of the backbone architecture. (for example a vector of 2048 dimensions in ResNet50). Figure 4 in Section 4.2 has an example characteristic response curve for these options.

As we use a Siamese network based approach for training our feature extractor, we create pairs of images for training. Each pair corresponds to an unmasked reference and a masked probe image. The network outputs a similarity in [0,1] with 0 indicating imposter and 1 indicating authentic match. Since absolute difference is taken between embeddings from a shared weight siamese network, the ordering of masked/unmasked images as reference and probe has no effect on the final similarity scores. Figure 3 contains a high level overview of the architecture.

The training workflow primarily proceeded as follows:

• Use a pretrained representation to build a model on a single dataset.

• Finetune the built model on multiple datasets to generalize the feature embedding.

• Further finetune based on identifying hard negative pairs during training.

Training was carried out on image pairs, drawn at random from the training set of identities. The shared weight Siamese formulation mentioned above was used, with an additional linear layer connected to a sigmoid activation function operating on the L2 distance between the embeddings of each image. Binary cross entropy was used as the training loss. Pretrained representations were obtained through several means, many were found from facial recognition/detection tasks from existing literature, custom representations using unsupervised learning were also generated and tested. Finetuning was carried out by freezing parameter updates for part of the network and using validation results as an indication of embedding improvement. A custom representation was generated using MoCoV2 with the following training parameters (recommended in [5] ): learning rate 0.015, batchsize 128 with MoCo softmax temperature 0.2, while using the projection head and augmentation workflow introduced in SimCLR. Pre-training was carried out on a 4 GPU node on Spartan [21] for 860 epochs on the CelebA(Masked and Unmasked) dataset. The resultant representation was finetuned end-to-end on 4 datasets (CelebA, LFW, SoF, youtube -masked and unmasked) combined for another 25 epochs by continuing the pretraining process. This exposes the representation to easier negatives (cross-dataset) and provides more data variety for the pretraining process. There were several approaches considered in this regard: for example, given in this context that most of the facial features being matched are outside of the masked region, it is possible to argue that using only masked images would be sufficient for representation learning. However, since the reference image is unmasked and the level of occlusion of facial features depends on many factors such as type of mask and the assumption that it is worn properly, we decide to include both masked and unmasked images in learning features. The primary impact this has on the pretraining workflow is that the model gains contextual knowledge regarding both masked and unmasked images and learns features to be able to distinguish between the two types. This stems from the fact that the pretraining workflow is focused around instance discrimination. Additionally, this creates a representation which should theoretically be useful for extending to other tasks. We share all representations (initial and final) used in this paper for further research in this area.

Validation was done using a precision metric. From the validation set of identities, a single identity(unmasked) is chosen as the reference and a masked image is drawn from the same identity forming an "authentic pair". 19 identities are drawn uniformly at random with replacement from the set of available identities excluding the reference identity. From these "imposter" identities 1 image is drawn per identity uniformly at random forming 1 authentic pair and 19 imposter pairs following the workflow in [22] . Evaluation on 20 such pairs counts as one validation step. 400 such steps are conducted at the end of each training iteration. Precision over the iteration is counted as the percentage of steps where the authentic image pair has the highest similarity (out of 20 possible pairs). Training iterations which produce a checkpoint with at least 90% validation precision were chosen for further evaluation on holdout datasets. Note that the expected precision from a random prediction would be 5% in this case ( 1 20 ). The similarity in this experiment is always inferred from the final linear layer. Due to the quadratic scaling ( n 2 where n is number of identities) of imposter pairs, this allows us to use more identities in training with fewer used for validation.

Our models output a similarity score for a given masked and unmasked image pair. Therefore, the decision outcome of the system depends on a threshold value.

if similarity(reference, probe) >= threshold : Accept as legitimate user (1) With this setup, there is a trade off between false accepts and false rejects as we alter the threshold value. Therefore, the evaluations are done by measuring following metrics. 

We reserve several datasets for the purpose of holdout testing. We generate an equal number of authentic and imposter pairs randomly and fix them for evaluation. These lists are released as part of our benchmark for one to one comparisons in future studies.

We performed an experiment in order to select a model backbone architecture as well as a training workflow. We evaluated several different models using the training approaches mentioned previously, and built with different pretrained representations to use as starting points for training. Similarity was measured directly at the final sigmoid layer of the output. Our objective was to identify the smallest model that was capable of generalizing good results over multiple datasets while also exploring the merits of training a new baseline representation useful for masked classification tasks. We select the best model checkpoint based on validation performance from each training workflow and compare them by performing inference on a comprehensive set of holdout datasets. We propose this as a benchmark for learning effective masked facial representations from a single dataset using masked/unmasked images, while evaluating on multiple holdout datasets. This benchmark captures the capacity of a particular model training process to generate data-set independent representations suitable for use in the wild. CelebA is uniquely suited for training purposes as there is more variation present in terms of within-identity age, hair style, pose and emotion variation. This is important in unmasked-masked identification as methods need to be robust to changes in all these factors (reference unmasked images are often used for a while before being recaptured).

In this benchmark we explore the capacity of a training workflow to utilize multiple datasets in order to learn a feature representation with a focus on following a general workflow that can be useful in other tasks involving one-shot learning.

The selected workflow from the previous experiment was used to finetune the pretrained model using 4 datasets: CelebA, LFW, sof_original and youtube. First, we freeze 50% of the ResNet50 and use Stochastic Gradient Descent with a learning rate of 1.0 to learn an overall strong representation. Due to training on 4 datasets at once and freezing a large part of the network, overfitting is avoided. The high learning rate allows the training prcess to explore the hypothesis space quickly and possibly cause gradient descent to bounce out of local minima, while our validation precision metric serves to identify checkpoints which have learnt a strong feature embedding. At this stage, we filter checkpoints with precision 90% or higher for further evaluation. Imposter pairs are drawn from within the same dataset (to minimize the model's focus on image background features). The dataset to draw a particular pair from is drawn as a categorical variable with selection probability directly proportional to the size of each dataset. All datasets not used for training and validation were used as holdout datasets.

We isolate two such promising checkpoints for further finetuning (CP1 and CP2) as the two checkpoints with highest precision. These models are further finetuned with a low learning rate across a grid of parameter settings to perform further improvements to prediction performance. The three such best performing versions are used to create a simple averaging ensemble, which averages the similarity of each individual model at inference time. We additionally incorporate training on harder imposter pairs during training. For this we first draw an identity to serve as the reference, and then draw a number of imposter images from the same dataset. Inference is carried out on these images to identify the pair that is hardest for the model to classify as an imposter pair (i.e. the pair with the highest similarity), and then training is carried out on this pair. As this happens during the data-loading within the training pipeline, it considerably slows down the training process but provides a way for the model to learn on imposter pairs it is currently most likely to misclassify as authentic.

For combining datasets for training, we use two main strategies for drawing training pairs from the datasets. Uniform sampling refers to drawing a dataset with probability proportional to the dataset size (thus representing one large dataset as opposed to 4 individual datasets, but without drawing imposter pairs from different datasets). Stratified sampling refers to drawing the dataset to sample from uniformly at random, thereby ensuring that each dataset is represented equally. We primarily use uniform sampling for model exploration and stratified sampling for model finetuning.

We combine FT1, FT2 and FT3 as a simple ensemble to derive the final benchmark. Inference is performed on each individual image pair by each FT model and the similarities are derived as 

The results in Table 4 

The results of using the different inference strategies from Section 3.2 can be seen in Figure 4 . We find that deriving similarity at the 2048 or 512 levels results in a better spread of the values compared to applying a softmax function on the final output. Comparing between 512 and 2048 levels we see that while the equal error rate is similar there is a drop in the FRR rate of 2048 features compared to 512. Therefore, we use the 2048 bottleneck features for future experiments. Table 3 has the parameters for each model.

The relative difficulty of different datasets can be visualized using Figure 4 . The overall benchmark consisting of single dataset based and multi-dataset based training is presented in Table 6 . We include the FRR100 metric which is more important for intruder detection in practical situations. 

In this work we have presented techniques for synthetic data generation and analysis for masked facial recognition. We use a general framework for benchmarking that builds upon contrastive representation learning without specializing any methodology for facial analysis (beyond using facial data). We present 2 benchmarks on masked recognition across multiple synthetic datasets in an easily reproducible manner to facilitate further research in this area. Our experiments show that using custom masking is an efficient way of creating datasets for training mask related models. We find that existing pretrained facial models appear to not be disentangled to the level where retraining them for use with masked images is straightforward. We show that it is better to use contrastive representation learning to build an initial representation and then adapt it to learn on the required facial task. We hypothesize that this is because existing identity/facial features use combinations of facial appendages -thus when several of them (nose, mouth, cheeks, lips, chin) are removed, the representations are unable to recover easily using simple fine-tuning. By training a fresh representation, it is possible to circumvent this issue of disentangling an existing representation by instead training the neural network to focus on what is present in the images rather than learn to ignore what is absent.

While previous work focuses solely on masked recognition, unmasked-masked recognition has several important use cases as standard identification protocols involve using unmasked base images. For example, all passports are taken with the face uncovered and with all facial features visible and there is additional information that can be derived from a fully visible face which can be useful for recognition with partially,incorrectly or fully masked faces. The datasets we synthesize and release provide the first publicly available and usable source of data conducive to this problem.

AFIF4: Deep Gender Classification based on AdaBoost-based Fusion of Isolated Facial Features and Foggy Faces

How to Correctly Detect Face-Masks for COVID-19 from Visual Information?

MaskHunter: real-time object detection of face masks during the COVID-19 pandemic

A Simple Framework for Contrastive Learning of Visual Representations

Baselines with Momentum Contrastive Learning. 2020

Masked Face Recognition: Human vs. Machine

The effect of wearing a mask on face recognition performance: an exploratory study

Masked Face Recognition with Latent Part Detection

Unsupervised Visual Representation Learning by Context Prediction

Discriminative Unsupervised Feature Learning with Convolutional Neural Networks

Towards NIR-VIS Masked Face Recognition

Implementation of principal component analysis on masked and nonmasked face recognition

Masked Face Recognition with Generative Data Augmentation and Domain Constrained Ranking

Georgia Tech Face Database

Bootstrap Your Own Latent -A New Approach to Self-Supervised Learning

Efficient Masked Face Recognition Method during the COVID-19 Pandemic

Momentum Contrast for Unsupervised Visual Representation Learning

Robust partial face recognition using instance-toclass distance

Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments

Dlib-ml: A Machine Learning Toolkit

Spartan performance and flexibility: An hpc-cloud chimera

One shot learning of simple visual concepts

Look Through Masks: Towards Masked Face Recognition with De-Occlusion Distillation

Partial face recognition: Alignment-free approach

Masked face detection via a modified LeNet

Deep Learning Face Attributes in the Wild

A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic

Fighting against COVID-19: A novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection

Masked Face Recognition using ResNet-50

Improving the recognition of faces occluded by facial accessories

Boosting Masked Face Recognition with Multi-Task ArcFace

Ongoing Face Recognition Vendor Test (FRVT) Part 6B: Face recognition accuracy with face masks using post-COVID-19 algorithms

Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

Representation Learning by Learning to Count

Occlusion invariant face recognition using selective local non-negative matrix factorization basis images

Control The COVID-19 Pandemic: Face Mask Detection Using Transfer Learning

Partially occluded facial image retrieval based on a similarity measurement

Thor: A Deep Learning Approach for Face Mask Detection to Prevent the COVID-19 Pandemic

Occlusion robust face recognition based on mask learning with pairwise differential siamese network

A new ranking method for principal components analysis and its application to face image analysis

Masked face recognition dataset and application

Robust point set matching for partial face recognition

Face recognition in unconstrained videos with matched background similarity

Colorful image colorization