key: cord-0591262-v3cgj0vd
authors: Fan, Xinqi; Jiang, Mingjie
title: RetinaFaceMask: A Single Stage Face Mask Detector for Assisting Control of the COVID-19 Pandemic
date: 2020-05-08
journal: nan
DOI: nan
sha: 805699c2e49e1f93034139d2e2e8460b628be55a
doc_id: 591262
cord_uid: v3cgj0vd

Coronavirus 2019 has made a significant impact on the world. One effective strategy to prevent infection for people is to wear masks in public places. Certain public service providers require clients to use their services only if they properly wear masks. There are, however, only a few research studies on automatic face mask detection. In this paper, we proposed RetinaFaceMask, the first high-performance single stage face mask detector. First, to solve the issue that existing studies did not distinguish between correct and incorrect mask wearing states, we established a new dataset containing these annotations. Second, we proposed a context attention module to focus on learning discriminated features associated with face mask wearing states. Third, we transferred the knowledge from the face detection task, inspired by how humans improve their ability via learning from similar tasks. Ablation studies showed the advantages of the proposed model. Experimental findings on both the public and new datasets demonstrated the state-of-the-art performance of our model.

According to the World Health Organization (WHO), coronavirus disease 2019 (COVID-19) has infected over 79.2 million individuals and caused over 1.7 million fatalities until the end of 2020 [1] . Numerous computer-assisted approaches have been developed to aid in the fight against COVID-19, including automatic detection of COVID-19 cases based on X-ray or computed tomography (CT) images [2] , [3] , COVID-19 trend prediction [4] , and analysis of human reactions to COVID-19 [5] . It is, however, more critical for individuals to protect themselves from the COVID-19 virus. Fortunately, the study [6] demonstrated that surgical face masks can help limit coronavirus dissemination. At the moment, the WHO recommends that people wear face masks if they have respiratory symptoms or are caring for someone who does [7] . Additionally, several public service providers require users to use services only while wearing masks [8] . Therefore, automatic face mask detection has emerged as a critical computer vision task for assisting the worldwide community, but research on this is limited.

Face mask detection entails both the localization of faces and the recognition of mask wearing states, which we define the states as no mask wearing and mask wearing in general. Due to the requirements of healthcare, we further classified the states of mask wearing into correct and incorrect mask wearing states. In one aspect, the face mask detection problem is similar to face detection [9] , as localizing the face is a critical subtask. In another perspective, the problem is closely related to general object detection [10] , where each state can be treated as a distinct class. As shown in Fig. 1 , the challenges of face mask detection include a variety of in-the-wild situations with a complex background, confused faces without masks where faces may be obscured by other objects, a variety of mask types with different shapes and colors, and incorrect mask wearing cases.

Typically, traditional object detectors are built on handcrafted feature extractors. The Viola Jones detector utilized the Haar feature in conjunction with the integral image approach [11] , whilst other studies utilized a variety of feature extractors, including the histogram of oriented gradients (HOG), the scale-invariant feature transform (SIFT), and others [12] . Recently, object detectors based on deep learning demonstrated superior performance and have dominated the development of new object detectors. Without relying on prior knowledge to construct feature extractors, deep learning can learn the features in an end-to-end manner [13] . There are two types of deep learning based object detectors: onestage and two-stage detectors. One-stage detectors, such as you only look once (YOLO) [14] and single shot detector (SSD) [15] , detected objects using a single neural network. The advantage of SSD is that it detects objects using multiscale feature maps. By contrast, two-stage detectors, such as region-based convolutional neural network (R-CNN) [16] and faster R-CNN [17] , employed two networks to conduct a coarse-to-fine detection. RetinaFace [18] , a dedicated face mask detector, used a multi-scale detection architecture similar to SSD but included a feature pyramid network (FPN) to fuse high and low level semantic information to increase detection performance. Additionally, numerous approaches for studying face mask detection were created. According to the timeline, the initial version of this work, RetinaFaceMask (also known as RetinaMask), can be considered as the first attempt to introduce the face mask detection work. Li et al. [19] increased the robustness of face mask detection by implementing a mix-up and multi-scale technique based on YOLOv3. To enhance the post-processing of YOLOv3 for face mask detection, a distance intersection over union (IoU) non-maximum suppression (NMS) approach was utilized [20] . However, these algorithms either ignore all possible face mask wearing states that occur in real healthcare applications, or report performance only on limited datasets.

In this paper, we proposed a novel single stage face mask detector, RetinaFaceMask, which is able to detect face masks and contribute to public healthcare. We made the following contributions in this study:

• By reannotating the current MAsked FAces (MAFA) dataset used for masked face analysis, we created a new dataset MAsked FAces for Face Mask Detection (MAFA-FMD). The new annotation includes three distinct mask wearing states: no mask wearing, correct mask wearing, and incorrect mask wearing, which is more realistic in terms of contributing to public health. MAFA-FMD contains around 56,000 annotations. • To focus on learning discriminated features associated with face mask wearing states, we proposed a novel context attention module (CAM). The module can extract more useful context features, and concentrate on those that are critical for face mask wearing states. • Inspired by how humans enhance their skills via the use of knowledge gained from other tasks, we used transfer learning (TL) to transfer the knowledge learned from face detection tasks. Experimentally, we demonstrated that face detection and face mask detection are highly correlated, and the feature learned from the former is useful for the latter task.

Ablation studies showed the effectiveness of the CAM and TL, since they can boost the mean average precision (mAP) by a large margin. Experimental results on the public dataset AIZOO demonstrated that RetinaFaceMask achieved the state-of-the-art result, and a 4% increase compared to the baseline method. RetinaFaceMask also had the best performance on the MAFA-FMD dataset, which contains three distinct mask wearing states and is notoriously difficult.

The remainder of this paper is structured as follows. Section II illustrates the established dataset. Section III presents the proposed RetinaFaceMask. Section IV discusses the used datasets, experiment settings, results, and discussion. Finally, Section V concludes the paper and outlines future work.

Gu et al. prepared the original MAFA dataset from the Internet using the Flickr, Google, and Bing search engines [21] . The dataset contains 35,806 images with a minimum length of 80 pixels. The annotations of the dataset have locations of faces, mask types, etc. Each image was annotated by two individuals and verified by another. More details of MAFA can be found in [21] . However, the original MAFA annotations do not address the requirements for face mask detection in healthcare settings. Therefore, we relabelled the MAFA dataset with three different mask wearing states, "no mask wearing", "correct mask wearing", and "incorrect mask wearing", and named it MAFA-FMD. The procedure for relabeling is as follows. First, we generated reference annotations from original annotations. In detail, we kept all box annotations, and converted "simple", "complex" mask types as correct mask wearing, "body", "hybrid" mask types as no mask wearing states. Second, we applied RetinaFaceMask trained on the AIZOO dataset to do inference on MAFA, and recorded all predictions as another reference. Finally, three professional persons manually revised all reference box coordinates and class annotations, and used LabelImg to relabel new faces as well. When identifying masks, we considered disposable medical masks, medical surgical masks, medical protective masks, dusk masks, gas masks, respirators as valid masks. In addition, cloth masks were also regarded as valid ones, since it is also advised by the centers for disease control and prevention (CDC) [22] . Certain masks that do not completely enclose the mouth and nose were deemed invalid. For example, those who wear traditional Chinese veils were considered no mask wearing cases, despite the fact that they resemble some forms of masks.

The major differences between the original MAFA and the MAFA-FMD are summarized in Table I . In terms of the total number of annotated faces, MAFA contains 39,485 annotated faces, while MAFA-FMD has 56,084 ones, which is around 16,000 more than that of MAFA. For face types, MAFA does not annotate faces without any masks, but MAFA-FMD contains both masked and unmasked faces. In addition, the mask types have been reclassified to mask wearing states as "no mask wearing", "correct masking wearing", "incorrect mask wearing" with the corresponding numbers 26,463, 28,233, and 1,388 for each class. The imbalanced label distribution shows a long-tailed problem for this in-the-wild dataset. Furthermore, MAFA-FMD includes blurred faces, which were not included in the original MAFA annotation. The number of low resolution (smaller than 32 × 32 resolution) annotations has been increased from approximately 1,000 in MAFA to more than 4,000 in MAFA-FMD. 

The architecture of the proposed RetinaFaceMask is shown in Fig. 2 . To cope with the diverse scenes in face mask detection, a strong feature extraction network ResNet50 is used as the backbone network. C 1 , C 2 , C 3 , C 4 and C 5 denote the intermediate output feature maps of the backbone's layers conv1, conv2 x, conv3 x, conv4 x and conv5 x used in the original ResNet50 [23] . These feature maps are generated by convolutions with distinct receptive fields, allowing for the detection of objects of varying sizes. At this point, we have established the general structure for our multi-scale detection model. However, one disadvantage of shallow layers is that their outputs lack sufficient high-level semantic information, which might result in poor detection performance. To address this, an FPN has been adopted, and the details are as follows. First, we apply a 3×3 convolution on C 5 to obtain P 5 . Then, we upsample P 5 using nearest interpolation to the same size as C 4 , and merge the upsampled P 5 and channel-adjusted C 4 with an element-wise addition. Likewise, we obtain P 3 from P 4 and C 3 . In addition, we also proposed a light-weighted version of RetinaFaceMask (RetinaFaceMask-Light) by using the backbone of MobileNetV1 for running on embedded devices efficiently. C 3 , C 4 and C 5 for RetinaFaceMask-Light are yielded from the last convolution blocks with the original output sizes 28 × 28, 14 × 14, and 7 × 7 in [24] .

In comparison to face detection, face mask detection requires both the localization of faces and the discrimination of distinct mask wearing states. To focus on learning more discriminated features for mask wearing states, we proposed a CAM as shown in Fig. 3 . First, to enhance the context feature extraction, we employ three parallel subbranches consisting of one 3 × 3 convolution, two 3 × 3 convolutions, and three 3 × 3 convolutions. Equivalently, these branches correspond to 3 × 3, 5 × 5 and 7 × 7 receptive fields. Then, inspired by [25] , we apply channel and spatial attention to focus on both channel and spatial important features associated with face mask wearing states. The channel attention block on the input P ∈ R D×H×W can be calculated as

where Λ c is the channel attention map; sigmoid function σ normalizes the output to (0, 1); F M LP denotes for a threelayer multi-layer perception; H GAP and H GM P are global average pooling and global maximum pooling. Similarly, the attention map Λ s yielded by the spatial attention block is

where denotes a 2D convolution; K 3×3 is a 3 × 3 kernel; ⊕ stands for the channel concatenation; H CAP and H CM P are channel average pooling and channel maximum pooling.

The uncontrolled and diverse in-the-wild scenes make feature learning difficult. One possible solution is to collect and annotate more data for training. In RetinaFaceMask, we proposed to mimic the human learning process by transferring knowledge from face detection to help face mask detection. According to [26] , [27] , TL has aided in feature learning as long as these tasks have a correlation. Therefore, in our work, we transfer the knowledge learned on a large scale face detection dataset Wider Face, which consists of 32,203 images and 393,703 annotated faces [28] to enhance the feature extraction ability for FMD.

Our network generates two matrices, location offset y l ∈ R np×4 and class probability y c ∈ R np×nc , where n p and n c refer to the number of anchors and the number of categories of the bounding boxes, respectively. The following data, default anchors y da ∈ R np×4 , the ground truth bounding boxes y l ∈ R no×4 and the true class label y c ∈ R no×1 are provided, where n o is the number of objects to be detected and is variable for different images.

To calculate the model's loss, we begin by selecting the top class and calculating the offset for each default anchor through matching the default anchors y da , the ground truth bounding boxes y l , and the true class label y c to obtain matched matrices p ml ∈ R np×4 and p mc ∈ R np , where the rows of p ml and p mc denote the coordinates offsets and the labels with the highest probability for each default anchor, respectively. Then, we obtain the positive localization prediction and positive matched default anchors y + l ∈ R p+×4 and p + ml ∈ R p+ by selecting the foreground boxes, where p + denotes the number of default anchors with non-zero top classification label. The L 1 -smooth loss L loc ( y + l , p + ml ) is used to perform box coordinates regression. Following that, the hard negative mining [29] is performed to obtain the sampled negative default anchors p − mc ∈ R p− and predicted anchors y − c ∈ R p− , where p − is the number of sampled negative anchors. Finally, we calculate the classification confidence loss by L conf ( y − c , p − mc ) + L conf ( y + c , p + mc ). In summary, the total loss is calculated as follows,

where n m is the number of matched default anchors, and α is a weight for the localization loss.

In the inference stage, the trained model generates the object's localization y l ∈ R np×4 and confidence y c ∈ R np×4 , where the second column of y c denoted as y n ∈ R np is the probability of no mask wearing states; the third column of y c denoted as y cm ∈ R np is the confidence of correct mask wearing states; the fourth column of y c denoted as y im ∈ R np is the confidence of incorrect mask wearing states. We remove objects with confidences lower than t c and perform the NMS with IoUs larger than t nms to obtain the final predictions. 

A. Dataset 1) AIZOO: The AIZOO Face Mask Dataset [30] has 7,959 images, where the faces are annotated either with a mask or without a mask. The dataset is a composite of the Wider Face [28] and MAFA datasets [21] , with approximately 50% of data from each. The predefined test set is used.

2) MAFA-FMD: As described in section II, MAFA-FMD is a reannotated dataset, in which there are three classes, "no mask wearing", "correct mask wearing" and "incorrect mask wearing". The original test set split of MAFA is kept.

The model was developed on PyTorch [31] deep learning framework. The model was trained for 250 epochs with a stochastic gradient descent (SGD) algorithm of learning rate 10 −3 and momentum 0.9. An NVIDIA GeForce RTX 2080 Ti GPU was employed. The input image resolution is 840 × 840 for RetinaFaceMask, and is 640 × 640 for RetinaFaceMask-Light.

We performed an ablation study to evaluate the effectiveness of CAM and TL using RetinaFaceMask on the AIZOO dataset. We used average precision (AP) for each class, and mean average precision (mAP) as the evaluation metrics [32] . AP N and AP M are APs for no mask wearing and mask wearing states, respectively. The experiment results were summarized in Table II and the best result was obtained by combing CAM and TL. The following sections discuss the effectiveness of each module.

1) Context Attention Module: By including CAM in the model, we observed an around 1% increase in mAP. In particular, AP for no mask wearing increased from 92.8% to 94.2%, and AP for mask wearing improved from 93.1% to 93.6%. These findings indicate that CAM can be used to focus on the desired face and mask features, which can alleviate the effect of the imbalanced problem.

2) Transfer Learning: To evaluate the performance of TL using face detection knowledge, we added TL to the model. We noticed a considerable rise in mAP from 93.0% to 94.4% when compared to the baseline. The possible reason for this is because face detection and face mask detection are highly related, and so the features learned for the former become beneficial for the latter. Table III , we compared our model's performance with that of other widely used detectors for face mask detection. SSD is the baseline approach released by the AIZOO dataset's produce [30] . YOLOv3 has been used in numerous face mask detection investigations [19] , [20] . RetinaFace was also included in the comparison as an efficient face detector. We discovered that RetinaFaceMask can outperform YOLOv3 and RetinaFace by 1.7% and 1.8%, respectively, and obtain the state-of-theart result in terms of mAP. Additionally, for the APs with and without masks, RetinaFaceMask demonstrated the best outcome. Our lite version, RetinaFaceMask-Light, which utilizes a significantly smaller model, achieved an acceptable result of 92.0% in mAP. It should be noted that the number of parameters in RetinaFaceMask-Light is much less than other models.

Additionally, we showed some qualitative AIZOO dataset results in Fig 4(a) . As seen in the first and fourth images, the model is robust to confusing masking types. In the second and third images, faces with mask wearing were correctly spotted. We discovered that one of an infant's little faces was omitted from the last image. One probable explanation for this is that the training dataset lacks small faces, and hence the model does not learn a good representation for such faces.

2) Comparison on MAFA-FMD: We also compared our method's performance on the MAFA-FMD dataset. Additional evaluation metrics: AP CM for the correct mask wearing, and AP IM for the incorrect mask wearing, are included. Since we only annotated masks that can protect humans in healthcare settings as valid masks, some masks which do not enclose the faces are denoted as no mask wearing. This may increase the hardness of learning, because they are hard to distinguish. In addition, the three-class task is likely to be harder than the two-class task. Although it is hard, our method still achieved the state-of-the-art performance on mAP and APs of different classes as shown in Table IV . Compared to the second best method RetinaFace, we had an around 2% improvement in mAP. However, our light-weighted version RetinaFaceMask-light only obtained a 59.8% mAP, which may be due to the reason that light and shallow models are hard to learn enough useful features. Fig. 4(b) illustrates some qualitative findings from the MAFA-FMD dataset. In comparison to the second AIZOO image in Fig. 4(a) , the model trained on our reannotated dataset is capable of correctly discriminating between correct and incorrect mask wearing cases, as demonstrated by the first three images. Additionally, the MAFA-FMD trained model is capable of capturing some small or blurred faces. However, rare failures may occur when the face is occluded by someone or something.

In this paper, we proposed a novel single stage face mask detector, namely RetinaFaceMask. We made the following contributions. First, we created a new face mask detection dataset, MAFA-FMD, with a more realistic and informative classification of mask wearing states. Second, we proposed a new attention module, CAM, that would be dedicated to learning discriminated features associated with face mask wearing states. Third, we emulated humans' ability to transfer knowledge from the face detection task to improve face mask detection. The proposed method achieved state-of-theart results on the public face mask dataset as well as our new dataset. In particular, compared with the baseline method on the AIZOO dataset, we have improved the mAP by 4% than the baseline. Therefore, we believe our method can benefit both the emerging field of face mask detection and public healthcare to combat the spread of COVID-19. Further work may include tackling problems of occlusions or small faces in face mask detection.

ACKNOWLEDGMENT The authors thank Prof. H. Yan for valuable discussion.

Coronavirus disease 2019 (COVID-19) weekly epidemiological update -29

A deep bayesian ensembling framework for COVID-19 detection using chest ct images

An uncertaintyaware transfer learning-based framework for COVID-19 diagnosis

A comparative study of predictive machine learning algorithms for COVID-19 trends and analysis

Understanding global reaction to the recent outbreaks of COVID-19: Insights from instagram data analysis

Face masks effectively limit the probability of SARS-CoV-2 transmission

Rational use of face masks in the COVID-19 pandemic

Transmission dynamics of the COVID-19 outbreak and effectiveness of government interventions: A data-driven analysis

Face detection techniques: a review

Object detection with deep learning: A review

Rapid object detection using a boosted cascade of simple features

A discriminatively trained, multiscale, deformable part model

Deep learning for generic object detection: A survey

You only look once: Unified, real-time object detection

SSD: Single shot multibox detector

Rich feature hierarchies for accurate object detection and semantic segmentation

Faster R-CNN: Towards realtime object detection with region proposal networks

RetinaFace: Single-shot multi-level face localisation in the wild

Robust deep learning method to detect face masks

Mask wearing detection based on YOLOv3

Detecting masked faces in the wild with LLE-CNNs

Types of masks

Deep residual learning for image recognition

Mobilenets: Efficient convolutional neural networks for mobile vision applications

CBAM: Convolutional block attention module

Taskonomy: Disentangling task transfer learning

Hybrid separable convolutional inception residual network for human facial expression recognition

Wider Face: A face detection benchmark

Training region-based object detectors with online hard example mining

Detect faces and determine whether people are wearing mask

Pytorch: An imperative style, high-performance deep learning library

A survey on performance metrics for object-detection algorithms

YOLOv3: An incremental improvement