key: cord-0636341-zmjzvihk
authors: Zhu, Jiayi; Guo, Qing; Juefei-Xu, Felix; Huang, Yihao; Liu, Yang; Pu, Geguang
title: Masked Faces with Faced Masks
date: 2022-01-17
journal: nan
DOI: nan
sha: 14529abeb59eb28ee9680a5b958d171589c595d3
doc_id: 636341
cord_uid: zmjzvihk

Modern face recognition systems (FRS) still fall short when the subjects are wearing facial masks, a common theme in the age of respiratory pandemics. An intuitive partial remedy is to add a mask detector to flag any masked faces so that the FRS can act accordingly for those low-confidence masked faces. In this work, we set out to investigate the potential vulnerability of such FRS equipped with a mask detector, on large-scale masked faces, which might trigger a serious risk, e.g., letting a suspect evade the FRS where both facial identity and mask are undetected. As existing face recognizers and mask detectors have high performance in their respective tasks, it is significantly challenging to simultaneously fool them and preserve the transferability of the attack. We formulate the new task as the generation of realistic&adversarial-faced mask and make three main contributions: First, we study the naive Delanunay-based masking method (DM) to simulate the process of wearing a faced mask that is cropped from a template image, which reveals the main challenges of this new task. Second, we further equip the DM with the adversarial noise attack and propose the adversarial noise Delaunay-based masking method (AdvNoise-DM) that can fool the face recognition and mask detection effectively but make the face less natural. Third, we propose the adversarial filtering Delaunay-based masking method denoted as MF2M by employing the adversarial filtering for AdvNoise-DM and obtain more natural faces. With the above efforts, the final version not only leads to significant performance deterioration of the state-of-the-art (SOTA) deep learning-based FRS, but also remains undetected by the SOTA facial mask detector, thus successfully fooling both systems at the same time.

: (a) shows a commercial off-the-shelf (COTS) faced mask and the look of being worn. (b) represents the security problem we are exploring. Faces wearing solid-color masks will be recognized as their original identities in most cases and easily detected by a mask detector (the upper arrow). The MF 2 M we proposed extracts the face information in the area surrounded by the red line of the template image to obtain a "faced mask", which can simultaneously deceive face recognizers and evade mask detection (the lower arrow).

Modern face recognition systems (FRS) still fall short when the subjects are wearing facial masks, a common theme in the age of respiratory pandemics. An intuitive partial remedy is to add a mask detector to flag any masked faces so that the FRS can act accordingly for those low-confidence masked faces. In this work, we set out to investigate the potential vulnerability of such FRS equipped with a mask detector, on large-scale masked faces, which might trigger a serious risk, e.g., letting a suspect evade the FRS where both facial identity and mask are undetected. As existing face recognizers and mask detectors have high performance in their respective tasks, it is significantly challenging to simultaneously fool them and preserve the transferability of the attack. We formulate the new task as the generation of realistic & adversarial-faced mask and make three main contributions: First, we study the naive Delanunay-based masking method (DM) to simulate the process of wearing a faced mask that is cropped from a template image, which reveals the main challenges of this new task. Second, we further equip the DM with the adversarial noise attack and propose the adversarial noise Delaunay-based masking method (AdvNoise-DM) that can fool the face recognition and mask detection effectively but make the face less natural. Third, we propose the adversarial filtering Delaunaybased masking method denoted as MF 2 M by employing the adversarial filtering for AdvNoise-DM and obtain more natural faces. With the above efforts, the final version not only leads to significant performance deterioration of the state-of-the-art (SOTA) deep learning-based FRS, but also remains undetected by the SOTA facial mask detector, thus successfully fooling both systems at the same time. We conduct extensive white-box and black-box experiments on three FRS and a facial mask detector. We utilize the datasets in MegaFace Challenge 1 and evaluate on dimensions of face recognition, face verification and mask detection, comparing with the solid-colored masking method and seven SOTA adversarial attacks. Moreover, we also set up the physical experiments by printing the adversarial faces in the real world and re-capturing them to fool the face recognition and mask detector, which demonstrates the high generalizability of our method. Overall, the proposed method, for the first attempt, unveils the vulnerability of the FRS when dealing with masked faces wearing faced masks.

Currently, under the severe international situation and environment (i.e., COVID-19 pandemic), people are mandatorily required to wear facial masks in public, especially in crowded places like airports. This situation poses a huge challenge for face recognition systems (FRS). Although existing face recognition models (e.g., SphereFace [27] , CosFace [42] , ArcFace [5] ) have high-performance on identity recognition tasks, these models are only available to faces in good imagery conditions. When faces are heavily obscured (e.g., wearing facial masks), even the state-of-the-art (SOTA) FRS do not perform satisfactorily since the information of the masked area is lost. Recently, the National Institute of Standards and Technology (NIST) published a specific study to confirm that the accuracy of FRS drops sharply targeting masked faces [33] .

An indirect way to solve this problem is to do mask detection. Once a facial mask is detected, the inspector can be made aware of the inaccuracy of the current face recognition result and respond accordingly. However, there are various styles of commercial off-theshelf (COTS) facial masks, even some faced masks (i.e., printed with the lower half of faces from celebrities) as shown in Fig. 1(a) . Such COTS faced masks cause great confusion to existing mask detectors as these detectors only consider solid-colored masks (e.g., common surgical masks) during training. Therefore, existing mask detectors can not deal with masks with special textures and complex patterns. It is worrying that potential offenders may wear such COTS faced masks and even do special treatment to viciously hide their identities and avoid mask detection at the same time. To explore this security problem, we simulate the process of manufacturing masks with face patterns and propose "faced mask" approaches. Our approaches attack both the FRS and the mask detector, exposing their weaknesses under this specific multitasking attack.

In this paper, we propose an adversarial filtering Delaunay-based masking method, denotes as Masked Faces with Faced Masks (MF 2 M), to stealthily generate masks with face patterns. The perpetrating faced masks not only significantly reduce the accuracy of two SOTA deep learning-based FRS but also drop the accuracy of a SOTA mask detector by 83.58%. As shown in Fig. 1(b) , faces wearing solid-colored masks will be recognized as their original identities in most cases and easily detected by a mask detector (the upper arrow). The MF 2 M we proposed (the lower arrow) can simultaneously deceive face recognizers and evade mask detection. In particular, we first modify the Delaunay method [31] to simulate the process of wearing masks and propose a Delaunay-based masking method. We replace the lower face of the input image (i.e., the original image in Fig. 1(b) ) with the lower face of the desired face image (i.e., the area surrounded by the red line of the template image in Fig. 1(b) ). Intuitively, adding adversarial noise to the mask can successfully attack both the face recognition and mask detection systems. However, our experiments show that this method will cause the pixels to be strongly modified. To make up for this deficiency and make images look more natural, we further exploit the advantages of filters and propose the novel filtering-based attack method MF 2 M.

Since the FRS mainly takes features of the eye area (also known as the periocular region [16] [17] [18] [19] [20] ) into consideration, attacking the FRS through only modifying the lower face is much more difficult than modifying the upper face. To our best knowledge, previous methods all attack the FRS by modifying the upper face area [23, 36, 49] . Our method is the first attempt which only changes the lower face area to attack the FRS. Furthermore, our method (i.e., MF 2 M) can both be used as white-box and black-box attacks to the SOTA FRS, which reflects its usability and universality.

The contributions are summarized as follows. ❶ We study the naive Delanunay-based masking method (DM) to simulate the process of wearing a faced mask, which reveals the main challenge that this operation of only replacing the lower face does not strongly interference discriminators. ❷ We further equip the DM with the adversarial noise attack and propose the adversarial noise Delaunaybased masking method (AdvNoise-DM) that can handle the jointtask that fools the face recognition and mask detection effectively. ❸ We propose the adversarial filtering Delaunay-based masking method (denoted as MF 2 M) by employing the adversarial filtering for AdvNoise-DM. This masking method leads to significant performance deterioration of SOTA deep learning-based face recognizers and mask detector while ensuring the naturalness of the obtained faces. ❹ Our extensive experiments in white-box attack and blackbox attack demonstrate the universality and transferability of our proposed MF 2 M. Then we extend to physical attack and illustrate the robustness of our proposed masking method.

Face recognition. Face recognition can be divided into closed-set recognition and open-set recognition. For closed-set recognition, the identities in the test set need to be included in the training datasets. This task is regarded as a multi-class classification problem and is usually solved by a softmax classifier [4, 34, 38, 39, 45] . Currently, most face recognition researches focus on open-set recognition (i.e., identities for testing do not exist in the training datasets). A series of algorithms were proposed to learn an embedding to represent each identity [5, 27, 28, 41, 42, 47, 50] . Those algorithms modify the loss function to maximize inter-class variance and minimize intra-class variance. SphereFace [27] pioneered an angular softmax loss to learn angularly discriminative features. CosFace [42] presented a large margin cosine loss to remove radial variations. ArcFace [5] proposed an additive angular margin loss. Although these methods show high performance on the face recognition task under good imagery conditions, they pay little attention to obscured faces and are incompetent with masked faces recognition.

To tackle this problem, some researchers added specific modules to adjust existing models and strengthen the recognition performance for masked faces [2, 25, 32] . Recently, Masked Face Recognition Competition (MFR 2021) [3] was held to promote masked face recognition accuracy. Almost all of the participants utilized variations of the ArcFace as their loss and exploited either real or simulated solid-colored masked face images as part of their training datasets. These masked face recognition methods showed better robustness regarding faces masked with solid-colored masks. However, these methods do not solve the problem essentially, not considering masks with face patterns and special textures.

Mask detection. In the age of the outbreak of the COVID-19 pandemic, researchers pay more attention to masked face detection and relative datasets. The Masked Faces (MAFA) dataset [12] is an early proposed dataset for occluded face detection, which is collected from the Internet, varying in pose angle and occlusion degree. Recently, Wuhan University has introduced the Real-world Masked Face Recognition Dataset (RMFRD) and the Simulated Masked Face Recognition Dataset (SMFRD) [46] . These datasets all focus on masks with solid colors and ignore masks with specific textures or complex patterns. To judge whether there is a mask, some mask detection methods fine-tune face detection models to meet the requirement [1, 29, 35] . Although these methods perform well on common mask detection, they can not deal with special masks with facial textures as they ignore this situation. Adversarial attack. There are a series of adversarial attacks. The fast gradient sign method (FGSM) [13] first proposes the additiveperturbation-based attack and the iterative fast gradient sign method (I-FGSM) [24] is an iterative variant of FGSM. Then the momentum iterative fast gradient sign method (MI-FGSM) [9] introduces the idea of momentum, which helps to stabilize optimization and escape from poor local maxima in the iteration. Besides, the translationinvariant fast gradient sign method (TI-FGSM) [10] and the diverse inputs iterative fast gradient sign method (DI 2 -FGSM) [48] are both designed for transferability and apply transformations to the input images at each iteration in the attack process. TI-FGSM utilizes a kernel matrix to simulate the translation of images in different directions while DI 2 -FGSM applies random resizing and padding to images with a given probability. More recently, a series of works focus on natural degradation-based adversarial attacks like adversarial morphing attack against face recognition [43] , adversarial relighting attack [11] , adversarial blur attack [14, 15] , and adversarial vignetting attack [40] . These works explore the robustness of deep models by adding natural degradations like motion blur and light variation to the input with different adversarial objective functions. In this work, we actually regard the faced masks as the real-world perturbations.

In order to illustrate the potential hazards of such faced masks, we propose a multi-stage framework to interfere with face recognizers and avoid mask detection simultaneously.

To simulate the process of wearing faced masks, the most intuitive way is to do an operation similar to face replacing in the specified area. We first apply a Delaunay-based masking method (DM). This method operates on two images, one is an original image I ∈ R 1 × 1 ×3 on which we want to put the faced mask, the other is a template image I t ∈ R 2 × 2 ×3 used to build the faced mask. The template image can be unreal and constructed by some DeepFake technique (e.g., StyleGAN [21] ). We aim to generate a facial mask that contains partial face patterns from I t and looks natural when added to I with the process

where I M ∈ R 1 × 1 ×3 is the obtained mask (e.g., facial mask in Fig. 2 ). Specifically, we expand the function 1 (·) as

where L (·) extracts the landmarks of I t and I as the first step of Fig. 2(a) . The function T (·) is to build the triangle-based face representation where the landmarks serve as vertices of the triangles. The obtained face representations from I t and I have the same number of triangles and those triangles correspond one by one according to the landmarks. As we have the correspondence of triangles between the two face representations, R (·) transforms each triangle in T ( L (I t )) into the corresponding triangle in T ( L (I)) by affine transformation to get a full-face mask at the right side of Fig. 2(a) . The function C (·) connects the landmarks of the contour of the lower face and the landmark of the nose (red dots in the full-face mask) in turn to obtain the faced mask area. As shown in the beginning of Fig. 2 (b), after getting the facial mask I M , we overlay it on the original image I to get the Delaunay-based masked image I DM ∈ R 1 × 1 ×3 through

To better motivate our proposed method, we have carried out a pilot study. Here, we briefly discuss the results using images obtained by DM in face recognition and mask detection and compare them with the results of solid-color masked faces. We use the whole gallery set (1M images) and take 3,530 faces of 80 celebrities as the probe set from MegaFace Challenge 1 [22] . We compare the top-1 identification rates and the mask detection rates of images wearing solid-color medical masks and images generated by DM in Fig. 3 . We can see that both indicators are very high when adding solidcolored medical masks, indicating that such masks hardly influence those discriminators. Images obtained by DM have a considerable impact to face recognizers and the mask detector. However, DM is not effective enough as shown in the second cluster of Fig. 3 . There are still about 51% and 28% tasks judged correctly for top-1 identification and mask detection, respectively, far from zero. In order to strengthen the aggressiveness of the designed faced mask, we propose adversarial masking methods to add special textures, as explained in the following sections.

The main challenges stem from: ❶ Most of the face information is concentrated in the eye region. In contrast, the part of the mask area, e.g., the mouth, plays a relatively low role in the face recognition task, which increases the difficulty of our work. ❷ Although a part of images processed by DM can remain undetected by the facial mask detector, 28% of images are detected due to the unavoidable factors in the process of adding masks (e.g., the chromatic aberration between faced masks and skins, discontinuities in textures), which makes interference to the mask detector unsuccessful. ❸ Our goal is to combine multiple tasks, and it is difficult to simultaneously handle tasks that have different optimization directions. ❹ Different deep-learning-based discriminators use different network structures and different network parameters. It is hard to ensure the transferability that the generated faced masks can effectively interfere with diverse discriminators.

Inspired by adversarial attacks, e.g., project gradient descent (PGD) [30] , which performs effectively when targeting pre-trained neural networks, we apply an adversarial attack to the masked image I DM acquired by DM. We define this method as an adversarial noise Delaunay-based masking method (AdvNoise-DM) and show the process in the middle of Fig. 2(b) . We replace I DM with Eq. (3) and generate the adversarial noise n ∈ R 1 × 1 ×3 to obtain

which denotes the superimposition of the intermediate 2 (I, 1 (I, I t )) and the adversarial perturbation n. Our goal is to find theÎ which can not only mislead the FRS but also remain undetected by the mask detector by the means of obtaining such an adversarial perturbation n. For this reason, the problem to be solved can be transformed into achieving the optimal trade-off between face recognition and mask detection. Then, we have the following objective function arg max n D (FR( 2 (I, 1 (I, I t )) + n), FR(I)) − * J (MD( 2 (I, 1 (I, I t )) + n), ).

In the first part of the objective function, FR(·) denotes a face recognition function which receives an image and returns the corresponding embedding. D (·) denotes the Euclidean distance between the embedding from the original image I and the embedding from the imageÎ processed by AdvNoise-DM. We intend to maximize this part of the objective function for the purpose of enlarging the gap of identification information before and after the modification.

In the second part, MD(·) represents a mask detection function which receives an image and returns the probability of wearing a mask. J (·) denotes the cross-entropy loss function, y is the ground truth label for whether the face is masked, 0 for not masked, and 1 for masked. Here, we set = 0 to force the imageÎ to be judged without a mask. The ratio is used to adjust the focus between face recognition and mask detection. We aim to minimize this crossentropy loss so we take a minus sign for this item.

The images obtained by AdvNoise-DM can significantly interfere with the discrimination of face recognizers and the mask detector. This method can almost reduce the top-1 identification rate to zero and reduce the mask detection rate to only 7.3% as shown in Fig. 3 . Nevertheless, AdvNoise-DM has a certain drawback as it causes great changes to the pixels, so reduces the naturalness of the generated images. In this case, it is necessary to use a smoother masking method.

Since the filtering process brings better smoothness, calculating each pixel by the surrounding pixels, we further propose an adversarial filtering Delaunay-based masking method (MF 2 M). This method combines noise-based and filtering-based attacks as shown in Fig. 2(b) . We first apply DM and add a relatively small adversarial perturbation n to get the intermediate 2 (I, 1 (I, I t )) + n, referring to the method AdvNoise-DM. Then we utilize pixel-wise kernels K ∈ R 1 × 1 × 2 to process the intermediate. The -th pixel of the intermediate 2 (I, 1 (I, I t )) + n is processed by the corresponding -th kernel in K, denoted as K ∈ R × , where represents the kernel size. We retouch the original image I via the guidance of filtering and reformulate Eq. (4) as

where ⊛ denotes the pixel-wise filtering process andĨ ∈ R 1 × 1 ×3 represents the filtered images. In the MF 2 M procedure, we aim at obtaining a deceptiveĨ in both face recognition task and mask detection task by altering the pixel-wise kernels K. The objective function for optimization looks similar to Eq. (5) as following arg max K D (FR(K ⊛ ( 2 (I, 1 (I, I t )) + n)), FR(I)) − * J (MD(K ⊛ ( 2 (I, 1 (I, I t )) + n)), ).

Compared with Eq. (5), the optimization objective becomes K. We intend to increase the Euclidean distance between the embedding from the original image I and that from the filtered imageĨ. Meanwhile, we try to improve the probability that the filtered imagẽ I is judged not wearing a mask. The ratio of the mask detection part is marked as . As shown in Fig. 3 , almost all images generated by MF 2 M is deceptive for face recognition and only 6.95% images are detected wearing masks. Besides, MF 2 M brings higher naturalness than AdvNoise-DM. Fig. 4 shows that the peak signalto-noise ratio (PSNR) calculated between MF 2 M and DM is higher than that calculated between AdvNoise-DM and DM, indicating that MF 2 M changes images less. We will use two similarity metrics in the experiment to prove this strong point. Update filtering kernels K via K = K + * ∇ K Loss; 10 Apply image filtering to obtain reconstruction imageĨ viã I = K ⊛Î;

Algorithm 1 summarizes our method. First, we apply DM to complete the face replacing process, i.e., generating a faced mask extracted from I t and overlaying it on I to obtain I DM . Second, we add an adversarial noise n to I DM and obtainÎ. In the filtering attack process, we initialize the filtering kernels K whose initial action is to make the filtered image consistent with the original image (i.e., the weight of the center position of each kernel is 1, and the weight of other positions is 0). In each iteration, we perform pixel-wise filtering by current kernels K and I DM to acquire the current filtered image I ′ . Then we calculate Loss D and Loss CE , via the Euclidean distance function and the cross-entropy loss function, respectively. These two loss functions constitute the final optimization objective function by the ratio . At the end of each iteration, we update the filtering kernels K according to the product of the step size and the gradient of the optimization objective. Finally, we use the optimized kernels to embellish the aimed imageĨ.

Face recognition methods. In our white-box attack experiment, the backbone of the face recognizer [6] is pre-trained ResNet50 under ArcFace. The face recognizer takes cropped images (112 × 112) as input and returns the final 512-D embedding features. To illustrate the transferability, we further use recognizers [8] pre-trained under CosFace with ResNet34 and ResNet50 as the backbone respectively to verify the black-box attack performance. We choose these two FRS as ArcFace and CosFace perform SOTA in face recognition.

Mask detection methods. The mask detection method bases on RetinaNet [26] , an efficient one-stage objects detecting method. The pre-trained model [37] we used is competitive in existing mask detectors, achieving 91.3% mAP at the face_mask validation dataset (including 1839 images). The mask detector outputs two probabilities of not-masked and masked faces respectively. By comparing these two probabilities, we can judge whether there is a mask.

Datasets. We utilize 1M images of 690K individuals in MegaFace Challenge 1 as the gallery set. In terms of the probe set, we refer to the setting of MegaFace [22] and use a subset of FaceScrub (i.e., 3,530 images of 80 celebrities) for efficiency. For the template images used to extract faced masks, we use StyleGAN to generate images with seeds numbered from 1 to 13,000. As some generated images have illumination or occlusion problems, we manually select 3,136 high-quality face images.

Evaluation settings. The face recognition evaluation is based on masked/not-masked pairs. We add masks to images of the probe set and remain images in the gallery not-masked. When adding masks, we select the most similar face image to the original face image from 3,136 template images according to the features extracted by the face recognition model, which aims to make the masked faces look more natural. In AdvNoise-DM, we use PGD attack to add deliberate noise. The epsilon (maximum distortion of adversarial example) is 0.04. The step size for each attack iteration is 0.001 while the number of iterations is 40. The ratio is set to 1. In MF 2 M, we add noise with an epsilon of 0.01. The kernel size of the pixel-wise kernels is 5. When alter the pixel-wise kernels, the step size is 0.1 and the number of iterations is 160. Here we set the coefficient to 1, same as . All optimization objectives are restricted to the faced mask area obtained by a deep learning-based method [7] . Metrics. For face recognition, we use the top-1 identification rate in the face identification task, the true accept rate (TAR) at 10 −6 false accept rate (FAR) and the area under curve (AUC) in the face verification task. For mask detection, we use the detection rate. To reflect the degree of reconstructed modification and evaluate the naturalness of the generated images, we further use the PSNR and the structural similarity (SSIM) [44] to measure the similarity between the adversarial masked results and the corresponding images from DM. The region of the calculation for similarity metrics is the whole image.

Face recognition has two main tasks, face identification and face verification. Given a probe image and a gallery, identification aims to find an image which has the same identity as the probe image from the gallery, i.e., 1 vs. search task. The verification task sets a threshold to judge whether two images have the same identity, i.e., 1 vs. 1 comparison task.

Face identification. We constitute 151K pairs with the same identity from 3,530 face images of 80 celebrities. For each pair, we take one image as the probe image and put the other image into the gallery. Top-k identification rate denotes the successful rate of matching pairs where k is the number of images selected from the gallery. Fig. 5(a) shows the cumulative matching characteristic (CMC) curves of images under different masking states. The abscissa indicates the number of images selected from the gallery according to the embedding obtained by the face recognizer. The ordinate denotes the identification rate at the specified number of images. When images are without masks, the top-1 identification rate (i.e., "Rank 1") is 0.98, proving that the recognizer achieves good performance without face occlusion. After adding solid-color medical masks, "Rank 1" reduces to 0.7389. This metric for DM declines to 0.5146. As for AdvNoise-DM and MF 2 M, the higher the attack intensity, the more their corresponding curves are close to the lower right of the graph. We respectively alter the iterations numbers and make the performance of AdvNoise-DM and MF 2 M close in this task, so as to compare them on other indicators. "Rank 1" of both methods drop to 1.35e −5 , indicating that the SOTA face recognizer performs poorly under AdvNoise-DM and MF 2 M. The second column of Table 1 shows that AdvNoise-DM and MF 2 M achieve significantly lower "Rank 1" than seven SOTA additiveperturbation-based baselines.

Face verification. We use the 3,530 images of 80 identities in the probe set and 1M images in the gallery to build 151K positive samples and 3.5 billion negative samples for face verification. Fig. 5(b) shows the receiver operating characteristic (ROC) curves. We define the true positive rate (TPR) when the false positive rate (FPR) is 1 −6 as "Veri. ", which is 0.7470 and 0.5154 for solid-color medical masks and DM, respectively. When we apply AdvNoise-DM and MF 2 M, "Veri." almost drops to zero. In the third column and the fourth column of Table 1 , we show the "Veri. " values and the AUC values, respectively. We can see that both metrics of AdvNoise-DM and MF 2 M are much lower than baselines.

Mask detection. We exhibit the mask detection rate of different masking methods in the fifth column of Table 1 , which also represents the accuracy of the used mask detector. Solid-colored masks are easily detected and the detection rate is 90.53%. DM reduces the detection rate to 27.74%. AdvNoise-DM further interferes with the judgment of the detector and the accuracy decreases to only 7.30%. MF 2 M achieves the best attack performance and reduces this rate to 6.95%. The detection rates for additive-perturbation-based baselines are between 39% and 66%. So far, we prove that our adversarial methods are very effective for both face recognition and mask detection in white-box attacks.

Similarity measurement. Now we turn to the discussion upon the similarity measurement before and after adding adversarial textures. The value of similarity is calculated by comparing with images obtained by DM, so we only calculate similarity scores for AdvNoise-DM and MF 2 M. We choose SSIM and PSNR as our similarity metrics, evaluating the similarity from aspects of visual error and structure difference. The SSIM of AdvNoise-DM and MF 2 M are 0.9808 and 0.9812, respectively, indicating the high structural similarity in the reconstruction process. In terms of PSNR, the value of MF 2 M is 40.45, higher than 38.76 of AdvNoise-DM, which is in line with the cognitive experience that the filtering operation has a more imperceptible modification to images. 

To verify the transferability of our methods, we utilize generated masked images to carry out black-box attacks. We conduct black-box attacks against face recognition models pre-trained under Cosface with ResNet34 and ResNet50 as the backbone. Curves with dots and without dots in Fig. 6 represent results of attacking ResNet34 and ResNet50, respectively. Compared with the "Rank 1" of AdvNoise-DM and MF 2 M in Fig. 5 (a) (nearly zero), they vary from 0.05 to 0.08 in Fig. 6 (a). It shows that the interference degree to face recognizers in black-box attacks reduces, but is still strong, i.e., our adversarial masking methods have sufficient transferability. Besides, curves of MF 2 M are lower than curves of AdvNoise-DM targeting the same model, indicating MF 2 M has stronger transferability. Compared with adversarial attack baselines, AdvNoise-DM and MF 2 M have absolute advantages in the impact on face recognizers as shown in Table 2 .

Due to various COVID-19 related restrictions, we were not able to recruit human subjects to study the effect of physical attack by wearing our proposed faced masks. Therefore, we use an alternative recapture method to illustrate the physical effects of our proposed masking methods. We randomly select 20 faces of different identities from the FaceScrub dataset as origin images in Fig. 7(a) . We process the 20 faces with our proposed MF 2 M and obtain digital attacked faces as shown in Fig. 7(b) . Then we use an − 5575 printer to print these attacked images and recapture images like Fig. 7(c) , which has obvious color differences from Fig. 7(b) . This procedure is meant for methodologically mimicking the process of plastering the patterns from a digital medium onto a physical one, such as fabric, linen, or paper, so that the physical appearance can be digitally reacquired via image sensors. Finally, we resize the recaptured images to the size of 112 × 112, extract faced masks from them, overlay faced masks to the 20 corresponding original faces, and obtain images used in the physical attack as shown in Fig. 7(d) . Based on this synthesis process, we conduct the experiment of physical attacks and demonstrate the robustness of our proposed MF 2 M and AdvNoise-DM to evade face recognition and mask detection. Except for the solid-color medical mask and DM, we choose I-FGSM baseline (i.e., apply I-FGSM to solid-color masks), which is the strongest baseline in digital attacks, as the main baseline for the physical attacks. We also process these masking methods by the method shown in Fig. 7 and set three baselines in physical attacks.

For the physical white-box attack, the second column and the third column of Table 3 show that the top-1 identification rates and the TAR at 10 −6 FAR of MF 2 M and AdvNoise-DM are both zero, indicating the strong interference of these two masking methods to the face recognizer. The last column of Table 3 shows that only 10% of images by MF 2 M and AdvNoise-DM are detected faced masks, demonstrating the powerful ability of our proposed masking methods in avoiding mask detection in physical white-box attack. The corresponding CMC curves in Fig. 8(a) illustrate that the identification rates of MF 2 M and AdvNoise-DM are always below three baselines at different ranks. The ROC curves in Fig. 8(b) show that the TPR of MF 2 M and AdvNoise-DM are always less than three baselines at different FPR.

As for the physical black-box attack, the top-1 identification rates and the TAR at 10 −6 FAR of I-FGSM baseline are more than 0.64 as shown in Table 4 . It indicates that the adversarial textures added in I-FGSM baseline almost failed. In contrast, these metrics of our proposed MF 2 M and AdvNoise-DM are still less than 0.22, indicating our proposed masking methods remain highly interference to face recognizers in physical black-box attack. Compared with AdvNoise-DM, MF 2 M has a greater influence on face recognizers. The corresponding CMC curves and ROC curves of physical black-box attack are shown in Fig. 9 .

In AdvNoise-DM, we only do noise-based adversarial attacks, and the intensity of noise reaches 0.04. When we combine noise-based and filtering-based attacks in MF 2 M, we only need lower noise intensity (i.e., 0.01) to achieve similar results in face recognition. We experiment under different attack intensities and compare these two methods in the PSNR-AUC curves. As shown in Fig. 10(a) , MF 2 M (orange curve) gets higher PSNR values than AdvNoise-DM (blue curve) at the same AUC values, indicating the advantage of MF 2 M in naturalness. We also conduct experiments on attacking images by only utilizing filtering kernels and show the comparison in Fig. 10(b) . At the same number of iterations for altering kernels, MF 2 M (orange curve) achieves lower AUC values than the method only using filtering kernels (blue curve). This shows that adding noise can assist the optimization of filtering kernels as noise-based attacks have fewer parameters and higher attack efficiency.

In this paper, we propose MF 2 M, an adversarial masking framework that adds faced masks containing partial face patterns and special adversarial textures. Our work reveals the potential risks of existing face recognizers and mask detectors regarding facial masks specially customized. The reconstructed images from our methods retain enough naturalness, generating a higher safety hazard. Therefore, particularly generated facial masks should be taken into consideration when designing the FRS and mask detection systems.

Vitomir Štruc, and Simon Dobrišek. 2021. How to Correctly Detect Face-Masks for COVID-19 from Visual Information?

Unmasking Face Embeddings by Self-restrained Triplet Loss for Accurate Masked Face Recognition

MFR 2021: Masked Face Recognition Competition

Vggface2: A dataset for recognising faces across pose and age

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Pytorch implementation of Face Recognition Model under ArcFace

Pytorch Implementation of predicting 2d-106 landmark coordinates

Pytorch implementation of Face Recognition Model under CosFace

Boosting adversarial attacks with momentum

Evading defenses to transferable adversarial examples by translation-invariant attacks

Adversarial Relighting against Face Recognition

Detecting masked faces in the wild with lle-cnns

Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples

Learning to Adversarially Blur Visual Object Tracking

Watch out! Motion is Blurring the Vision of Your Deep Neural Networks

Unconstrained Periocular Face Recognition: From Reconstructive Dictionary Learning to Generative Deep Learning and Beyond

Spartans: Single-sample periocular-based alignment-robust recognition technique applied to non-frontal scenarios

Hallucinating the full face from the periocular region via dimensionally weighted K-SVD

Subspace-based discrete transform encoded local binary patterns representations for robust periocular matching on NIST's face recognition grand challenge

Fastfood dictionary learning for periocular-based full face hallucination

A style-based generator architecture for generative adversarial networks

The megaface benchmark: 1 million faces for recognition at scale

Advhat: Real-world adversarial attack on arcface face id system

Adversarial machine learning at scale

Cropping and attention based approach for masked face recognition

Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection

Sphereface: Deep hypersphere embedding for face recognition

Large-margin softmax loss for convolutional neural networks

A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic

Towards deep learning models resistant to adversarial attacks

Delaunay Triangulation and Voronoi Diagram using OpenCV ( C++ / Python

Boosting Masked Face Recognition with Multi-Task ArcFace

Ongoing Face Recognition Vendor Test (FRVT) Part 6B: Face recognition accuracy with face masks using post-COVID-19 algorithms

Deep Face Recognition

Identifying facemask-wearing condition using image super-resolution with classification network to prevent COVID-19

Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition

Pytorch Implementation of Retinanet for Face Mask Detection

Deep learning face representation from predicting 10,000 classes

Deepface: Closing the gap to human-level performance in face verification

AVA: Adversarial Vignetting Attack against Visual Recognition

Additive margin softmax for face verification

Cosface: Large margin cosine loss for deep face recognition

Amora: Black-box adversarial morphing attack

Image quality assessment: from error visibility to structural similarity

Multi-task deep neural network for joint face recognition and facial attribute prediction

Masked face recognition dataset and application

A discriminative feature learning approach for deep face recognition

Improving transferability of adversarial examples with input diversity

Adv-Makeup: A New Imperceptible and Transferable Attack on Face Recognition

Range loss for deep face recognition with long-tailed training data