1 Introduction

The face detection approaches have been widely adopted in several applications to create biometric markers in social networks, smart devices, surveillance systems, fraud detection, and so on [15, 20, 25]. In summary, face detection is a subarea of object detection, devoted to finding regions in images that contain faces [22, 25].

The advances in the Artificial Neural Networks (ANN) [9], especially with the Convolutional Neural Networks (CNN) [14], have improved the face detection performances and made its usage more usual in many real-world scenarios. Nowadays, state-of-the-art manuscripts report relevant performances by finding faces with different positions and rotations, occlusions, expressions, scales, and localization [23, 25].

Despite the theoretical and scientific advances, scientists are concerned about the fairness in AI models, which is a current hot topic in computer vision, aiming at better understand the robustness of such AI models across important human features as, for example, age, gender, and race [5]. Due to ethical issues, some commercial systems designed by important companies (e.g. IBM, Microsoft, and Clarifai) were discontinued due to the high error rates in specific groups as black women, possibly presenting a racist bias [7]. Recently, in New York, scientists have asked to interrupt the usage of face recognition in some situations due to errors associated with gender, race, and ethnicity [4]. The most important factor related to this issue is that, even in systems with high accuracy rates, small errors may strongly affect the life of minorities who are historically oppressed [2, 15].

An alert was recently triggered when several researchers have emphasized that classification errors are more likely when people of color are under scrutiny [4]. The reason why such errors happen is still being assessed. According to the scientists, it can be driven by the own methods and/or the dataset used during the training process, thus requiring a careful evaluation when deployed in a scenario characterized by different groups  [16].

In our context, the problem came up during the development of an AI-based system to detect fraud in public transportation in Salvador (Brazil). Before the pandemic outbreak, more than 1,1 million passengers used to take buses per day in Salvador (average calculated between 2016 and 2019). Our system analyzes all images taken when the passengers cross the turnstile by running two steps. Firstly, we execute a face detection step in the usage images and, then, we compare the estimated face along with the users’ profiles. The fraud usually happens when users have some benefits like student discounts. In this situation, the students share their discount cards with others. In case of confirmed fraud, the card is blocked.

By considering Salvador is the Brazilian city with the highest percentage of black people (about \(80\%\)), any error may affect a significant number of users, leading to a high number of false positives. In our scenario, due to the absence of images with appropriate labels to describe users by gender and race, we used pre-trained face detectors, published by the original authors. Thus, we created an empirical setup to assess whether the detectors have gender and race biases, before deploying our system. Aiming at reaching this goal, we firstly created a specialist committee to label our images. Then, we analyzed the errors produced by every detector by taking into account different groups of genders and races. As discussed later, we have indeed found a race bias in our environment.

This manuscript is organized as follows: Sect. 2 shows a set of related work; Sect. 3 introduces our methodology and tools considered to conduct our study; The experimental results are presented in Sect. 4; Finally, concluding remarks and future directions are given in Sect. 5, which is followed by a list of references.

2 Background

This section presents studies related to the investigation of ethical issues caused by gender and race biases that influence the performance of Artificial Intelligence methods. In this sense, Farinella and Dugelay (2012) [6] published a manuscript focused on understand whether gender and ethnicity affect each other during the classification process. According to the authors, these features are not affected by themselves.

Furl, Phillips, and O’Toole (2002) [7] designed a study to verify the precision of face recognition models by dealing with different ethnicities. The results were focused on Caucasians and Asians. Phillips et al. (2011) [16] have discussed the risks of using faces with different characteristics during the training and test phases.

Similar work was published by Klare et al. (2012) [13], in which the main contribution was extend the analysis to different demographic features as gender, race, and age. According to the authors, black and young women have presented the highest error rates. To overcome this situation, the authors recommend balanced datasets with data sampled from the population where the system will be deployed. The authors also suggest an effort to create public datasets support systems that will be implemented in similar scenario.

More recent manuscripts show contributions in the organization of datasets as well as investigation of the existence of a racist bias. In that sense, Karkkainen and Joo (2021) [12] focused on developing a balanced dataset, named FairFace, which has information about seven race groups: White, Black, Indian, East Asia, Southeast Asia, Middle East, and Latino. This dataset was compared with others in the literature and used in commercial APIs, bringing improvements in the race classification task.

Buolamwini and Gebru (2018) [2] explored three models created to detect faces used in commercial solutions (Microsoft, IBM, Face++) to evaluate the presence of gender and race prejudice. Their most important contribution to the literature was the creation of a dataset, by integrating other public images, balanced by gender and ethnicity. Results presented by the authors show higher errors were obtained from the group composed of black women.

Wang et al. (2019) [21] have evaluated the racist bias in races/ethnicities by developing a dataset and a model capable of reducing racist bias in facial recognition. The dataset created, named Racial Faces in-the-Wild (RFW), has Caucasian, Indian, Asian and African ethnicities. The experiments have shown that the facial recognition of Africans has error rates twice as high as the facial recognition of Caucasians.

Raji et al. (2020) [17] have investigated ethics in facial processing technologies developed by Amazon, Microsoft and Clarifai, verifying the accuracy of their systems in classifying people according to race, gender, age, expression, and face detection. All experiments were performed on a dataset created by the authors, which is made up only of celebrities. The main result of such research is the alert about the importance of correctly analyzing ethical issues in the considered models.

3 Methodology

As previously mentioned, the main objective of this work is to assess whether or not the mostly-adopted face detection methods present similar performance to classify real-world images regardless of the users’ gender and race. The analyzed images were collected by the consortium Integra, formed by the four companies responsible for public transportation in Salvador (Bahia – Brazil).

Aiming at reaching this objective, we defined the following hypothesis: Do the pre-trained face detection models act with prejudice? This hypothesis was analyzed by using validation metrics, usually adopted in ML problems, under a null hypothesis (\(H_0\)) that supports the statement there is no difference between metrics when compared users by gender and race. In turn, our alternative hypothesis was used to illustrate when \(H_0\) is rejected, which might call our attention to biases that affect the expected face detection behavior.

To better define the scope of our analyses, we have selected images from the Integra databases and three well-defined face detectors: i) Yolo Faces [18]; ii) Multitask Cascaded Convolutional Networks (MTCNN) [26]; and iii) ResNet [10]. The following sections detail our scope, the experimental setup and the methodology considered to assess our hypothesis.

3.1 Dataset

The images considered in our experiments were collected from public transportation in Salvador. When a passenger crosses the turnstile inside the buses, four pictures are taken and stored in a database. From this database, we have randomly selected 10, 000 images captured on December 10, 2019. This date was a usual workday and was considered without any relevant reason, but the fact of being before the COVID-19 pandemic outbreak to avoid analyzing images of users wearing masks, which might affect the performance of the selected face detectors.

We also created a tool to support specialists during the tasks of setting the gender and race labels and validating the face detector results, as shown in Fig. 1. As one may notice, as a specialist sets the gender and race, the percentage of each label is updated to keep the dataset balanced. Consequently, the final set of images contains a similar amount of users according to their gender and race. We emphasize that, although we are aware of the different classes of gender and race adopted by IBGE (Brazilian Institute of Geography and Statistics), the scope of this study was limited to man/woman and black/white, respectively. The final dataset is composed of instances with the following class rates: i) black – 569 (\(56.50\%\)); ii) white – 438 (\(43.50\%\)); iii) male – 407 (\(40.42\%\)); and iv) female – 600 (\(59.58\%\)).

Fig. 1.
figure 1

GUI Interface for target imagens and models predictions.

Another important feature implemented in this tool is the validation of the face detectors. Once pre-trained models are used, we show to the specialists a face and the bounding boxes produced by all detectors, thus selecting their performances based on four possibilities. True Positive (TP) happens when the bounding box highlights an actual face. In turn, if a bounding box is drawn on a region without a face, the specialists set it as False Positive (FP). A True Negative option is used to confirm when no bounding box is drawn on an image without a face. It usually happens when the camera is wrongly positioned or triggered. Finally, False Negative characterizes situations when the face detectors do not find an existing face. After performing this step, we have a contingency matrix, which is considered to assess our hypothesis.

Aiming at respecting the users’ privacy and due to the Brazilian General Personal Data Protection Law (LGPD), the resultant dataset cannot be shared. We emphasize all images were manipulated by authorized employees from Integra and the researchers associated with this researcher, respecting the LGPD requirements. If the reviewers deem it is necessary to audit the dataset, the program chairs of the conference can contact us to have access to images, bounding boxes, and labels, always respecting the LGPD requirements.

3.2 Face Detection Models

The detection of faces in images is a challenging task in the Computer Vision area and is considered an instance of object detection in general. Recent advances in this area, especially with the advent of Convolutional Neural Networks (CNNs), have allowed performing this task with great performance, even in a situation characterized by, for example, face rotation, and occlusion (e.g. hat, cap, and glass). Based on the literature, approaches designed to detect face are, usually, less complex than those considered to look for several objects in images. Next, we show three models widely used to do so.

The first approach considered in our experiments was YOLO (You Only Look Once) [18], which is a CNN with 26 convolutional, 4 pooling, and 2 full connected layers. The face detection is essentially performed by a fast regression task, which provides high performance. In summary, YOLO analyzes the image as a grid, in which every piece is considered to predict a bounding box and define a confidence level to estimate the presence of objects. The pre-trained model considered in our evaluation was based on the Wider Face dataset [24].

The second approach is called Multi-Task Cascaded Convolutional Neural Network (MTCNN) [26], which is based on 3 steps. The fist one, referred to as P-Net, contains three convolutional layers with 10, 16, and 32 filters and a pooling layer. The second one is called R-Net and composed of three convolutional layers with 28, 48, and 64 filters, two pooling layers, and a full-connected layer. Finally, the last step, known as O-Net, is characterized for presenting four convolutional layers along with 32, 64, 64, and 128 filters. Moreover, it has 3 pooling and 1 full-connected layers. P-Net is designed to detect candidate faces initially and its results are refined by R-Net that removes regions whose presence of faces is highly unlikely. Then, O-Net detects the faces and provides five reference points. In our scenario, we have considered a pre-trained model that was adjusted on the datasets Face Detection Benchmark [11] and Wider Face [24].

The third approach used in our experiments was ResNet-33 [10], which is based on Region Proposal Network (RPN) to define bounding boxes and reference points. The architecture considered by ResNet uses a shared memory strategy, aiming at making the face map available to all convolutional layers. Firstly, there is a fast ZF-Net(Zeiler and Fergus model) with five convolutional and three full-connected layers. Then, there is a VGG-16 architecture [19] with 13 convolutional and 3 full-connected layers. The model used in our analyses was previously trained on the dataset Wider Face, as well.

3.3 Validation Metrics

Aiming at assessing the contingency matrix produced by comparing the results obtained by the face detectors and specialists, we used six validation metrics widely adopted by the ML area: Sensitivity, Specificity, Accuracy, Matthews Correlation Coefficient (MCC), Kappa Coefficient, and F1-Score.

Sensitivity, defined in Eq. 1, calculates the true positive rate between the correctly-detected and missing faces. The true negative rate is measured by the Specificity as shown in Eq. 2, i.e., the total of images correctly classified as without faces and wrong bounding boxes. Accuracy is responsible for calculating the total number of correct classifications, i.e. right bounding boxes and total of images without faces and bounding boxes, as depicted in Eq. 3. In contrast, Precision counts the total of correctly detected faces among all bounding boxes produced by the detectors, Eq. 4. We also used F-Score, Eq. 5, which calculates a harmonic mean between Precision and Sensitivity (Recall). In our analyses, we have considered Precision and Sensitivity to have the same weight and are equally important.

$$\begin{aligned} \text {Sensitivity} = \frac{TP}{TP + FN} \end{aligned}$$
(1)
$$\begin{aligned} \text {Specificity} = \frac{TN}{FP + TN} \end{aligned}$$
(2)
$$\begin{aligned} \text {Accuracy} = \frac{TP + TN}{FP + TN + TP + TN} \end{aligned}$$
(3)
$$\begin{aligned} \text {Precision} = \frac{TP}{FP + TP} \end{aligned}$$
(4)
$$\begin{aligned} \text {F1-Score} = \frac{2 * ( \text {precision} * \text {sensitivity})}{ \text {precision} + \text {sensitivity}} \end{aligned}$$
(5)

The Kappa Coefficient (\(\kappa \)), Eq. 6, is calculated between the observed \(P_o\) classification and the agreement that would be expected to occur by chance \(P_e\), as shown in Eq. 7. For all metrics presented so far, the greater the value, the better the classification is.

$$\begin{aligned} \text {Kappa} = \kappa = (P_o - P_e) / (1 - P_e) \end{aligned}$$
(6)
$$\begin{aligned} P{_{o}}=\sum _{i=1}^{C}p_{ii}, \quad P{_{e}}=\sum _{i=1}^{C}p_{i.}p_{.i} \end{aligned}$$
(7)

The last measure used in our experiments is MCC (Matthews Correlation Coefficient), Eq. 8, which uses a contingency matrix to compare classifiers similarly to the Pearson’s correlation coefficient. The coefficient can be interpreted by the interval \([-1,+1]\), in which \(+1\) represents a perfect mach between expected and predicted labels. When it is equal to 0, learning models are confirmed to provide random predictions. Values approaching \(-1\) suggest a total disagreement between expected and predicted labels.

$$\begin{aligned} MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \end{aligned}$$
(8)

3.4 Hypothesis Test

The validation metrics aforementioned were considered to verify the agreement between the bounding boxes produced by the face detectors and the presence of actual faces. Next, we need to verify whether or not the classifier performances vary depending on the users’ gender and race.

In this way, we start from the assumption that the data distribution follows a discrete sequence of independent Bernoulli random variables with probability p for Class 1 and probability \(q = 1 - p\) for Class 0. Thus, the expected value of the i-th Bernoulli random variable is \(E(X_i) = p\) and its variance is defined by \(\text {Var}(X_i) = p(1-p)\) [8]. In our context, by calculating the probability of success for the gender and race labels, the confidence interval approaches a binomial proportion, which is used to calculate the test statistics.

From this perspective, we evaluate whether there are significant differences among outcomes yielded from the combination of different pairs of experiments considering users’ gender and race. In this sense, we used the two proportion z-test to assess the presence of biases affecting the face detection process. The null hypothesis (\(H_0\)) states there is no significant difference between proportions (\(p_1 - p_2 = 0\)). On the other hand, the alternate hypothesis (\(H_1\)) states the proportions are not the same, i.e., we reject \(H_0\) with a significance level \(\alpha = 0.05\) in case of \(p_1-p_2>0\). Equation 9 calculates the statistic used in our test.

$$\begin{aligned} Z=\dfrac{p_1-p_2}{\sqrt{\dfrac{p\cdot (1-p)}{n_1} + \dfrac{p\cdot (1-p)}{n_2}}};p=\dfrac{n_1\cdot p_1+n_2\cdot p_2}{n_1+n_2}; \end{aligned}$$
(9)

As one may notice, we focused our analysis on the upper-tailed test, in which a Z value greater than 1.645 leads to rejecting \(H_0\). By choosing \(\alpha = 0.05\), there is a \(5\%\) probability of getting pairs of value by chance and the results present, indeed, type I error (false positive) [3]. Next section, we show the obtained results and our analyses performed to assess our hypothesis.

4 Experimental Results

Before proceeding with our statistical analyses, we have to confirm that the detectors are suitable to find faces in images from our database. As previously mentioned, we are processing real-world images collected from passengers while using public transportation in Salvador. The analyses of such images are very challenging due to the dynamic environment, i.e., pictures taken inside the buses are affected by cameras with low quality, current weather, road condition, natural movements, and different illumination, thus producing, for example, underexposed, overexposed, and blurry images.

Firstly, we have run the three detectors on our complete dataset. As a consequence, every detector produced new images with bounding boxes drawn on regions where faces were estimated. Figure 1 illustrates three images along with bounding boxes produced by the detectors. In this figure, there is also a button called “No clear face” that is used to remove the image from our analyses. This is especially important when, for example, the images have low quality or the specialists are not completely confident about the most appropriate label. In summary, if a human specialist has any doubt, then the image must be disregarded to avoid propagating a classification error.

Aiming at verifying the performances of the selected detectors, the human specialists had two main tasks. The first one was to check the bounding boxes, by choosing one of the following options for the contingency matrix: TP (true positive – the face was correctly detected), FN (false negative – no face was found), FP (false positive – a bounding box was drawn on a region with no face), and TN (true negative – no bounding box was drawn on an image with no face). The second task is to set gender and race labels for the users. Therefore, the final dataset contains 1, 038 images correctly classified by a specialist committee. Figure 2 presents all contingency matrices calculated based on the agreement between the face detectors (rows) and the specialists’ opinion (columns).

Fig. 2.
figure 2

Contingency matrices for all face detectors.

By considering the resultant contingency matrices, we calculated the validation metrics (see Sect. 3.3) as shown in Table 1. According to such results, we notice the best overall result was produced by MTCNN. We emphasize that, as expected, our dataset is imbalanced, presenting more images with faces (positive). For this reason, we have also selected metrics that verify results by chance. Concerning the classes based on gender and race, the dataset is balanced as presented in Sect. 3.1. Once we use pre-trained models published by the original authors, all these images were used as a test fold, whose results are presented in this section.

Table 1. Validation metrics: Positive – \(97.27\%\) and Negative \(2.73\%\).

After demonstrating the general performance of the detectors, we start assessing our hypothesis that, essentially, raises the following question: “Are the face detectors influenced by the users’ gender and race?”. To systematically answer this question, we have hypothesized the following assumptions:

  • \(H_0\): the error proportions are the same regardless of the users’ gender and/or race: \(p_1-p_2=0\);

  • \(H_1\): the error proportions are not the same, thus there is difference based on the users’ gender and/or race: \(p_1-p_2 > 0\).

The first step towards performing this analysis was to find the error proportions p and the number n of sample per class, as shown in Table 2. For example, by considering this table, we notice the detection error in black users is greater when compared to white people. However, can this difference be considered significant?

Table 2. Error proportions (%) between classes.

In the next step, by only considering images with faces according to the specialists, we computed \(Z_{obs}\), \(\alpha _{obs}\), and the (lower and upper) confidence intervals (CI) from the error proportions between a different combination of gender and races. In the following tables, results with significant differences were highlighted in bold.

In Table 3, we show the comparison between errors from different combinations of classes, based on the face detection task performed by MTCNN. As one may notice, although there are different error rates in Table 2, the statistical tests reject the null hypothesis in three situations: Black \(\times \) White, Black/Woman \(\times \) White/Woman, and Black/Man \(\times \) White/Man. Therefore, by considering the alternative hypothesis, the error is greater when black users are the analyzed passenger. In this experiment, we have noticed no significant difference when the passengers are compared by gender.

Table 3. Test statistic for MTCNN.

Next, we repeated the same analyses on images processed by YOLO. Regardless of the classification metrics, we decided to assess whether the errors produced by this detector were also influenced by gender and race. According to the results summarized in Table 4, we noticed the face detection errors have ratified the results produced by MTCNN. As shown in bold, there are significant differences when race is assessed. The face detection with YOLO also produced worse results when the passengers are black.

Table 4. Test statistic for Yolo.

In our last analysis, we have also compared the results produced by ResNet. Although this detector has presented the worst results, we decided to perform a close examination on the error proportions produced by it according to the different classes. The final results, presented in Table 5, also ratifies the previous analyses, in which the error proportions are significant when users are compared by their race. Moreover, the errors produced by this detector were statistically different when black woman results are compared to a white man. Despite the users’ race, we cannot affirm, at least in our scenario and experiments, there is a significant difference by comparing the users’ gender.

Table 5. Test statistic for ResNet.

5 Conclusion

In this work, we decided to investigate an ethical issue that has been calling the attention of several AI researchers and practitioners. In 2020, a letter signed by several mathematicians had explicitly pointed out the risks of using AI-based technologies without carefully examining whether their decision processes are implicitly working as expected [5]. The Mathematics Community has not criticized the advances of AI, which are important and necessary to the wellness of our society as the contribution in areas as, for example, medicine and ecology. However, can we work in such direction without propagating to our algorithms (and data) historical errors as structural and explicit racism?

This situation is strongly relevant in the Computer Vision area, especially by working on tasks related to face recognition. Are our data and algorithms performing as expected regardless of users’ race? Can the algorithm produce more errors when black people are under analysis? As discussed by Nick Bostrom in the preface of his book on superintelligence with the fable “The Unfinished Fable of the Sparrows”, an important part of the AI advances is performed by researchers interested in studies about the risks of deploying them incautiously [1].

This issue motivated us during the design of a solution to identify fraud in the usage of benefit cards in public transportation in Salvador (Brazil). Aiming at identifying whether or not the cards were being used by the actual owner, we created a fraud detection system that uses Artificial Neural Networks to identify when the users’ pictures taken inside the bus were different from their registered profiles.

The experimental setup created in this work was based on the application of models, published to detect faces in images, on our dataset, which is composed of millions of pictures daily taken inside the buses. In summary, we evaluated the errors produced by MTCNN, YOLO, and ResNet by comparing different groups according to the users’ gender and race. According to our analyses, we have noticed significant differences when the passengers’ race is analyzed. All results consistently highlighted detection errors were greater with black passengers. Moreover, we have noticed this problem also happened when we subsampled the data by the users’ gender. In summary, errors with black woman and man were greater than those obtained from white woman and man, respectively.

This bias found in our experiments was strongly important to emphasize the need for better models to detect fraud before deploying our system. The main limitation of our work is the usage of pre-trained model, although it is a common practice in commercial applications, especially when the availability of labeled data is limited. Finally, we recall the results obtained in this work shows this undesired bias to our unspecific application and illustrates the risks, in which similar applications may suffer in real-world scenarios. Thus, as future work, we plan to increase our dataset with more gender and race labels, aiming at retraining the face detectors, and updating their parameters and hyper-parameters. Then, we will draw stronger conclusion about whether the problem happens due to the ANN architectures, our data used in the test phase, or the requirement of retraining the models with more appropriated (fair) datasets. In the future, we also plan to evaluate other users’ labels.