Balancing Biases and Preserving Privacy on Balanced Faces in the Wild

There are demographic biases present in current facial recognition (FR) models. To measure these biases across different ethnic and gender subgroups, we introduce our Balanced Faces in the Wild (BFW) dataset. This dataset allows for the characterization of FR performance per subgroup. We found that relying on a single score threshold to differentiate between genuine and imposters sample pairs leads to suboptimal results. Additionally, performance within subgroups often varies significantly from the global average. Therefore, specific error rates only hold for populations that match the validation data. To mitigate imbalanced performances, we propose a novel domain adaptation learning scheme that uses facial features extracted from state-of-the-art neural networks. This scheme boosts the average performance and preserves identity information while removing demographic knowledge. Removing demographic knowledge prevents potential biases from affecting decision-making and protects privacy by eliminating demographic information. We explore the proposed method and demonstrate that subgroup classifiers can no longer learn from features projected using our domain adaptation scheme. For access to the source code and data, please visit https://github.com/visionjo/facerec-bias-bfw.


I. I
A S machine learning machine learning (ML) becomes more integrated into our daily lives, interest in concepts like bias, fairness, and the ethical implications of using this technology grows [1]- [3]. As we rely more on ML to assist with everyday tasks, it becomes increasingly critical to address biased and unfair algorithms [4], [5]. Systems deployed for sensitive tasks require thorough examination, with biometrics [6]: facial recognition (FR) being a prime example. We propose a test bed to evaluate FR fairly.
Researchers and practitioners often use convolutional neural networks (CNNs) or transformer models to map face features to a vector. These FR models are typically trained on data to learn to encode faces in a space where those of the same identity are minimally separated while those of different identities are furthest apart. The FR model then extracts features ( Fig. 1) to store in a database with labels.
Subsequently, during inference, one compares the features of a test face to faces stored during enrollment Manuscript  to determine a match. An optimal threshold (i.e., θ) serves as the decision boundary to compare the similarity score s of a pair of unseen faces to predict the pair-wise class label (i.e., genuine or imposter). Ideally, the face features of true pairs yield scores that satisfy criterion s ≥ θ [7]- [10]: θ serves as a trade-off parameter to control the false-positive rate (FPR) and false-negative rate (FNR) (Fig. 2). The adverse effects of a single threshold are threefold: 1) Evaluation sets typically have imbalanced distributions similar to the training, so the majority dominates the performance rating. 2) Score ranges for genuine vary across demographics; true face pairs from the underrepresented subgroups tend to score lower.
3) The optimal scores per subgroup vary, meaning a single, global threshold is only optimal if set and tested on a single sub-population (Fig. 3). Throughout this work, we use the term global to refer to values averaged across all demographics, in contrast to subgroup-specific, which refers to a particular demographic. To address the issue of imbalanced data (i.e., item (1) mentioned above), we propose the Balanced Faces in the Wild (BFW) dataset to measure subgroup biases in FR. BFW provides a fair evaluation of FR systems by considering demographicspecific performance. We can now understand the performance gap in facial features with state-of-theart (SOTA) CNNs. We then suggest a mechanism to eliminate prejudice and level out performance ratings across demographics while improving accuracy. Specifically, we preserve identity information and remove demographic knowledge from the features. This feature adaptation scheme addresses items 2-3. arXiv:2103.09118v5 [cs.CV] 5 Jul 2023 A byproduct of the proposed is to preserve privacy. The learned features in the lowerdimensional space contain less knowledge of the subgroups, disallowing the extraction of ethnicity and gender information from the enrolled facial features. Protecting user data is valuable, and reduces the chance of malicious or even unintended bias [11].
The contributions are listed as follows.
• We demonstrate a bias in CNNs with our BFW dataset. We added another attribute representing facial skin tones. The raw data, face embeddings, and meta-data are on IEEEDataPort. 1 • We propose a feature learning scheme that debiases face features and balances performances across subgroups, increasing performance.
• We minimize subgroup information in features -a byproduct of the proposed work is the reduction of subgroup-based knowledge to address privacy concerns and avoid other potential biases.
• We draw attention to the challenging samples the suggested de-biasing scheme overcomes. The paper is organized as follows. We review related work (Section II). Then, we go over constructing the BFW database (Section III). We introduce the proposed method (Section IV). Then, the settings and results of the experiments are covered (Section V). Finally, we discuss the next steps (Section VI).

A. Bias in facial recognition
Automatic facial recognition (FR) based on deep learning dates back to 2014, when Taigman et al. [12] first proposed using a CNN for recognition, which has 1 https://ieee-dataport.org/documents/balanced-faces-wild. seen significant improvements nearly annually. The SOTA continues to improve, with layer and network types evolving (e.g., transformers [13]). For more indepth surveys, see [14], [15], and [16].
Recent research focuses on reducing bias in automatic face understanding [17], [18]. Some focus on the changes in performance (i.e., biases) of soft attributes like gender [19], ethnicity, age [6], or other traits [20]. Others explore methods to measure bias using generative modeling. For example, Balakrishnan et al. [21] train a generator to manipulate the latent space to alter attributes such as skin tone, hair length, and hair color. Georgopoulos et al. [22] generate faces of various ages to augment training data. Muthukumar et al. [23] study the effects of color tone on gender classification by recoloring faces in images across a spectrum, i.e., from lighter to darker. Others use knowledge distillation (c.f., [24], [25]). Gong et al. [26] base subgroups on facial skin tones. This paper focuses on the common one-to-one facial verification (FV) setting.
Some researchers aim to characterize the amount of bias at the system level, including gender [35]- [37], ethnicity [3], age [38], or multiple [39]- [43]. A recent European Conference on Computer Vision (ECCV) challenge encourages researchers to tackle bias in ethnicity, gender, age, pose, and even with/without sunglasses [44]. Other methods add modalities (e.g., profile information) to mitigate bias [45], [46]. Still, other works focus on the measurement of biases in FR at different levels, including the system [35], [40], templates [47], scores [48], and in pre-trained models [49]. Wang et al. [50] introduce a reinforcement learning-based race-balanced network to find optimal margins for non-Caucasians. Law et al. [51] leverage HCI technology to detect bias semi-supervised by having a human in the loop. These works target the image space, whereas we target the features without the original model or face image. Terhorst et al. [48] and Cavazos et al. [52] recognize the same challenges shown in this work: the sensitivities in subgroups when a matching function is applied to generate a score from a pair of features vary across demographics. These works normalize the scores to handle demographic-specific sensitivities to the average-an issue we highlight in [53].

B. Imbalanced data and data problems in FR
To effectively produce fair data distributions, one can under or over-sample sub-groups [54]. Alternatively, one can adjust learning costs per sub-group [55], [56]. Rudd et al. [57] propose the mixed objective optimization network (MOON) architecture that learns to classify attributes of faces by treating each subgroup as a multi-task attribute (i.e., a task per attribute). Cluster-based Large Margin Local Embedding (CLMLE) [58] samples in the feature space regularize the models at the decision boundaries of underrepresented classes. Wang et al. [59] change images by masking out aspects of humans that cause "leakage" of gender information to avoid biasing a set of labels describing a scene to a specific gender. The less recent solution can be found in reviews [60]- [62].
We study demographics' effect on FV by assessing demographic-specific performances. Our BFW data resource allows us to analyze existing SOTA deep CNNs on different subgroups. We provide practical insights showing that experiments often report misleading performance ratings that depend on demographics. Many researchers release large FR datasets to match the capacity of modern-day deep models [66]- [69]. More recently, several have focused on balancing demographics in FR data [3], [70]- [72]. Diversity in Faces (DiF) came first [72], which came without identity labels. DiF is no longer available for download. Others released data with demographics balanced and omitted identity labels [3], [71]. Hupont et al. [70] propose DemogPairs balanced across six subgroups of 600 identities from CASIA-WebFace (CASIA-W) [73], VGG [67], and VGG2 [68]. Our BFW includes eight subgroups (i.e., split the African/Indian subgroup used in DemogPairs into separate groups, Black and Indian), 800 identities, and more face samples per identity. We only use the VGG datasets, not CASIA-Web, to test a broader range of models (i.e., even models trained on CASIA-Web). With public resources used to train existing models, we built BFW using only VGG2 to minimize conflicts in the overlap between train and test. Table II compares our data with the others.
Khan et al. [74] study the limitations of face datasets with racial categories, including BFW. The authors note the challenges of creating precise definitions of subgroups and measuring the self-consistency of face datasets and cross-dataset consistency. (BFW is approximately as consistent as other datasets.) Khan et al. also note the challenges of racial types in science and the problems that arise without them (e.g., generative models generating only Caucasian faces).

C. Feature alignment / Domain adaptation
Domain adaptation (DA) employs labeled data from the source domain to generalize well to the typically label-scarce target domain, which relieves the high costs and burden of labeling data by reducing the required amount [75]- [77]. We can roughly classify DA as a semi-supervised DA [77]- [79] or an unsupervised one [80], according to access to target labels. The crucial challenge toward DA is the distribution shift of features across domains (i.e., domain gap), which violates the distribution-sharing assumption of conventional machine learning problems. In our case, the domains are the subgroups (Table III).
Some feature alignment (FA) methods attempt to project the raw data into a shared subspace where certain feature divergences or distances confuse groups. Many develop methods following this paradigm, such as correlation alignment [81], maximum mean discrepancy [82], and geodesic flow kernel [83], [84]. Adversarial domain alignment methods (i.e., DANN [85], ADDA [86]) design a zero-sum game between a domain classifier (i.e., discriminator) and a feature generator. The discriminator can not differentiate the source and target features if it mixes the features of different domains. More recently, learning well-clustered target features has proven helpful in conditional distribution alignment. DIRT-T [80] and MME [77] use an entropy loss on target features to group them as multiple clusters in the feature space implicitly. This helps keep the discriminative structures through adaptation. By adjusting the sensitivities in true scores, we align the score distributions of the subgroups (Fig. 4).

D. Protecting demographic information in FR
For reasons of privacy and protection, recent attempts remove demographic features from the raw face images [87]- [89]. These works recognize the importance of maintaining identity information in facial features while ridding it of evidence of demographics. Our model inherently does this as part of the target, aiming for the inability to recognize subgroups. Some achieve this using adversarial learning on top of the features via a Minimax filter [90]: maximizing the attributes loss while minimizing the target task. More recently, several treat it as a minimization problem by reversing the gradient of the protected classifier. For instance, Bertran et al. [91] learn a projection that maps images to an embedding space to disallow inference in gender information. Similar to this, we also aim to rid the data of demographic knowledge. However, the difference is that we learn to protect demographics in the facial features often stored in place of imagery (i.e., we map from a biased to a non-bias feature space). Ray et al. [92] follow the same path (i.e., map image-to-feature with demographic information protected). Again, we aim to preserve a database of facial features, with no assumption that the initial model or raw images are accessible.
Wu et al. [93], [94] present a method for removing gender information and preserving privacy in videos while maintaining sufficient information for action classification. They achieve this by learning a filter that selectively degrades the video. Their contribution differs from ours in that it operates in the image domain rather than the learned embedding domain. They also target action recognition rather than recognition applications.
In fact, several works aim to hide attribute information in image space. For instance, Othman et al. [95] learn to morph faces to suppress gender and preserve identity information in the image space. Guo et al. [96] map the image to noise by encrypting the photo, such that the encoder decodes the identity without the ability to recognize gender. Ma et al. [97] design protocols for transferring facial features via a cascade of classifiers in their lightweight privacy-preserving adaptive boosting (AdaBoost) framework. Dhar et al. [98] attempts to remove the attributes information from a pre-trained CNN with the help of discriminators and an injected generator layer. However, it is required to use multiple binary discriminators with correspondence with each attribute. Instead, we apply a single multi-class classifier to fulfill such an object, ensuring dense computation and improving model efficiency.
The proposed differs from previous work in the underlying data assumption: here, there is access to facial features, not the imagery, which is often the case in production. Images are processed once to reduce computation. Also, face features are compressed representations, making them much less expensive to store and transmit.
BFW provides balanced data across ethnicity (i.e., Asian (A), Black (B), Indian (I), and White (W)) and gender (i.e., Female (F) and Male (M))-eight demographics referred to as subgroups (Fig. 5). As in Table II, BFW has an equal number of subjects per subgroup (i.e., 100 subjects per subgroup) and faces per subject (i.e., 25 faces per subject). Note that the key difference between BFW and DemogPairs is in the additional attributes and the increase in labeled data; the differences between RFW and FairFace are in the identity labels and distributions (Table III).
We built BFW with VGG2 [68] by using classifiers on the list of names and then the corresponding face data. Specifically, we ran a name-ethnicity classifier [99] to generate the initial list of subject proposals. Then, the corresponding faces with ethnicity [100] and gender [101] classifiers further refined the list. Next, we manually validated, keeping only the genuine members of the respective subgroup. We then limited faces for each subject to 25 faces selected at random. Thus, BFW costs minimal human input, having generated the proposal lists by automatic machinery.    In summary, four experts in FR manually validated all the data: first, the validation of individuals per subgroup was conducted (i.e., inspect that all subjects belong to the assigned subgroups), and then the faces of the individual (i.e., verify that each face instance belongs to the identity). We only kept the subjects and samples verified as true by all annotators. See our conference paper for additional details [53].
We determined the subgroups of BFW based on physical features most common among the respective subgroup [53]. We can regard this as multiple domains because of the feature distribution mismatch across these subgroups. However, the assumption is that a discrete label that can describe an individual is imprecise. The assumption allows for a finer-grain analysis of the subgroup and is a step in the right direction. Thus, we refute any claim that our efforts here are the ultimate solution. The data and proposed machinery are merely an attempt to establish a foundation for future work to extend. The two genders for the four ethnic groups make up the eight subgroups of the BFW dataset (Fig. 5). Formally put, the tasks addressed have labels for gender l g ∈ {F, M } and ethnicity l e ∈ {A,B, I ,W }, where the K subgroups (i.e., demographics) are then K = l g * l e = 8.

A. The data subgroups
Kärkkäinen et al. [71] note that physical attributes correlate with the human race, while ethnicity is culturally based. Still, people often use race and ethnicity interchangeably. We refer to the U.S. Census Bureau to choose subgroups. Such labels are oversimplified [103] and are not precisely defined [74]. From these limitations, the categories can show value for sub-group analysis of bias in computer vision [53].
Dermatologists diagnose sun exposure risks by manual inspection of the tone of a subject's skin with a label called the Fitzpatrick skin type (FST). Because of the need for manual review by multiple experts can be challenging to collect such data. Merler et al. [104] propose a digital image processing scheme to characterize the skin tones of faces in their dataset, Diversity in Faces (DiF). The authors reference an earlier study that revealed a correlation between the melanin index (MI), a measure objectively measured by reflectance spectrophotometry [105], and the individual typology angle (ITA): a practical measure to categorize skin tones, as the FST can be determined digitally. ITA is calculated from pixels in CIE-Lab color space as follows: where larger lightness L and smaller blue-yellow b yield larger ITAs. We proceed following [104] (and perhaps less like Kinyanjui et al. [106]). We mask out target regions of the skin. However, instead of splitting face into areas based on detected landmarks, we segment faces, masking out all but the flat areas to avoid shadowing (i.e., omitting the nose, eyes, mouth, and hair). 2 With the pixels transformed from RGB to CIE-Lab, we removed L and b values more than one standard deviation of the respective mean for that face. we further mitigated concerns of outlier pixels by smoothing them out via a mean filter. Finally, the mean of pixel-wise ITA values is calculated for a face (Eq. 1). Fig. 6 shows the distribution of all ITA values. The values line up within expectation: the smaller the ITA (in degrees), the darker the skin tone. Notice the left tail of the Black, Indian, Asian, and White go from the densest to the scarcest. Furthermore, the mean (i.e., vertical line in the figure) shifts right for the lightertoned skin subgroups. There is a significant spread in values within racial groups, partly due to the varied lighting conditions.

IV. M
We first introduce the bias and privacy concerns of facial verification (FV) systems, and then we explain our method for addressing these issues. Specifically, we review the problem statement, the BFW dataset, and the proposed framework.

A. Problem statement
FV systems infer the likelihood that a pair of faces share the same identity. Verification is often solved like traditional facial recognition (FR). Specifically, a model is trained on a set of identities and then used to encode faces (i.e., embed faces). The closeness of the resulting vectors is a single score; typically, cosine similarity is used [107]. The goal is to learn the optimal score that separates valid from false pairs. The threshold is the decision boundary in score space, i.e., the matching function. As demonstrated in our previous work, the optimal threshold changes between subgroups [53]. Our prior solution was to learn a threshold per subgroup, which assumes the subgroup is known. We now aim to project features to a space that simultaneously preserves the identity and removes evidence of the subgroups. As we show, the results are less biased, while demographic privacy is preserved.
1) The matching function: A real-valued similarity score R assumes a discrete label of Y = 1 for genuine pairs (i.e., a true match) or Y = 0 for an imposter (i.e., untrue match). We map the actual real number to a discrete label byŶ = I{R > θ} for some pre-defined threshold θ. We can express the aforementioned as matcher d operating as where the face features in ⃗ x being the i t h and j t h sample -a conventional scheme in the face recognition (FR) research communities [108]. We use cosine similarity as the matcher in Eq. 2, which produces . The decision boundary formed by the threshold θ controls the level of acceptance and rejection. Thus, θ inherits a tradeoff between sensitivity and specificity. The value of θ depends on the purpose of the system. For instance, in security, there is a need for higher sensitivity (i.e., smaller θ). Specifically, the trade-off involves FNR that attempts to pass but falsely rejects-a Type 1 Error. Mathematically, it relates by with positive counts P .
The other error type contributes to the FPR, the Type II Error, which is when an imposter falsely passes: where the number of negatives is N , with metrics truenegative (TN), false-positive (FP), true-negative rate (TNR), and FPR. The geometric relationships of the metrics related to the score distributions and the choice of threshold show the trade-offs in error rates (Fig. 2).
The parameter θ determines the error rate on heldout validation, specific to the use case. Researchers tend to set it for top performance, while others analyze θ as a range of values to generate plots and assess the trade-offs. The held-out validation and test sets share data distributions as a single source partitioned into subsets (i.e., train, validation, test). We transfer the decision boundary in score space, which maximizes the performance to the pin-point (i.e., 1D) decision boundary-the floating-point value spans [0, 1].

2) Feature alignment: The tuple
represents domain D, with X and Y representing the input feature space and output label space, respectively. FR algorithms aim to learn a mapping function (i.e., a hypothesis): η ∶ X → Y, assigning vectors with a semantic identity label.
Mathematically, we denote the labeled source domain D S and the unlabeled target domain D T as with the sample count N S = D S and N T = D T corresponding to the i -th sample (i.e., x i ∈ R d ) and label (i.e., y i ∈ {1,...,K }). We further define D S and D T as tasks T S and T T , respectively, which show the exact label type(s) and the specific K classes of interest. The goal is to learn an objective η S ∶ X S → Y S , then transfer to target D T for T T . By this, we leverage knowledge from both D S for D T to get η T . Since either domain has different marginal distributions (i.e., p(x s ) = p(x t )) and distinct conditional distributions (i.e., p(y t x s ) = p(y t x t ),) a model trained on the labeled source usually performs poorly on the unlabeled target. A standard solution to a domain gap is to learn a model f that aligns the features in a shared subspace by p( f (x s )) ≈ p( f (x t )).

B. Proposed framework
We used both identity and subgroup labels for the two objectives of the proposed framework (Fig. 7).
,..., I } and y at t ∈ {1,...,K }. Hence, we aim to learn a mapping f d eb = M (x,Θ M ) to a lower-dimensional space f d eb ∈ R d 2 that preserves the identity information of the target via the identity loss L I D . Then, we learn to do so without subgroup information, which we call the where p(y = y i d i x i ) and p(y = y at t i x i ) represent the probability conditioned on the identity and attribute, respectively.
We added L AT T to de-bias the features to remove variation in scores previously handled with a variable threshold. Furthermore, a byproduct is these features that preserve identity information without knowledge of subgroups -a critical concern in the privacy and protection of biometric data.
There are three groups of parameters (i.e., Θ M , Θ I D , and Θ AT T ) optimized by the objective (Fig. 7). Both classifiers, the identity C I D and the attribute C AT T , are used to find a feature space that remains accurate to identity and not for subgroup by minimizing the empirical risk of L I D and L AT T : Thus, a gradient reversal layer [102] that acts as the identity during the forward pass while inverting the sign of the gradient back-propagated with a scalar λ as the adversarial loss during training: Although the proposed learning scheme is simple, it proved effective for both objectives we seek to solve. Next, we illustrate the effectiveness of the results and provide an analysis.

V. E
We include two sets of experiments to show the effectiveness of the proposed using our balanced BFW [53]. First, we evaluate verification performance. Specifically, we compare the global, subgroup-specific, and baseline. Then, for the privacy-preserving claim, we compare the performance of models trained on top of de-biased features f d eb with those of the original features f i n . We present the problem statement, metrics and settings, and analysis for each. An ablation study shows the performance of LFW [108].

A. Common settings
We use Arcface (i.e., ResNet-34) as the baseline (i.e., f i n ) [9]. MS1M [66] was the train set, with about 5.8 million faces for 85,000 subjects. We prepared the faces using MTCNN [109] to detect five facial landmarks. We then applied a similarity transformation to align the face by the five detected landmarks, from which we cropped and resized each to 96×112. The RGB (i.e., pixel values of [0, 255]) were normalized by centering about 0 (i.e., subtracting 127.5) and then standardizing (i.e., dividing by 128); features were later L2 normalized [110]. The batch size was 200, and an SGD optimizer with a momentum of 0.9, weight decay 5e-4, and the learning rate started at 0.1 and factored by 10 two times when the error leveled. We chose these settings based on Arcface being among the bestperforming FR deep models. Off-the-shelf CNNs are typical solutions implemented in systems using FR technology in research and practice.
We used our BFW dataset (Section III): the de-bias and privacy-based experiments use the pre-defined five-folds; the ablation study on LFW uses all BFW data train M (Fig. 7). As mentioned, we built BFW using data of VGG2, and there is no overlap between CASIA-Webface and LFW used to train the face encoder.

B. De-bias experiment
The percent error is a typical metric for FR, as specialized figures (e.g., plots and confusion  Fig. 9: TPR at a FPR. The last column of AF shows how the TPR scores for the global (G) threshold, privacypreserving (P-P) features (i.e., proposed), and subgroupspecific (S-S) threshold (i.e., baseline) go from darkest to lightest (labeled in the last column of AF). Higher is better (9a). The spread of G scores across subgroups is larger than that of S-S scores, as shown clearly in (9b), which visualizes the left column in (9a). matrices) are difficult for nontechnical audiences to interpret. Specifically, global ratings (e.g., average) are more practical to comprehend. A prime example is to share the error rate per number correctly predicted (e.g., falsely classify one in ten thousand). For instance, claiming that a system predicts an FP in 1 of 10,000 predictions. However, such an approximation can be hazardous, for it is inherently misleading. To show this, we ask the following questions. Does this hold for different demographics? Does this rating depend on the faces -does it carry for all males and females? Setting our system to the desired FPR is fair regardless of population demographics (i.e., subgroups). The questions above were central to our previous work [53]. We found the answer clear -No, the reported FPR is not true when analyzed per subgroup. When comparing the FPR values (i.e., the subgroup-specific to the global), the values drastically deviate from the global average when the score threshold is fixed for all subgroups. Demographic-specific thresholds, meaning an assumption that demographic information is known prior to the problem, proved to mitigate the problem. However, prior knowledge of demographic, although plausible (e.g., identifying a known subject on a blacklist), a strong assumption limits the practical uses for which it could be deployed. To extend our prior work, we propose a de-biasing scheme to reduce the differences between the global and subgroup-specific. We set out to claim subgroup-specific error rates to be fair across all involved demographics.
1) Metrics and settings: TPR and FPR are used to examine the trade-off in confusion dependent on the choice of threshold discussed earlier. Specifically, we look at subgroup-specific TPR scores at the desired FPR. We compute the following metric, the percent difference of the global and subgroup-specific FPR values (i.e., an average score is targeted) at a threshold l. So we ask, "How do the different subgroups compare to the average?" Specifically, The global results are the results averaged across subgroups. Then, the subgroup-specific results, which differ meaningfully from the mean result (i.e., the global results), are analyzed independently per subgroup. Hence, there is a gap between global and subgroup-specific, which we show in Fig. 8 using the percent error (i.e., 100% * (subgroup − global) global). Note that the percent error is negative when global>subgroup (i.e., subgroup-specific are inferior).
2) Analysis: The proposed balances the results while significantly boosting the TPR at FPR. The percent difference between global and subgroup-specific FPR scores leads to fairer representation, especially at high FAR. Fig. 9 shows the distribution of TPR for the baseline (global), proposed, and the optimal threshold (per subgroup) at FPR = 0.3 (i.e., the first column of the table above represented as a box-plot). Note that the standard deviation of TPR using the baseline approach is high, which we mitigate using the proposed scheme. The proposed has thus boosted performance: improved the rating and reduced the variances.
We can interpret Fig. 8 as a practical use case. A threshold is set to yield a specific FPR (i.e., how often an FP is expected). The far-right (i.e., 1e-4) claims 1 in 10,000 is incorrectly matched. Again, a verification system is set via a trade-off threshold (i.e., θ) that sets sensitivity: decreasing the score threshold increases the FPR (Fig. 2). The direction (i.e., ±) represents whether the difference is an improvement. A negative %-difference shows a drop in performance compared to the global result. For instance, AF with a -25% difference for the baseline at an average FPR of 1 in 10,000 implies that if the population of samples comprises only AF subjects, then the FPR for the chosen θ for the claim of 1 in 10,000 would be 1 in 7,500. A consumer expecting a FPR would only match this value when the sample population has the same distribution in samples per subgroup as the validation data for which θ was found.
We can remove the percent difference via an optimal threshold. However, the assumption is that subgroupspecific thresholds can be determined from validation sets separated by subgroup. Also, the optimal solution assumes prior knowledge of the subgroup to which the sample of interest belongs at test time. Although the method was proof, both assumptions are impractical for most use cases. Hence, the proposed feature transformations reduce the percent differences from the original features (i.e., the baseline). Fig. 10 shows several hard positives and negatives incorrectly matched by the baseline but correctly identified by the proposed. These samples had scores closest to the global threshold (i.e., score boundary). Notice the quality of at least one face per pair is lowresolution; extreme pose differences between the faces are also common. The proposed scheme overcomes these challenges: mitigating bias boosts results, and several pairs change from falsely being rejected to correctly being accepted.

C. Privacy-preserving experiment
We aim to preserve identity information while de-biasing facial features, as shown in the prior experiment. We use a reverse gradient when training the subgroup branch to force the process to penalize the subgroup classifier when it is correct. Another benefit of the proposed de-biasing scheme is that it rids the facial features of demographic information, which is useful for privacy and protection problems. Ideally, face features, often the only representation of face information available at the system level, will not include attribute information like gender or ethnicity, as we prohibit the subgroup classifiers from learning. We train a multi-layered perceptron (MLP) to classify subgroups on top of the features to show how much subgroup information was removed. We can then measure the amount of information present in the face representation [40].
The MLP comprises three fully connected (f c ) layers (i.e., sizes 512, 512, and 256) and the output f c layer (i.e., size 8, one per subgroup) in Keras. The first three layers were separated by ReLU activation and dropout [111] (i.e., probability of 0.5), while only dropout (again, probability of 0.5) was placed before the output softmax layer. A categorical cross-entropy loss with Adam [112] set with a 0.001 learning rate used to train.

1) Metrics and settings:
We examine the accuracy of the subgroup classifiers via a confusion matrix. Specifically, we will look at how often each subgroup was predicted correctly and, when incorrect, the percentage it was mistaken for the others. The confusion was generated by averaging the five folds. Note the top-performing thresholds from the training folds on each test fold for the subgroup classifiers.
Also, we measure precision and recall. Precision is defined as P(l ) = TP TP+FP , which we average across subgroups l ∈ L. The recall (R) is computed as R(l ) = TP TP+FN . This complements the confusion by allowing the specificity and sensitivity of the subgroups to be examined. There are inherent trade-offs between P and R. This motivates the F 1 -score [113], as the harmonic mean of P and R, F 1 = 2 * P * R P +R .

2) Analysis:
We showed the preservation of identity knowledge (Fig. 9), and now we show the other benefits of privacy. The results confirm the privacy-preserving claim is accurate, leading to a 30% drop in predicting gender and ethnicity from the features (Table IV).   Hence, the predictive power of all subgroups dropped significantly. The decrease in performance suffices to claim the predictions are now unreliable. Interestingly, it hindered the subgroups that the baseline favored the most from the de-bias scheme. WM and WF drop the most, while the AM and AF drop the least. The same trends in confusion propagate from the baseline to the proposed results (e.g., WM mostly confuses IM initially and then again with the proposed). The same applies to cases of different sex. Next, we examine the confusion for the different subgroups before and after de-biasing the face features (Fig. 11). As established, the baseline contains more subgroup knowledge, which a model can learn on top of. When trained and evaluated on BFW, the baseline performs best on F subgroups, which differs from the norm, where M is most of the data. The WM is inferior in performance to all subgroups in either case.

D. The privacy model
To check the effectiveness of the proposed, we train M on the BFW dataset and deploy it on the well-known LFW benchmark. We note that the training dataset we employ is significantly smaller than that used by SOTA networks trained to achieve high performance on LFW using the MS1MV2 dataset, which contains 5.8 million images of 85,000 identities. Even though we initialize our network starting with features learned on MS1MV2, we train on a small dataset of 20,000 images of 800 subjects, two orders of magnitude smaller. The current SOTA has 99.8% verification accuracy. In comparison, the proposed scheme reaches its best score of 95.2% after five epochs before dropping off and then leveling out around 81% (Fig. 12). The unbalanced data hinders the benefits of privacy and de-biasing (i.e., LFW comprises about 85% WM). Furthermore, we optimized M by choosing the epoch with the best performance before the drop-off. Future steps could improve the proposed approach when transferring to unbalanced sets to detect the optimal settings.

VI. C
We show a bias for subgroups in facial verification (FV) systems, where scores are converted to decisions via a predefined threshold. We previously introduced a subgroup-specific threshold. We propose a novel approach: learn a lower-dimensional mapping that preserves identity and removes subgroup information, drawing inspiration from feature alignment. With the proposed method, the performance across subgroups balances and boosts accuracy. We reduce the difference between subgroup-specific and global performance across subgroups. Also, as knowledge of subgroups is removed from the features, privacy regarding demographics increases.
The Balanced Faces in the Wild (BFW) data and benchmarks address fairness in the data. Our feature encoder addresses privacy concerns by learning to map faces to a lower dimension that preserves identity and removes subgroup information. BFW is at the forefront of ethical AI [114].
The experimental settings and practices remain an open problem. For instance, gender labels are discrete values (i.e., boolean): an approximation of sexuality best represented as real values [72]. Finergrained or more specific subgroups could be another improvement (e.g., Indians from North India versus South India, Black Africans versus African Americans, or distinguishing groups in Africa). We intend BFW to serve as a benchmark for existing systems and a foundation for future researchers to extend.