An Experimental Study of Byzantine-Robust Aggregation Schemes in Federated Learning

Byzantine-robust federated learning aims at mitigating Byzantine failures during the federated training process, where malicious participants may upload arbitrary local updates to the central server to degrade the performance of the global model. In recent years, several robust aggregation schemes have been proposed to defend against malicious updates from Byzantine clients and improve the robustness of federated learning. These solutions were claimed to be Byzantine-robust, under certain assumptions. Other than that, new attack strategies are emerging, striving to circumvent the defense schemes. However, there is a lack of systematic comparison and empirical study thereof. In this paper, we conduct an experimental study of Byzantine-robust aggregation schemes under different attacks using two popular algorithms in federated learning, FedSGD and FedAvg . We first survey existing Byzantine attack strategies and Byzantine-robust aggregation schemes that aim to defend against Byzantine attacks. We also propose a new scheme, ClippedClustering , to enhance the robustness of a clustering-based scheme by automatically clipping the updates. Then we provide an experimental evaluation of eight aggregation schemes in the scenario of five different Byzantine attacks. Our results show that these aggregation schemes sustain relatively high accuracy in some cases but are ineffective in others. In particular, our proposed ClippedClustering successfully defends against most attacks under independent and IID local datasets. However, when the local datasets are Non-IID, the performance of all the aggregation schemes significantly decreases. With Non-IID data, some of these aggregation schemes fail even in the complete absence of Byzantine clients. We conclude that the robustness of all the aggregation schemes is limited, highlighting the need for new defense strategies, in particular for Non-IID datasets.

in this context is known as Byzantine failures [5], where some of the participants are not rigorously following the protocol, but upload arbitrary parameters to the central server, for example, due to faulty communication [6], or even worse, adversaries, where malicious attackers modify the update vectors to their desire and upload them to the server [7]. We use the term "Byzantine attack" to refer to the attacks where malicious attackers upload arbitrary updates to the server in order to degrade the overall performance of the global model in FL. In typical FL algorithms (e.g., FedAvg) [1], the server aggregates the uploaded updates by calculating their sample mean and adds the result to the global model. However, it is well-known that the result of such an aggregation scheme can be arbitrarily skewed even by a single Byzantine client [8]. The server thus requires Byzantine-robust solutions to defend against malicious clients.
In recent years, a number of Byzantine-robust techniques have been proposed [9]. They can be classified into three categories: redundancy-based schemes that assign each client redundant updates and use this redundancy to eliminate the effect of Byzantine failures [10], [11], [12], [13]; trust-based schemes that assume some of the clients or datasets are trusted for filtering and re-weighting the local model updates [14], [15], [16]; robust aggregation schemes that estimate the updates according to some robust aggregation algorithms [8], [17], [18], [19], [20], [21]. For the first category, redundancy-based schemes, in the worst case, require each node to compute Ω(M ) times more updates, where M is the number of Byzantine clients [10]. This overhead is prohibitive in settings with large numbers of Byzantine clients. For the second category, the trusted clients/datasets are not always available to the server due to the concern of user data privacy.

arXiv:2302.07173v1 [cs.LG] 14 Feb 2023
Robust aggregation schemes, in contrast, aggregate the updates efficiently, without requiring trusted clients or datasets. However, typical schemes, including GeoMed [18], Krum [17], TrimmedMean [8], Median [8] and CC [20], often come with limited guarantees of Byzantine robustness (e.g., only establishing convergence to a limit, or only guaranteeing that the output of the aggregation scheme has a positive inner product with the true gradient [17], [22]) and often require other strong assumptions, such as bounded absolute skewness [8]. More importantly, recent studies reveal the vulnerability of some schemes to new attacks. For instance, the A Little Is Enough (ALIE) attack can circumvent TrimmedMean and Krum by taking advantage of empirical variance between the updates of clients if such variance is high enough [23]. The Inner Product Manipulation (IPM) attack poses a significant threat to Median and Krum by manipulating the inner product between the true gradient and the robust aggregated gradients to be negative [24]. Other schemes, such as AutoGM [19] and Clustering [21], were proposed with only empirical evaluations.
These existing aggregation schemes are evaluated using different datasets, attack types, and hyper-parameters. There is a lack of empirical studies that compare different schemes of utilizing the same settings. Furthermore, the impact of data heterogeneity on robustness schemes is rarely evaluated as those schemes usually assume that all clients' local data are independent and identically distributed (IID). Therefore, there is a clear need for a comparative experimental study that offers in-depth insight into the performance of the existing Byzantine-robust schemes for FL.
To meet this need, we conduct an experimental study on the Byzantine attack and defense problem in FL based on two well-known algorithms, FedSGD and FedAvg [1], [23]. We first survey existing attack strategies and robust aggregation schemes in the literature. We further propose a new aggregation scheme ClippedClustering to address the weakness of an existing clustering-based scheme. Then we design experiments to evaluate the robustness of eight representative Byzantine-robust aggregation rules by applying five state-of-the-art attacking strategies. Our experimental results show that those aggregation rules sustain relatively high accuracy in some cases. However, they are not effective in all cases. Moreover, when the local datasets are not independent and identically distributed (Non-IID), the capability of all the aggregation rules decreases significantly. With Non-IID data, some of these aggregation rules fail even in the complete absence of Byzantine clients. Furthermore, our proposed scheme performs the best in most attack scenarios when the datasets are IID. From the evaluation, we conclude that existing aggregation rules are insufficient to meet the need for Byzantine robustness, highlighting the demand for new defense strategies in FL, especially with training on Non-IID datasets.
Our key contributions can be summarized as follows: • We survey existing Byzantine attack strategies to compromise FL, as well as Byzantine-robust aggregation schemes that aim to defend against Byzantine attacks.
• Based on an existing clustering-based aggregation scheme, we further propose an enhanced scheme called ClippedClustering, by applying an automatical clipping technique to mitigate the effect of amplified local updates.
• We evaluate eight robust aggregation schemes (including the proposed ClippedClustering) under five repre-sentative Byzantine attack strategies. Our experimental results show that the aggregation schemes sustain high accuracy in some cases, but have limited success in other cases, especially in the presence of Non-IID data.
The rest of this paper is organized as follows: Section 2 first formulates the problem of FL and introduces two optimization algorithms. Then Section 3 introduces the threat models evaluated in this paper. Subsequently, representative robust aggregation schemes are presented in Section 4. Section 5 presents an adaptive attack to the proposed aggregation scheme, ClippedClustering. Section 6 presents the experiments for robust aggregation schemes, from which some notable findings are uncovered. Finally, we review related work in Section 7 and make some conclusions in Section 8.

FEDERATED LEARNING
In this section, we first formulate the optimization problem of FL. Then we introduce two popular algorithms for solving the FL problem, one is the classic distributed SGD optimization algorithm FedSGD and the other is the famous communication-efficient algorithm FedAvg.

Problem Formulation
In FL, multiple clients collaboratively learn a shared global model using their private datasets in a distributed way, assisted by the coordination of a central server. The goal is to find a parameter vector w that minimizes the following distributed optimization model: min where K is the total number of clients, the local objective F k (·) can be defined as an empirical risk over local data, i.e., F k (w) = 1 n k j∈[n k ] (w; x k,j ), where (·; ·) is a user-specified loss function, x k,j is a training sample and n k is the size of training dataset owned by client k.
A common assumption in FL is that local training datasets can be unbalanced, i.e., clients can have different numbers of training samples [1]. However, in this paper, we assume that data are balanced, i.e., n 1 = n 2 = · · · = n K to align with most studies that specifically focus on Byzantine robustness [8], [18], [20]. We note that one can get rid of this assumption using the re-scaling trick proposed by Li et al. [25].

Optimizations of Federated Learning
We adopt the two most popular algorithms in Byzantine robust optimization literature to solve Problem (1), i.e., FedSGD and FedAvg.

FedSGD
Stochastic gradient descent (SGD) can be applied naively to the federated optimization problem (1) [1]. As summarized in Algorithm 1 with option I, each client calculates a single minibatch gradient and uploads it to the server in parallel at each round of training. The server then aggregates the received gradients and updates the model parameters according to the aggregated gradients. Benefiting from mini-batch of stochastic gradient calculation, this approach is computationally efficient, but it still requires a very large number of communication rounds to produce good models [1], [26]. In this paper, we refer to this algorithm as FedSGD, also known as sync-SGD in some related work [23], [24].

Algorithm 1 Optimization of Federated Learning
Input: K, T, η t , w 0 1: for each global round t ∈ [T ] do 2: for each client k ∈ [K] in parallel do 3: w t k ← w t

FedAvg
A more communication-efficient framework for FL is FedAvg [1]. As summarized in Algorithm 1 with Option II, at each round of training, the server broadcasts its global model to each client. In parallel, the clients run multiple steps of SGD on their own loss functions and send the resulting model to the server. The server then updates its global model according to its aggregation rule and broadcasts the resulting global model to each client to enable the next round of training. Multiple rounds of interactions between the server and clients are required to obtain an accurate shared global model.
As one may see, general FedAvg-based algorithms usually randomly select a subset of clients to perform local training while the algorithm we adopt involves full participation of all clients at each round. This is because all of the aggregation schemes considered in this paper are based on an assumption that less than half of the updates for aggregation are malicious on each round. Selecting subsets at random violates this assumption with some probability, as it may select more malicious clients than benign ones by chance. Therefore, full participation is used in this paper.

Update aggregation
In both FedSGD and FedAvg, the server aggregates the received updates and uses the result of the aggregation to update the global model. A widely-used aggregation scheme is calculating the sample Mean of the uploaded updates, i.e., However, Mean is vulnerable to malicious local updates. As the breakdown point of Mean is 1/K [27], which means that even if only one of the clients is malicious, the resulting global model can significantly deviate from the original Mean. In Section 4 we will cover robust aggregations that aim to defend against malicious updates.

THREAT MODELS
In this section, we describe the five threat models of Byzantine attacks we evaluate in this paper. In terms of Byzantine attacks, most existing literature for distributed learning and federated learning focuses on convergence prevention [7], [19], [21], [24]. As illustrated in Fig. 1, the attackers (known as Byzantine clients) may upload arbitrary parameters to the server in order to degrade the performance of the global model. Thus, Line 6 and 12 of Algorithm 1 are replaced by the following: where represents arbitrary values.
In this paper, we follow the assumption that the majority of the clients are benign [19], [20], which means we have M K < 0.5, where M is the number of Byzantine clients. We examine five typical attacks in our threat models.

Noise
A straightforward attack is to sample some random noise from a distribution (e.g., Gaussian distribution) and add it to the updates before uploading [19], [28]. For simplicity sake, the mean and variance of the noise are both 0.1 in our experiment.

A Little is Enough (ALIE)
In contrary to the random Noise attack, the attackers may modify the noise carefully to pretend to be benign and fool the aggregation rules. A Little is Enough (ALIE) [23] assumes that the benign updates are expressed by a normal distribution. The attackers therefore immediately take advantage of the high empirical variance between the updates of clients and upload a noise in a range without being detected.
For each coordinate i ∈ [d], the attackers calculate mean (µ i ) and std (δ i ) over benign updates, and set corrupted updates ∆ i to values in the range (µ i − z max δ i , µ i + z max δ i ), where z max ranges from 0 to 1, and is typically obtained from the Cumulative Standard Normal Function [23].

Inner Product Manipulation (IPM)
The Inner Product Manipulation (IPM) attack [24] seeks for the negative inner product between the true mean of the updates and the output of the aggregation schemes so that at least the loss will not descend. Assuming that the attackers know the mean of benign updates, a specific way to perform an IPM attack is where we assume that the first M clients are malicious, is a positive coefficient controlling the magnitude of malicious updates. Then the Mean becomes Note that when < K M − 1, IPM does not change the direction of the average over benign updates but only decreases its magnitude, because we have the optimization thus can still converge using Mean as an aggregation scheme. However, as we will show in Section 6, such an attack can circumvent the defense of several aggregation schemes and inverse the direction of updates, which heavily damages the global model. On the contrary, when > K M − 1, the sign of Mean is reversed, indicating that the loss will increase if the model is updated using the Mean. In our experiment, we examine both cases by letting = 0.5 and = 100, respectively.

Sign Flipping (SF)
Different from IPM, the Sign Flipping (SF) attackers do not need to know the updates from other clients and simply flip the signs of the gradient [13], [20], which means that the attackers strive to maximize the loss via gradient ascent instead of gradient descent. Specifically, in FedSGD, the clients upload the negative gradients; in FedAvg, the flipping is applied at every local updating step.

Label Flipping (LF)
The aforementioned attacks assume that the attackers have full access to the training process so that they can modify the updates immediately. However, full access may be limited as the training APIs are not always open. Correspondingly, the attackers can also change the training dataset instead of the update parameters [29]. The Label Flipping (LF) attack simply flips the label of each training sample [7]. Specifically, a label l is flipped as L − l − 1, where L is the number of classes in the classification problem and l = 0, 1, ..., L − 1.

AGGREGATION SCHEMES FOR EVALUATION
In this section, we survey existing robust aggregation schemes, which represent state-of-the-art methods in the literature. Then, we propose a new scheme ClippedClustering, which addresses the weakness of the clustering-based scheme. Other than that, we provide a taxonomy of the eight aggregation schemes evaluated in our experiments.
All aggregation schemes considered in this paper are working on each round separately. For the sake of readability, we will omit the notation of the round t in the following sections.

Krum
Krum [17] strives to find one of the local model updates that is closest to another K − M − 2 ones with respect to squared Euclidean distance, which can be expressed by: where i → j is the indices of the K − M − 2 nearest neighbours of ∆ i measured by squared Euclidean distance, recall that K is the number of clients in total, and M is the number of malicious clients.
Under the FedSGD framework, Krum was proven to converge with an important assumption that c 1 σ < g , where c 1 is a constant factor depending on the number of malicious clients and the dimension of model parameters, σ is the maximal variance of the updates and g is the expectation of updates.

GeoMed
The Geometric Median (GeoMed) [18], [30] scheme aims to find a vector that minimizes the sum of its Euclidean distances to all the update vectors: Although there is no closed-form solution to the GeoMed problem, a (1 + )-approximate solution can be computed in nearly linear time [31]. Similar to Krum, GeoMed was also proven to converge under the FedSGD framework, with the assumption that c 2 σ < g , where c 2 is a another constant factor that differs from c 1 .

AutoGM
Auto-weighted Geometric Median (AutoGM) [19] is a generalized version of GeoMed. AutoGM aggregates the updates by solving the following problem: where λ is a user-specified hyper-parameter that controls the smoothness of α.
The key idea of optimizing AutoGM is to divide the problem into two parts, i.e., one subproblem for estimating the weighted GeoMed, and the other subproblem for weighting the importance of each point. Then, we can minimize the objective iteratively with respect to one variable each time while fixing the other one [19].

Median
Median [8] is defined as the coordinate-wise median of the given set of updates, i.e., , and median is the usual (one-dimensional) median.
When using the FedSGD framework, the robustness of the Median scheme is based on the assumptions that the gradient of the loss function has bounded variance, and each coordinate of the gradient has coordinate-wise bounded absolute skewness [8].

TrimmedMean
The TrimmedMean [8] aggregation scheme computes the coordinate-wise trimmed average of the model updates, which can be expressed by: x∈U k x, and U k is a subset obtained by removing the largest and smallest β fraction of its elements.
In addition to the aforementioned assumptions for Median, the robustness of TrimmedMean relies on one stronger assumption that all the moments of the derivatives of the loss function are bounded [8].

Centered Clipping (CC)
Centered Clipping (CC) [20] iteratively clips the updates around the center while updating the center accordingly. For l ≥ 0, CC computes where ∆ 0 is assigned with the aggregated updates in the previous round.
Karimireddy et al. [20] proofed the robustness of the CC scheme when the variance of updates is bounded and K M ≤ 0.15.

Clustering
This Clustering aggregation scheme [21], [32] first calculates the pairwise cosine distances between their parameter updates, i.e., then it separates the client population into two groups based on the cosine similarities using agglomerative clustering with average linkage. Finally, it aggregates the updates in the largest group using Mean.
Despite the lack of theoretical guarantee of robustness, Clustering achieves superior robustness in some cases, as we will show in Section 6. However, an obvious drawback of clustering using cosine similarities is that it only considers the relative directions, ignoring the magnitude of each vector. The attackers thus can fool the clustering scheme by simply amplifying the updates without changing their directions. As a consequence, the resulting updates added to the parameters will make the model jump over the minima and prevent the convergence of the optimization without being detected.

ClippedClustering
We enhance the robustness of the aforementioned Clustering aggregation scheme by performing a clipping on all the updates before clustering, i.e., Here, τ is a clipping value hyper-parameter that is determined by the server. Note that this clipping scheme is the so-called clip by norm, not clip by value, where individual values of the update vectors are clipped if they go beyond a pre-set value. In clip by norm, the entire update is scaled if the norm of the update exceeds the threshold τ . Thus we place a maximum on the magnitude of each vector that can be taken during training, preventing the attackers from amplifying the updates in the same direction. If the norm of update is below the threshold τ , the update is unaffected. Inspired by [33], we design an automatic clipping strategy to defend against potential amplified malicious updates that the naive cosine similarity-based clustering scheme cannot handle well. Specifically, we set the clipping value hyper-parameter based on the statistics of the historical norms of the updates uploaded during training, i.e., we save the update norms up to current iteration and automatically set τ using the 50-th percentile value (i.e., the median) of the history.
There are two reasons of choosing the median update norm as the clipping threshold: First, prior work [34] has demonstrated that adaptive clipping to median norm works well across a range of general federated learning tasks without the need to tune any clipping hyper-parameter. Second, the median itself is a robust statistical measure of central tendency. As we assume that the majority of the clients are benign, malicious clients are unable to control the median norm even if they have full knowledge of all the updates. Table 1 shows a taxonomy of the eight aggregation schemes, where Krum, GeoMed, and AutoGM are typical Euclidean distance-based schemes, i.e., they are all designed to find a vector closest to the updates measured by Euclidean distance. Among them, GeoMed and AutoGM are both based on geometric median. The Median scheme simply computes the coordinatewise median instead of geometric median. TrimmedMean, CC, Clustering and ClippedClustering are all categorized ! ! " Fig. 2: Demonstration of the solution to (11). The solution of e 1 can be expressed as a linear combination of e and e .

Taxonomy
as mean-based schemes as they eventually compute the mean, although they also utilize other mechanisms. Clustering and ClippedClustering both perform clustering based on cosine similarity while ClippedClustering clips the updates before clustering.

ADAPTIVE ATTACK ON CL I P P E DCL U S T E R I N G
In this section, we design a strong adaptive attack on ClippedClustering. We assume that the attacker has full knowledge of the system, including the aggregation scheme and all the updates from benign clients. Given the fact that the updates are clipped by the server if their magnitudes are beyond the threshold, state-of-the-art approach [35] that aims to maximize Euclidean distance between the aggregated vector and the true update is no longer applicable. Instead, the attack would be more effective if it can change the direction of update without being excluded by the aggregation scheme.
The idea of our attack is to make sure that all the malicious updates stay in the largest cluster, while deviating from the correct direction as possible. Since the clustering method and linkage function are known to us, we can perform the same clustering process with the benign updates only. Then we can carefully design malicious updates to make it close enough to the largest cluster, without breaking the existing structure. Specifically, we compute the average cosine similarity of the two benign clusters, denoted by δ, and use it as the bound to design malicious updates.
For simplification, we let ∆ 1 = · · · = ∆ M , and denote by e i = ∆i ∆i the unit vector whose direction is the same as ∆ i . The problem becomes: where e is the unit vector of ∆, and e is the unit vector of the center of the largest benign cluster. The constraint ensures that the malicious group is included by the largest cluster when we perform hierarchical clustering based on average linkage. Minimizing the problem is equivalent to maximizing the angle between e 1 and e, with the constraint that the angle between e 1 and e should be smaller than arccos δ, which is demonstrated in Fig. 2. Now we are able to solve the problem using {e, e } a basis, i.e., let where > 0 is a small enough number. The solution to (11) is expressed as: Once we obtain the unit vector of malicious updates, we can scale the magnitude by the clipping threshold τ , i.e., We note that such an attack is applicable to agglomerative clustering with average linkage. For complete linkage, one can simply replace δ with the minimum cosine similarity of the two benign clusters and solve the problem in the same manner. Other linkage functions are omitted due to space limitations.

EVALUATION
In this section, we design experiments to evaluate the aforementioned robust aggregation schemes under different attacks and show the experimental results.

Experimental Setup
We simulate an FL system with the server and 20 clients, targeting at image classification tasks on CIFAR-10 [36] and MNIST [37] datasets with both IID and Non-IID partitions, where five of the clients are Byzantine clients by default. For the IID partitioning, we randomly split the training set into 20 subsets and allocate them to the 20 clients. For the Non-IID partition, we follow prior work [38], [39] and model the Non-IID data distributions with a Dirichlet distribution p l ∼ Dir K (α). Then we allocate a p l,k proportion of the training sample of class l to client k, in which a smaller α indicates a stronger Non-IID data partition. We let α = 0.1 for all Non-IID settings. Fig. 3 visualizes the resulting statistical heterogeneity of labels on CIFAR-10. Such a partition is strongly Non-IID as one can see that some of the classes are completely missing for each client.
For CIFAR-10, we choose a lightweight Compact Convolutional Transformers (CCT) network [40], as such a small yet effective model has more potential to overcome the on-board resource limitation of FL devices [41]. For MNIST, we train a simple two-layer perceptron using ReLu as activation function. We train the model for 6000 and 600 communication rounds for FedSGD and FedAvg, respectively. By default, we set the batch size to 128 and 64 for MNIST and CIFAR-10, respectively. As suggested by Karimireddy et al. [20], we decay the learning rate during training to improve convergence, i.e., for FedSGD, we let for FedAvg, we let For FedAvg, we apply 50 SGD steps before uploading the updates to the server (i.e., E l = 50).

Impact on the Mean Scheme
We first demonstrate the impact of attacks on conventional FedSGD and FedAvg using the Mean scheme to aggregate updates by plotting test accuracy versus the number of communication rounds in Fig. 4. At first glance, when the datasets are IID, FedAvg takes fewer communication rounds than FedSGD to converge when there is no attack, benefiting from multiple steps of local training. However, when the datasets are Non-IID, the performance of FedAvg significantly decreases while FedSGD still maintains relatively high accuracy. We note that this is a well-known drawback of FedAvg [42], [43].
Recall that some of the attacks (e.g., IPM with small [24]) are particularly designed to break the line of robust defense, which means that they may not bring much damage to the Mean scheme. For instance, Fig. 4 shows that IPM ( = 0.5) produces less damage to Mean compared to the other attacks, as the malicious updates do not change the direction of the average update but only decrease its magnitude (see Section 3.3). Furthermore, Noise and IPM ( = 100) eventually damage the models under the four settings by decreasing the test accuracy to around 10% (no better than random guess). This is because they both make large changes to the updates, and Mean could be easily biased by large changes [19].
Unsurprisingly, ALIE attack is ineffective to MNIST dataset, while it makes a large impact on CIFAR-10 with FedSGD. This is because the amount of noise for ALIE is determined by the empirical variance of benign updates. As it is already known that MLP on MNIST usually has a lower variance than CNN on CIFAR-10 [44]. Table 2 shows the overall comparison of the robust aggregation schemes in FedSGD and FedAvg with respect to test accuracy on both IID and Non-IID partitioned datasets.

Impact on Robust Aggregation Schemes
First of all, with the absence of malicious clients, Krum, GeoMed, AutoGM, and Median achieve relatively lower accuracy compared to other schemes. Especially, such degradation is more significant with Non-IID data. For instance, the accuracy almost reduces to random guessing, 10%, for CIFAR-10 with FedSGD, while Mean, TrimmedMean, Clustering, and ClippedClustering sustain the accuracy at around 80%. This suggests that Krum, GeoMed, AutoGM, and Median should be used with caution for highly Non-IID data.
Euclidean-based schemes (i.e., Krum, GeoMed, and AutoGM) reach lower accuracy compared to the other schemes in the complete absence of attackers, especially when the datasets are Non-IID. Surprisingly, their performance of FedSGD with Non-IID CIFAR-10 data is not much better than random guessing. This might be because they all tend to select a single update that is closest to all or part of the others measured by Euclidean distance, which however is a poor estimation of the overall tendency in data heterogeneity situations. Furthermore, they all show similar robustness in both FedSGD and FedAvg. E.g., with IID data partition, they all handle Noise and IPM ( = 100) well while somehow struggling with the other attacks. This is not surprising as those Euclidean-based schemes are essentially designed to defend against large changes of updates. On the other hand, small-scale  of the clients are malicious. A semi-transparent value indicates that the corresponding accuracy is lower than 15% (not much better than random guessing). We bold the numbers with the highest accuracy. When integrated with robust aggregation schemes, FedAvg is more robust than FedSGD. On the other hand, all the attacks become more effective in Non-IID scenarios.
attacks such as IPM ( = 0.5) challenges them as the malicious updates are close to the benign ones when measured by Euclidean distance. Two other classic robust aggregation schemes, Median, and TrimmedMean show similar robustness in most cases. For instance, when tested with IID MNIST, they successfully defend against most attacks with insignificant accuracy degradation. However, for Non-IID MNIST, both suffer from IPM attacks. A similar phenomenon occurs with CIFAR-10.
CC and Clustering both fail in all IPM ( = 100) and Noise attacks, because these attacks are of large magnitude. In these cases, ClippedClustering significantly enhances the robustness of Clustering, benefiting from the adaptive clipping mechanism.
In the experiments, our proposed ClippedClustering successfully defends against more types of attacks than the other schemes, and achieves the highest test accuracy. However, it is defeated by ALIE when we train the model using FedSGD with CIFAR-10. Similar to other schemes, ClippedClustering is also less robust in Non-IID data scenarios. We note that Non-IIDness is a well-known challenge to FL especially when it comes to robustness [39], as it becomes hard to induce a consensus model for the benign clients if their data distributions are significantly different. Thus the malicious clients can damage the global model more easily.
Another important observation from our experiments is that ALIE makes almost no accuracy degradation when we train models using MNIST dataset, while it successfully circumvents almost all aggregation schemes of FedSGD with CIFAR-10 and keeps the test accuracy below 20%. The models suffer much less damage from ALIE when trained with FedAvg. Moreover, even the Mean scheme could handle it well. We thus infer that multiple steps of SGD updates may result in lower variance compared to single-step updates when the datasets are IID.

Impact of Fraction of Malicious Clients
To study the impact of the number of malicious clients, we perform an experiment using FedAvg with IID partitioned CIFAR-10 under six types of attacks. The results are shown in Fig. 5. Overall, the attacks make varying degrees of impact on the aggregation schemes as the fraction of Byzantine clients increases. Interestingly, the tendency of Krum, GeoMed, and AutoGM are similar, except for the impact of SL attack. Particularly, these schemes all suffer from even a small fraction of IPM ( = 0.5) attackers, i.e., the accuracy is reduced by 10% with the presence of a single Byzantine client (when 5% of the clients are malicious). CC and Clustering are both vulnerable to Noise, SF and IPM ( = 100) attacks even with the presence of only 5% Byzantine clients. Median, TrimmedMean, and ClippedClustering are more robust to attacks especially when the fraction is low. Surprisingly, ClippedClustering tends to be affected by LF attacks as the fraction increases. Note that Fig. 5 only shows the performance of FedAvg with IID data. When the local datasets are Non-IID, all schemes show considerably less tolerance to Byzantine attacks.

Impact of Batch Size
In the previous experiments, the batch size per client is relatively small (i.e., 128 and 64 for MNIST and CIFAR-10, respectively), which leads to a large variance among benign updates, thus making the attacks more challenging [20]. Taking CIFAR-10 as an example, we investigate the effect of batch size on the robustness of aggregation schemes by varying batch size in {64, 128, 512, 2500}. Fig. 6 shows the performance of FedSGD with IID partition under ALIE attacks. Mean, ClippedClustering, AutoGM, TrimmedMean and CC become more robust as the batch size increases. This is because the variance tends to be lower as we increase the batch size. Particularly, when the batch size is 2500, the clients use all their training data for each step of training, stochastic gradient then becomes population gradient (the optimization becomes full-batch gradient descent). However, except ClippedClustering, all the other aggregation schemes still fail to achieve acceptable test accuracy. Although a large batch size is more favorable for robustness purposes as indicated by this experiment, it is not the desired solution as it brings a large computation burden to local training.

Impact of Adaptive Attack
We examine ClippedClustering with the adaptive attack described in Section 5. The result is shown in Fig. 7. For the sake of visualization, we clamp the loss into the range [0, 10 5 ]. For FedSGD with IID data (shown in the first column), ClippedClustering tolerates up to 15% clients attacked with little performance degradation. When more than 25% of clients are malicious, the loss curves fluctuate or even tend to increase. For FedSGD with Non-IID data (shown in the second column), the models diverge as long as there are 15% clients are malicious.
As for FedAvg with IID data, ClippedClustering shows higher tolerance than FedSGD. Specifically, the model for IID MNIST successfully converges with all the different fractions of Byzantine clients. However, as shown by the fourth column of Fig. 7, it is still much less robust when it comes to Non-IID data, where the model does not tend to converge when more than 15% of clients are malicious.
Noticeably, we observe that preventing convergence is not always sufficient to degrade the accuracy. For example, for FedSGD on Non-IID MNIST with 15% clients are malicious, the loss increases to the upper bound (10 5 ) after 4000 rounds, the model, however, still remains 92% accuracy. This is because the top-1 accuracy (as we use in this paper) only takes into account the output with the highest probability. In this case, the attackers significantly change the output distribution but fail to change the index of the highest output.

Pairwise Cosine Similarities
Recall that the aggregation schemes show different robustness in FedSGD and FedAvg with respective to IID and Non-IID data partitions. To further investigate this issue, we compare the pairwise cosine similarities of all benign local updates without attacks. We note that cosine similarly reflects the angle between two vectors, i.e., a higher value indicates a smaller angle. As visualized in Fig. 8, the pairwise cosine similarities of updates from FedSGD vary widely, no matter if the local datasets are IID or not. Considerable pairs of clients even show negative similarities. Such a stochastic property may confuse the robust aggregation schemes and make it more challenging to detect malicious updates. The updates from FedAvg with IID data show the highest pairwise similarities, which means that their update directions are almost identical. Benefiting from this, clustering-based schemes can group benign updates together and exclude malicious updates. However, they become less similar when the local datasets are Non-IID.

Byzantine Attacks on FL
Byzantine attacks on FL are carried out by malicious clients during the distributed optimization of machine learning models, aiming to bias the global model to the desire of malicious clients [5]. Depending on the adversarial goals, Byzantine attacks in FL can be classified into two categories: targeted attacks and untargeted attacks [5], [29]. Targeted attacks, such as backdoor attacks, aim to make the global model generate attacker-desired misclassifications for some particular test samples [45], [46], [47]. While untargeted attacks aim to degrade the overall performance of the global model indiscriminately [7].
We particularly focus on untargeted attacks as most Byzantinerobust studies do [8], [17], [18], [19], [20], [21]. Many studies considered to launch attacks by adding Gaussian noise or flipping the sign of the actual updates [13], [19]. Those attacks, however, can be detected by some Euclidean distance-based aggregation schemes such as Krum [17], as they usually make the malicious updates far from the benign ones when measured by Euclidean distance. On the other hand, Baruch et al. [23] showed that the attackers can actually circumvent robust schemes including TrimmedMean and Krum by taking advantage of empirical variance between the updates of clients if such variance is high enough. Furthermore, Xie et al. proposed an attack strategy, Inner Product Manipulation (IPM) attack that poses a significant threat to Median and Krum by manipulating the inner product between the true mean of the updates and the output of the aggregation schemes [24].
There exists a debate over Byzantine attacks in real-world FL systems. Shejwalkar et al. [48] claimed that the fraction of compromised genuine clients are usually small (e.g., 0.01%) in practice due to the high cost of hacking and manipulating multiple devices simultaneously, resulting in a very limited impact on the global model. However, Cao et al. [49] argued that the attackers may address this limitation by injecting fake clients into FL systems using zombie devices and simulators.

Byzantine-robust FL
In FL settings, a number of strategies have been explored to defend against specific types of attacks or failures, including backdoor attacks [47], [50], free-rider attacks [51], [52], and gradient inversion attacks [53], [54]. On a more general level, Byzantine-robust FL solutions aim to mitigate the effect of arbitrary updates uploaded by malicious clients, instead of focusing on specific types of attacks [7]. Those Byzantine-robust solutions can be classified into three categories: redundancy-based schemes, trust-based schemes, and robust aggregation schemes.
Redundancy-based schemes assign each client redundant updates and use this redundancy to eliminate the effect of Byzantine failures [10], [11], [55]. In 2018, Chen et al. [10] presented a framework, DRACO, for robust distributed training that uses ideas from coding theory. In DRACO, each client evaluates redundant gradients that are used by the server to eliminate the effects of adversarial updates. In 2019, Rajput et al. presented DETOX, a framework that combines algorithmic redundancy with robust aggregation. The defense of DETOX operates in two steps, a filtering step that uses limited redundancy to significantly reduce the effect of Byzantine nodes, and a hierarchical aggregation step that can be used in tandem with any state-of-the-art robust aggregation method. However, these redundant updates, in the worst case, require each node to compute Ω(M ) times more updates, where M is the number of Byzantine clients [10]. This overhead is prohibitive in settings with a large number of Byzantine clients. In 2021, Cao et al. [56] proposed to use a randomly selected subset of clients to learn redundant global models. At inference time, it takes the majority vote among the global models when predicting the label of a testing sample. The authors showed that such an ensemble approach with any base FL algorithm is provably secure against malicious clients.
Trust-based schemes assume that some of the clients or datasets are trusted for filtering and re-weighting the local model updates [13], [14], [15], [16]. For example, in 2019, Li et al. [13] proposed to incorporate the objective function with a regularization term, which minimizes the distance between the server parameters and the client parameters. In 2021, Park et al. [15] designed an entropybased filtering scheme to detect the outlier updates based on some trusted public data on the server side. During the training, the server computes the entropy of each update with the trusted dataset. Based on their experimental observations, they argue that the updates with higher entropy will lead to lower accuracy during the testing stage. Thus, they set a threshold for the entropy and filter out updates with entropy higher than the threshold. In 2021, Cao et al. [16] utilized cosine similarity to measure the similarity between updates submitted by the clients and the update obtained by training based on the trusted dataset owned by the server. The authors argued that an attacker can manipulate the directions of updates to perform model poisoning attacks, and the directions of the updates can, to a certain extent, indicate the honesty of the end devices. After the calculation of cosine similarity, the server calculates a trust score for each update using the ReLu function. The score is then used as the weight for the global model aggregation. In general, trust-based schemes have the potential to deal with situations where more than half of the updates are malicious according to some pre-validated information to detect malicious updates. However, trusted datasets or clients are not always available for the server, for example, due to the concern of user data privacy.
Robust aggregation schemes estimate the global update based on the local updates according to their robust aggregation rules or algorithms [8], [17], [18], [19], [20], [21]. Byzantine-robust aggregation has been explored to handle the devices sending corrupted updates to the server, including geometric median (GeoMed) [18], Krum [17], TrimmedMean [8], and Median [8]. They are commonly used to estimate the model parameters and mitigate the effect of malicious updates in global aggregation. In 2017, Chen et al. [18] proposed a GeoMed-based method to aggregate the gradients for distributed statistical machine learning and showed the robustness and convergence in i.i.d settings. In 2020, Wu et al. [57] showed that GeoMed scheme provably provides improved Byzantine robustness compared to other aggregation schemes in FL. In 2022, Pillutla et al. [30] applied GeoMed as a robust aggregation rule for FL and analyze the convergence of the resulting FL algorithm for least-squares objective with IID local datasets. In 2022, Li et al. [19] proposed AutoGM, a variant of GeoMed that automatically re-scales the weight of each parameter component according to a user-specified threshold of skewness. According to our empirical study in this paper, this category of schemes all show limitations in FedSGD and FedAvg algorithms in terms of Byzantine robustness.

CONCLUSIONS
In this paper, we provided an experimental study of Byzantine robust aggregation schemes for FL. In particular, we survey existing Byzantine attacks and defense strategies in the FL literature. We also propose a novel scheme, ClippedClustering, which enhances the robustness of the clustering-based scheme by automatically clipping the updates to mitigate the effect of amplified malicious updates. We then evaluate eight robust aggregation schemes under five representative Byzantine attack strategies. Our experimental results show that all those aggregation schemes achieve limited robustness in the presence of Byzantine attacks. In the future, it would be interesting to carry out a theoretical analysis to guarantee the robustness of ClippedClustering. Furthermore, we plan to improve the robustness of FL from more perspectives, e.g., low variance algorithms, and robust learning rates.

STATUS AND PUBLICATION INFORMATION
This paper has been accepted for publication in IEEE Transactions on Big Data, and the final version is now available at https://doi. org/10.1109/TBDATA.2023.3237397