Mean Aggregator is More Robust than Robust Aggregators under Label Poisoning Attacks on Distributed Heterogeneous Data
TL;DR Summary
This study reveals that the mean aggregator outperforms robust aggregators under label poisoning attacks in distributed heterogeneous data, proving order-optimal learning error when data is sufficiently heterogeneous, supported by experimental validation.
Abstract
Robustness to malicious attacks is of paramount importance for distributed learning. Existing works usually consider the classical Byzantine attacks model, which assumes that some workers can send arbitrarily malicious messages to the server and disturb the aggregation steps of the distributed learning process. To defend against such worst-case Byzantine attacks, various robust aggregators have been proposed. They are proven to be effective and much superior to the often-used mean aggregator. In this paper, however, we demonstrate that the robust aggregators are too conservative for a class of weak but practical malicious attacks, known as label poisoning attacks, where the sample labels of some workers are poisoned. Surprisingly, we are able to show that the mean aggregator is more robust than the state-of-the-art robust aggregators in theory, given that the distributed data are sufficiently heterogeneous. In fact, the learning error of the mean aggregator is proven to be order-optimal in this case. Experimental results corroborate our theoretical findings, showing the superiority of the mean aggregator under label poisoning attacks.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Mean Aggregator is More Robust than Robust Aggregators under Label Poisoning Attacks on Distributed Heterogeneous Data
1.2. Authors
- Jie Peng (School of Computer Science and Engineering, Sun Yat-Sen University)
- Weiyu Li (School of Engineering and Applied Science, Harvard University)
- Stefan Vlaski (Department of Electrical and Electronic Engineering, Imperial College London)
- Qing Ling (School of Computer Science and Engineering, Sun Yat-Sen University) - Corresponding author
1.3. Journal/Conference
Published at the International Joint Conference on Artificial Intelligence (IJCAI) in 2024, as an extended version of a conference paper (Peng et al., 2024). IJCAI is a highly reputable and influential conference in the field of Artificial Intelligence.
1.4. Publication Year
2024
1.5. Abstract
The paper addresses the crucial issue of robustness to malicious attacks in distributed learning. While existing research primarily focuses on classical Byzantine attacks (where workers send arbitrarily malicious messages), leading to the development of various robust aggregators deemed superior to the mean aggregator, this study challenges that assumption for a different class of attacks. The authors demonstrate that these robust aggregators can be overly conservative under label poisoning attacks, a weaker yet practical form of attack where only sample labels of some workers are corrupted. Surprisingly, the paper theoretically shows that the mean aggregator is more robust than state-of-the-art robust aggregators when distributed data is sufficiently heterogeneous. In this specific scenario, the mean aggregator's learning error is proven to be order-optimal. Experimental results corroborate these theoretical findings, highlighting the mean aggregator's superiority under label poisoning attacks.
1.6. Original Source Link
/files/papers/694f4ea59c764da3f20e36d0/paper.pdf (This indicates the paper is an officially published work, likely an extended journal version of the IJCAI conference paper).
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is ensuring robustness in distributed learning systems against malicious attacks. Distributed learning, especially federated learning, has gained immense popularity for its effectiveness in large-scale machine learning and privacy preservation. However, its decentralized nature makes it vulnerable.
Traditionally, research in this area has focused on classical Byzantine attacks, a worst-case scenario where malicious workers can send arbitrarily corrupted messages. To combat this, many robust aggregators have been developed, often showing superior performance compared to the simple mean aggregator.
The paper argues that these traditional robust aggregators might be "too conservative" for more practical and less destructive attacks, specifically label poisoning attacks. In such attacks, malicious workers only corrupt the labels of their local data, but otherwise follow the algorithmic protocol. This is a practical concern in scenarios like spam detection where users might intentionally mislabel emails. The existing gap is understanding how these robust aggregators perform under such bounded disturbances, and if the mean aggregator, often dismissed as vulnerable, might have unexpected strengths.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
First Investigation into Mean Aggregator Robustness: It is the first work to specifically investigate the robustness of the
mean aggregatorindistributed learningunderlabel poisoning attacks. This challenges the prevailing assumption thatrobust aggregatorsare universally superior. -
Theoretical Proof of Mean Aggregator's Superiority: The paper theoretically demonstrates that, under
label poisoning attacksand given sufficientlyheterogeneousdistributed data, themean aggregatoris more robust than state-of-the-artrobust aggregators. This is a surprising finding, asrobust aggregatorsare designed specifically for attack scenarios. -
Order-Optimality of Mean Aggregator: For the scenario where
distributed datais sufficientlyheterogeneous, thelearning errorof themean aggregatoris proven to beorder-optimal, irrespective of thefraction of poisoned workers. -
Comprehensive Theoretical Analysis: The paper provides new theoretical analyses for
distributed stochastic momentumas the backbone algorithm, establishing tight upper bounds and a lower bound for thelearning errorfor bothmeanandrobust aggregators. -
Empirical Validation: Extensive experimental results on both
convex(softmax regression) andnon-convex(two-layer perceptrons,CNNs) tasks, across different levels of dataheterogeneityandattack strengths, fully support the theoretical findings.The key conclusion is that the choice of
aggregatorshould be nuanced, depending on the specific attack model and data characteristics. Forlabel poisoning attacksonheterogeneous data, the simplemean aggregatorcan surprisingly outperform complexrobust aggregators.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational understanding of distributed learning, optimization algorithms, and security in machine learning is essential.
- Distributed Learning: This refers to the training of machine learning models across multiple computational devices (
workers) coordinated by a central server (parameter server).- Server-Worker Architecture: In this setup, the
serverholds the global model.Workersdownload the global model, compute local updates (e.g.,stochastic gradientsormomenta) using their local data, and send these updates back to theserver. Theserverthen aggregates these local updates to update theglobal model. - Aggregation: The process by which the
servercombines the messages (updates) received fromworkersto form a single update for theglobal model. The simplest form is themean aggregator, which just averages all received messages. - Federated Learning (FL): A specific type of distributed learning where data remains decentralized on client devices, offering privacy benefits. The paper frames its work within the context of distributed learning, with FL being a prominent application.
- Server-Worker Architecture: In this setup, the
- Stochastic Optimization Algorithms:
- Stochastic Gradient Descent (SGD): An iterative optimization algorithm that updates model parameters using the gradient calculated from a single randomly selected data sample (or a small batch). This introduces
stochastic noise. - Momentum: A technique used in optimization algorithms to accelerate convergence, especially in the presence of noisy gradients. It accumulates a velocity vector based on past gradients, allowing the optimization process to "smooth out" oscillations and move more consistently towards the optimum. The
distributed stochastic momentum algorithmused in this paper combinesstochastic gradientswithmomentum. The update rule for a local momentum at worker is: $ m_w^t = (1 - \alpha) m_w^{t-1} + \alpha \nabla f_{w, i_w^t}(x^t) $ where is themomentum coefficient, is the previousmomentum, and is thestochastic gradientcomputed from sample at global model .
- Stochastic Gradient Descent (SGD): An iterative optimization algorithm that updates model parameters using the gradient calculated from a single randomly selected data sample (or a small batch). This introduces
- Malicious Attacks in Distributed Learning:
- Byzantine Attacks: A severe class of attacks where malicious
workers(Byzantine workers) can send arbitrarily corrupted messages to theserver. This includes sending completely fabricated or nonsensical updates. The goal is to severely disrupt the learning process. TheByzantine generals problem(Lamport et al., 1982) is a classic computer science problem illustrating challenges in achieving consensus in the presence of malicious agents. - Robust Aggregators: Algorithms designed to
defend against Byzantine attacksby identifying and mitigating the impact of malicious messages during the aggregation step. They aim to produce an aggregate update that is close to what would be obtained from only honest workers. Examples includeKrum,geometric median,coordinate-wise median,trimmed-mean (TriMean),FABA, andcentered clipping (CC). - Poisoning Attacks: Attacks that aim to corrupt the training data or model updates.
- Model Poisoning Attacks: Malicious
workerssend arbitrarily poisoned model updates (gradients or weights) to theserver. This is often whatByzantine attacksrefer to. - Data Poisoning Attacks: Malicious
workersfabricate or corrupt their local data to generate poisoned messages. - Label Poisoning Attacks: A specific and weaker type of
data poisoning attackwhere maliciousworkersonly mislabel their local data, but otherwise strictly follow the algorithmic protocol (e.g., computing gradients from the mislabeled data and sending them). Thefeaturedata remains clean. This is the focus of the paper.
- Model Poisoning Attacks: Malicious
- Byzantine Attacks: A severe class of attacks where malicious
- Data Heterogeneity: In
distributed learning,data heterogeneityrefers to the scenario where the local data distributions across differentworkersare not identical. This is common in real-worldfederated learningsettings.- \xi`(heterogeneity parameter)`: In this paper, Assumption 3 quantifies `heterogeneity` as the maximum distance between the local gradient of any regular worker and the global gradient. A larger $\xi$ implies higher `heterogeneity`.
* **Poisoning Disturbance:**
* `$A$ (disturbance parameter)`: In this paper, Assumption 5 quantifies the `disturbance` caused by `poisoned workers` as the maximum distance between the poisoned local gradient of a poisoned worker and the global gradient.
## 3.2. Previous Works
The paper contextualizes its contributions by discussing various related works:
* **Classical Byzantine Robustness:**
* Works like `Krum` (Blanchard et al., 2017), `geometric median` (Chen et al., 2017), `coordinate-wise median` (Yin et al., 2018), `coordinate-wise trimmed-mean (TriMean)` (Yin et al., 2018), `FABA` (Xia et al., 2019), `centered clipping (CC)` (Karimireddy et al., 2021), and `VRMOM` (Tu et al., 2021) are cited as examples of `robust aggregators` designed for `worst-case Byzantine attacks`. Their core idea is to find an aggregate that is robust to outliers, effectively filtering out malicious messages.
* Unified frameworks for analyzing these aggregators exist (Farhadkhani et al., 2022; Allouah et al., 2023).
* **Handling Stochastic Gradient Noise:**
* Some works (Khanduri et al., 2019; Wu et al., 2020; Karimireddy et al., 2021; Rammal et al., 2024; Guerraoui et al., 2024) combine `variance-reduction` and `momentum` techniques with `robust aggregators` to alleviate the effect of `stochastic gradient noise` and enhance `Byzantine robustness`.
* **Addressing Data Heterogeneity:**
* The performance of `robust aggregators` degrades with `data heterogeneity` (Li et al., 2019; Karimireddy et al., 2022). Solutions include `model aggregation` instead of `stochastic gradient aggregation` (Li et al., 2019) or `bucketing/resampling` and `nearest neighbor mixing` (Karimireddy et al., 2022; Peng et al., 2022; Allouah et al., 2023) to reduce message heterogeneity.
* **Data Poisoning Attacks:**
* A broad category of attacks (Sun et al., 2019; Bagdasaryan et al., 2020; Wang et al., 2020; Rosenfeld et al., 2020; Cinà et al., 2024). Defenses often involve `data sanitization` (Steinhardt et al., 2017) or pruning activation units (Liu et al., 2018).
* **Label Poisoning Specific Works:**
* Tavallali et al. (2022) proposed a `regularization-based defense` to detect and exclude samples with flipped labels, but it requires a clean validation set, which has privacy concerns.
* `LFighter` (Jebreel et al., 2024) is a state-of-the-art defense that clusters local gradients to identify and discard poisoned ones. Its performance can degrade with high `data heterogeneity`.
* **Optimality of Robust Aggregators:**
* Karimireddy et al. (2022) and Farhadkhani et al. (2024) have shown the optimality of certain `robust aggregators` for `model poisoning` and `data poisoning attacks`, but they primarily consider scenarios where `poisoned workers` can cause *unbounded* disturbances.
* **Preliminary Observations on Mean Aggregator:**
* Shejwalkar et al. (2022) observed the `mean aggregator`'s robustness in production `federated learning` systems but with restrictive `poisoning ratios` ($<0.1\%$) and without theoretical analysis.
## 3.3. Technological Evolution
The field has evolved from focusing on worst-case `Byzantine attacks` with arbitrary malicious behavior to considering more realistic and *bounded* attack models, such as `label poisoning`. Initially, the development of `robust aggregators` aimed to completely mitigate the impact of any malicious input. However, as `distributed learning` systems encounter real-world `data heterogeneity`, and as attack models become more refined (e.g., `label poisoning` which is a specific type of `data poisoning` that is less destructive than full `Byzantine`), the effectiveness and necessity of these `conservative robust aggregators` are being re-evaluated. This paper marks a shift in perspective, suggesting that simpler `aggregators` might be optimal under specific, practical attack conditions. It builds upon the theoretical understanding of convergence for `stochastic momentum` algorithms and the characterization of `robust aggregators`' approximation abilities (\rho
-robustness).
- \xi`(heterogeneity parameter)`: In this paper, Assumption 3 quantifies `heterogeneity` as the maximum distance between the local gradient of any regular worker and the global gradient. A larger $\xi$ implies higher `heterogeneity`.
* **Poisoning Disturbance:**
* `$A$ (disturbance parameter)`: In this paper, Assumption 5 quantifies the `disturbance` caused by `poisoned workers` as the maximum distance between the poisoned local gradient of a poisoned worker and the global gradient.
## 3.2. Previous Works
The paper contextualizes its contributions by discussing various related works:
* **Classical Byzantine Robustness:**
* Works like `Krum` (Blanchard et al., 2017), `geometric median` (Chen et al., 2017), `coordinate-wise median` (Yin et al., 2018), `coordinate-wise trimmed-mean (TriMean)` (Yin et al., 2018), `FABA` (Xia et al., 2019), `centered clipping (CC)` (Karimireddy et al., 2021), and `VRMOM` (Tu et al., 2021) are cited as examples of `robust aggregators` designed for `worst-case Byzantine attacks`. Their core idea is to find an aggregate that is robust to outliers, effectively filtering out malicious messages.
* Unified frameworks for analyzing these aggregators exist (Farhadkhani et al., 2022; Allouah et al., 2023).
* **Handling Stochastic Gradient Noise:**
* Some works (Khanduri et al., 2019; Wu et al., 2020; Karimireddy et al., 2021; Rammal et al., 2024; Guerraoui et al., 2024) combine `variance-reduction` and `momentum` techniques with `robust aggregators` to alleviate the effect of `stochastic gradient noise` and enhance `Byzantine robustness`.
* **Addressing Data Heterogeneity:**
* The performance of `robust aggregators` degrades with `data heterogeneity` (Li et al., 2019; Karimireddy et al., 2022). Solutions include `model aggregation` instead of `stochastic gradient aggregation` (Li et al., 2019) or `bucketing/resampling` and `nearest neighbor mixing` (Karimireddy et al., 2022; Peng et al., 2022; Allouah et al., 2023) to reduce message heterogeneity.
* **Data Poisoning Attacks:**
* A broad category of attacks (Sun et al., 2019; Bagdasaryan et al., 2020; Wang et al., 2020; Rosenfeld et al., 2020; Cinà et al., 2024). Defenses often involve `data sanitization` (Steinhardt et al., 2017) or pruning activation units (Liu et al., 2018).
* **Label Poisoning Specific Works:**
* Tavallali et al. (2022) proposed a `regularization-based defense` to detect and exclude samples with flipped labels, but it requires a clean validation set, which has privacy concerns.
* `LFighter` (Jebreel et al., 2024) is a state-of-the-art defense that clusters local gradients to identify and discard poisoned ones. Its performance can degrade with high `data heterogeneity`.
* **Optimality of Robust Aggregators:**
* Karimireddy et al. (2022) and Farhadkhani et al. (2024) have shown the optimality of certain `robust aggregators` for `model poisoning` and `data poisoning attacks`, but they primarily consider scenarios where `poisoned workers` can cause *unbounded* disturbances.
* **Preliminary Observations on Mean Aggregator:**
* Shejwalkar et al. (2022) observed the `mean aggregator`'s robustness in production `federated learning` systems but with restrictive `poisoning ratios` ($<0.1\%$) and without theoretical analysis.
## 3.3. Technological Evolution
The field has evolved from focusing on worst-case `Byzantine attacks` with arbitrary malicious behavior to considering more realistic and *bounded* attack models, such as `label poisoning`. Initially, the development of `robust aggregators` aimed to completely mitigate the impact of any malicious input. However, as `distributed learning` systems encounter real-world `data heterogeneity`, and as attack models become more refined (e.g., `label poisoning` which is a specific type of `data poisoning` that is less destructive than full `Byzantine`), the effectiveness and necessity of these `conservative robust aggregators` are being re-evaluated. This paper marks a shift in perspective, suggesting that simpler `aggregators` might be optimal under specific, practical attack conditions. It builds upon the theoretical understanding of convergence for `stochastic momentum` algorithms and the characterization of `robust aggregators`' approximation abilities (\rho
3.4. Differentiation Analysis
The core differentiation of this paper from related works lies in its focus and findings:
- Attack Model: Unlike most prior works that focus on
worst-case Byzantine attacks(arbitrary messages, unbounded disturbances), this paper specifically investigateslabel poisoning attackswhere disturbances are bounded (Assumption 5 holds). This distinction is critical. - Aggregator Performance: While previous research largely concludes that
robust aggregatorsare superior to themean aggregatorforByzantine robustness, this paper demonstrates the opposite forlabel poisoning attacksunderheterogeneous data. - Optimality: Prior works on optimality (Karimireddy et al., 2022; Farhadkhani et al., 2024) consider
unbounded disturbances. This paper proves theorder-optimalityof themean aggregatorspecifically forbounded disturbancescaused bylabel poisoningwhen data isheterogeneous. - Theoretical vs. Empirical Observations: While Shejwalkar et al. (2022) made empirical observations about the
mean aggregator's robustness, this paper provides rigorous theoretical analysis and extends the observations to higherpoisoning ratios. - Role of Heterogeneity: The paper explicitly links the superior performance of the
mean aggregatorto the condition of "sufficientlyheterogeneousdistributed data," showing that theheterogeneityparameter and thepoisoning disturbanceparameter become comparable in such settings. This is a nuanced understanding not deeply explored by prior works promoting generalrobust aggregators.
4. Methodology
4.1. Principles
The core idea behind the method is to rigorously analyze the learning error of distributed stochastic momentum when different aggregators (mean aggregator vs. robust aggregators) are used, specifically under label poisoning attacks on heterogeneous data. The theoretical basis hinges on understanding how the contraction properties of robust aggregators (quantified by ) and the direct impact of poisoned workers (quantified by and ) affect the convergence bounds in the presence of data heterogeneity () and stochastic noise (). The surprising intuition revealed is that when data heterogeneity is high, the robust aggregators might be too aggressive in filtering messages, inadvertently discarding valuable information or treating legitimate heterogeneous messages as malicious, whereas the mean aggregator simply averages everything, which can be more effective if the poisoning disturbance is bounded and comparable to the natural heterogeneity.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Formulation
The paper aims to solve a distributed learning problem defined over a set of regular workers out of a total set of workers . The objective function is:
$
\begin{array} { r l } & { \displaystyle \underset { x \in \mathbb { R } ^ { D } } { \mathrm { m i n } } f ( x ) \triangleq \frac { 1 } { R } \sum _ { w \in \mathcal { R } } f _ { w } ( x ) , } \ & { \mathrm { w i t h } f _ { w } ( x ) \triangleq \displaystyle \frac { 1 } { J } \sum _ { j = 1 } ^ { J } f _ { w , j } ( x ) , \forall w \in \mathcal { R } . } \end{array}
$
Here:
-
is the
global modelparameters, a vector of dimension . -
f(x)is theglobal cost functionwhich is the average of thelocal costsof allregular workers. -
is the
local cost functionforregular worker. -
is the cost associated with the -th
sampleofworker. -
is the number of
sampleseachworkerhas (assumed to be uniform across workers). -
is the number of
regular workers.The problem considers the presence of
poisoned workersin .
4.2.2. Label Poisoning Attacks
The paper defines label poisoning attacks as follows:
Definition 1 (Label poisoning attacks)
In solving (1), there exist a number of poisoned workers, whose local costs are in the same form as the regular workers but an arbitrary fraction of sample labels are poisoned. Nevertheless, these poisoned workers exactly follow the algorithmic protocol during the distributed learning process.
This definition is crucial as it distinguishes label poisoning from classical Byzantine attacks. Poisoned workers do not send arbitrary messages; they compute messages correctly based on their (corrupted) local data.
4.2.3. Distributed Stochastic Momentum Algorithm
The chosen optimization algorithm is distributed stochastic momentum, which generalizes distributed gradient descent and distributed stochastic gradient descent. The algorithm proceeds iteratively:
Algorithm 1: Distributed Stochastic Momentum with Aggregators
- Initialization:
global model, initialmomentaforregular workers, initialmomentaforpoisoned workers,step size,momentum coefficient, totaliterations. - For do:
- Server Broadcast:
Serversendsglobal modelto allworkers. - Worker Computation (Regular): Each
regular workersamples uniformly from , computesstochastic gradient, and updates itslocal momentum: $ m_w^t = (1 - \alpha) m_w^{t-1} + \alpha \nabla f_{w, i_w^t}(x^t) $ Thismomentumis sent as message to theserver. - Worker Computation (Poisoned): Each
poisoned workersamples uniformly, computespoisoned stochastic gradient(based on poisoned labels), and updates itspoisoned local momentum: $ \tilde{m}_w^t = (1 - \alpha) \tilde{m}w^{t-1} + \alpha \nabla \tilde{f}{w, i_w^t}(x^t) $ Thismomentumis sent as message to theserver. For notational convenience, all messages sent byworkerare denoted as: $ \begin{array} { r } { \hat { m } _ { w } ^ { t } = \left{ { m } _ { w } ^ { t } , ~ w \in \mathcal { R } , \right. } \ { \tilde { m } _ { w } ^ { t } , ~ w \in \mathcal { W } \backslash \mathcal { R } . } \end{array} $ - Server Aggregation and Update:
Serverreceives all messages and updates .- With a
Robust Aggregator(): $ \boldsymbol { x } ^ { t + 1 } = \boldsymbol { x } ^ { t } - \gamma \cdot \mathrm { R A g g } ( { \hat { m } _ { w } ^ { t } : w \in \mathcal { W } } ) $ - With the
Mean Aggregator(): $ x ^ { t + 1 } = x ^ { t } - \gamma \cdot \mathbf { M e a n } ( { \hat { m } _ { w } ^ { t } : w \in \mathcal { W } } ) , $ where $ \mathbf { M e a n } ( { \hat { m } _ { w } ^ { t } : w \in \mathcal { W } } ) \triangleq \frac { 1 } { W } \sum _ { w \in \mathcal { W } } \hat { m } _ { w } ^ { t } . $
- With a
- Server Broadcast:
- End for.
4.2.4. Assumptions
The theoretical analysis relies on several standard assumptions in distributed optimization:
- Assumption 1 (Lower boundedness): The
global costislower boundedby , i.e., . This is a standard assumption fornon-convex optimizationensuring the function does not go to negative infinity. - Assumption 2 (Lipschitz continuous gradients): The
global costhas -Lipschitz continuous gradients. This means the change in gradient is bounded by the change in input, with being theLipschitz constant. $ | \nabla f ( x ) - \nabla f ( y ) | \leq L | x - y | $ for any . - Assumption 3 (Bounded heterogeneity): For any , the maximum distance between the
local gradientsof anyregular workerand theglobal gradientis upper-bounded by . $ \operatorname* { m a x } _ { w \in \mathcal { R } } | \nabla f _ { w } ( x ) - \nabla f ( x ) | \leq \xi . $ This assumption quantifies thedata heterogeneityamongregular workers. A larger indicates higher heterogeneity. - Assumption 4 (Bounded inner variance): For any , the
varianceof thelocal stochastic gradientsof anyworkerwith respect to thelocal gradientis upper-bounded by . $ \begin{array} { r l } & { \mathbb { E } _ { i _ { w } } | \nabla f _ { w , i _ { w } } ( x ) - \nabla f _ { w } ( x ) | ^ { 2 } \leq \sigma ^ { 2 } , \quad \forall w \in \mathcal { R } , } \ & { \mathbb { E } _ { i _ { w } } | \nabla \tilde { f } _ { w , i _ { w } } ( x ) - \nabla \tilde { f } _ { w } ( x ) | ^ { 2 } \leq \sigma ^ { 2 } , \quad \forall w \in \mathcal { W } \backslash \mathcal { R } , } \end{array} $ where is a randomly selected sample index. This means thestochastic gradientsare not excessively noisy. It applies to bothregularandpoisoned workers, aslabel poisoningis assumed not to drastically changegradient variance. - Assumption 5 (Bounded disturbances of poisoned local gradients): For any , the maximum distance between the
poisoned local gradientsofpoisoned workersand theglobal gradientis upper-bounded by . $ \operatorname* { m a x } _ { w \in \mathcal { W } \backslash \mathcal { R } } | \nabla \tilde { f } _ { w } ( x ) - \nabla f ( x ) | \leq A . $ This is a critical assumption forlabel poisoning attacks, implying that the maliciousness isbounded, unlikeworst-case Byzantine attackswhere could be arbitrary.
4.2.5. Justification of Assumption 5
The paper provides a theoretical justification for Assumption 5 in the context of distributed softmax regression, a common model for classification tasks.
Example: Distributed Softmax Regression under Label Poisoning Attacks
The local cost of a regular worker is given by:
$
f _ { w } ( \boldsymbol { x } ) = \frac { 1 } { J } \sum _ { j = 1 } ^ { J } f _ { w , j } ( \boldsymbol { x } ) , \mathrm { w h e r e ~ } f _ { w , j } ( \boldsymbol { x } ) = - \sum _ { k = 1 } ^ { K } \mathbf { 1 } { b ^ { ( w , j ) } = k } \log \frac { \exp ( x _ { k } ^ { T } a ^ { ( w , j ) } ) } { \sum _ { l = 1 } ^ { K } \exp ( x _ { l } ^ { T } a ^ { ( w , j ) } ) } .
$
And for a poisoned worker , where labels are replaced by :
$
\tilde { f } _ { w } ( \boldsymbol { x } ) = \frac { 1 } { J } \sum _ { j = 1 } ^ { J } \tilde { f } _ { w , j } ( \boldsymbol { x } ) , \mathrm { w h e r e ~ } \tilde { f } _ { w , j } ( \boldsymbol { x } ) = - \sum _ { k = 1 } ^ { K } \mathbf { 1 } { \tilde { b } ^ { ( w , j ) } = k } \log \frac { \exp ( x _ { k } ^ { T } a ^ { ( w , j ) } ) } { \sum _ { l = 1 } ^ { K } \exp ( x _ { l } ^ { T } a ^ { ( w , j ) } ) } .
$
Here:
- is the number of classes.
- is the -th
sampleofworker, with as thefeatureand as thelabel. - is an indicator function, 1 if the label is , 0 otherwise.
- is the -th block of
global model.
Lemma 11 (Bounded Gradients of Local Costs):
This lemma first establishes that for softmax regression, gradients are bounded.
$
\begin{array} { r l } & { | \nabla f _ { w , j } ( x ) | \leq 2 | a ^ { ( w , j ) } | , \quad \forall w \in \mathcal { R } , \forall j \in [ 1 , \cdots , J ] , } \ & { | \nabla \tilde { f } _ { w , j } ( x ) | \leq 2 | a ^ { ( w , j ) } | , \quad \forall w \in \mathcal { W } \setminus \mathcal { R } , \forall j \in [ 1 , \cdots , J ] . } \end{array}
$
The gradients of sample costs ( and ) are bounded by 2 times the norm of the feature vector .
Also, the gradient of the local cost is bounded by the maximum norm of local features:
$
\begin{array} { r l } & { | \nabla f _ { w } ( x ) | \le 2 \displaystyle \operatorname* { m a x } _ { j \in [ J ] } | a ^ { ( w , j ) } | , \quad \forall w \in \mathcal { R } , } \ & { | \nabla \tilde { f } _ { w } ( x ) | \le 2 \displaystyle \operatorname* { m a x } _ { j \in [ J ] } | a ^ { ( w , j ) } | , \quad \forall w \in \mathcal { W } \setminus \mathcal { R } . } \end{array}
$
Moreover, if is entry-wise non-negative (e.g., pixel values in image classification), a tighter bound exists:
$
\begin{array} { r l } & { | \nabla f _ { w } ( x ) | \leq \sqrt { K } | \displaystyle \frac { 1 } { J } \displaystyle \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } | , \quad \forall w \in \mathcal { R } , } \ & { | \nabla \tilde { f } _ { w } ( x ) | \leq \sqrt { K } | \displaystyle \frac { 1 } { J } \displaystyle \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } | , \quad \forall w \in \mathcal { W } \backslash \mathcal { R } . } \end{array}
$
Lemma 2 (Justification of Assumption 5 for Softmax Regression):
This lemma directly proves Assumption 5 holds for distributed softmax regression under label poisoning.
Given the local costs (13) and (14) and entry-wise non-negative features , Assumption 5 is satisfied with:
$
A \leq 2 \sqrt { K } \operatorname* { m a x } _ { w \in \mathcal { W } } | \frac { 1 } { J } \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } | .
$
The proof uses Lemma 11 and triangle inequality:
$
\operatorname* { m a x } _ { w \in { \mathcal W } \backslash { \mathcal R } } | \nabla \tilde { f } _ { w } ( x ) - \nabla f ( x ) | ^ { 2 } \leq 2 \operatorname* { m a x } _ { w \in { \mathcal W } \backslash { \mathcal R } } | \nabla \tilde { f } _ { w } ( x ) | ^ { 2 } + 2 | \nabla f ( x ) | ^ { 2 } .
$
Then, using the bounds from Lemma 11 for and (which is an average of ):
$
\operatorname* { m a x } _ { w \in \mathcal { W } \backslash \mathcal { R } } | \nabla \tilde { f } _ { w } ( x ) | ^ { 2 } \leq K \operatorname* { m a x } _ { w \in \mathcal { W } \backslash \mathcal { R } } | \frac { 1 } { J } \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } | ^ { 2 } .
$
And for :
$
| \nabla f ( x ) | ^ { 2 } \leq \frac { 1 } { R } \sum _ { w \in { \mathcal R } } | \nabla f _ { w } ( x ) | ^ { 2 } \leq \operatorname* { m a x } _ { w \in { \mathcal R } } | \nabla f _ { w } ( x ) | ^ { 2 } \leq K \operatorname* { m a x } _ { w \in { \mathcal R } } | \frac { 1 } { J } \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } | ^ { 2 } .
$
Combining these yields the bound for .
Lemma 3 (Justification of Assumption 3 and relation between and ):
Similarly, for regular workers in distributed softmax regression, Assumption 3 is satisfied with:
$
\xi \leq 2 \sqrt { K } \operatorname* { m a x } _ { w \in \mathcal { R } } \big | \frac { 1 } { J } \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } \big | .
$
Furthermore, in a sufficiently heterogeneous case (where each regular worker specializes in one class, and each class belongs to only one worker), it's shown that:
$
\boldsymbol { \xi } = \Theta \big ( \operatorname* { m a x } _ { \boldsymbol { w } \in \mathcal { R } } \big | \frac { 1 } { J } \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } \big | \big ) .
$
This implies that is in the same order as the maximum average feature norm of regular workers. If feature norms of regular and poisoned workers are comparable, then A = O(\xi). This relation, A=O(\xi), is crucial for comparing the learning error bounds of the mean aggregator and robust aggregators.
4.2.6. -Robust Aggregators
To analyze robust aggregators, the paper uses the concept of a \rho`-robust aggregator`:
**Definition 4 ($\boldsymbol{\rho}$-robust aggregator):**
Consider any $W$ messages $y_1, y_2, \dots, y_W \in \mathbb{R}^D$, among which $R$ messages are from `regular workers` $w \in \mathcal{R}$. An aggregator $\mathrm{RAgg}(\cdot)$ is said to be a \rho-robust aggregator if there exists a contraction constant such that:
$
\left| R A g g ( { y _ { 1 } , \cdot \cdot \cdot , y _ { W } } ) - \bar { y } \right| \leq \rho \cdot \operatorname* { m a x } _ { w \in \mathcal { R } } | y _ { w } - \bar { y } | ,
$
where is the average message of the regular workers. A smaller indicates that the robust aggregator's output is closer to the true average of regular messages.
Lemma 5 (Lower bound for ):
This lemma states that a \rho`-robust aggregator` exists only if the `fraction of poisoned workers` $\delta \triangleq 1 - \frac{R}{W}$ is less than $\frac{1}{2}$ ($\delta < \frac{1}{2}$), and $\rho \geq \operatorname{min}\{\frac{\delta}{1-2\delta}, 1\}$. This shows that $\rho$ cannot be arbitrarily small and is fundamentally limited by $\delta$.
The paper provides specific $\rho$ values for common `robust aggregators`:
* **TriMean (Lemma 14):** If $\delta < \frac{1}{2}$, `TriMean` is a \rho-robust aggregator with . TriMean discards the smallest and largest W-R elements in each dimension. The proof (Appendix B.2) shows that its error bound in each dimension is proportional to , which is then extended to the vector norm.
- Centered Clipping (CC) (Lemma 15): If , one-step
CC(with proper initialization and clipping threshold) is a \rho`-robust aggregator` with $\rho = \sqrt{24\delta}$. `CC` iteratively clips messages that are too far from a central point. The proof (Appendix B.3) bounds the error by analyzing clipped and unclipped components. * **FABA (Lemma 16):** If $\delta < \frac{1}{3}$, `FABA` is a \rho-robust aggregatorwith .FABAiteratively discards the message farthest from the current average. The proof (Appendix B.4) analyzes the removal process and error accumulation.
4.2.7. Main Results
The paper derives learning error bounds for both robust aggregators and the mean aggregator, and a lower bound for any identity-invariant algorithm.
Theorem 7 (Learning Error for -Robust Aggregators):
For Algorithm 1 using a \rho`-robust aggregator` `RAgg(`\cdot`)` to solve (1), under Assumptions 1-4, with a `fraction of poisoned workers` $\delta \in [0, \frac{1}{2})$, appropriate `step size` $\gamma$ and `momentum coefficient` $\alpha = 8L\gamma$, the average squared `gradient norm` (a measure of `learning error`) is bounded by:
$
\begin{array} { l } { \displaystyle \frac { 1 } { T } \sum _ { t = 1 } ^ { T } \mathbb { E } \| \nabla f ( x ^ { t } ) \| ^ { 2 } } \\ { \displaystyle = O \Bigg ( \rho ^ { 2 } \xi ^ { 2 } + \sqrt { \frac { ( L F ^ { 0 } + \rho ^ { 2 } \sigma ^ { 2 } ) ( \rho ^ { 2 } + 1 ) \sigma ^ { 2 } } { T } } + \frac { L F ^ { 0 } + ( \rho ^ { 2 } + 1 ) \sigma ^ { 2 } + \rho ^ { 2 } \xi ^ { 2 } } { T } \Bigg ) . } \end{array}
$
Here:
* $\mathbb{E}[\cdot]$ denotes the `expectation`.
* $T$ is the total number of `iterations`.
* $L$ is the `Lipschitz constant` (Assumption 2).
* $F^0 \triangleq f(x^0) - f^*$ is the initial gap to the optimal `cost`.
* $\rho$ is the `contraction constant` of the `robust aggregator` (Definition 4).
* $\xi$ is the `heterogeneity parameter` (Assumption 3).
* $\sigma^2$ is the `bounded inner variance` (Assumption 4).
The `non-vanishing learning error` (the term independent of $T$) is $O(\rho^2 \xi^2)$.
**Theorem 8 (Learning Error for Mean Aggregator):**
For Algorithm 1 using the `mean aggregator` `Mean(`\cdot`)` to solve (1), under Assumptions 1, 2, 4, and 5, with a `fraction of poisoned workers` $\delta \in [0, 1)$, appropriate `step size` $\gamma$ and `momentum coefficient` $\alpha = 8L\gamma$, the average squared `gradient norm` is bounded by:
$
\begin{array} { r l } & { \frac { 1 } { T } \displaystyle \sum _ { t = 1 } ^ { T } \mathbb { E } \| \nabla f ( x ^ { t } ) \| ^ { 2 } } \\ & { = O \Bigg ( \delta ^ { 2 } A ^ { 2 } + \sqrt { \frac { ( L F ^ { 0 } + \delta ^ { 2 } \sigma ^ { 2 } ) ( \delta ^ { 2 } + 1 ) \sigma ^ { 2 } } { T } } + \frac { L F ^ { 0 } + ( \delta ^ { 2 } + 1 ) \sigma ^ { 2 } + \delta ^ { 2 } A ^ { 2 } } { T } \Bigg ) . } \end{array}
$
Here:
* $\delta$ is the `fraction of poisoned workers`.
* $A$ is the `bounded disturbance parameter` (Assumption 5).
The `non-vanishing learning error` is $O(\delta^2 A^2)$.
**Theorem 9 (Lower Bound of Learning Error):**
For `label poisoning attacks` with a `fraction of poisoned workers` $\delta$, any `identity-invariant algorithm` (an algorithm whose output is unaffected by which specific workers are regular or poisoned, given the same set of messages) running for $T$ iterations will have a `learning error` lower bounded by:
$
\frac { 1 } { T } \sum _ { t = 1 } ^ { T } \mathbb { E } \| \nabla f ( x ^ { t } ) \| ^ { 2 } = \Omega ( \delta ^ { 2 } \operatorname* { m i n } \{ A ^ { 2 } , \xi ^ { 2 } \} ) .
$
This `lower bound` implies that some `non-vanishing error` is unavoidable for `identity-invariant algorithms` (which most practical algorithms are).
**Comparison and Optimality:**
The `non-vanishing learning error` for `robust aggregators` is $O(\rho^2 \xi^2)$, while for the `mean aggregator` it is $O(\delta^2 A^2)$.
* When the disturbances from `poisoned workers` are small (i.e., $A < \xi$), the `lower bound` becomes $\Omega(\delta^2 A^2)$. In this case, the `mean aggregator`'s `learning error` $O(\delta^2 A^2)$ matches the `lower bound` in order, indicating `order-optimality`.
* When `distributed data` is `sufficiently heterogeneous`, $A$ is in the same order as $\xi$ (`A=O(\xi)`) as discussed in Section 4.1. The paper then presents Table 1 to compare the `learning errors`.
The following are the results from Table 1 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<th>Aggregator</th>
<th>Learning error</th>
</tr>
</thead>
<tbody>
<tr>
<td>TriMean</td>
<td>O(δ<sup>2</sup>ξ<sup>2</sup>/(1−2δ)<sup>2</sup>)</td>
</tr>
<tr>
<td>CC</td>
<td>O(δξ<sup>2</sup>)</td>
</tr>
<tr>
<td>FABA</td>
<td>O(δ<sup>2</sup>ξ<sup>2</sup>/(1−3δ)<sup>2</sup>)</td>
</tr>
<tr>
<td>Mean</td>
<td>O(δ<sup>2</sup>ξ<sup>2</sup>)</td>
</tr>
<tr>
<td>Lower bound</td>
<td>Ω(δ<sup>2</sup>ξ<sup>2</sup>)</td>
</tr>
</tbody>
</table></div>
This table shows that when $\delta$ (fraction of poisoned workers) is small, `TriMean`, `FABA`, and `Mean` all match the `lower bound` in order. However:
* `TriMean`'s error explodes as $\delta \to \frac{1}{2}$.
* `FABA`'s error explodes as $\delta \to \frac{1}{3}$.
* The `mean aggregator`'s error is `order-optimal` ($O(\delta^2 \xi^2)$) regardless of the `fraction of poisoned workers` (within the $\delta < 1$ assumption).
* The `mean aggregator` is also superior to `CC` by a magnitude of $\delta$.
**Remark 10:** The analysis holds even for `distributed stochastic gradient descent` (when $\alpha=1$) or `distributed gradient descent` (when $\alpha=1$ and $\sigma=0$). The `non-vanishing errors` remain dominant by the `heterogeneity` ($\xi^2$) and `disturbance` ($A^2$) terms when they are larger than `stochastic variance` ($\sigma^2$), confirming the paper's conclusions.
# 5. Experimental Setup
## 5.1. Datasets
The experiments were conducted on two widely used image classification datasets:
* **MNIST:** A dataset of handwritten digits (0-9).
* **Source:** National Institute of Standards and Technology.
* **Scale:** 60,000 training images and 10,000 testing images.
* **Characteristics:** Grayscale images of size 28x28 pixels. Relatively simple, often used for benchmarking.
* **Domain:** Digit recognition.
* **Use in Paper:**
* `Convex case`: `Softmax regression` (a linear classifier).
* `Non-convex case`: `Two-layer perceptrons` (MLP) with ReLU activation, 50 neurons per layer.
* **CIFAR10:** A dataset of natural images.
* **Source:** Canadian Institute for Advanced Research.
* **Scale:** 50,000 training images and 10,000 testing images.
* **Characteristics:** Color images of size 32x32 pixels, belonging to 10 classes (e.g., airplane, automobile, bird). More complex than MNIST.
* **Domain:** Object recognition.
* **Use in Paper:** `Non-convex case`: `Convolutional Neural Networks` (CNNs). The architecture is provided in Table 2.
The following are the results from Table 2 of the original paper:
<div class="table-wrapper"><table>
<thead>
<tr>
<th>Layer Name</th>
<th>Layer size</th>
</tr>
</thead>
<tbody>
<tr>
<td>Convolution + ReLU</td>
<td>3 × 3 × 16</td>
</tr>
<tr>
<td>Max pool</td>
<td>2× 2</td>
</tr>
<tr>
<td>Convolution + ReLU</td>
<td>3 × 3 × 32</td>
</tr>
</tr>
<tr>
<td>Max pool</td>
<td>2× 2</td>
</tr>
<tr>
<td>Fully connected + ReLU</td>
<td>128</td>
</tr>
<tr>
<td>Softmax</td>
<td>10</td>
</tr>
</tbody>
</table></div>
These datasets were chosen because they are standard benchmarks for classification tasks, allowing for comparison with previous works and validation across different model complexities (linear, shallow neural networks, deep CNNs).
## 5.2. Data Partitioning
The experiments were set up with $W=10$ `workers`. By default, $R=9$ `workers` are `regular`, and one `worker` is `poisoned`. The impact of different `fractions of poisoned workers` (e.g., $R=8$ or $R=7$) is explored in Appendix H.
Three data distribution scenarios were considered to investigate the impact of `heterogeneity`:
* **i.i.d. (independent and identically distributed):** Training data is uniformly and randomly divided among all `workers`. This represents the least `heterogeneous` scenario.
* **Mild non-i.i.d.:** Training data is divided using a `Dirichlet distribution` with `hyper-parameter` $\beta=1$ by default (Hsu et al., 2019). This creates some level of `data heterogeneity` where `workers` might have different proportions of classes.
* **Non-i.i.d.:** Each class of the training data is assigned to one `worker`. For instance, in MNIST (10 classes), each `worker` might predominantly (or exclusively) receive data from a single digit class. This represents the most extreme `heterogeneity`.
## 5.3. Label Poisoning Attacks
Two types of `label poisoning attacks` were investigated:
* **Static Label Flipping:** The `poisoned worker` deterministically flips labels. For MNIST, a label $b$ is flipped to `9-b` (e.g., 0 becomes 9, 1 becomes 8, etc.). This is a fixed, non-adaptive attack.
* **Dynamic Label Flipping:** The `poisoned worker` adapts its attack strategy. At each iteration, a label $b$ is flipped to the `least probable label` with respect to the current `global model` $x^t$ (Shejwalkar et al., 2022). This is a more sophisticated, adaptive attack.
## 5.4. Aggregators to Compare
The performance of the `mean aggregator` was compared against several representative \rho-robust aggregators and a state-of-the-art label poisoning defense:
- Mean: The simple average of all received messages. This is also used as a
baselinewithout attacks to show ideal performance. - TriMean (Trimmed Mean): A robust aggregator that discards a certain percentage of the smallest and largest values in each dimension before averaging.
- FABA (Fast Aggregation Against Byzantine Attacks): An iterative robust aggregator that discards outliers based on their distance from the current average.
- CC (Centered Clipping): An aggregator that clips messages exceeding a certain threshold, bringing extreme values closer to the center.
- LFighter: A state-of-the-art defense specifically designed for
label poisoning attacksinfederated learning, which clusters local gradients to identify and discard poisoned ones (Jebreel et al., 2024).
5.5. Hyperparameters
The step size for the optimization algorithm was set to . The momentum coefficient was set to .
5.6. Evaluation Metrics
The primary evaluation metric used in the experiments is classification accuracy.
- Classification Accuracy: This metric measures the proportion of correctly predicted instances out of the total instances in a dataset.
- Conceptual Definition:
Classification accuracyis a straightforward measure of a model's overall correctness. It indicates how well the model predicts the correct class label for a given input. It is useful for balanced datasets where all classes are equally important. - Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $
- Symbol Explanation:
Number of correct predictions: The count of instances where the model's predicted class label matches the true class label.Total number of predictions: The total number of instances in the dataset being evaluated (e.g., test set).
- Conceptual Definition:
6. Results & Analysis
The experimental results consistently validate the theoretical findings, particularly highlighting the superior performance of the mean aggregator under label poisoning attacks when data heterogeneity is high.
6.1. Convex Case: Softmax Regression on MNIST
6.1.1. Classification Accuracy
The performance of different aggregators for softmax regression on the MNIST dataset is shown for static label flipping (Figure 1) and dynamic label flipping (Figure 2).
The following figure (Figure 1 from the original paper) shows the accuracies of softmax regression on the MNIST dataset under static label flipping attacks:
该图像是一个图表,展示了在不同数据分布下(IID、Mild Noniid 和 Noniid)使用多种聚合器(Baseline、Mean、TriMean、FABA、CC 和 LFighter)进行软最大回归的准确率随迭代次数的变化情况。可以观察到,在IID条件下,Mean聚合器表现出优越的表现。
The following figure (Figure 2 from the original paper) shows the accuracies of softmax regression on the MNIST dataset under dynamic label flipping attacks:

Analysis of Figures 1 & 2:
- i.i.d. Case: All methods perform well, close to the
baseline(mean aggregator without attacks). Interestingly, themean aggregatorwith attacks shows a slightly lower accuracy compared torobust aggregatorshere. This suggests that in lowheterogeneityenvironments,robust aggregatorsmight still offer some benefit. - Mild non-i.i.d. Case:
FABAandLFighteremerge as the best performers, with other aggregators showing similar performance. - Non-i.i.d. Case: This is where the paper's key finding is most evident. Due to high
heterogeneity, all aggregators are significantly affected, showing larger gaps to thebaseline. However, in this scenario, themean aggregatorperforms the best among all aggregators. This strongly corroborates the theoretical prediction that themean aggregatoris more robust whendata heterogeneityis large.
6.1.2. Heterogeneity and Disturbance Parameters ( and )
To support the theoretical claims about and , Figure 3 plots their estimated values for softmax regression.
The following figure (Figure 3 from the original paper) shows the heterogeneity of regular local gradients () and disturbance of poisoned local gradients () in softmax regression on the MNIST dataset, under static label flipping and dynamic label flipping attacks:

Analysis of Figure 3:
- Bounded : The disturbance of the
poisoned local gradients() remainsboundedunder bothstaticanddynamic label flipping attacks, validating Assumption 5 and Lemma 2. - Increasing : As the data distribution shifts from
i.i.d.tomild non-i.i.d.tonon-i.i.d., theheterogeneityofregular local gradients() increases. - in Non-i.i.d.: Crucially, in the
non-i.i.d. case, becomes comparable to . This aligns with the theoretical discussion in Section 4.1 (A=O(\xi)under sufficient heterogeneity). When , according to Table 1, themean aggregator'slearning error() becomesorder-optimaland often lower than otherrobust aggregators. This explains the superior performance of themean aggregatorobserved in Figures 1 and 2 for thenon-i.i.d. case.
6.2. Nonconvex Case: Two-layer Perceptrons on MNIST and CNNs on CIFAR10
6.2.1. Classification Accuracy
The evaluation extends to non-convex problems: two-layer perceptrons on MNIST and CNNs on CIFAR10. Figures 4 and 5 show the accuracies for static and dynamic label flipping attacks, respectively.
The following figure (Figure 4 from the original paper) shows the accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under static label flipping attacks:
该图像是一个图表,展示了在不同数据集(MNIST和CIFAR10)上,基础、平均、TriMean、FABA、CC和LFighter等多个算法在IID、轻度非IID和非IID条件下的准确度随迭代次数变化的情况。
The following figure (Figure 5 from the original paper) shows the accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under dynamic label flipping attacks:

Analysis of Figures 4 & 5:
- i.i.d. Case: Similar to the
convex case, all methods generally perform well, close to thebaseline. An exception isCCon CIFAR10 underdynamic label flipping, which performs worse than others. - Mild non-i.i.d. Case:
- On MNIST, all methods perform well, close to the
baseline. - On CIFAR10,
Mean,FABA, andLFighterperform best and are close to thebaseline.CCandTriMeanperform worse, withTriMeanbeing the worst and showing a significant gap underdynamic label flipping.
- On MNIST, all methods perform well, close to the
- Non-i.i.d. Case: Again, this scenario strongly supports the theory. All methods are affected by attacks, but the
mean aggregatorconsistently emerges as the best performer.CC,FABA, andLFighterare worse, andTriMeanlargely fails (shows very low accuracy). This reaffirms themean aggregator's robustness in highlyheterogeneousnon-convexsettings underlabel poisoning.
6.2.2. Heterogeneity and Disturbance Parameters ( and )
Figure 6 shows the values of and for the non-convex cases.
The following figure (Figure 6 from the original paper) shows the heterogeneity of regular local gradients () and disturbance of poisoned local gradients () in training two-layer perceptrons on the MNIST dataset and training convolutional neural networks on the CIFAR10 dataset, under static label flipping and dynamic label flipping attacks:

Analysis of Figure 6:
- The findings mirror the
convex case: (disturbance) remainsboundedacross all scenarios, validating Assumption 5. - (heterogeneity) increases from
i.i.d.tonon-i.i.d.. - In the
non-i.i.d. case, is close to , further supporting the theoretical condition where themean aggregatorisorder-optimal.
6.3. Impacts of Heterogeneity and Attack Strengths
To further analyze the nuanced interplay of heterogeneity and attack strengths, experiments were conducted on two-layer perceptrons on MNIST, varying the Dirichlet hyper-parameter (smaller means larger heterogeneity) and the label flipping probability (larger means stronger static label flipping attacks). Figure 7 summarizes which aggregator performs best under different combinations.
The following figure (Figure 7 from the original paper) shows the best accuracies of trained two-layer perceptrons by all aggregators on the MNIST dataset under static label flipping attacks. Each block is associated with a hyper-parameter that characterizes the heterogeneity and the flipping probability that characterizes the attack strength. For each block, the best accuracy and the corresponding aggregator is marked. Orange means that the mean aggregator is the best:

Analysis of Figure 7:
-
Mean Aggregator's Dominance: The orange blocks indicate where the
mean aggregatorachieves the best performance. It clearly dominates whenheterogeneityis large (small values, e.g., , ). -
Effect of Heterogeneity (): For a fixed
flipping probability, as decreases (increasingheterogeneity), themean aggregatorprogressively surpassesrobust aggregators. -
Effect of Attack Strength (): For a fixed
heterogeneity(fixed ), as decreases (weakerattack strength), themean aggregatortends to perform better. -
Recommendation: The results suggest recommending the
mean aggregatorwhendistributed dataissufficiently heterogeneousor when thedisturbancecaused bylabel poisoning attacksis comparable to theheterogeneityofregular local gradients.The full results are provided in Table 3 (Appendix F). The following are the results from Table 3 of the original paper:
| β | p | Mean | CC | FABA | LFighter | TriMean |
|---|---|---|---|---|---|---|
| 5 | 0.0 | 0.9441 | 0.9385 | 0.9410 | 0.9420 | 0.9429 |
| 0.2 | 0.9426 | 0.9405 | 0.9439 | 0.9437 | 0.9425 | |
| 0.4 | 0.9443 | 0.9421 | 0.9437 | 0.9456 | 0.9430 | |
| 0.6 | 0.9427 | 0.9402 | 0.9431 | 0.9439 | 0.9397 | |
| 0.8 | 0.9429 | 0.9408 | 0.9424 | 0.9443 | 0.9386 | |
| 1.0 | 0.9415 | 0.9382 | 0.9423 | 0.9437 | 0.9371 | |
| 1 | 0.0 | 0.9437 | 0.9362 | 0.9390 | 0.9361 | 0.9402 |
| 0.2 | 0.9448 | 0.9371 | 0.9417 | 0.9341 | 0.9373 | |
| 0.4 | 0.9417 | 0.9404 | 0.9447 | 0.9404 | 0.9396 | |
| 0.6 | 0.9386 | 0.9355 | 0.9409 | 0.9422 | 0.9365 | |
| 0.8 | 0.9402 | 0.9323 | 0.9386 | 0.9431 | 0.9318 | |
| 1.0 | 0.9424 | 0.9401 | 0.9414 | 0.9433 | 0.9273 | |
| 0.1 | 0.0 | 0.9407 | 0.9251 | 0.9404 | 0.9170 | 0.9134 |
| 0.2 | 0.9423 | 0.9292 | 0.9271 | 0.9313 | 0.9076 | |
| 0.4 | 0.9420 | 0.9270 | 0.9229 | 0.9201 | 0.8942 | |
| 0.6 | 0.9372 | 0.9226 | 0.8996 | 0.9377 | 0.8891 | |
| 0.8 | 0.9278 | 0.9135 | 0.9431 | 0.9433 | 0.8498 | |
| 1.0 | 0.8327 | 0.8305 | 0.9426 | 0.9449 | 0.8054 | |
| 0.05 | 0.0 | 0.9468 | 0.9317 | 0.8976 | 0.8617 | 0.8763 |
| 0.2 | 0.9467 | 0.9311 | 0.8603 | 0.8845 | 0.8529 | |
| 0.4 | 0.9474 | 0.9279 | 0.8601 | 0.8860 | 0.8526 | |
| 0.6 | 0.9418 | 0.9248 | 0.8571 | 0.9343 | 0.8492 | |
| 0.8 | 0.9201 | 0.9155 | 0.9394 | 0.9385 | 0.8576 | |
| 1.0 | 0.8573 | 0.8950 | 0.9384 | 0.9374 | 0.8106 | |
| 0.03 | 0.0 | 0.9426 | 0.9299 | 0.8634 | 0.8517 | 0.8523 |
| 0.2 | 0.9413 | 0.9298 | 0.8507 | 0.8493 | 0.8427 | |
| 0.4 | 0.9393 | 0.9268 | 0.8506 | 0.8502 | 0.8370 | |
| 0.6 | 0.9374 | 0.9206 | 0.8481 | 0.8449 | 0.8246 | |
| 0.8 | 0.9159 | 0.8679 | 0.8019 | 0.8313 | 0.7869 | |
| 1.0 | 0.8312 | 0.8152 | 0.7437 | 0.9396 | 0.7514 | |
| 0.01 | 0.0 | 0.9456 | 0.9164 | 0.8678 | 0.8681 | 0.8008 |
| 0.2 | 0.9441 | 0.9165 | 0.8657 | 0.8660 | 0.7788 | |
| 0.4 | 0.9415 | 0.9135 | 0.8631 | 0.8632 | 0.7858 | |
| 0.6 | 0.9356 | 0.9039 | 0.8601 | 0.8399 | 0.7708 | |
| 0.8 | 0.8827 | 0.8750 | 0.8155 | 0.7864 | 0.7457 | |
| 1.0 | 0.8505 | 0.8278 | 0.7731 | 0.8162 | 0.7258 |
6.4. Bounded Variance of Stochastic Gradients (Appendix G)
Figure 8 verifies Assumption 4, which states that the variance of stochastic gradients is bounded.
The following figure (Figure 8 from the original paper) shows the maximum variance of regular stochastic gradients and variance of poisoned stochastic gradients in softmax regression on the MNIST dataset under static label flipping attacks:

Analysis of Figure 8:
The figure shows that both the maximum variance of regular stochastic gradients and the variance of poisoned stochastic gradients remain bounded across different heterogeneity settings and iterations under static label flipping attacks. This supports the realism of Assumption 4, indicating that label poisoning (corrupting labels but not features) does not introduce arbitrarily large stochastic gradient variance.
6.5. Impact of Fraction of Poisoned Workers (Appendix H)
The paper also investigates the impact of increasing the fraction of poisoned workers (from 1 to 2, then 3 poisoned workers out of 10). Figures 9-12 illustrate these results.
The following figure (Figure 9 from the original paper) shows the accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under static label flipping attacks when :
该图像是一个图表,展示了在静态标签翻转攻击下,两个层感知器在 MNIST 数据集和卷积神经网络在 CIFAR10 数据集上的准确率表现。图表中分别表示了 IID、Mild Noniid 和 Noniid 情况下的学习过程,显示了不同聚合器算法的效果。
The following figure (Figure 10 from the original paper) shows the accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under dynamic label flipping attacks when :
该图像是一个图表,展示了在不同数据分布(IID、Mild Noniid、Noniid)下,两个层感知器在MNIST数据集和卷积神经网络在CIFAR10数据集上,随着迭代次数增加而变化的准确率。不同的线条和标记代表了不同的算法。
The following figure (Figure 11 from the original paper) shows the accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under static label flipping attacks when :
该图像是一个实验结果图,展示了在不同数据分布下(IID、Mild Noniid 和 Noniid)两层感知器和卷积神经网络在 MNIST 和 CIFAR10 数据集上的准确性。这些结果反映了在静态标签翻转攻击下模型表现的对比情况。
The following figure (Figure 12 from the original paper) shows the accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under dynamic label flipping attacks when :

Analysis of Figures 9-12:
- Consistency with Case: The general trend observed with (1 poisoned worker) holds: the
mean aggregatorgenerally outperformsrobust aggregatorsin thenon-i.i.d. case. - Performance Degradation with More Poisoned Workers: As the
fraction of poisoned workersincreases (from to to ), the performance of both themean aggregatorandrobust aggregatorsdecreases. This is expected and aligns with the theoretical results in Theorem 7 and 8, where thelearning error boundsare proportional to (or , which depends on ). This confirms that while themean aggregatorcan be superior relative to others, an increased number ofpoisoned workerswill still negatively impact overall performance.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper rigorously investigates the robustness of aggregators in distributed learning systems against label poisoning attacks. Contrary to the common belief that robust aggregators are universally superior for defending against malicious activities, the study theoretically and empirically demonstrates that the simple mean aggregator can be more robust and even order-optimal when distributed data are sufficiently heterogeneous. The theoretical framework, based on distributed stochastic momentum, provides specific learning error bounds that highlight the interplay between data heterogeneity (), poisoning disturbance (), and the fraction of poisoned workers (). The critical insight is that robust aggregators, designed for worst-case Byzantine attacks, might be too conservative for bounded disturbances like label poisoning, leading them to misinterpret legitimate heterogeneity as malicious.
7.2. Limitations & Future Work
The paper primarily focuses on label poisoning attacks as a specific type of bounded disturbance. While this is a practical attack model, it does not cover all forms of malicious attacks (e.g., more arbitrary Byzantine attacks). The theoretical optimality of the mean aggregator is contingent on the condition of sufficient data heterogeneity and bounded poisoning disturbances (where ). The paper notes that in the i.i.d. case or mild non-i.i.d. case, robust aggregators might still be competitive or even slightly better.
For future work, the authors plan to extend this analysis to the more challenging decentralized learning problem. In decentralized learning, there is no central server, and workers communicate directly, making aggregation and robustness even more complex.
7.3. Personal Insights & Critique
This paper offers a refreshing and critical perspective on the application of robust aggregators in distributed learning. It challenges the default assumption that more complex robust aggregators are always better, emphasizing the importance of understanding the specific attack model and data characteristics.
My insights are:
- Context-Dependent Robustness: The key takeaway is that
robustnessis not a one-size-fits-all concept. The effectiveness of anaggregatorstrongly depends on the nature of theattack(e.g.,boundedvs.unbounded disturbances) and the inherent properties of the data (heterogeneity). This highlights the need for more nuanced design choices infederated learningand otherdistributed learningsystems. - Rethinking Simple Baselines: The
mean aggregatoris often seen as a naive baseline, vulnerable to any form ofattack. This paper demonstrates that simplicity can be an advantage when sophisticated defenses become over-conservative. It encourages researchers to re-evaluate simple methods under specific, practical constraints. - Adaptive Aggregation: The findings could pave the way for adaptive
aggregation strategieswhere the system dynamically selects the most appropriateaggregatorbased on real-time assessments ofdata heterogeneityand potentialattack profiles. For instance, a system might use amean aggregatorifheterogeneityis high andlabel poisoningis suspected, but switch to aByzantine-robust aggregatorif more severe, arbitrarymodel poisoning attacksare detected. - The Role of Assumptions: The paper's strength lies in its clear articulation and justification of assumptions (especially Assumption 5). This underscores how crucial assumptions about the
attacker's capabilitiesare in designing and analyzingrobust learning algorithms. Whenattackersare constrained, the optimal defense might change significantly. - Generalizability of
A=O(\xi): While justified forsoftmax regression, the generalizability ofA=O(\xi)to other models (e.g., deep neural networks) might require further empirical validation or theoretical analysis. The experimental results ontwo-layer perceptronsandCNNsprovide good empirical support, but a universal theoretical proof forA=O(\xi)across allnon-convex modelsunderlabel poisoningwould be a powerful extension. - Practical Implications: For real-world
federated learningdeployments, wheredata heterogeneityis rampant andlabel errors(intentional or accidental) are common, this work provides a strong argument for considering themean aggregator. It suggests that the perceivedvulnerabilityof themean aggregatormight be overstated for certain practicalthreat models.
Similar papers
Recommended via semantic vector search.