Paper status: completed

Mean Aggregator is More Robust than Robust Aggregators under Label Poisoning Attacks on Distributed Heterogeneous Data

Published:04/21/2024

Robustness of Aggregators under Label Poisoning Attacks (1)Defense Against Malicious Attacks in Distributed Learning (1)Learning Error Optimization in Heterogeneous Data Environments (1)Theoretical Analysis of Mean Aggregator (1)Comparative Study of Robust Aggregators (1)

Original Link

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study reveals that the mean aggregator outperforms robust aggregators under label poisoning attacks in distributed heterogeneous data, proving order-optimal learning error when data is sufficiently heterogeneous, supported by experimental validation.

Abstract

Robustness to malicious attacks is of paramount importance for distributed learning. Existing works usually consider the classical Byzantine attacks model, which assumes that some workers can send arbitrarily malicious messages to the server and disturb the aggregation steps of the distributed learning process. To defend against such worst-case Byzantine attacks, various robust aggregators have been proposed. They are proven to be effective and much superior to the often-used mean aggregator. In this paper, however, we demonstrate that the robust aggregators are too conservative for a class of weak but practical malicious attacks, known as label poisoning attacks, where the sample labels of some workers are poisoned. Surprisingly, we are able to show that the mean aggregator is more robust than the state-of-the-art robust aggregators in theory, given that the distributed data are sufficiently heterogeneous. In fact, the learning error of the mean aggregator is proven to be order-optimal in this case. Experimental results corroborate our theoretical findings, showing the superiority of the mean aggregator under label poisoning attacks.

Mind Map

In-depth Reading

English Analysis~19 min read · 22,973 chars

1. Bibliographic Information

1.1. Title

Mean Aggregator is More Robust than Robust Aggregators under Label Poisoning Attacks on Distributed Heterogeneous Data

1.2. Authors

Jie Peng (School of Computer Science and Engineering, Sun Yat-Sen University)
Weiyu Li (School of Engineering and Applied Science, Harvard University)
Stefan Vlaski (Department of Electrical and Electronic Engineering, Imperial College London)
Qing Ling (School of Computer Science and Engineering, Sun Yat-Sen University) - Corresponding author

1.3. Journal/Conference

Published at the International Joint Conference on Artificial Intelligence (IJCAI) in 2024, as an extended version of a conference paper (Peng et al., 2024). IJCAI is a highly reputable and influential conference in the field of Artificial Intelligence.

1.4. Publication Year

2024

1.5. Abstract

The paper addresses the crucial issue of robustness to malicious attacks in distributed learning. While existing research primarily focuses on classical Byzantine attacks (where workers send arbitrarily malicious messages), leading to the development of various robust aggregators deemed superior to the mean aggregator, this study challenges that assumption for a different class of attacks. The authors demonstrate that these robust aggregators can be overly conservative under label poisoning attacks, a weaker yet practical form of attack where only sample labels of some workers are corrupted. Surprisingly, the paper theoretically shows that the mean aggregator is more robust than state-of-the-art robust aggregators when distributed data is sufficiently heterogeneous. In this specific scenario, the mean aggregator's learning error is proven to be order-optimal. Experimental results corroborate these theoretical findings, highlighting the mean aggregator's superiority under label poisoning attacks.

1.6. Original Source Link

/files/papers/694f4ea59c764da3f20e36d0/paper.pdf (This indicates the paper is an officially published work, likely an extended journal version of the IJCAI conference paper).

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is ensuring robustness in distributed learning systems against malicious attacks. Distributed learning, especially federated learning, has gained immense popularity for its effectiveness in large-scale machine learning and privacy preservation. However, its decentralized nature makes it vulnerable.

Traditionally, research in this area has focused on classical Byzantine attacks, a worst-case scenario where malicious workers can send arbitrarily corrupted messages. To combat this, many robust aggregators have been developed, often showing superior performance compared to the simple mean aggregator.

The paper argues that these traditional robust aggregators might be "too conservative" for more practical and less destructive attacks, specifically label poisoning attacks. In such attacks, malicious workers only corrupt the labels of their local data, but otherwise follow the algorithmic protocol. This is a practical concern in scenarios like spam detection where users might intentionally mislabel emails. The existing gap is understanding how these robust aggregators perform under such bounded disturbances, and if the mean aggregator, often dismissed as vulnerable, might have unexpected strengths.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

First Investigation into Mean Aggregator Robustness: It is the first work to specifically investigate the robustness of the mean aggregator in distributed learning under label poisoning attacks. This challenges the prevailing assumption that robust aggregators are universally superior.
Theoretical Proof of Mean Aggregator's Superiority: The paper theoretically demonstrates that, under label poisoning attacks and given sufficiently heterogeneous distributed data, the mean aggregator is more robust than state-of-the-art robust aggregators. This is a surprising finding, as robust aggregators are designed specifically for attack scenarios.
Order-Optimality of Mean Aggregator: For the scenario where distributed data is sufficiently heterogeneous, the learning error of the mean aggregator is proven to be order-optimal, irrespective of the fraction of poisoned workers.
Comprehensive Theoretical Analysis: The paper provides new theoretical analyses for distributed stochastic momentum as the backbone algorithm, establishing tight upper bounds and a lower bound for the learning error for both mean and robust aggregators.
Empirical Validation: Extensive experimental results on both convex (softmax regression) and non-convex (two-layer perceptrons, CNNs) tasks, across different levels of data heterogeneity and attack strengths, fully support the theoretical findings.

The key conclusion is that the choice of aggregator should be nuanced, depending on the specific attack model and data characteristics. For label poisoning attacks on heterogeneous data, the simple mean aggregator can surprisingly outperform complex robust aggregators.

3.1. Foundational Concepts

To understand this paper, a foundational understanding of distributed learning, optimization algorithms, and security in machine learning is essential.

Distributed Learning: This refers to the training of machine learning models across multiple computational devices (workers) coordinated by a central server (parameter server).
- Server-Worker Architecture: In this setup, the server holds the global model. Workers download the global model, compute local updates (e.g., stochastic gradients or momenta) using their local data, and send these updates back to the server. The server then aggregates these local updates to update the global model.
- Aggregation: The process by which the server combines the messages (updates) received from workers to form a single update for the global model. The simplest form is the mean aggregator, which just averages all received messages.
- Federated Learning (FL): A specific type of distributed learning where data remains decentralized on client devices, offering privacy benefits. The paper frames its work within the context of distributed learning, with FL being a prominent application.
Stochastic Optimization Algorithms:
- Stochastic Gradient Descent (SGD): An iterative optimization algorithm that updates model parameters using the gradient calculated from a single randomly selected data sample (or a small batch). This introduces stochastic noise.
- Momentum: A technique used in optimization algorithms to accelerate convergence, especially in the presence of noisy gradients. It accumulates a velocity vector based on past gradients, allowing the optimization process to "smooth out" oscillations and move more consistently towards the optimum. The distributed stochastic momentum algorithm used in this paper combines stochastic gradients with momentum. The update rule for a local momentum $m_w^t$ at worker $w$ is: $ m_w^t = (1 - \alpha) m_w^{t-1} + \alpha \nabla f_{w, i_w^t}(x^t) $ where $\alpha \in [0, 1]$ is the momentum coefficient, $m_w^{t-1}$ is the previous momentum, and $\nabla f_{w, i_w^t}(x^t)$ is the stochastic gradient computed from sample $i_w^t$ at global model $x^t$ .
Malicious Attacks in Distributed Learning:
- Byzantine Attacks: A severe class of attacks where malicious workers (Byzantine workers) can send arbitrarily corrupted messages to the server. This includes sending completely fabricated or nonsensical updates. The goal is to severely disrupt the learning process. The Byzantine generals problem (Lamport et al., 1982) is a classic computer science problem illustrating challenges in achieving consensus in the presence of malicious agents.
- Robust Aggregators: Algorithms designed to defend against Byzantine attacks by identifying and mitigating the impact of malicious messages during the aggregation step. They aim to produce an aggregate update that is close to what would be obtained from only honest workers. Examples include Krum, geometric median, coordinate-wise median, trimmed-mean (TriMean), FABA, and centered clipping (CC).
- Poisoning Attacks: Attacks that aim to corrupt the training data or model updates.
  - Model Poisoning Attacks: Malicious workers send arbitrarily poisoned model updates (gradients or weights) to the server. This is often what Byzantine attacks refer to.
  - Data Poisoning Attacks: Malicious workers fabricate or corrupt their local data to generate poisoned messages.
  - Label Poisoning Attacks: A specific and weaker type of data poisoning attack where malicious workers only mislabel their local data, but otherwise strictly follow the algorithmic protocol (e.g., computing gradients from the mislabeled data and sending them). The feature data remains clean. This is the focus of the paper.
Data Heterogeneity: In distributed learning, data heterogeneity refers to the scenario where the local data distributions across different workers are not identical. This is common in real-world federated learning settings.
- $\xi`(heterogeneity parameter)`: In this paper, Assumption 3 quantifies `heterogeneity` as the maximum distance between the local gradient of any regular worker and the global gradient. A larger $\xi$ implies higher `heterogeneity`. * **Poisoning Disturbance:** * `$A$ (disturbance parameter)`: In this paper, Assumption 5 quantifies the `disturbance` caused by `poisoned workers` as the maximum distance between the poisoned local gradient of a poisoned worker and the global gradient. ## 3.2. Previous Works The paper contextualizes its contributions by discussing various related works: * **Classical Byzantine Robustness:** * Works like `Krum` (Blanchard et al., 2017), `geometric median` (Chen et al., 2017), `coordinate-wise median` (Yin et al., 2018), `coordinate-wise trimmed-mean (TriMean)` (Yin et al., 2018), `FABA` (Xia et al., 2019), `centered clipping (CC)` (Karimireddy et al., 2021), and `VRMOM` (Tu et al., 2021) are cited as examples of `robust aggregators` designed for `worst-case Byzantine attacks`. Their core idea is to find an aggregate that is robust to outliers, effectively filtering out malicious messages. * Unified frameworks for analyzing these aggregators exist (Farhadkhani et al., 2022; Allouah et al., 2023). * **Handling Stochastic Gradient Noise:** * Some works (Khanduri et al., 2019; Wu et al., 2020; Karimireddy et al., 2021; Rammal et al., 2024; Guerraoui et al., 2024) combine `variance-reduction` and `momentum` techniques with `robust aggregators` to alleviate the effect of `stochastic gradient noise` and enhance `Byzantine robustness`. * **Addressing Data Heterogeneity:** * The performance of `robust aggregators` degrades with `data heterogeneity` (Li et al., 2019; Karimireddy et al., 2022). Solutions include `model aggregation` instead of `stochastic gradient aggregation` (Li et al., 2019) or `bucketing/resampling` and `nearest neighbor mixing` (Karimireddy et al., 2022; Peng et al., 2022; Allouah et al., 2023) to reduce message heterogeneity. * **Data Poisoning Attacks:** * A broad category of attacks (Sun et al., 2019; Bagdasaryan et al., 2020; Wang et al., 2020; Rosenfeld et al., 2020; Cinà et al., 2024). Defenses often involve `data sanitization` (Steinhardt et al., 2017) or pruning activation units (Liu et al., 2018). * **Label Poisoning Specific Works:** * Tavallali et al. (2022) proposed a `regularization-based defense` to detect and exclude samples with flipped labels, but it requires a clean validation set, which has privacy concerns. * `LFighter` (Jebreel et al., 2024) is a state-of-the-art defense that clusters local gradients to identify and discard poisoned ones. Its performance can degrade with high `data heterogeneity`. * **Optimality of Robust Aggregators:** * Karimireddy et al. (2022) and Farhadkhani et al. (2024) have shown the optimality of certain `robust aggregators` for `model poisoning` and `data poisoning attacks`, but they primarily consider scenarios where `poisoned workers` can cause *unbounded* disturbances. * **Preliminary Observations on Mean Aggregator:** * Shejwalkar et al. (2022) observed the `mean aggregator`'s robustness in production `federated learning` systems but with restrictive `poisoning ratios` ($<0.1\%$) and without theoretical analysis. ## 3.3. Technological Evolution The field has evolved from focusing on worst-case `Byzantine attacks` with arbitrary malicious behavior to considering more realistic and *bounded* attack models, such as `label poisoning`. Initially, the development of `robust aggregators` aimed to completely mitigate the impact of any malicious input. However, as `distributed learning` systems encounter real-world `data heterogeneity`, and as attack models become more refined (e.g., `label poisoning` which is a specific type of `data poisoning` that is less destructive than full `Byzantine`), the effectiveness and necessity of these `conservative robust aggregators` are being re-evaluated. This paper marks a shift in perspective, suggesting that simpler `aggregators` might be optimal under specific, practical attack conditions. It builds upon the theoretical understanding of convergence for `stochastic momentum` algorithms and the characterization of `robust aggregators`' approximation abilities ($ \rho-robustness).

3.4. Differentiation Analysis

The core differentiation of this paper from related works lies in its focus and findings:

Attack Model: Unlike most prior works that focus on worst-case Byzantine attacks (arbitrary messages, unbounded disturbances), this paper specifically investigates label poisoning attacks where disturbances are bounded (Assumption 5 holds). This distinction is critical.
Aggregator Performance: While previous research largely concludes that robust aggregators are superior to the mean aggregator for Byzantine robustness, this paper demonstrates the opposite for label poisoning attacks under heterogeneous data.
Optimality: Prior works on optimality (Karimireddy et al., 2022; Farhadkhani et al., 2024) consider unbounded disturbances. This paper proves the order-optimality of the mean aggregator specifically for bounded disturbances caused by label poisoning when data is heterogeneous.
Theoretical vs. Empirical Observations: While Shejwalkar et al. (2022) made empirical observations about the mean aggregator's robustness, this paper provides rigorous theoretical analysis and extends the observations to higher poisoning ratios.
Role of Heterogeneity: The paper explicitly links the superior performance of the mean aggregator to the condition of "sufficiently heterogeneous distributed data," showing that the heterogeneity parameter $\xi$ and the poisoning disturbance parameter $A$ become comparable in such settings. This is a nuanced understanding not deeply explored by prior works promoting general robust aggregators.

4. Methodology

4.1. Principles

The core idea behind the method is to rigorously analyze the learning error of distributed stochastic momentum when different aggregators (mean aggregator vs. robust aggregators) are used, specifically under label poisoning attacks on heterogeneous data. The theoretical basis hinges on understanding how the contraction properties of robust aggregators (quantified by $\rho$ ) and the direct impact of poisoned workers (quantified by $\delta$ and $A$ ) affect the convergence bounds in the presence of data heterogeneity ( $\xi$ ) and stochastic noise ( $\sigma$ ). The surprising intuition revealed is that when data heterogeneity is high, the robust aggregators might be too aggressive in filtering messages, inadvertently discarding valuable information or treating legitimate heterogeneous messages as malicious, whereas the mean aggregator simply averages everything, which can be more effective if the poisoning disturbance is bounded and comparable to the natural heterogeneity.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

The paper aims to solve a distributed learning problem defined over a set of regular workers $\mathcal{R}$ out of a total set of workers $\mathcal{W}$ . The objective function is: $ \begin{array} { r l } & { \displaystyle \underset { x \in \mathbb { R } ^ { D } } { \mathrm { m i n } } f ( x ) \triangleq \frac { 1 } { R } \sum _ { w \in \mathcal { R } } f _ { w } ( x ) , } \ & { \mathrm { w i t h } f _ { w } ( x ) \triangleq \displaystyle \frac { 1 } { J } \sum _ { j = 1 } ^ { J } f _ { w , j } ( x ) , \forall w \in \mathcal { R } . } \end{array} $ Here:

$x \in \mathbb{R}^D$ is the global model parameters, a vector of dimension $D$ .
f(x) is the global cost function which is the average of the local costs of all regular workers.
$f_w(x)$ is the local cost function for regular worker $w \in \mathcal{R}$ .
$f_{w,j}(x)$ is the cost associated with the $j$ -th sample of worker $w$ .
$J$ is the number of samples each worker has (assumed to be uniform across workers).
$R = |\mathcal{R}|$ is the number of regular workers.

The problem considers the presence of poisoned workers in $\mathcal{W} \setminus \mathcal{R}$ .

4.2.2. Label Poisoning Attacks

The paper defines label poisoning attacks as follows: Definition 1 (Label poisoning attacks) In solving (1), there exist a number of poisoned workers, whose local costs are in the same form as the regular workers but an arbitrary fraction of sample labels are poisoned. Nevertheless, these poisoned workers exactly follow the algorithmic protocol during the distributed learning process.

This definition is crucial as it distinguishes label poisoning from classical Byzantine attacks. Poisoned workers do not send arbitrary messages; they compute messages correctly based on their (corrupted) local data.

4.2.3. Distributed Stochastic Momentum Algorithm

The chosen optimization algorithm is distributed stochastic momentum, which generalizes distributed gradient descent and distributed stochastic gradient descent. The algorithm proceeds iteratively:

Algorithm 1: Distributed Stochastic Momentum with Aggregators

Initialization: global model $x^0$ , initial momenta for regular workers $m_w^{-1} = \nabla f_{w, i_w^0}(x^0)$ , initial momenta for poisoned workers $\tilde{m}_w^{-1} = \nabla \tilde{f}_{w, i_w^0}(x^0)$ , step size $\gamma$ , momentum coefficient $\alpha$ , total iterations $T$ .
For $t = 0, \dots, T-1$ do:
1. Server Broadcast: Server sends global model $x^t$ to all workers.
2. Worker Computation (Regular): Each regular worker $w \in \mathcal{R}$ samples $i_w^t$ uniformly from $\{1, \dots, J\}$ , computes stochastic gradient $\nabla f_{w, i_w^t}(x^t)$ , and updates its local momentum $m_w^t$ : $ m_w^t = (1 - \alpha) m_w^{t-1} + \alpha \nabla f_{w, i_w^t}(x^t) $ This momentum $m_w^t$ is sent as message $\hat{m}_w^t = m_w^t$ to the server.
3. Worker Computation (Poisoned): Each poisoned worker $w \in \mathcal{W} \setminus \mathcal{R}$ samples $i_w^t$ uniformly, computes poisoned stochastic gradient $\nabla \tilde{f}_{w, i_w^t}(x^t)$ (based on poisoned labels), and updates its poisoned local momentum $\tilde{m}_w^t$ : $ \tilde{m}_w^t = (1 - \alpha) \tilde{m}w^{t-1} + \alpha \nabla \tilde{f}{w, i_w^t}(x^t) $ This momentum $\tilde{m}_w^t$ is sent as message $\hat{m}_w^t = \tilde{m}_w^t$ to the server. For notational convenience, all messages sent by worker $w$ are denoted as: $ \begin{array} { r } { \hat { m } _ { w } ^ { t } = \left{ { m } _ { w } ^ { t } , ~ w \in \mathcal { R } , \right. } \ { \tilde { m } _ { w } ^ { t } , ~ w \in \mathcal { W } \backslash \mathcal { R } . } \end{array} $
4. Server Aggregation and Update: Server receives all messages $\{\hat{m}_w^t : w \in \mathcal{W}\}$ ${m^wt:w∈W}$ and updates $x^{t+1}$ $x^{t + 1}$ .
  - With a Robust Aggregator ( $\mathrm{RAgg}(\cdot)$ ): $ \boldsymbol { x } ^ { t + 1 } = \boldsymbol { x } ^ { t } - \gamma \cdot \mathrm { R A g g } ( { \hat { m } _ { w } ^ { t } : w \in \mathcal { W } } ) $
  - With the Mean Aggregator ( $\mathbf{Mean}(\cdot)$ ): $ x ^ { t + 1 } = x ^ { t } - \gamma \cdot \mathbf { M e a n } ( { \hat { m } _ { w } ^ { t } : w \in \mathcal { W } } ) , $ where $ \mathbf { M e a n } ( { \hat { m } _ { w } ^ { t } : w \in \mathcal { W } } ) \triangleq \frac { 1 } { W } \sum _ { w \in \mathcal { W } } \hat { m } _ { w } ^ { t } . $
End for.

4.2.4. Assumptions

The theoretical analysis relies on several standard assumptions in distributed optimization:

Assumption 1 (Lower boundedness): The global cost $f(\cdot)$ is lower bounded by $f^*$ , i.e., $f(x) \geq f^*$ . This is a standard assumption for non-convex optimization ensuring the function does not go to negative infinity.
Assumption 2 (Lipschitz continuous gradients): The global cost $f(\cdot)$ has $L$ -Lipschitz continuous gradients. This means the change in gradient is bounded by the change in input, with $L$ being the Lipschitz constant. $ | \nabla f ( x ) - \nabla f ( y ) | \leq L | x - y | $ for any $x, y \in \mathbb{R}^D$ .
Assumption 3 (Bounded heterogeneity): For any $x \in \mathbb{R}^D$ , the maximum distance between the local gradients of any regular worker $w \in \mathcal{R}$ and the global gradient is upper-bounded by $\xi$ . $ \operatorname* { m a x } _ { w \in \mathcal { R } } | \nabla f _ { w } ( x ) - \nabla f ( x ) | \leq \xi . $ This assumption quantifies the data heterogeneity among regular workers. A larger $\xi$ indicates higher heterogeneity.
Assumption 4 (Bounded inner variance): For any $x \in \mathbb{R}^D$ , the variance of the local stochastic gradients of any worker $w \in \mathcal{W}$ with respect to the local gradient is upper-bounded by $\sigma^2$ . $ \begin{array} { r l } & { \mathbb { E } _ { i _ { w } } | \nabla f _ { w , i _ { w } } ( x ) - \nabla f _ { w } ( x ) | ^ { 2 } \leq \sigma ^ { 2 } , \quad \forall w \in \mathcal { R } , } \ & { \mathbb { E } _ { i _ { w } } | \nabla \tilde { f } _ { w , i _ { w } } ( x ) - \nabla \tilde { f } _ { w } ( x ) | ^ { 2 } \leq \sigma ^ { 2 } , \quad \forall w \in \mathcal { W } \backslash \mathcal { R } , } \end{array} $ where $i_w$ is a randomly selected sample index. This means the stochastic gradients are not excessively noisy. It applies to both regular and poisoned workers, as label poisoning is assumed not to drastically change gradient variance.
Assumption 5 (Bounded disturbances of poisoned local gradients): For any $x \in \mathbb{R}^D$ , the maximum distance between the poisoned local gradients of poisoned workers $w \in \mathcal{W} \setminus \mathcal{R}$ and the global gradient is upper-bounded by $A$ . $ \operatorname* { m a x } _ { w \in \mathcal { W } \backslash \mathcal { R } } | \nabla \tilde { f } _ { w } ( x ) - \nabla f ( x ) | \leq A . $ This is a critical assumption for label poisoning attacks, implying that the maliciousness is bounded, unlike worst-case Byzantine attacks where $A$ could be arbitrary.

4.2.5. Justification of Assumption 5

The paper provides a theoretical justification for Assumption 5 in the context of distributed softmax regression, a common model for classification tasks.

Example: Distributed Softmax Regression under Label Poisoning Attacks The local cost of a regular worker $w \in \mathcal{R}$ is given by: $ f _ { w } ( \boldsymbol { x } ) = \frac { 1 } { J } \sum _ { j = 1 } ^ { J } f _ { w , j } ( \boldsymbol { x } ) , \mathrm { w h e r e ~ } f _ { w , j } ( \boldsymbol { x } ) = - \sum _ { k = 1 } ^ { K } \mathbf { 1 } { b ^ { ( w , j ) } = k } \log \frac { \exp ( x _ { k } ^ { T } a ^ { ( w , j ) } ) } { \sum _ { l = 1 } ^ { K } \exp ( x _ { l } ^ { T } a ^ { ( w , j ) } ) } . $ And for a poisoned worker $w \in \mathcal{W} \setminus \mathcal{R}$ , where labels $b^{(w,j)}$ are replaced by $\tilde{b}^{(w,j)}$ : $ \tilde { f } _ { w } ( \boldsymbol { x } ) = \frac { 1 } { J } \sum _ { j = 1 } ^ { J } \tilde { f } _ { w , j } ( \boldsymbol { x } ) , \mathrm { w h e r e ~ } \tilde { f } _ { w , j } ( \boldsymbol { x } ) = - \sum _ { k = 1 } ^ { K } \mathbf { 1 } { \tilde { b } ^ { ( w , j ) } = k } \log \frac { \exp ( x _ { k } ^ { T } a ^ { ( w , j ) } ) } { \sum _ { l = 1 } ^ { K } \exp ( x _ { l } ^ { T } a ^ { ( w , j ) } ) } . $ Here:

$K$ is the number of classes.
$(a^{(w,j)}, b^{(w,j)})$ is the $j$ -th sample of worker $w$ , with $a^{(w,j)} \in \mathbb{R}^d$ as the feature and $b^{(w,j)} \in \mathbb{R}$ as the label.
$\mathbf{1}\{b^{(w,j)} = k\}$ is an indicator function, 1 if the label is $k$ , 0 otherwise.
$x_k \triangleq [x]_{kd:(k+1)d} \in \mathbb{R}^d$ is the $k$ -th block of global model $x$ .

Lemma 11 (Bounded Gradients of Local Costs): This lemma first establishes that for softmax regression, gradients are bounded. $ \begin{array} { r l } & { | \nabla f _ { w , j } ( x ) | \leq 2 | a ^ { ( w , j ) } | , \quad \forall w \in \mathcal { R } , \forall j \in [ 1 , \cdots , J ] , } \ & { | \nabla \tilde { f } _ { w , j } ( x ) | \leq 2 | a ^ { ( w , j ) } | , \quad \forall w \in \mathcal { W } \setminus \mathcal { R } , \forall j \in [ 1 , \cdots , J ] . } \end{array} $ The gradients of sample costs ( $f_{w,j}(x)$ and $\tilde{f}_{w,j}(x)$ ) are bounded by 2 times the norm of the feature vector $a^{(w,j)}$ . Also, the gradient of the local cost is bounded by the maximum norm of local features: $ \begin{array} { r l } & { | \nabla f _ { w } ( x ) | \le 2 \displaystyle \operatorname* { m a x } _ { j \in [ J ] } | a ^ { ( w , j ) } | , \quad \forall w \in \mathcal { R } , } \ & { | \nabla \tilde { f } _ { w } ( x ) | \le 2 \displaystyle \operatorname* { m a x } _ { j \in [ J ] } | a ^ { ( w , j ) } | , \quad \forall w \in \mathcal { W } \setminus \mathcal { R } . } \end{array} $ Moreover, if $a^{(w,j)}$ is entry-wise non-negative (e.g., pixel values in image classification), a tighter bound exists: $ \begin{array} { r l } & { | \nabla f _ { w } ( x ) | \leq \sqrt { K } | \displaystyle \frac { 1 } { J } \displaystyle \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } | , \quad \forall w \in \mathcal { R } , } \ & { | \nabla \tilde { f } _ { w } ( x ) | \leq \sqrt { K } | \displaystyle \frac { 1 } { J } \displaystyle \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } | , \quad \forall w \in \mathcal { W } \backslash \mathcal { R } . } \end{array} $

Lemma 2 (Justification of Assumption 5 for Softmax Regression): This lemma directly proves Assumption 5 holds for distributed softmax regression under label poisoning. Given the local costs (13) and (14) and entry-wise non-negative features $a^{(w,j)}$ , Assumption 5 is satisfied with: $ A \leq 2 \sqrt { K } \operatorname* { m a x } _ { w \in \mathcal { W } } | \frac { 1 } { J } \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } | . $ The proof uses Lemma 11 and triangle inequality: $ \operatorname* { m a x } _ { w \in { \mathcal W } \backslash { \mathcal R } } | \nabla \tilde { f } _ { w } ( x ) - \nabla f ( x ) | ^ { 2 } \leq 2 \operatorname* { m a x } _ { w \in { \mathcal W } \backslash { \mathcal R } } | \nabla \tilde { f } _ { w } ( x ) | ^ { 2 } + 2 | \nabla f ( x ) | ^ { 2 } . $ Then, using the bounds from Lemma 11 for $\nabla \tilde{f}_w(x)$ and $\nabla f(x)$ (which is an average of $\nabla f_w(x)$ ): $ \operatorname* { m a x } _ { w \in \mathcal { W } \backslash \mathcal { R } } | \nabla \tilde { f } _ { w } ( x ) | ^ { 2 } \leq K \operatorname* { m a x } _ { w \in \mathcal { W } \backslash \mathcal { R } } | \frac { 1 } { J } \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } | ^ { 2 } . $ And for $\nabla f(x)$ : $ | \nabla f ( x ) | ^ { 2 } \leq \frac { 1 } { R } \sum _ { w \in { \mathcal R } } | \nabla f _ { w } ( x ) | ^ { 2 } \leq \operatorname* { m a x } _ { w \in { \mathcal R } } | \nabla f _ { w } ( x ) | ^ { 2 } \leq K \operatorname* { m a x } _ { w \in { \mathcal R } } | \frac { 1 } { J } \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } | ^ { 2 } . $ Combining these yields the bound for $A$ .

Lemma 3 (Justification of Assumption 3 and relation between $A$ and $\xi$ ): Similarly, for regular workers in distributed softmax regression, Assumption 3 is satisfied with: $ \xi \leq 2 \sqrt { K } \operatorname* { m a x } _ { w \in \mathcal { R } } \big | \frac { 1 } { J } \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } \big | . $ Furthermore, in a sufficiently heterogeneous case (where each regular worker specializes in one class, and each class belongs to only one worker), it's shown that: $ \boldsymbol { \xi } = \Theta \big ( \operatorname* { m a x } _ { \boldsymbol { w } \in \mathcal { R } } \big | \frac { 1 } { J } \sum _ { j = 1 } ^ { J } a ^ { ( w , j ) } \big | \big ) . $ This implies that $\xi$ is in the same order as the maximum average feature norm of regular workers. If feature norms of regular and poisoned workers are comparable, then A = O(\xi). This relation, A=O(\xi), is crucial for comparing the learning error bounds of the mean aggregator and robust aggregators.

4.2.6. $\rho$ -Robust Aggregators

To analyze robust aggregators, the paper uses the concept of a $\rho`-robust aggregator`: **Definition 4 ($\boldsymbol{\rho}$-robust aggregator):** Consider any $W$ messages $y_1, y_2, \dots, y_W \in \mathbb{R}^D$, among which $R$ messages are from `regular workers` $w \in \mathcal{R}$. An aggregator $\mathrm{RAgg}(\cdot)$ is said to be a$ \rho-robust aggregator if there exists a contraction constant $\rho \geq 0$ such that: $ \left| R A g g ( { y _ { 1 } , \cdot \cdot \cdot , y _ { W } } ) - \bar { y } \right| \leq \rho \cdot \operatorname* { m a x } _ { w \in \mathcal { R } } | y _ { w } - \bar { y } | , $ where $\bar{y} = \frac{1}{R} \sum_{w \in \mathcal{R}} y_w$ is the average message of the regular workers. A smaller $\rho$ indicates that the robust aggregator's output is closer to the true average of regular messages.

Lemma 5 (Lower bound for $\rho$ ): This lemma states that a $\rho`-robust aggregator` exists only if the `fraction of poisoned workers` $\delta \triangleq 1 - \frac{R}{W}$ is less than $\frac{1}{2}$ ($\delta < \frac{1}{2}$), and $\rho \geq \operatorname{min}\{\frac{\delta}{1-2\delta}, 1\}$. This shows that $\rho$ cannot be arbitrarily small and is fundamentally limited by $\delta$. The paper provides specific $\rho$ values for common `robust aggregators`: * **TriMean (Lemma 14):** If $\delta < \frac{1}{2}$, `TriMean` is a$ \rho-robust aggregator with $\rho = \frac{3\delta}{1-2\delta} \operatorname{min}\{\sqrt{D}, \sqrt{R}\}$ . TriMean discards the smallest and largest W-R elements in each dimension. The proof (Appendix B.2) shows that its error bound in each dimension is proportional to $\operatorname{max}_{w \in \mathcal{R}} \|[y_w]_d - [\bar{y}]_d\|$ , which is then extended to the vector norm.

Centered Clipping (CC) (Lemma 15): If $\delta < \frac{1}{2}$ , one-step CC (with proper initialization and clipping threshold) is a $\rho`-robust aggregator` with $\rho = \sqrt{24\delta}$. `CC` iteratively clips messages that are too far from a central point. The proof (Appendix B.3) bounds the error by analyzing clipped and unclipped components. * **FABA (Lemma 16):** If $\delta < \frac{1}{3}$, `FABA` is a$ \rho-robust aggregator with $\rho = \frac{2\delta}{1-3\delta}$ . FABA iteratively discards the message farthest from the current average. The proof (Appendix B.4) analyzes the removal process and error accumulation.

4.2.7. Main Results

The paper derives learning error bounds for both robust aggregators and the mean aggregator, and a lower bound for any identity-invariant algorithm.

Theorem 7 (Learning Error for $\rho$ -Robust Aggregators): For Algorithm 1 using a $\rho`-robust aggregator` `RAgg(`\cdot`)` to solve (1), under Assumptions 1-4, with a `fraction of poisoned workers` $\delta \in [0, \frac{1}{2})$, appropriate `step size` $\gamma$ and `momentum coefficient` $\alpha = 8L\gamma$, the average squared `gradient norm` (a measure of `learning error`) is bounded by: $ \begin{array} { l } { \displaystyle \frac { 1 } { T } \sum _ { t = 1 } ^ { T } \mathbb { E } \| \nabla f ( x ^ { t } ) \| ^ { 2 } } \\ { \displaystyle = O \Bigg ( \rho ^ { 2 } \xi ^ { 2 } + \sqrt { \frac { ( L F ^ { 0 } + \rho ^ { 2 } \sigma ^ { 2 } ) ( \rho ^ { 2 } + 1 ) \sigma ^ { 2 } } { T } } + \frac { L F ^ { 0 } + ( \rho ^ { 2 } + 1 ) \sigma ^ { 2 } + \rho ^ { 2 } \xi ^ { 2 } } { T } \Bigg ) . } \end{array} $ Here: * $\mathbb{E}[\cdot]$ denotes the `expectation`. * $T$ is the total number of `iterations`. * $L$ is the `Lipschitz constant` (Assumption 2). * $F^0 \triangleq f(x^0) - f^*$ is the initial gap to the optimal `cost`. * $\rho$ is the `contraction constant` of the `robust aggregator` (Definition 4). * $\xi$ is the `heterogeneity parameter` (Assumption 3). * $\sigma^2$ is the `bounded inner variance` (Assumption 4). The `non-vanishing learning error` (the term independent of $T$) is $O(\rho^2 \xi^2)$. **Theorem 8 (Learning Error for Mean Aggregator):** For Algorithm 1 using the `mean aggregator` `Mean(`\cdot`)` to solve (1), under Assumptions 1, 2, 4, and 5, with a `fraction of poisoned workers` $\delta \in [0, 1)$, appropriate `step size` $\gamma$ and `momentum coefficient` $\alpha = 8L\gamma$, the average squared `gradient norm` is bounded by: $ \begin{array} { r l } & { \frac { 1 } { T } \displaystyle \sum _ { t = 1 } ^ { T } \mathbb { E } \| \nabla f ( x ^ { t } ) \| ^ { 2 } } \\ & { = O \Bigg ( \delta ^ { 2 } A ^ { 2 } + \sqrt { \frac { ( L F ^ { 0 } + \delta ^ { 2 } \sigma ^ { 2 } ) ( \delta ^ { 2 } + 1 ) \sigma ^ { 2 } } { T } } + \frac { L F ^ { 0 } + ( \delta ^ { 2 } + 1 ) \sigma ^ { 2 } + \delta ^ { 2 } A ^ { 2 } } { T } \Bigg ) . } \end{array} $ Here: * $\delta$ is the `fraction of poisoned workers`. * $A$ is the `bounded disturbance parameter` (Assumption 5). The `non-vanishing learning error` is $O(\delta^2 A^2)$. **Theorem 9 (Lower Bound of Learning Error):** For `label poisoning attacks` with a `fraction of poisoned workers` $\delta$, any `identity-invariant algorithm` (an algorithm whose output is unaffected by which specific workers are regular or poisoned, given the same set of messages) running for $T$ iterations will have a `learning error` lower bounded by: $ \frac { 1 } { T } \sum _ { t = 1 } ^ { T } \mathbb { E } \| \nabla f ( x ^ { t } ) \| ^ { 2 } = \Omega ( \delta ^ { 2 } \operatorname* { m i n } \{ A ^ { 2 } , \xi ^ { 2 } \} ) . $ This `lower bound` implies that some `non-vanishing error` is unavoidable for `identity-invariant algorithms` (which most practical algorithms are). **Comparison and Optimality:** The `non-vanishing learning error` for `robust aggregators` is $O(\rho^2 \xi^2)$, while for the `mean aggregator` it is $O(\delta^2 A^2)$. * When the disturbances from `poisoned workers` are small (i.e., $A < \xi$), the `lower bound` becomes $\Omega(\delta^2 A^2)$. In this case, the `mean aggregator`'s `learning error` $O(\delta^2 A^2)$ matches the `lower bound` in order, indicating `order-optimality`. * When `distributed data` is `sufficiently heterogeneous`, $A$ is in the same order as $\xi$ (`A=O(\xi)`) as discussed in Section 4.1. The paper then presents Table 1 to compare the `learning errors`. The following are the results from Table 1 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <th>Aggregator</th> <th>Learning error</th> </tr> </thead> <tbody> <tr> <td>TriMean</td> <td>O(δ<sup>2</sup>ξ<sup>2</sup>/(1−2δ)<sup>2</sup>)</td> </tr> <tr> <td>CC</td> <td>O(δξ<sup>2</sup>)</td> </tr> <tr> <td>FABA</td> <td>O(δ<sup>2</sup>ξ<sup>2</sup>/(1−3δ)<sup>2</sup>)</td> </tr> <tr> <td>Mean</td> <td>O(δ<sup>2</sup>ξ<sup>2</sup>)</td> </tr> <tr> <td>Lower bound</td> <td>Ω(δ<sup>2</sup>ξ<sup>2</sup>)</td> </tr> </tbody> </table></div> This table shows that when $\delta$ (fraction of poisoned workers) is small, `TriMean`, `FABA`, and `Mean` all match the `lower bound` in order. However: * `TriMean`'s error explodes as $\delta \to \frac{1}{2}$. * `FABA`'s error explodes as $\delta \to \frac{1}{3}$. * The `mean aggregator`'s error is `order-optimal` ($O(\delta^2 \xi^2)$) regardless of the `fraction of poisoned workers` (within the $\delta < 1$ assumption). * The `mean aggregator` is also superior to `CC` by a magnitude of $\delta$. **Remark 10:** The analysis holds even for `distributed stochastic gradient descent` (when $\alpha=1$) or `distributed gradient descent` (when $\alpha=1$ and $\sigma=0$). The `non-vanishing errors` remain dominant by the `heterogeneity` ($\xi^2$) and `disturbance` ($A^2$) terms when they are larger than `stochastic variance` ($\sigma^2$), confirming the paper's conclusions. # 5. Experimental Setup ## 5.1. Datasets The experiments were conducted on two widely used image classification datasets: * **MNIST:** A dataset of handwritten digits (0-9). * **Source:** National Institute of Standards and Technology. * **Scale:** 60,000 training images and 10,000 testing images. * **Characteristics:** Grayscale images of size 28x28 pixels. Relatively simple, often used for benchmarking. * **Domain:** Digit recognition. * **Use in Paper:** * `Convex case`: `Softmax regression` (a linear classifier). * `Non-convex case`: `Two-layer perceptrons` (MLP) with ReLU activation, 50 neurons per layer. * **CIFAR10:** A dataset of natural images. * **Source:** Canadian Institute for Advanced Research. * **Scale:** 50,000 training images and 10,000 testing images. * **Characteristics:** Color images of size 32x32 pixels, belonging to 10 classes (e.g., airplane, automobile, bird). More complex than MNIST. * **Domain:** Object recognition. * **Use in Paper:** `Non-convex case`: `Convolutional Neural Networks` (CNNs). The architecture is provided in Table 2. The following are the results from Table 2 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <th>Layer Name</th> <th>Layer size</th> </tr> </thead> <tbody> <tr> <td>Convolution + ReLU</td> <td>3 × 3 × 16</td> </tr> <tr> <td>Max pool</td> <td>2× 2</td> </tr> <tr> <td>Convolution + ReLU</td> <td>3 × 3 × 32</td> </tr> </tr> <tr> <td>Max pool</td> <td>2× 2</td> </tr> <tr> <td>Fully connected + ReLU</td> <td>128</td> </tr> <tr> <td>Softmax</td> <td>10</td> </tr> </tbody> </table></div> These datasets were chosen because they are standard benchmarks for classification tasks, allowing for comparison with previous works and validation across different model complexities (linear, shallow neural networks, deep CNNs). ## 5.2. Data Partitioning The experiments were set up with $W=10$ `workers`. By default, $R=9$ `workers` are `regular`, and one `worker` is `poisoned`. The impact of different `fractions of poisoned workers` (e.g., $R=8$ or $R=7$) is explored in Appendix H. Three data distribution scenarios were considered to investigate the impact of `heterogeneity`: * **i.i.d. (independent and identically distributed):** Training data is uniformly and randomly divided among all `workers`. This represents the least `heterogeneous` scenario. * **Mild non-i.i.d.:** Training data is divided using a `Dirichlet distribution` with `hyper-parameter` $\beta=1$ by default (Hsu et al., 2019). This creates some level of `data heterogeneity` where `workers` might have different proportions of classes. * **Non-i.i.d.:** Each class of the training data is assigned to one `worker`. For instance, in MNIST (10 classes), each `worker` might predominantly (or exclusively) receive data from a single digit class. This represents the most extreme `heterogeneity`. ## 5.3. Label Poisoning Attacks Two types of `label poisoning attacks` were investigated: * **Static Label Flipping:** The `poisoned worker` deterministically flips labels. For MNIST, a label $b$ is flipped to `9-b` (e.g., 0 becomes 9, 1 becomes 8, etc.). This is a fixed, non-adaptive attack. * **Dynamic Label Flipping:** The `poisoned worker` adapts its attack strategy. At each iteration, a label $b$ is flipped to the `least probable label` with respect to the current `global model` $x^t$ (Shejwalkar et al., 2022). This is a more sophisticated, adaptive attack. ## 5.4. Aggregators to Compare The performance of the `mean aggregator` was compared against several representative$ \rho-robust aggregators and a state-of-the-art label poisoning defense:

Mean: The simple average of all received messages. This is also used as a baseline without attacks to show ideal performance.
TriMean (Trimmed Mean): A robust aggregator that discards a certain percentage of the smallest and largest values in each dimension before averaging.
FABA (Fast Aggregation Against Byzantine Attacks): An iterative robust aggregator that discards outliers based on their distance from the current average.
CC (Centered Clipping): An aggregator that clips messages exceeding a certain threshold, bringing extreme values closer to the center.
LFighter: A state-of-the-art defense specifically designed for label poisoning attacks in federated learning, which clusters local gradients to identify and discard poisoned ones (Jebreel et al., 2024).

5.5. Hyperparameters

The step size for the optimization algorithm was set to $\gamma = 0.01$ . The momentum coefficient was set to $\alpha = 0.1$ .

5.6. Evaluation Metrics

The primary evaluation metric used in the experiments is classification accuracy.

Classification Accuracy: This metric measures the proportion of correctly predicted instances out of the total instances in a dataset.
- Conceptual Definition: Classification accuracy is a straightforward measure of a model's overall correctness. It indicates how well the model predicts the correct class label for a given input. It is useful for balanced datasets where all classes are equally important.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $
- Symbol Explanation:
  - Number of correct predictions: The count of instances where the model's predicted class label matches the true class label.
  - Total number of predictions: The total number of instances in the dataset being evaluated (e.g., test set).

6. Results & Analysis

The experimental results consistently validate the theoretical findings, particularly highlighting the superior performance of the mean aggregator under label poisoning attacks when data heterogeneity is high.

6.1. Convex Case: Softmax Regression on MNIST

6.1.1. Classification Accuracy

The performance of different aggregators for softmax regression on the MNIST dataset is shown for static label flipping (Figure 1) and dynamic label flipping (Figure 2).

The following figure (Figure 1 from the original paper) shows the accuracies of softmax regression on the MNIST dataset under static label flipping attacks:

Figure 1: Accuracies of softmax regression on the MNIST dataset under static label fipping attacks. 该图像是一个图表，展示了在不同数据分布下（IID、Mild Noniid 和 Noniid）使用多种聚合器（Baseline、Mean、TriMean、FABA、CC 和 LFighter）进行软最大回归的准确率随迭代次数的变化情况。可以观察到，在IID条件下，Mean聚合器表现出优越的表现。

The following figure (Figure 2 from the original paper) shows the accuracies of softmax regression on the MNIST dataset under dynamic label flipping attacks:

Figure 2: Accuracies of softmax regression on the MNIST dataset under dynamic label flipping attacks.

Analysis of Figures 1 & 2:

i.i.d. Case: All methods perform well, close to the baseline (mean aggregator without attacks). Interestingly, the mean aggregator with attacks shows a slightly lower accuracy compared to robust aggregators here. This suggests that in low heterogeneity environments, robust aggregators might still offer some benefit.
Mild non-i.i.d. Case: FABA and LFighter emerge as the best performers, with other aggregators showing similar performance.
Non-i.i.d. Case: This is where the paper's key finding is most evident. Due to high heterogeneity, all aggregators are significantly affected, showing larger gaps to the baseline. However, in this scenario, the mean aggregator performs the best among all aggregators. This strongly corroborates the theoretical prediction that the mean aggregator is more robust when data heterogeneity is large.

6.1.2. Heterogeneity and Disturbance Parameters ( $\xi$ and $A$ )

To support the theoretical claims about $\xi$ and $A$ , Figure 3 plots their estimated values for softmax regression.

The following figure (Figure 3 from the original paper) shows the heterogeneity of regular local gradients ( $\xi$ ) and disturbance of poisoned local gradients ( $A$ ) in softmax regression on the MNIST dataset, under static label flipping and dynamic label flipping attacks:

$Figure 3: Heterogeneity of regular local gradients (the smallest $\\xi$ satisfying Assumption 3) and disturbance of poisoned local gradients (the smallest $A$ satisfying Assumption 5) in softmax regression on the MNIST dataset, under static label flipping and dynamic label flipping attacks.$

Analysis of Figure 3:

Bounded $A$ : The disturbance of the poisoned local gradients ( $A$ ) remains bounded under both static and dynamic label flipping attacks, validating Assumption 5 and Lemma 2.
Increasing $\xi$ : As the data distribution shifts from i.i.d. to mild non-i.i.d. to non-i.i.d., the heterogeneity of regular local gradients ( $\xi$ ) increases.
$\xi \approx A$ in Non-i.i.d.: Crucially, in the non-i.i.d. case, $\xi$ becomes comparable to $A$ . This aligns with the theoretical discussion in Section 4.1 (A=O(\xi) under sufficient heterogeneity). When $\xi \approx A$ , according to Table 1, the mean aggregator's learning error ( $O(\delta^2 \xi^2)$ ) becomes order-optimal and often lower than other robust aggregators. This explains the superior performance of the mean aggregator observed in Figures 1 and 2 for the non-i.i.d. case.

6.2. Nonconvex Case: Two-layer Perceptrons on MNIST and CNNs on CIFAR10

6.2.1. Classification Accuracy

The evaluation extends to non-convex problems: two-layer perceptrons on MNIST and CNNs on CIFAR10. Figures 4 and 5 show the accuracies for static and dynamic label flipping attacks, respectively.

The following figure (Figure 4 from the original paper) shows the accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under static label flipping attacks:

Figure 4: Accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under static label flipping attacks. 该图像是一个图表，展示了在不同数据集（MNIST和CIFAR10）上，基础、平均、TriMean、FABA、CC和LFighter等多个算法在IID、轻度非IID和非IID条件下的准确度随迭代次数变化的情况。

The following figure (Figure 5 from the original paper) shows the accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under dynamic label flipping attacks:

Figure 5: Accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under dynamic label flipping attacks.

Analysis of Figures 4 & 5:

i.i.d. Case: Similar to the convex case, all methods generally perform well, close to the baseline. An exception is CC on CIFAR10 under dynamic label flipping, which performs worse than others.
Mild non-i.i.d. Case:
- On MNIST, all methods perform well, close to the baseline.
- On CIFAR10, Mean, FABA, and LFighter perform best and are close to the baseline. CC and TriMean perform worse, with TriMean being the worst and showing a significant gap under dynamic label flipping.
Non-i.i.d. Case: Again, this scenario strongly supports the theory. All methods are affected by attacks, but the mean aggregator consistently emerges as the best performer. CC, FABA, and LFighter are worse, and TriMean largely fails (shows very low accuracy). This reaffirms the mean aggregator's robustness in highly heterogeneous non-convex settings under label poisoning.

6.2.2. Heterogeneity and Disturbance Parameters ( $\xi$ and $A$ )

Figure 6 shows the values of $\xi$ and $A$ for the non-convex cases.

The following figure (Figure 6 from the original paper) shows the heterogeneity of regular local gradients ( $\xi$ ) and disturbance of poisoned local gradients ( $A$ ) in training two-layer perceptrons on the MNIST dataset and training convolutional neural networks on the CIFAR10 dataset, under static label flipping and dynamic label flipping attacks:

$Figure 6: Heterogeneity of regular local gradients (the smallest $\\xi$ satisfying Assumption 3) and disturbance of poisoned local gradients (the smallest $A$ satisfying Assumption 5) in training two-layer perceptrons on the MNIST dataset and training convolutional neural networks on the CIFAR10 dataset, under static label flipping and dynamic label flipping attacks.$

Analysis of Figure 6:

The findings mirror the convex case: $A$ (disturbance) remains bounded across all scenarios, validating Assumption 5.
$\xi$ (heterogeneity) increases from i.i.d. to non-i.i.d..
In the non-i.i.d. case, $\xi$ is close to $A$ , further supporting the theoretical condition where the mean aggregator is order-optimal.

6.3. Impacts of Heterogeneity and Attack Strengths

To further analyze the nuanced interplay of heterogeneity and attack strengths, experiments were conducted on two-layer perceptrons on MNIST, varying the Dirichlet hyper-parameter $\beta$ (smaller $\beta$ means larger heterogeneity) and the label flipping probability $p$ (larger $p$ means stronger static label flipping attacks). Figure 7 summarizes which aggregator performs best under different combinations.

The following figure (Figure 7 from the original paper) shows the best accuracies of trained two-layer perceptrons by all aggregators on the MNIST dataset under static label flipping attacks. Each block is associated with a hyper-parameter $\beta$ that characterizes the heterogeneity and the flipping probability $p$ that characterizes the attack strength. For each block, the best accuracy and the corresponding aggregator is marked. Orange means that the mean aggregator is the best:

$Figure7: Best accuracies of trained two-layer perceptrons by all aggregators on the MNIST dataset under static label flipping attacks. Each block is associated with a hyper-parameter $\\beta$ that characterizes the heterogeneity and the flipping probability $p$ that characterizes the attack strength. For each block, the best accuracy and the corresponding aggregator is marked. Orange means that the mean aggregator is the best.$

Analysis of Figure 7:

Mean Aggregator's Dominance: The orange blocks indicate where the mean aggregator achieves the best performance. It clearly dominates when heterogeneity is large (small $\beta$ values, e.g., $\beta=0.01$ , $\beta=0.03$ ).
Effect of Heterogeneity ( $\beta$ ): For a fixed flipping probability $p$ , as $\beta$ decreases (increasing heterogeneity), the mean aggregator progressively surpasses robust aggregators.
Effect of Attack Strength ( $p$ ): For a fixed heterogeneity (fixed $\beta$ ), as $p$ decreases (weaker attack strength), the mean aggregator tends to perform better.
Recommendation: The results suggest recommending the mean aggregator when distributed data is sufficiently heterogeneous or when the disturbance caused by label poisoning attacks is comparable to the heterogeneity of regular local gradients.

The full results are provided in Table 3 (Appendix F). The following are the results from Table 3 of the original paper:

β	p	Mean	CC	FABA	LFighter	TriMean
5	0.0	0.9441	0.9385	0.9410	0.9420	0.9429
	0.2	0.9426	0.9405	0.9439	0.9437	0.9425
	0.4	0.9443	0.9421	0.9437	0.9456	0.9430
	0.6	0.9427	0.9402	0.9431	0.9439	0.9397
	0.8	0.9429	0.9408	0.9424	0.9443	0.9386
	1.0	0.9415	0.9382	0.9423	0.9437	0.9371
1	0.0	0.9437	0.9362	0.9390	0.9361	0.9402
	0.2	0.9448	0.9371	0.9417	0.9341	0.9373
	0.4	0.9417	0.9404	0.9447	0.9404	0.9396
	0.6	0.9386	0.9355	0.9409	0.9422	0.9365
	0.8	0.9402	0.9323	0.9386	0.9431	0.9318
	1.0	0.9424	0.9401	0.9414	0.9433	0.9273
0.1	0.0	0.9407	0.9251	0.9404	0.9170	0.9134
	0.2	0.9423	0.9292	0.9271	0.9313	0.9076
	0.4	0.9420	0.9270	0.9229	0.9201	0.8942
	0.6	0.9372	0.9226	0.8996	0.9377	0.8891
	0.8	0.9278	0.9135	0.9431	0.9433	0.8498
	1.0	0.8327	0.8305	0.9426	0.9449	0.8054
0.05	0.0	0.9468	0.9317	0.8976	0.8617	0.8763
	0.2	0.9467	0.9311	0.8603	0.8845	0.8529
	0.4	0.9474	0.9279	0.8601	0.8860	0.8526
	0.6	0.9418	0.9248	0.8571	0.9343	0.8492
	0.8	0.9201	0.9155	0.9394	0.9385	0.8576
	1.0	0.8573	0.8950	0.9384	0.9374	0.8106
0.03	0.0	0.9426	0.9299	0.8634	0.8517	0.8523
	0.2	0.9413	0.9298	0.8507	0.8493	0.8427
	0.4	0.9393	0.9268	0.8506	0.8502	0.8370
	0.6	0.9374	0.9206	0.8481	0.8449	0.8246
	0.8	0.9159	0.8679	0.8019	0.8313	0.7869
	1.0	0.8312	0.8152	0.7437	0.9396	0.7514
0.01	0.0	0.9456	0.9164	0.8678	0.8681	0.8008
	0.2	0.9441	0.9165	0.8657	0.8660	0.7788
	0.4	0.9415	0.9135	0.8631	0.8632	0.7858
	0.6	0.9356	0.9039	0.8601	0.8399	0.7708
	0.8	0.8827	0.8750	0.8155	0.7864	0.7457
	1.0	0.8505	0.8278	0.7731	0.8162	0.7258

6.4. Bounded Variance of Stochastic Gradients (Appendix G)

Figure 8 verifies Assumption 4, which states that the variance of stochastic gradients is bounded.

The following figure (Figure 8 from the original paper) shows the maximum variance of regular stochastic gradients and variance of poisoned stochastic gradients in softmax regression on the MNIST dataset under static label flipping attacks:

Figur 8: Maximum variance of regular stochastic gradients and variance o poisoned stochastic gradients in softmax regression on the MNIST dataset under static label flipping attacks.

Analysis of Figure 8: The figure shows that both the maximum variance of regular stochastic gradients and the variance of poisoned stochastic gradients remain bounded across different heterogeneity settings and iterations under static label flipping attacks. This supports the realism of Assumption 4, indicating that label poisoning (corrupting labels but not features) does not introduce arbitrarily large stochastic gradient variance.

6.5. Impact of Fraction of Poisoned Workers (Appendix H)

The paper also investigates the impact of increasing the fraction of poisoned workers (from 1 to 2, then 3 poisoned workers out of 10). Figures 9-12 illustrate these results.

The following figure (Figure 9 from the original paper) shows the accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under static label flipping attacks when $R = 8$ :

$Figure 9: Accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under static label flipping attacks when $R = 8$ .$ 该图像是一个图表，展示了在静态标签翻转攻击下，两个层感知器在 MNIST 数据集和卷积神经网络在 CIFAR10 数据集上的准确率表现。图表中分别表示了 IID、Mild Noniid 和 Noniid 情况下的学习过程，显示了不同聚合器算法的效果。

The following figure (Figure 10 from the original paper) shows the accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under dynamic label flipping attacks when $R = 8$ :

$Figure 10: Accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under dynamic label flipping attacks when $R = 8$ .$ 该图像是一个图表，展示了在不同数据分布（IID、Mild Noniid、Noniid）下，两个层感知器在MNIST数据集和卷积神经网络在CIFAR10数据集上，随着迭代次数增加而变化的准确率。不同的线条和标记代表了不同的算法。

The following figure (Figure 11 from the original paper) shows the accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under static label flipping attacks when $R = 7$ :

$Figure 11: Accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under static label flipping attacks when $R = 7$ .$ 该图像是一个实验结果图，展示了在不同数据分布下（IID、Mild Noniid 和 Noniid）两层感知器和卷积神经网络在 MNIST 和 CIFAR10 数据集上的准确性。这些结果反映了在静态标签翻转攻击下模型表现的对比情况。

The following figure (Figure 12 from the original paper) shows the accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under dynamic label flipping attacks when $R = 7$ :

$Figure 12: Accuracies of two-layer perceptrons on the MNIST dataset and convolutional neural networks on the CIFAR10 dataset under dynamic label flipping attacks when $R = 7$ .$

Analysis of Figures 9-12:

Consistency with $R=9$ Case: The general trend observed with $R=9$ (1 poisoned worker) holds: the mean aggregator generally outperforms robust aggregators in the non-i.i.d. case.
Performance Degradation with More Poisoned Workers: As the fraction of poisoned workers increases (from $R=9$ to $R=8$ to $R=7$ ), the performance of both the mean aggregator and robust aggregators decreases. This is expected and aligns with the theoretical results in Theorem 7 and 8, where the learning error bounds are proportional to $\delta^2$ (or $\rho^2$ , which depends on $\delta$ ). This confirms that while the mean aggregator can be superior relative to others, an increased number of poisoned workers will still negatively impact overall performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper rigorously investigates the robustness of aggregators in distributed learning systems against label poisoning attacks. Contrary to the common belief that robust aggregators are universally superior for defending against malicious activities, the study theoretically and empirically demonstrates that the simple mean aggregator can be more robust and even order-optimal when distributed data are sufficiently heterogeneous. The theoretical framework, based on distributed stochastic momentum, provides specific learning error bounds that highlight the interplay between data heterogeneity ( $\xi$ ), poisoning disturbance ( $A$ ), and the fraction of poisoned workers ( $\delta$ ). The critical insight is that robust aggregators, designed for worst-case Byzantine attacks, might be too conservative for bounded disturbances like label poisoning, leading them to misinterpret legitimate heterogeneity as malicious.

7.2. Limitations & Future Work

The paper primarily focuses on label poisoning attacks as a specific type of bounded disturbance. While this is a practical attack model, it does not cover all forms of malicious attacks (e.g., more arbitrary Byzantine attacks). The theoretical optimality of the mean aggregator is contingent on the condition of sufficient data heterogeneity and bounded poisoning disturbances (where $A \approx \xi$ ). The paper notes that in the i.i.d. case or mild non-i.i.d. case, robust aggregators might still be competitive or even slightly better.

For future work, the authors plan to extend this analysis to the more challenging decentralized learning problem. In decentralized learning, there is no central server, and workers communicate directly, making aggregation and robustness even more complex.

7.3. Personal Insights & Critique

This paper offers a refreshing and critical perspective on the application of robust aggregators in distributed learning. It challenges the default assumption that more complex robust aggregators are always better, emphasizing the importance of understanding the specific attack model and data characteristics.

My insights are:

Context-Dependent Robustness: The key takeaway is that robustness is not a one-size-fits-all concept. The effectiveness of an aggregator strongly depends on the nature of the attack (e.g., bounded vs. unbounded disturbances) and the inherent properties of the data (heterogeneity). This highlights the need for more nuanced design choices in federated learning and other distributed learning systems.
Rethinking Simple Baselines: The mean aggregator is often seen as a naive baseline, vulnerable to any form of attack. This paper demonstrates that simplicity can be an advantage when sophisticated defenses become over-conservative. It encourages researchers to re-evaluate simple methods under specific, practical constraints.
Adaptive Aggregation: The findings could pave the way for adaptive aggregation strategies where the system dynamically selects the most appropriate aggregator based on real-time assessments of data heterogeneity and potential attack profiles. For instance, a system might use a mean aggregator if heterogeneity is high and label poisoning is suspected, but switch to a Byzantine-robust aggregator if more severe, arbitrary model poisoning attacks are detected.
The Role of Assumptions: The paper's strength lies in its clear articulation and justification of assumptions (especially Assumption 5). This underscores how crucial assumptions about the attacker's capabilities are in designing and analyzing robust learning algorithms. When attackers are constrained, the optimal defense might change significantly.
Generalizability of A=O(\xi): While justified for softmax regression, the generalizability of A=O(\xi) to other models (e.g., deep neural networks) might require further empirical validation or theoretical analysis. The experimental results on two-layer perceptrons and CNNs provide good empirical support, but a universal theoretical proof for A=O(\xi) across all non-convex models under label poisoning would be a powerful extension.
Practical Implications: For real-world federated learning deployments, where data heterogeneity is rampant and label errors (intentional or accidental) are common, this work provides a strong argument for considering the mean aggregator. It suggests that the perceived vulnerability of the mean aggregator might be overstated for certain practical threat models.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Mean Aggregator is More Robust than Robust Aggregators under Label Poisoning Attacks on Distributed Heterogeneous Data

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~19 min read · 22,973 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

4.2.2. Label Poisoning Attacks

4.2.3. Distributed Stochastic Momentum Algorithm

4.2.4. Assumptions

4.2.5. Justification of Assumption 5

4.2.6. ρ\rhoρ-Robust Aggregators

4.2.7. Main Results

5.5. Hyperparameters

5.6. Evaluation Metrics

6. Results & Analysis

6.1. Convex Case: Softmax Regression on MNIST

6.1.1. Classification Accuracy

6.1.2. Heterogeneity and Disturbance Parameters (ξ\xiξ and AAA)

6.2. Nonconvex Case: Two-layer Perceptrons on MNIST and CNNs on CIFAR10

6.2.1. Classification Accuracy

6.2.2. Heterogeneity and Disturbance Parameters (ξ\xiξ and AAA)

6.3. Impacts of Heterogeneity and Attack Strengths

6.4. Bounded Variance of Stochastic Gradients (Appendix G)

6.5. Impact of Fraction of Poisoned Workers (Appendix H)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

4.2.6. $\rho$ -Robust Aggregators

6.1.2. Heterogeneity and Disturbance Parameters ( $\xi$ and $A$ )

6.2.2. Heterogeneity and Disturbance Parameters ( $\xi$ and $A$ )