Paper status: completed

APPROACHING THE HARM OF GRADIENT ATTACKS WHILE ONLY FLIPPING LABELS

Published:03/04/2025

Label Flipping Attacks in Federated Learning (1)Adversarial Availability Attacks (1)Security Analysis of Logistic Regression (1)Distributed Machine Learning Attack Defense (1)Greedy Algorithms for Adversarial Attacks (1)

Original Link

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work models constrained label flipping attacks on federated logistic regression, proposing an optimal greedy method; flipping minimal labels severely degrades model accuracy, highlighting trade-offs between attack frequency, budget, and targeted versus untargeted attacks.

Abstract

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 A PPROACHING THE H ARM OF G RADIENT A TTACKS W HILE O NLY F LIPPING L ABELS Anonymous authors Paper under double-blind review A BSTRACT Machine learning systems deployed in distributed or federated environments are highly susceptible to adversarial manipulations, particularly availability attacks -rendering the trained model unavailable. Prior research in distributed ML has demonstrated such adversarial effects through the injection of gradients or data poisoning. In this study, we aim to better understand the potential of weaker (action- wise) adversaries by asking: Can availability attacks be inflicted solely through the flipping of a subset of training labels, without altering features, and under a strict flipping budget? We analyze the extent of damage caused by constrained label flipping attacks against federated learning under mean aggregation—the dom- inant baseline in research and production. Foc

Mind Map

In-depth Reading

English Analysis~17 min read · 19,350 chars

1. Bibliographic Information

Title: APPROACHING THE HARM OF GRADIENT ATTACKS WHILE ONLY FLIPPING LABELS
Authors: Anonymous authors (The paper was under double-blind review at the time of writing).
Journal/Conference: The publication venue is not specified, as the paper is presented as being under review.
Publication Year: Not specified, but references to work from 2024 and 2025 suggest it is a very recent paper.
Abstract: The paper investigates the threat of availability attacks in federated learning environments by a weak adversary who is restricted to only flipping a small, budgeted fraction of training data labels. The authors formalize this attack for logistic regression under mean aggregation, proposing a greedy algorithm that is proven to be optimal at each training step. Empirical results demonstrate that flipping as few as 0.1% of labels per round can degrade model accuracy by 6%, while a 25% budget can make the model perform worse than random guessing. The study also analyzes the trade-off between the attacker's write-access frequency and flipping budget and compares the effectiveness of targeted versus untargeted attacks.
Original Source Link: The paper is available at /files/papers/68f76460b5728723472282bf/paper.pdf. Its status is a review copy, not a final published version.

2. Executive Summary

Background & Motivation (Why):
- Machine learning models in distributed settings like Federated Learning (FL) are vulnerable to training-time attacks. A critical threat is the availability attack, which aims to make the final model unusable (e.g., by destroying its accuracy).
- Prior research has extensively studied powerful adversaries who can manipulate gradients directly or inject maliciously crafted data points (features and labels). However, in many real-world scenarios, an attacker's capabilities are far more limited.
- This paper addresses a significant gap: Can a weak adversary, who can only change the labels of a small number of existing data points, still cause significant harm? This is a more realistic threat model in systems where data features are generated by trusted pipelines and only the labeling process is susceptible to manipulation.
- The paper's innovation is to formalize this constrained label-flipping attack as a per-step optimization problem and demonstrate its surprising effectiveness, showing that even minimal manipulation can be potent if done strategically.
Main Contributions / Findings (What):
1. Novel Formalization: The paper formalizes a budget-constrained label-flipping attack for logistic regression as an optimization problem where the attacker aims to maximize the misalignment of the training gradient at each step.
2. Provably Optimal Greedy Algorithm: It introduces a simple greedy algorithm for selecting which labels to flip. The algorithm is proven to be optimal at each training iteration (per-epoch optimal).
3. Demonstrated Severity: Empirical results show the attack is highly effective. Flipping just 0.1% of labels per round reduces accuracy by 6%, and a 25% global budget can drive a binary classifier's accuracy below random guessing.
4. Attacker Strategy Insights: The paper uncovers a key trade-off: for a fixed total number of corrupted labels, having broader write-access (controlling a larger number of data points, $k$ ) is more damaging than having a larger local budget (the ability to flip more labels within a smaller set, $b$ ).
5. Attack Variant Comparison: It defines and empirically compares targeted attacks (steering the model toward a specific malicious parameter set) and untargeted attacks (simply maximizing training disruption), quantifying their relative impact and variance.
6. Generalization to Multi-Class: The attack framework and algorithm are extended from the binary classification setting to the more general multi-class case.

Foundational Concepts:
- Federated Learning (FL): A machine learning paradigm where multiple clients (e.g., mobile devices) collaboratively train a model under the coordination of a central server, but without sharing their local data. Clients train on their data and send model updates (typically gradients) to the server, which aggregates them to produce a global model.
- Data Poisoning: A class of attacks where an adversary corrupts the training data to compromise the final model. This paper focuses on a specific type called a label-flipping attack, where only the labels are altered, not the features.
- Availability Attack: An attack whose goal is to degrade the performance (availability) of the trained model for all users, for example, by reducing its accuracy to the level of random guessing. This is distinct from a backdoor attack, which aims for targeted misclassification on specific inputs.
- Mean Aggregation: The most common and basic aggregation rule in FL, where the server simply computes the average of all client updates to update the global model. Its simplicity makes it scalable but also vulnerable to malicious updates.
- Logistic Regression: A fundamental linear model used for classification tasks. It predicts the probability of a binary outcome. Its loss function is convex, which allows for tractable mathematical analysis.
- Gradient Descent: The iterative optimization algorithm at the core of training most machine learning models. In each step, the model's parameters are updated in the opposite direction of the loss function's gradient.
Previous Works: The paper positions its contribution by contrasting it with prior work that assumed stronger adversaries.
- Stronger Attack Models: Works like Baruch et al. (2019) and Bouaziz et al. (2024) explored adversaries who could directly manipulate or inject malicious gradients. Others, like Shafahi et al. (2018), assumed the ability to inject fully crafted data points. This paper's adversary is much weaker, being unable to modify features or gradients.
- Different Attack Goals: Jha et al. (2023) also used label flipping, but their goal was to insert backdoors, not cause a general availability failure. Their threat model and objectives are therefore different.
- Offline vs. Online Attacks: Some methods, like Li et al. (2016), require offline access to the entire dataset to perform clustering and select vulnerable points. In contrast, this paper's attack is online, meaning it makes decisions dynamically at each training iteration.
Differentiation: This work carves out a new niche in the landscape of poisoning attacks. Figure 2 in the paper visualizes this, placing the proposed attack in a highly constrained region where few attacks were previously known to be effective. The key distinctions are:
- Constraint: The adversary can only flip labels of existing data points.
- Budget: The attack is effective even with a very small budget (e.g., < 1%).
- Optimality: The proposed greedy algorithm is provably optimal per step in the convex setting of logistic regression.
- Online Nature: The attack is performed iteratively during training, adapting to the model's state at each epoch.
  
  Figure 1 further clarifies the constrained nature of the attack by showing that the set of achievable malicious gradients from label flipping is a small subset of what is possible with unrestricted data poisoning or direct gradient injection.
  
  $Figure 1: Images of the gradient operator on different sets. $\\mathbb { R } ^ { \\breve { d } }$ is where an attacker can craft unrestricted gradient attacks. \$\\nabla _ { \\theta } L ( h _ { \\theta } (…$ Figure 1: This diagram illustrates the power of different attack types. The largest set, $\mathbb{R}^{\breve{d}}$ , represents unrestricted gradient attacks. Subsets show the gradients achievable via data poisoning on features and labels, and the most constrained set, relevant to this paper, shows gradients achievable when only labels of existing data can be manipulated.
  
  Figure 2: This chart maps the landscape of availability attacks based on attacker constraints. The origin represents the most constrained setting. The paper's contribution (marked with a filled triangle) is shown to be effective in a highly constrained domain where attacks were not previously well-understood.

4. Methodology (Core Technology & Implementation)

The core of the paper is the mathematical formalization of the label-flipping attack as an optimization problem.

Threat Model & Notation: The setup (visualized in Figure 3) involves a federated learning system with honest workers and one malicious worker (the attacker).

At each training epoch $t$ , the data is split into an honest set $H$ and a set $K_H$ controlled by the attacker.
The attacker has write-access to a fraction $k = |K_H| / N$ of the total data $N$ .
The attacker has a local budget $b$ , meaning they can flip the labels of at most $b * |K_H|$ data points.
The global budget is the total fraction of corrupted data, $k * b$ .

The attacker is assumed to be omniscient, meaning they know the model parameters $α_t$ and the honest gradient at each step. This allows for a worst-case analysis.

Manual Transcription of Table 1: Notation Summary

Notation	Description
$d$	Dimension of the feature space.
$t$	Epoch (training iteration) index.
$(x_n, y_n)$	n-th data point, with features $x_n \in \mathbb{R}^{d+1}$ and label $y_n \in \{0, 1\}$ .
$α$ $\in \mathbb{R}^{d+1}$	Binary logistic regression parameter vector.
$W$ $\in \mathbb{R}^{C \times (d+1)}$	Multinomial logistic regression parameter matrix.
$H$	Set of honest data points (labels are not flippable).
$K$	Set of attacker-controlled data points (labels can be flipped).
$K_H$	Honest version of $K$ before any label flips.
$D_H = H ∪ K_H$	Entire honest training dataset (unmodified).
$D = H ∪ K$	Entire training dataset after poisoning.
$N = \|D\| = \|D_H\|$	Total number of data points.
$k$	Fraction of the dataset controlled by the attacker (write-access).
`P ⊆ K`	Subset of $K$ whose labels are actually flipped by the attacker.
$b$	Local flipping budget (proportion of $K$ that can be label-flipped).
`1[...]`	Indicator function.
$σ(·)$	Sigmoid function: $\sigma(z) = 1 / (1 + e^{-z})$ .
`k x b`	Corrupted fraction (Global budget).

$Figure 3: Illustration of the setting: Each user obtains its data from a clean reservoir. The malicious user flips up to a budget $b$ fraction of the labels in `K _ { H }` .$ 该图像是论文中的示意图，展示了联邦学习中的训练数据流程。多个“Worker”节点从干净数据源中获取数据批次 $D$ ，向服务器发送梯度或参数更新。攻击者节点介入，可能翻转标签，其翻转预算为 $K_H$ 。 Figure 3: This diagram shows the attack setup. A malicious user receives a clean batch of data $K_H$ , flips a fraction $b$ of the labels, and submits the poisoned set $K$ to the server, which aggregates it with honest data $H$ .

Problem Formulation: The attacker's goal is to manipulate the training gradient $\nabla L_D(\alpha_t)$ to be as misaligned as possible with a desired direction $Δ$ . The direction $Δ$ depends on the attack type (visualized in Figure 4):
1. Untargeted Attack: The goal is maximum disruption. The attacker pushes the gradient away from the honest update direction. $\Delta = - \nabla L_{D_H}(\alpha_t)$
2. Targeted Attack: The goal is to steer the model parameters towards a specific malicious target $\alpha^{\mathrm{Target}}$ . $\Delta = - (\alpha^{\mathrm{Target}} - \alpha_t)$
The attacker solves the following optimization problem at each epoch $t$ : $\begin{aligned} \arg\min_{\{y_i^{(D)}\}_{i \in K}} & \quad \langle -\nabla L_D(\alpha_t), \Delta \rangle \\ \text{s.t.} & \quad \sum_{i \in K} \mathbf{1}[y_i^{(D)} \neq y_i^{(D_H)}] \leq b|K| \end{aligned}$ For binary logistic regression, the gradient is $\nabla L_D(\alpha) = \frac{1}{N} \sum_{n=1}^N (\sigma(\alpha^\top x_n) - y_n)x_n$ . Since the honest data $H$ and the features $x_i$ are fixed, the optimization problem simplifies to choosing the labels $\{y_i\}_{i \in K}$ to minimize the objective. The objective becomes: $\underset{\{y_i^{(D)}\}_{i \in K}}{\arg\min} \sum_{i \in K} \langle \Delta, x_i \rangle y_i^{(D)} \quad \text{s.t. budget constraint}$ This simplified form reveals the core strategy: the impact of flipping a label $y_i$ is proportional to the inner product $\langle \Delta, x_i \rangle$ .
Greedy Label-Flipping Algorithm (Binary Case): Based on the simplified objective, the paper proposes Algorithm 1, a greedy strategy that is provably optimal at each step (proof in Appendix C relies on the Rearrangement Inequality).

Intuition: To minimize $\sum s_i y_i$ where $s_i = \langle \Delta, x_i \rangle$ and $y_i \in \{0, 1\}$ , one should assign label 1 to points with the most negative $s_i$ and label 0 to points with the most positive $s_i$ .

Algorithm 1 Steps:
1. For each data point $(x_i, y_i)$ in the attacker's set $K$ , compute the score $s_i = \langle \Delta, x_i \rangle$ .
2. Identify the $p = \lfloor b \cdot |K| \rfloor$ points with the algebraically smallest $s_i$ values. These are the points most aligned with the negative $Δ$ direction, offering the most "leverage".
3. For each of these $p$ $p$ selected points, flip its label to the value that minimizes the term $s_i y_i$ $s_{i} y_{i}$ :
  - If $s_i < 0$ , the optimal label is 1.
  - If $s_i \geq 0$ , the optimal label is 0.
4. The actual flip only occurs if the point's current label is not already the optimal one. The budget $p$ is applied to the points that provide the largest negative contribution to the objective.
Extension to Multi-Class Classification: The framework is extended to multi-class logistic regression with $C$ classes and a weight matrix $W$ . The objective is to minimize the Frobenius inner product between the gradient tensor and the target direction tensor. The per-sample optimization becomes choosing a class $c$ for each sample $n$ in $K$ to minimize a score $Z_{cn}$ : $Z_{cn} = \sum_{j=1}^C \left\langle \mathbf{x}^{(n)}, \left( \mathbb{1}[c=j] - \mathrm{softmax}(\mathbf{Wx}^{(n)})_c \right) \Delta_j \right\rangle$ Algorithm 2 operationalizes this for the multi-class setting:
1. For each point $n$ in $K$ , calculate the scores $Z_{cn}$ for all possible target classes $c=1,...,C$ .
2. Find the class $c^{(n*)}$ that yields the minimum score, $Z_n^{\mathrm{min}} = \min_c Z_{cn}$ .
3. From all points in $K$ , select the $p = b|K|$ points with the overall smallest $Z_n^{\mathrm{min}}$ values.
4. For these selected points, change their labels to their respective optimal target class $c^{(n*)}$ .

5. Experimental Setup

Datasets: Experiments were conducted on standard image classification datasets. The details are transcribed from Table 2.

Manual Transcription of Table 2: Datasets and target models

Name	# Features	# Train/test	Target model
MNIST (0 vs 1)	28 x 28	6903/7877	Fully flipped (0 ← 1)
CIFAR10 (airplane vs automobile)	3 x 32 x 32	5000/1000	Fully flipped (airplane ←→ automobile)
MNIST (10-class)	28 x 28	60000/10000	Cyclic shift y → (y + 1) mod 10
CIFAR10 (10-class)	3 x 32 x 32	50000/10000	Cyclic shift y → (y + 1) mod 10

The models were logistic regression classifiers trained for 200 epochs using mini-batch SGD.

Evaluation Metrics:
- Test Accuracy:
  1. Conceptual Definition: The proportion of test samples that the model classifies correctly. It is the most common metric for classification performance.
  2. Mathematical Formula: $\mathrm{Accuracy} = \frac{\mathrm{Number\ of\ Correct\ Predictions}}{\mathrm{Total\ Number\ of\ Predictions}} = \frac{TP+TN}{TP+TN+FP+FN}$
  3. Symbol Explanation: TP (True Positives), TN (True Negatives), FP (False Positives), and FN (False Negatives) are the four outcomes of a binary classification test.
- F1 Score: (Used in Appendix B)
  1. Conceptual Definition: The harmonic mean of precision and recall. It is useful for evaluating models on imbalanced datasets where accuracy can be misleading.
  2. Mathematical Formula: $\mathrm{F1\ Score} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} \quad \text{where} \quad \mathrm{Precision} = \frac{TP}{TP+FP}, \mathrm{Recall} = \frac{TP}{TP+FN}$
  3. Symbol Explanation: Precision measures the accuracy of positive predictions, while Recall (or sensitivity) measures the model's ability to find all actual positive samples.
Baselines: The primary comparisons are not against other attack algorithms but between different configurations of the proposed attack:
- Honest Training: The baseline performance with no attack ( $k=0$ or $b=0$ ).
- Untargeted vs. Targeted Attack: Comparing the two attack variants.
- Varying Budgets: Analyzing the impact of different $k$ (write-access) and $b$ (local budget) values.

6. Results & Analysis

Core Results: Attack Impact The experiments confirm the high potency of the greedy label-flipping attack.
- Binary Classification (Figure 5): The model's test accuracy degrades monotonically as the global budget k x b increases. A tiny budget of 0.1% is enough to cause a 6% accuracy drop. With a budget of 25%, the model performs close to or worse than random guessing (50% accuracy). Both untargeted and targeted attacks are effective.
  
  Figure 5: Test accuracy as a function of the global budget for binary classification. The left plot shows the untargeted attack and the right shows the targeted attack. Accuracy consistently decreases as the budget for label flipping grows.
- Multi-Class Classification (Figure 9): A similar trend holds for the 10-class problem. A global budget of just 5% leads to a significant accuracy drop of 10-19%.
  
  Figure 9: Test accuracy for multi-class classification. Performance degrades as the global budget increases for both untargeted (left) and targeted (right) attacks.
Untargeted vs. Targeted Attacks (Figures 6 & 7) The analysis reveals that untargeted attacks are more "chaotic" and slightly more effective at high corruption rates.
- Variance (Figure 6): The standard deviation of the final model accuracy is significantly higher for untargeted attacks, especially as the attacker's write-access $k$ increases. This suggests untargeted attacks make the training process more unstable and unpredictable.
- Effectiveness (Figure 7): The difference in mean accuracy between untargeted and targeted attacks is minor for small budgets ( $k < 0.2$ ). For larger budgets, untargeted attacks become slightly more damaging, likely due to their per-step optimality in disrupting the honest gradient direction.
  
  Note: The image provided as images/6.jpg corresponds to Figure 11 in the original PDF, showing a heatmap. The image for Figure 6 is images/5.jpg and shows the standard deviation. I will analyze both based on their content.
  
  $Figure 11: F1 score as a function of $k \\times b$ with early stopping and an SGD optimizer.$ Figure 6 from paper (image file 5.jpg): Standard deviation of accuracy as a function of write-access $k$ . Untargeted attacks (blue) show a much higher variance than targeted attacks (orange), indicating more erratic training outcomes.
  
  $Figure 12: Test accuracy versus $k \\times b$ with early stopping.$ Figure 7 from paper (image file 6.jpg): Heatmap of the accuracy difference (Untargeted - Targeted). Green indicates untargeted is better, red indicates targeted is better. For most $k$ and $b$ combinations, the difference is small. Untargeted attacks show a slight advantage at higher $k$ values.
Trade-off: Write-Access ( $k$ ) vs. Local Budget ( $b$ ) (Figure 8) This is one of the most insightful findings. Given a fixed global budget k x b, how should an attacker allocate their resources?
- The heatmap in Figure 8 clearly shows that for a constant product k x b, configurations with a larger $k$ (more write-access) are consistently more effective than those with a larger $b$ (more intensive flipping on fewer points).
- Intuition: Having access to a wider pool of data points ( $k$ ) gives the attacker a richer set of feature vectors $x_i$ to choose from. This allows them to find points whose feature vectors are better aligned with their malicious objective, providing greater "leverage" to manipulate the gradient, even if they can only flip a few of them.
  
  $Figure 8: Heatmap of test accuracy as a function of $k$ and $b$ at 200 epochs for the untargeted attack.$ Figure 8: Heatmap of test accuracy as a function of $k$ and $b$ . The darkest colors (lowest accuracy) are concentrated in the top-right, but for any diagonal (constant k x b), moving right (increasing $k$ ) is more effective than moving up (increasing $b$ ).
Additional Analysis from Appendix:
- The attack is also effective when measured by F1 Score (Figure 11).
- Using the Adam optimizer instead of SGD makes the model more vulnerable to the attack, with accuracy dropping to as low as 10% for certain budgets (Figure 14).
- Early stopping can amplify the attack's impact, as the model's performance degrades sharply in early epochs (Figure 12).
  
  Figure 11 (Appendix): F1 score drops as the corrupted fraction k x b increases, mirroring the accuracy results.
  
  Figure 14 (Appendix): With the Adam optimizer, accuracy drops more severely compared to SGD, indicating higher vulnerability.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully demonstrates that a weak adversary, constrained to only flipping a small percentage of labels, can mount a potent availability attack against federated learning systems using mean aggregation. By formalizing the attack as a per-step optimization problem, the authors derive a simple, greedy, and provably optimal (per-epoch) algorithm. The findings challenge the assumption that significant adversarial power (like gradient or feature control) is necessary to compromise distributed learning and establish a new, highly relevant baseline for a constrained threat model.
Limitations & Future Work: The authors acknowledge several limitations and propose future research directions:
- Model Complexity: The analysis is performed on logistic regression. Extending the framework and its guarantees to complex deep neural networks is a key next step.
- Aggregation Rules: The attack is designed for mean aggregation. Its effectiveness against robust aggregators (e.g., median, Krum) is an open question.
- Optimality: The algorithm is only proven to be optimal at each epoch (greedy), not globally optimal over the entire training trajectory. Finding a globally optimal flipping strategy is a more complex, long-term challenge.
- Defenses: The paper focuses on the attack, but devising practical defenses specifically tailored to this threat model is a crucial follow-up.
Personal Insights & Critique:
- Significance: This paper is significant because it lowers the bar for what is considered a dangerous adversary. The threat model is highly practical; in crowdsourced data labeling or user-feedback systems, malicious or careless users could easily introduce a small percentage of flipped labels. The finding that this can severely degrade a model is an important warning.
- Strong Assumption of Omniscience: The primary critique is the assumption of an omniscient attacker. In a real-world FL setting, an attacker would likely not have access to the global model parameters or the full honest gradient at each step. The authors briefly mention in the appendix that a surrogate-based approach could work with partial knowledge, but this is a critical aspect that needs more exploration to assess the attack's real-world viability.
- The $k$ vs. $b$ Trade-off: The analysis of the write-access vs. local budget trade-off is a standout contribution. It provides a non-obvious, practical insight into attacker strategy: it is better to have shallow control over many participants than deep control over a few.
- Foundation for Future Work: The paper provides an excellent theoretical and empirical foundation. The clear problem formulation and simple algorithm make it a strong baseline for future research on both more advanced label-flipping attacks and defenses designed to counter them.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.