Paper status: completed

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

Published:08/19/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
5 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Antidote mitigates harmful fine-tuning attacks on large language models by one-shot pruning harmful parameters post-fine-tuning, independent of hyperparameters, effectively reducing harmful outputs while preserving task accuracy.

Abstract

Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. While several defenses have been proposed, our evaluation shows that existing defenses fail \textit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks. Code is available at https://github.com/git-disl/Antidote.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

1.2. Authors

Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Joshua Kimball, Ling Liu

1.3. Journal/Conference

Published at arXiv, a preprint server. This indicates it is a pre-publication version of a research paper, often submitted to conferences or journals like AAAI2025-AIA or ICML2025 as mentioned in the acknowledgments. arXiv is a highly influential platform for rapid dissemination of research in fields like AI.

1.4. Publication Year

2024 (Specifically, August 18, 2024, as per the UTC timestamp provided in the user prompt).

1.5. Abstract

Safety-aligned Large Language Models (LLMs) are susceptible to harmful fine-tuning attacks, where even a small amount of malicious data mixed into the fine-tuning dataset can compromise their safety alignment. Existing defenses often fail under specific fine-tuning hyper-parameters, such as a large learning rate or a high number of training epochs. To address this, the paper proposes Antidote, a solution applied after the fine-tuning stage. Antidote operates on the principle of removing harmful parameters to recover the model's safety, irrespective of how these parameters were formed during fine-tuning. It introduces a one-shot pruning stage to identify and remove weights responsible for generating harmful content. Despite its simplicity, empirical results show Antidote effectively reduces the harmful score while maintaining accuracy on downstream tasks.

https://arxiv.org/abs/2408.09600 PDF Link: https://arxiv.org/pdf/2408.09600v3.pdf Publication Status: Preprint (on arXiv).

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the vulnerability of safety-aligned Large Language Models (LLMs) to harmful fine-tuning attacks. Traditionally, LLMs are safety-aligned to refuse harmful content generation. However, studies show that even a small amount of harmful data injected during fine-tuning can cause the model to forget its safety knowledge and generate unsafe responses.

This problem is important because fine-tuning-as-a-service is an emerging paradigm where users customize LLMs with their own data. Service providers (e.g., OpenAI) have an obligation to ensure that the fine-tuned models remain harmless to avoid governance issues or legal repercussions.

Existing mitigation strategies, broadly categorized into alignment stage defenses and user fine-tuning stage defenses, have a common weakness: they are highly sensitive to fine-tuning hyper-parameters, such as the learning rate and number of epochs. A large learning rate or many epochs, while often necessary for good performance on downstream tasks, can easily invalidate these defenses, leading to a degradation in safety alignment. This creates a trade-off where ensuring safety might compromise task performance. The paper's innovative idea is to propose a defense that is agnostic to these fine-tuning hyper-parameters.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  • Identification of Hyper-parameter Sensitivity Issue: A systematic evaluation revealing that existing alignment-stage and fine-tuning-stage defenses are highly sensitive to fine-tuning hyper-parameters (learning rate and epochs), often failing when these are set to values required for good downstream task performance.
  • Proposal of Antidote: Introducing a novel post-fine-tuning stage solution named Antidote, which is designed to be agnostic to the training details of the fine-tuning stage.
  • Core Philosophy: Antidote operates on the philosophy that removing harmful parameters can recover a model from harmful behaviors, irrespective of how those harmful parameters were formed.
  • Methodology: Implementing Antidote through a one-shot pruning stage that uses the Wanda score on a re-alignment dataset to identify and remove harmful weights.
  • Empirical Validation: Comprehensive experiments demonstrating that Antidote significantly reduces the harmful score (up to 17.8% compared to SFT without defense) while largely maintaining fine-tuning accuracy (with a marginal loss of up to 1.83%). It also shows robustness across different harmful ratios, fine-tuning sample sizes, and even against benign fine-tuning attacks.
  • Generalizability and Efficiency: Demonstrating that Antidote generalizes well across different fine-tuning datasets, alignment datasets, and LLM architectures (Llama2-7B, Mistral-7B, Gemma-7B, Llama3-8B). Furthermore, it introduces only a slight increase in clock time overhead and comparable GPU memory usage compared to SFT, making it a system-efficient defense.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following concepts:

  • Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data to understand and generate human-like text. They are capable of performing various natural language processing tasks, from answering questions to writing creative content. Examples include OpenAI's GPT series, Google's Gemma, and Meta's Llama.
  • Fine-tuning: This is a process where a pre-trained LLM (which has learned general language patterns) is further trained on a smaller, specific dataset to adapt it to a particular task or domain. This allows the model to become more specialized for user-specific needs without starting training from scratch.
  • Safety Alignment: This refers to the process of training LLMs to adhere to ethical guidelines, avoid generating harmful, biased, or inappropriate content, and generally align their behavior with human values. Techniques like Reinforcement Learning from Human Feedback (RLHF) are commonly used for safety alignment. The goal is for the LLM to provide refusal responses (e.g., "I cannot assist with that request") when prompted with harmful queries.
  • Harmful Fine-tuning Attacks: This is a security vulnerability where malicious or even unintentionally harmful data is included in the fine-tuning dataset. When an LLM is fine-tuned on such data, it can unlearn or forget its previously established safety alignment, leading it to generate harmful content upon request.
  • Hyper-parameters: These are configuration settings that are external to the model and whose values cannot be estimated from data. They are typically set before the training process begins and control how the model learns. In the context of this paper, key hyper-parameters include:
    • Learning Rate (lr): This determines the step size at each iteration while moving toward a minimum of a loss function. A larger learning rate means bigger steps, potentially leading to faster convergence but also overshooting the minimum or instability. A smaller learning rate means smaller steps, which can be more stable but slower.
    • Epochs (ep): An epoch represents one complete pass through the entire training dataset during training. More epochs mean the model sees the data multiple times, potentially leading to better learning but also overfitting or, in this context, more deviation from safety alignment if harmful data is present.
  • LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into the Transformer architecture layers. This significantly reduces the number of trainable parameters for specific tasks, making fine-tuning more efficient and less memory-intensive. The paper uses LoRA for both alignment and fine-tuning.
  • Pruning: In machine learning, pruning is a technique used to reduce the size and complexity of a neural network by removing (or setting to zero) less important connections (weights). This can lead to faster inference, reduced memory footprint, and sometimes even improved generalization. One-shot pruning implies pruning is done once, after training, rather than iteratively during training.
  • Wanda Score: A scoring mechanism used in model sparsification (pruning) literature to quantify the importance of individual weights. It measures the importance of parameters based on their absolute value and the L2L_2 norm of their corresponding input activations. A higher Wanda score indicates a more important parameter.

3.2. Previous Works

The paper categorizes previous works into three main areas:

  • Safety Alignment:

    • Methods like RLHF (Ouyang et al., 2022) and its variants (Dai et al., 2023; Bai et al., 2022; Wu et al., 2023; Dong et al., 2023; Rafailov et al., 2023; Yuan et al., 2023) are foundational for aligning LLMs with human values.
    • More recent solutions focus on augmenting alignment data (Liu et al., 2023a;b; Ye et al., 2023; Tekin et al., 2024).
    • These methods aim to ensure LLMs produce safe outputs initially.
  • Harmful Fine-tuning Defenses: This is the most relevant category to Antidote and is divided into two sub-categories:

    • Alignment Stage Solutions: These methods modify the alignment stage to improve the model's robustness against harmful fine-tuning in later stages.

      • Vaccine (Huang et al., 2024d): Adds artificial perturbation in the alignment stage to simulate harmful embedding drift during fine-tuning, using minimax optimization to make the model immune.
      • RepNoise (Rosati et al., 2024b;a): Uses a representation noising technique to degrade the representation distribution of harmful data, making it harder for the model to learn harmful content generation.
      • Other examples: CTRL (Liu et al., 2024c), TAR (Tamirisa et al., 2024), Booster (Huang et al., 2024a), SN-Tune (Zhao et al., 2025b), T-Vaccine (Liu et al., 2024a), CTRAP (Yi et al., 2025b), KT-IPA (Cheng et al., 2025), SAM unlearning (Fan et al., 2025), Reward Neutralization (Cao, 2025), SEAM (Wang et al., 2025c).
    • Fine-tuning Stage Solutions: These methods modify the fine-tuning stage to prevent alignment knowledge forgetting while still learning user tasks.

      • LDIFS (Mukhoti et al., 2023): Introduces a regularizer to constrain the feature space drift of the fine-tuning iterate to remain close to that of the aligned model.
      • Lisa (Huang et al., 2024b): Alternately optimizes over alignment data and fine-tuning data, using a proximal regularizer to enforce proximity between iterates.
      • Other examples: (Bianchi et al., 2023; Zong et al., 2024; Wang et al., 2024; Lyu et al., 2024; Qi et al., 2024a; Shen et al., 2024; Choi et al., 2024; Du et al., 2024; Li et al., 2025; Eiras et al., 2024; Li & Kim, 2025; Li et al., 2024b; Liu et al., 2024b; Zhao et al., 2025a; Liu et al., 2025; Li, 2025; Wu et al., 2025; Peng et al., 2025).
  • Model Sparsification (Pruning):

    • The concept of model sparsification (Frankle & Carbin, 2018) explores finding sparse, trainable neural networks.
    • For LLMs, SparseGPT (Frantar & Alistarh, 2023) proposes layer-wise reconstruction for importance scoring.
    • Wanda score (Sun et al., 2023) utilizes joint weights/activation metrics to measure coordinate importance. Antidote directly borrows the Wanda score for identifying harmful parameters.
    • OWL (Yin et al., 2023) builds on Wanda for further compression.
  • Concurrent Post-Fine-tuning Defenses: The paper acknowledges several concurrent works that also aim at purifying the model after fine-tuning completes, such as RESTA (Bhardwaj et al., 2024), LAT (Casper et al., 2024), Safe LoRA (Hsu et al., 2024), SOMF (Yi et al., 2024c), (Tong et al., 2024) for self-contrastive decoding, IRR (Wu et al., 2024) and NLSR (Yi et al., 2024b) for neuron correction, SafetyLock (Zhu et al., 2024) for activation patching, and Panacea (Wang et al., 2025b) for optimizing post-fine-tuning perturbation. The key differentiation highlighted by Antidote is its focus on the hyper-parameter sensitivity issue, which was not systematically studied by these prior works.

3.3. Technological Evolution

The field of LLM safety has evolved from initial safety alignment techniques (RLHF, SFT) to addressing jailbreaking and, more recently, harmful fine-tuning attacks. Initially, research focused on making LLMs safe from scratch. As LLMs became more widely adopted and customized via fine-tuning, the vulnerability introduced by user-provided data became apparent. This led to the development of defenses at different stages:

  1. Pre-alignment/Alignment Stage: Strengthening the base model's robustness.

  2. Fine-tuning Stage: Modifying the fine-tuning process itself to prevent safety degradation.

  3. Post-Fine-tuning Stage (Antidote's domain): Realignment or purification after the fine-tuning has completed.

    Antidote fits into this timeline as a post-fine-tuning solution, addressing a critical gap identified: the hyper-parameter sensitivity of existing defenses. It leverages techniques from model sparsification, a field primarily focused on efficiency, and repurposes them for safety.

3.4. Differentiation Analysis

Antidote differentiates itself from existing alignment-stage and fine-tuning-stage defenses primarily by its post-fine-tuning approach and its agnosticism to fine-tuning hyperparameters.

  • Existing Alignment Stage Defenses (e.g., Vaccine, RepNoise): These modify the initial safety alignment process to make the base model more robust. However, they can still be overcome by aggressive fine-tuning (large learning rates or many epochs) as the model drifts too far from the initial robust state. Antidote works after this drift has occurred.
  • Existing Fine-tuning Stage Defenses (e.g., Lisa, LDIFS): These introduce regularizers or modified training procedures during fine-tuning to keep the model close to its safe state. Their effectiveness is also tied to the fine-tuning hyper-parameters; large learning rates can make regularization ineffective, causing the model to diverge. Antidote bypasses this by operating on the final corrupted model.
  • Core Innovation (Hyper-parameter Agnosticism): Antidote's core difference is that it does not intervene during the fine-tuning process itself. Instead, it takes the fine-tuned (potentially corrupted) model and purifies it afterward. This means its effectiveness is not dependent on the learning rate, epochs, or other training settings chosen during the user's fine-tuning stage. This is a significant advantage, as users might need specific hyper-parameters for optimal downstream task performance.
  • Methodological Innovation (Pruning for Safety): While pruning techniques (Wanda score) traditionally aim for model compression and efficiency, Antidote repurposes one-shot pruning to identify and remove harmful parameters specifically for safety realignment. This is a novel application of sparsification for safety.

4. Methodology

4.1. Principles

The core idea behind Antidote is rooted in the philosophy that harmful behaviors in an LLM are caused by specific harmful parameters that have been modified or strengthened during fine-tuning. By identifying and removing these harmful parameters (i.e., weights), the model can be "recovered" or "re-aligned" to its original safe state, regardless of the specific hyper-parameters or training dynamics used during the fine-tuning process. This post-fine-tuning approach makes Antidote agnostic to the training details of the fine-tuning stage.

4.2. Core Methodology In-depth (Layer by Layer)

Antidote operates as a three-stage pipeline, as shown in Figure 1 and Figure 4 (which are identical diagrams illustrating the overall process).

The three stages are:

  1. Safety Alignment: A Pretrained LLM undergoes safety alignment (e.g., using RLHF or SFT) with alignment data (containing harmful prompt-safe answer pairs) to become a safety-aligned model. This is the initial step to make the LLM generally safe.

  2. User Fine-tuning: The safety-aligned model is then fine-tuned by users on a fine-tuning dataset. This dataset may contain a mixture of downstream task data and a certain percentage of harmful data (containing harmful prompt-harmful answer pairs), which can lead to safety alignment broken or corrupted models. This is the stage where the hyper-parameter sensitivity issue of existing defenses arises.

  3. One-shot Pruning (Antidote's Contribution): After the user fine-tuning is complete and the model might be safety alignment broken, Antidote intervenes. This stage aims to re-align the model by identifying and removing the harmful parameters that cause the model to generate unsafe content.

    Here's a detailed breakdown of the one-shot pruning stage as described in Algorithm 1:

Algorithm 1 Antidote: a post-fine-tuning safety alignment Input:

  • Mask ratio, α\alpha: This is a hyper-parameter determining the percentage of weights to be pruned.
  • Re-alignment dataset, Drealign\mathcal{D}_{realign}: A dataset specifically designed to contain harmful prompt-harmful answer pairs. This dataset is used to identify the harmful parameters.
  • Safety alignment-broken fine-tuned model, w\pmb{w}: The weights of the LLM after user fine-tuning, which may have lost its safety alignment.

Output:

  • The re-aligned model w~\tilde{\pmb{w}} ready for deployment.

Procedure:

  1. Calculate Importance Score (Identify Harmful Parameters): The first step is to quantify the "harmfulness" or importance of each individual weight in the fine-tuned model w\pmb{w} with respect to generating harmful content. This is done using a modified Wanda score, calculated over the re-alignment dataset Drealign\mathcal{D}_{realign}. The Wanda score for each weight coordinate jj is given by:

    [h(w,Drealign)]j=1DxDrealignwjAj(x,w)2 [ h ( \pmb { w } , \mathcal { D } _ { r e a l i g n } ) ] _ { j } = \frac { 1 } { | \mathcal { D } | } \sum _ { \pmb { x } \in \mathcal { D } _ { r e a l i g n } } | \pmb { w } _ { j } | \cdot \| \pmb { A } _ { j } ( \pmb { x } , \pmb { w } ) \| _ { 2 }

    • [h(w,Drealign)]j[ h ( \pmb { w } , \mathcal { D } _ { r e a l i g n } ) ] _ { j }: The Wanda score for the jj-th weight coordinate in the model w\pmb{w}, calculated using the re-alignment dataset Drealign\mathcal{D}_{realign}.

    • w\pmb{w}: The vector of all weights in the safety alignment-broken fine-tuned model.

    • D| \mathcal { D } |: The total number of data points (samples) in the re-alignment dataset Drealign\mathcal{D}_{realign}.

    • xDrealign\sum _ { \pmb { x } \in \mathcal { D } _ { r e a l i g n } }: Summation over all data points x\pmb{x} in the re-alignment dataset.

    • wj| \pmb { w } _ { j } |: The absolute value of the jj-th weight coordinate of the model. This term captures the magnitude of the weight.

    • Aj(x,w)2\| \pmb { A } _ { j } ( \pmb { x } , \pmb { w } ) \| _ { 2 }: The L2L_2 norm of the hidden activation associated with the jj-th weight coordinate when the input is x\pmb{x} and the model weights are w\pmb{w}. This term reflects the influence of the input on that specific weight.

      Intuitively, a weight is considered important (and potentially harmful) if it has a large absolute value and is frequently activated by harmful data in the re-alignment dataset.

  2. Create Harmful Mask: After calculating the Wanda scores for all weights, a mask m\pmb{m} is created to identify the top harmful parameters. This mask is a binary vector (containing 0s and 1s) that indicates which weights are to be removed.

    m=ArgTopKα(h(w,Drealign)) \pmb { m } = \mathrm { ArgTopK } _ { \alpha } \big ( h ( \pmb { w } , \mathcal { D } _ { r e a l i g n } ) \big )

    • m\pmb{m}: The harmful mask vector.
    • ArgTopKα()\mathrm { ArgTopK } _ { \alpha } ( \cdot ): A function that returns a binary mask. It sets the mask to 1 for the top α\alpha percentage of weight coordinates that have the highest Wanda scores h(w,Drealign)h(\pmb{w}, \mathcal{D}_{realign}), and 0 for the rest.
    • α\alpha: The mask ratio, a hyper-parameter controlling the percentage of weights to be masked (i.e., identified as harmful). For example, if α=0.2\alpha = 0.2, then 20% of the weights with the highest Wanda scores will be set to 1 in the mask.
  3. Remove Harmful Parameters (Pruning Operation): Finally, the harmful parameters are removed from the fine-tuned model w\pmb{w} using the created harmful mask m\pmb{m}. This effectively sets the identified harmful weights to zero.

    w~=(1m)w \tilde { \pmb { w } } = ( \mathbf { 1 } - \pmb { m } ) \odot \pmb { w }

    • w~\tilde{\pmb{w}}: The re-aligned model that is now considered safe and ready for deployment.

    • 1\mathbf{1}: A vector of ones with the same dimensions as m\pmb{m} and w\pmb{w}.

    • (1m)(\mathbf{1} - \pmb{m}): This operation inverts the mask. Where m\pmb{m} has a 1 (indicating a harmful weight to be removed), (1m)(\mathbf{1} - \pmb{m}) will have a 0. Where m\pmb{m} has a 0 (indicating a non-harmful weight to be kept), (1m)(\mathbf{1} - \pmb{m}) will have a 1.

    • \odot: The Hadamard product (element-wise multiplication). This operation multiplies each weight in w\pmb{w} by the corresponding element in the inverted mask (1m)(\mathbf{1} - \pmb{m}).

    • Effectively, this means any weight in w\pmb{w} that corresponds to a 1 in m\pmb{m} (i.e., identified as harmful) will be multiplied by 0, thereby being set to zero (pruned). Weights corresponding to 0 in m\pmb{m} (i.e., not identified as harmful) will be multiplied by 1, thus retaining their original value.

      The re-aligned model w~\tilde{\pmb{w}} is then deployed to serve users' customized tasks, now having recovered its safety alignment.

The following figure (Figure 4 from the original paper) shows the system overview of Antidote:

该图像是论文Antidote中展示的三阶段流程示意图,包含对齐阶段、微调阶段和一次性剪枝阶段,描述了如何通过一步剪枝去除有害参数以恢复模型安全性。 该图像是论文Antidote中展示的三阶段流程示意图,包含对齐阶段、微调阶段和一次性剪枝阶段,描述了如何通过一步剪枝去除有害参数以恢复模型安全性。

5. Experimental Setup

5.1. Datasets

The experiments used several types of datasets:

  • Pre-trained Models:

    • Llama2-7B (default backbone for most evaluations)
    • Mistral-7B
    • Gemma-7B
    • Llama3-8B (for advanced model evaluation) These are all mainstream, open-source LLMs.
  • Harmful Data-related Datasets: All harmful data are sampled from BeaverTails (Ji et al., 2023), a human-preference dataset for LLM safety alignment.

    • Alignment Dataset (Dalign\mathcal{D}_{align}): Contains alignment data (harmful prompt-safe answer pairs). This is sampled from BeaverTails with is_safe=True labels.
    • Fine-tuning Dataset: This dataset is a mixture of downstream task data and harmful data.
      • pp percentage of harmful data (harmful prompt-harmful answer pairs), sampled from BeaverTails with is_safe=False.
      • 1-p percentage of downstream task data. The harmful data in this dataset is distinct from the harmful data in the re-alignment dataset.
    • Re-alignment Dataset (Drealign\mathcal{D}_{realign}): Solely constituted by harmful data (harmful prompt-harmful answer pairs), also sampled from BeaverTails with is_safe=False. This dataset is used by Antidote to calculate Wanda scores for pruning. This dataset is also distinct from the harmful data in the fine-tuning dataset.
  • Downstream Task Datasets: Four different datasets were used for fine-tuning tasks:

    • SST2 (Stanford Sentiment Treebank v2): A sentiment analysis dataset for classifying movie review sentences as positive or negative.
    • AGNEWS: A topic classification dataset of news articles into four categories (World, Sports, Business, Sci/Tech).
    • GSM8K (Grade School Math 8K): A dataset of 8,500 grade school math word problems.
    • AlpacaEval: A dataset used for evaluating instruction-following capabilities of LLMs.
  • Data Sample Example: The paper indicates a prompt format for constructing supervised datasets: Prompt:Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. Instruction:{instruction} Input: {input} Response: Output: {output} For AGNEWS and SST2, the paper references Huang et al., 2024d;b, which are text classification tasks, meaning instruction could be "Classify the sentiment of the following text" or "Categorize the news article," with input being the text and output the label. For GSM8K, the instruction would involve solving a math problem, with input being the problem statement and output the solution. For BeaverTails, instruction would be a harmful prompt, input could be empty or context, and output would be either a safe refusal or a harmful response.

5.2. Evaluation Metrics

Two main metrics are used for evaluation:

  1. Finetune Accuracy (FA):

    • Conceptual Definition: This metric measures the performance of the fine-tuned model on its specific downstream task (e.g., sentiment analysis, math problem solving). It quantifies how well the model performs its intended function after fine-tuning.
    • Mathematical Formula: For classification tasks (like SST2, AGNEWS), Finetune Accuracy is typically calculated as the proportion of correctly predicted labels out of the total number of predictions. $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
    • Symbol Explanation:
      • Number of Correct Predictions: The count of instances where the model's output matches the true label.
      • Total Number of Predictions: The total number of instances in the test set. For generative tasks like GSM8K or AlpacaEval, accuracy might refer to the percentage of problems solved correctly or instructions followed accurately, often requiring more complex parsing or human evaluation for correctness.
  2. Harmful Score (HS):

    • Conceptual Definition: This metric quantifies the degree to which a model generates harmful content when presented with malicious instructions. A higher Harmful Score indicates a greater lack of safety alignment.
    • Mathematical Formula: The paper defines Harmful Score as the ratio of flagged unsafe output. $ \text{Harmful Score} = \frac{\text{Number of Flagged Unsafe Outputs}}{\text{Total Number of Malicious Instructions}} \times 100% $
    • Symbol Explanation:
      • Number of Flagged Unsafe Outputs: The count of model responses to malicious instructions that are identified as unsafe by a moderation model.
      • Total Number of Malicious Instructions: The total number of harmful prompts given to the model during testing. The paper uses a moderation model from (Ji et al., 2023) to flag unsafe outputs. A sample of 1000 harmful instructions from BeaverTails (Ji et al., 2023) is used to calculate this score.

5.3. Baselines

The paper compares Antidote against five baselines:

  1. SFT (Supervised Fine-Tuning): This is the standard approach where the model is fine-tuned on both alignment data and user fine-tuning data without any specific safety defense mechanism during the fine-tuning stage. It serves as a baseline to show the vulnerability without defense.

  2. RepNoise (Rosati et al., 2024a): An alignment-stage solution that modifies the alignment process to improve robustness by degrading harmful data representations to random Gaussian noise. It keeps the fine-tuning stage as SFT.

  3. Vaccine (Huang et al., 2024d): Another alignment-stage solution that vaccinates the model by adding embedding perturbations during alignment to make it robust against harmful fine-tuning. It also keeps the fine-tuning stage as SFT.

  4. Lisa (Huang et al., 2024b): A fine-tuning-stage solution that modifies the fine-tuning process itself. It alternately optimizes over alignment data and fine-tuning data and uses a proximal regularizer to enforce proximity between iterates.

  5. LDIFS (Mukhoti et al., 2023): Another fine-tuning-stage solution that introduces a regularizer during fine-tuning to enforce the iterate's embedding to remain in close proximity to that of the aligned model.

    These baselines are representative as they cover the two main existing categories of defenses (alignment-stage and fine-tuning-stage) against harmful fine-tuning.

5.4. Training Details and Hyper-parameters

  • Parameter-Efficient Fine-tuning: All alignment and fine-tuning processes utilize LoRA (Hu et al., 2021). The rank of the LoRA adaptor is set to 256 for both tasks.
  • Alignment Stage:
    • Safety samples: 5000
    • Optimizer: AdamW
    • Learning Rate: 1e31 \text{e}-3
    • Epochs: 20
  • Fine-tuning Stage (Default Settings):
    • Total samples (nn): 5000
    • Harmful data percentage (pp): 0.2 (i.e., 20% of fine-tuning data is harmful)
    • Optimizer: AdamW
    • Learning Rate (lr): 1e41 \text{e}-4
    • Epochs (ep): 20
    • Default dataset: SST2
  • Antidote Specific Hyper-parameters (Default Settings):
    • Mask Ratio (α\alpha): 0.2 (meaning 20% of weights with highest Wanda scores are pruned).
    • Mask Ratio for GSM8K: α=0.05\alpha = 0.05 (tuned for this specific task).
    • Re-alignment dataset size: 2000 harmful samples are used to form Drealign\mathcal{D}_{realign}.
  • Baselines Hyper-parameters:
    • Vaccine: perturbation intensity ρ=2\rho = 2.
    • RepNoise: α=0.1\alpha = 0.1, β=0.001\beta = 0.001.
    • Lisa: proximal penalty ρ=1\rho = 1.
    • LDIFS: regularization coefficient λ=0.0001\lambda = 0.0001 (tuned from a set of values).
  • Hardware: All experiments were conducted on an H100 GPU.

6. Results & Analysis

6.1. Core Results Analysis

Robustness to Harmful Ratio

The following are the results from Table 1 of the original paper:

Methods Harmful score Finetune accuacy
clean p=0.05 p=0.1 p=0.2 p=0.5 Average clean p=0.05 p=0.1 p=0.2 p=0.5 Average
SFT 52.30 76.70 79.00 79.40 80.20 73.52 95.87 95.18 95.07 95.18 93.69 95.00
Repnoise 42.40 79.20 79.50 77.90 82.60 72.32 95.07 94.84 94.84 94.38 94.61 94.75
Vaccine 44.80 80.20 80.00 81.50 81.90 73.68 95.53 95.53 94.04 95.18 94.04 94.86
Lisa 53.00 60.90 64.80 68.20 72.10 63.80 93.92 93.69 93.58 93.23 91.17 93.12
LDIFS 51.70 67.70 68.80 72.30 71.80 66.46 93.46 93.23 93.69 93.23 94.04 93.53
Antidote 52.90 61.20 61.20 64.60 64.50 60.88 93.58 93.46 93.12 93.35 91.74 93.05

Antidote demonstrates the best defense performance across varying harmful ratios (pp), achieving the lowest average harmful score (60.88). This is a significant 11.56% reduction compared to SFT (73.52). It maintains competitive fine-tune accuracy (93.05 average), with only a marginal 1.45% drop from SFT (95.00). Notably, Antidote's performance remains consistent even as the harmful ratio increases, unlike other methods (e.g., Lisa's harmful score increases from 60.90 at p=0.05p=0.05 to 72.10 at p=0.5p=0.5). This validates Antidote's post-fine-tuning design, making it robust to how harmful data is introduced. The alignment stage solutions (RepNoise, Vaccine) generally perform worse than SFT at higher harmful ratios, suggesting they are less effective in these attack scenarios.

Robustness to Fine-tuning Samples

The following are the results from Table 2 of the original paper:

Methods Harmful score Finetune accuacy
n=100 n=1000 n=2000 n=3000 n=5000 Average n=100 n=1000 n=2000 n=3000 n=5000 Average
SFT 65.50 76.90 77.80 80.70 79.40 76.06 92.20 94.72 94.27 94.50 95.18 94.17
Repnoise 66.50 77.60 78.80 78.60 77.90 75.88 89.45 92.66 93.69 94.72 94.38 92.98
Vaccine 66.40 79.00 78.60 81.10 81.50 77.32 90.48 93.92 94.95 95.30 95.18 93.97
Lisa 52.80 52.40 54.00 64.30 68.20 58.34 26.72 33.72 49.54 91.17 93.23 58.88
LDIFS 55.70 64.60 67.10 68.90 72.30 65.72 87.73 91.17 92.32 92.43 93.23 91.38
Antidote 57.00 60.70 62.80 61.70 64.60 61.36 90.02 92.43 93.12 93.00 93.35 92.38

Antidote achieves the best defense performance among baselines, reducing the harmful score by 13.42% on average. It maintains robust performance across different numbers of fine-tuning samples (nn), with harmful scores remaining relatively stable. Lisa shows a lower average harmful score (58.34) but at the cost of unacceptably low fine-tune accuracy at smaller sample sizes (e.g., 26.72% for n=100n=100). This highlights the trade-off. Antidote provides a good balance, effectively reducing harmful scores while keeping fine-tune accuracy high (92.38 average).

Robustness to Learning Rate in Fine-tuning

The following are the results from Table 3 of the original paper:

Methods Harmful score Finetune accuacy
lr=1e-7 lr=1e-6 lr=1e-5 lr=1e-4 lr=1e-3 Average lr=1e-7 lr=1e-6 lr=1e-5 lr=1e-4 lr=1e-3 Average
SFT 52.80 70.30 80.10 77.80 79.80 72.16 4.30 14.00 23.10 21.90 23.30 17.32
Repnoise 52.50 70.10 79.00 80.20 75.50 71.46 4.80 12.60 24.90 23.50 24.70 18.10
Vaccine 46.50 66.00 79.40 80.60 77.50 70.00 1.80 10.90 25.50 24.20 25.80 17.64
Lisa 52.30 55.00 64.40 73.20 77.30 64.44 4.00 5.70 13.60 21.90 24.70 13.98
LDIFS 53.20 56.10 59.00 68.50 78.50 63.06 4.00 4.80 5.40 6.10 14.10 6.88
Antidote 53.50 61.80 65.60 65.30 68.80 63.00 4.10 11.20 17.50 16.10 20.40 13.86

Using GSM8K dataset, Antidote reduces the average harmful score by 6.56% (from 72.16 for SFT to 63.00 for Antidote) with only a 0.38% average fine-tune accuracy drop. While Lisa and LDIFS achieve larger harmful score reductions, their fine-tune accuracy is severely impacted (e.g., LDIFS average FA is 6.88%). This confirms the hyper-parameter sensitivity issue for Lisa and LDIFS at higher learning rates (lr=1e3lr=1e-3), where their harmful scores escalate significantly. In contrast, Antidote's harmful score is less susceptible to the exact learning rate setting, validating its agnostic design. The trade-off between safety and accuracy is clear, and Antidote provides a more balanced solution.

The following figure (Figure 2 from the original paper) shows the harmful score and finetune accuracy with different learning rates:

Figure 2. Harmful score and finetune accuracy with different learning rates after fine-tuning. Here we fix fine-tuning epochs to 20. 该图像是图表,展示了在固定微调轮数为20时,不同学习率下五种方法的有害分数和微调准确率变化趋势。

Robustness to Training Epochs in Fine-tuning

The following are the results from Table 4 of the original paper:

Methods Harmful score Finetune accuacy
ep=1 ep=5 ep=10 ep=20 ep=40 Average ep=1 ep=5 ep=10 ep=20 ep=40 Average
SFT 76.50 78.90 79.90 77.80 78.70 78.36 21.00 25.80 26.50 21.90 24.60 23.96
Repnoise 76.30 79.50 79.00 80.20 80.80 79.16 19.70 26.20 26.10 23.50 22.70 23.64
Vaccine 75.80 82.10 79.60 80.60 80.40 79.70 20.40 26.00 25.10 24.20 22.60 23.66
Lisa 55.40 54.80 71.50 73.20 75.00 65.98 4.50 4.50 21.70 21.90 24.40 15.40
LDIFS 56.70 61.50 64.90 68.50 72.40 64.80 4.90 5.00 5.70 6.10 6.10 5.56
Antidote 61.50 66.80 66.60 65.30 63.60 64.76 13.60 17.80 19.80 16.10 13.90 16.24

Similar to learning rate, Antidote shows remarkable robustness to varying numbers of fine-tuning epochs. While other defenses (Lisa, LDIFS) tend to see their harmful scores increase with more epochs (indicating safety degradation), Antidote's harmful score actually decreases slightly from ep=10ep=10 to ep=40ep=40 (66.60 to 63.60). This further supports the claim that Antidote is less susceptible to fine-tuning hyper-parameters. Again, Lisa and LDIFS achieve lower average harmful scores but with substantial sacrifices in fine-tune accuracy at lower epochs.

The following figure (Figure 3 from the original paper) shows the harmful score and finetune accuracy with different finetuning epochs:

Figure 3. Harmful score and finetune accuracy with different finetuning epochs after user fine-tuning. Here we fix fine-tuning learning rate to 1e-5. 该图像是论文中图3的图表,展示了不同微调轮次下多种方法的有害得分和微调准确率。图中微调学习率固定为1e-5,可见有害得分随轮次增多普遍提升,而微调准确率整体保持较高水平。

Robustness to Benign Fine-tuning Attack

The following are the results from Table 5 of the original paper:

Harmful Score Fine-tune Accuracy
SFT 61.50 27.60
RepNoise 66.10 27.40
Vaccine 58.90 26.60
LDIFS 64.40 6.70
Lisa 59.20 27.60
Antidote 57.10 27.80

Antidote effectively reduces the harmful score (57.10) even when the fine-tuning is on benign data (GSM8K), without significantly hurting fine-tune accuracy (27.80). This demonstrates its ability to mitigate safety degradation from benign fine-tuning as well, a known attack vector where seemingly harmless data can still degrade safety. LDIFS again shows a low harmful score but with a drastically reduced fine-tune accuracy (6.70).

6.2. Generalizations on Datasets and Models

Generalization to Fine-tuning Datasets

The following are the results from Table 6 of the original paper:

Methods SST2 AGNEWS GSM8K AlpacaEval Average
HS FA HS FA HS FA HS FA HS FA
SFT 79.40 95.18 79.60 92.70 77.80 21.90 73.80 43.27 77.65 63.26
Repnoise 77.90 94.38 82.30 92.20 80.20 23.50 73.50 42.00 78.90 63.14
Vaccine 81.50 95.18 81.10 93.00 80.60 24.20 73.40 40.10 79.15 63.12
Lisa 68.20 93.23 74.80 90.80 73.20 21.90 65.20 39.90 72.45 61.92
LDIFS 72.30 93.23 69.60 87.10 68.50 6.10 66.60 39.81 69.25 56.56
Antidote 64.60 93.35 69.50 88.00 65.30 16.10 60.50 41.83 64.98 59.82

Antidote generalizes well across different fine-tuning tasks (SST2, AGNEWS, GSM8K, AlpacaEval). On average, it reduces the harmful score by 11.75% compared to SFT (from 77.65 to 64.98) with an average fine-tune accuracy loss of 3.08% (from 63.26 to 59.82). While the fine-tune accuracy loss is higher than in some other tables, it's noted that the mask ratio (α\alpha) was not specifically tuned for each dataset, indicating potential for further optimization.

Generalization to Alignment Datasets

The following are the results from Table 7 of the original paper:

p=0 p=0.05 p=0.1 p=0.2 p=0.5
Methods HS FA HS FA HS FA HS FA HS FA
SFT 13.5 29.2 80.3 28.2 78.8 28.1 78.6 26.8 82.3 24.1
Lisa 45.5 29.7 67.8 28.5 75.5 28.5 78.7 27.2 78.7 24.1
Antidote 2.3 22.2 11.3 22.6 15.5 21.9 21.8 20.6 36.5 19.3

When using a stronger safety alignment dataset (BeaverTails refusal, from Rosati et al., 2024a), Antidote achieves even better defense performance. For instance, at p=0.2p=0.2, Antidote significantly reduces the harmful score to 21.8, which is over a 40% reduction compared to SFT (78.6) and Lisa (78.7). This indicates that Antidote is compatible with and benefits from stronger initial safety alignment, leading to better overall safety.

Generalization to Models

The following are the results from Table 8 of the original paper:

Methods Llama2-7B Mistral-7B Gemma-7b Average
HS FA HS FA HS FA HS FA
SFT 79.40 95.18 80.30 95.99 80.90 96.22 80.20 95.80
Repnoise 77.90 94.38 79.00 94.95 80.70 88.76 79.20 92.70
Vaccine 81.50 95.18 80.60 94.04 79.10 94.72 80.40 94.65
Lisa 68.20 93.23 65.30 95.07 75.40 96.22 69.63 94.84
LDIFS 72.30 93.23 69.50 92.09 72.70 93.35 71.50 92.89
Antidote 64.60 93.35 64.80 94.95 59.40 94.04 62.93 94.11

Antidote generalizes across different LLM architectures (Llama2-7B, Mistral-7B, Gemma-7B). It achieves harmful score reductions of 11.6% (Llama2-7B), 20.0% (Mistral-7B), and 22.5% (Gemma-7B) compared to SFT. These reductions come with minor fine-tune accuracy drops of 1.49%, 0.92%, and 1.72% respectively. Notably, Antidote appears more effective in reducing harmful scores for stronger backbone models (e.g., Gemma-7B had the highest reduction). This indicates its potential for robust application across a range of LLMs.

The following are the results from Table 9 of the original paper:

Methods Harmful Score Finetune Accuracy
SFT 80.30 42.40
Vaccine 77.50 36.90
RepNoise 78.30 41.40
Lisa 74.40 41.30
LDIFS 71.50 15.90
Antidote 71.20 39.00

Antidote also shows effective performance on the more advanced Llama3-8B model. It achieves the lowest harmful score (71.20) among all methods, notably outperforming LDIFS which has a similar harmful score (71.50) but with a much lower fine-tune accuracy (15.90 vs 39.00). This confirms Antidote's generalizability to state-of-the-art LLMs.

6.3. Statistical and System Evaluation

Harmful Embedding Drift (HED)

The following figure (Figure 5 from the original paper) shows the Harmful Embedding Drift (HED) under different learning rates and epochs in the fine-tuning stage:

Figure 5. Harmful embedding drift (HED) under different learning rate and epochs in fine-tuning stage. Antidote obtains a relatively small HED. 该图像是图表,展示了在微调阶段不同学习率和训练轮数下有害嵌入漂移(Harmful Embedding Drift,HED)的变化趋势。结果显示,Antidote方法在各种设置下都保持了较低的HED值,表现出较好的安全性鲁棒性。

HED measures the L2L_2 norm of the difference between the hidden embedding of the aligned model and that of the fine-tuned model over the same alignment data. The results show that Antidote consistently maintains HED at a small scale. While SFT and Antidote share identical processes in their first two stages (leading to similar HED before Antidote's intervention), the one-shot pruning in Antidote's post-fine-tuning stage significantly reduces the HED. This indicates that removing the identified harmful parameters effectively recovers the alignment knowledge and mitigates the harmful embedding drift issue. In contrast, other methods (RepNoise, Vaccine, Lisa, LDIFS) show increasing HED with larger learning rates and epochs, demonstrating their hyper-parameter sensitivity.

Output Logit Drift Visualization

The following figure (Figure 6 from the original paper) shows the visualization of output logit for harmful and GSM8K samples:

Figure 6. Visualization of output logit. Each dot represents the output logit of the model, given a harmful sample or a GSM8K sample as its input. For example, to generate a red point, we input a GSM… 该图像是图表,展示了图6中模型在修剪前后对有害样本和GSM8K样本输出logit漂移的可视化。左图比较了使用Antidote修剪前后的漂移,右图展示了随机修剪的对比,反映了不同修剪方法对模型输出的影响。

This visualization compares the output logit drift caused by Antidote's pruning versus random pruning. The output logit represents the raw, unnormalized prediction scores from the model's final layer. Antidote results in a smaller logit drift (13058) for GSM8K (benign) samples compared to random pruning (22000). At the same time, Antidote achieves a similar drift for harmful samples (24469) as random pruning (26172). This suggests that Antidote's targeted pruning effectively shifts the logits of harmful outputs towards a more benign state without severely disrupting the logits for benign tasks, thereby preserving general performance.

System Performance

The following are the results from Table 10 of the original paper:

Methods Clock time (hour) GPU Memory (GB)
Alignment Fine-tuning Post-FT Sum Alignment Fine-tuning Post-FT Max
SFT 0.92 (1x) 0.78 (1x) 0 1.70 (1x) 35.45 (1x) 33.06 (1x) 0 35.45 (1x)
Repnoise 1.97 (2.14x) 0.78 (1x) 0 2.75 (1.62x) 75.26 (2.12x) 33.06 (1x) 0 75.26 (2.12x)
Vaccine 1.84 (2x) 0.78 (1x) 0 2.63 (1.54x) 56.46 (1.71x) 33.06 (1x) 0 56.46 (1.71x)
Lisa 0.92 (2.14x) 0.80 (1.03x) 0 1.72 (1.01x) 35.45 (1x) 52.95 (1.60x) 0 52.95 (1.49x)
LDIFS 0.92 (2.14x) 1.19 (1.53x) 0 2.11 (1.24x) 35.45 (1x) 64.53 (1.95x) 0 64.53 (1.82x)
Antidote 0.92 (1x) 0.78 (1x) 0.04 1.78 (1.02x) 35.45 (1x) 33.06 (1x) 22.35 35.45 (1x)

The system evaluation shows that Antidote is highly efficient. It introduces only a slight increase in total clock time (1.02x compared to SFT's 1.00x) and maintains the same GPU memory usage (1x) for the overall pipeline. The extra overhead for Antidote comes from the Post-FT stage (0.04 hours for Wanda score calculation). In stark contrast, alignment stage solutions (RepNoise, Vaccine) incur significant overhead in the alignment stage (e.g., RepNoise is 2.14x clock time and 2.12x GPU memory). Fine-tuning stage solutions (Lisa, LDIFS) also increase GPU memory (1.60x and 1.95x respectively) and sometimes clock time (LDIFS is 1.53x in fine-tuning). This makes Antidote a practical and scalable solution.

6.4. Hyper-parameters Analysis and Ablation Study

Impact of Mask Ratio (α\alpha)

The following are the results from Table 11 of the original paper:

α=0.01 α=0.05 α=0.1 α=0.15 α=0.2 α=0.25
HS 73.60 68.70 64.60 58.90 58.40 57.00
FA 94.95 94.50 93.35 91.06 86.58 80.05

As the mask ratio (α\alpha) increases, both the harmful score (HS) and the fine-tune accuracy (FA) decrease. This is an expected trade-off: pruning more parameters (higher α\alpha) leads to better safety (lower HS) but also removes more task-relevant parameters, thus reducing FA. The results show a clear inverse relationship: from α=0.01\alpha=0.01 to α=0.25\alpha=0.25, HS drops from 73.60 to 57.00, while FA drops from 94.95 to 80.05. This demonstrates that α\alpha is a crucial hyper-parameter for balancing safety and utility. A larger mask ratio could also offer benefits in model acceleration, though that is considered future work.

Necessity of Re-alignment Dataset

The following are the results from Table 12 of the original paper:

p=0.05 p=0.1 p=0.2 p=0.5 Average
HS (w/ harmful data) 63.10 68.30 68.80 69.20 67.35
HS (w/ fine-tuning data) 63.30 (+0.20) 69.80 (+1.50) 68.50 (-0.30) 70.50 (+1.30) 68.03 (+0.68)
HS (w/ benign data) 63.80 (+0.70) 69.70 (+1.40) 69.20 (+0.40) 71.20 (+2.00) 68.48 (+1.13)

This ablation study confirms the necessity of using a dedicated re-alignment dataset (composed solely of harmful data) for calculating the Wanda score. Using the fine-tuning data (which contains benign and harmful mixture) or benign data (fine-tuning data without harmful samples) for Wanda score calculation leads to higher harmful scores compared to using the dedicated harmful re-alignment data. Specifically, using benign data results in the worst performance, as it cannot effectively identify harmful parameters. This validates Antidote's design choice of a specialized re-alignment dataset.

Impact of the Size of Re-alignment Dataset

The following are the results from Table 13 of the original paper:

|Drealign| 0 5 10 100 1k 2k 5k
HS 72.1 70.60 70.30 70.20 69.30 69.20 69.40
(0) (-1.5) (-1.8) (-1.9) (-2.8) (-2.9) (-2.7)

The results show a general trend: increasing the number of harmful samples in the re-alignment dataset (Drealign|\mathcal{D}_{realign}|) tends to reduce the harmful score. This is because a larger dataset better reflects the true harmful distribution, allowing for more precise identification of harmful parameters. The benefit of increasing samples diminishes after a certain point; 1000 samples already yield a significant reduction in HS, and further increases to 2000 or 5000 samples provide only marginal additional benefits. The case where Drealign=0|\mathcal{D}_{realign}|=0 implies Wanda score reduces to weight magnitude, yielding the highest harmful score (72.1), further proving the utility of re-alignment data. The practicality of collecting 1k harmful samples supports the feasibility of Antidote.

6.5. Extensions

The following are the results from Table 14 of the original paper:

HS FA
p=0.1 p=0.2 p=0.5 Average p=0.1 p=0.2 p=0.5 Average
Antidote 61.20 64.60 64.50 63.43 93.12 93.35 91.74 92.74
V-S-A 58.90 62.30 61.70 60.97 (-2.46) 94.04 93.00 91.74 92.93 (+0.19)
S-L-A 61.10 61.60 60.90 61.20 (-2.23) 91.28 92.89 91.86 92.01 (-0.73)
V-L-A 63.70 63.70 60.60 62.67 (-0.76) 93.12 93.58 91.51 92.74 (0)

This section explores combining Antidote with other defenses due to its post-fine-tuning nature.

  • V-S-A (Vaccine + SFT + Antidote): This combination uses Vaccine in the alignment stage, standard SFT in fine-tuning, and then Antidote. It achieves the best safety performance, reducing the average harmful score by 2.46 points compared to Vanilla Antidote (from 63.43 to 60.97), while slightly increasing average fine-tune accuracy by 0.19 points (from 92.74 to 92.93). This suggests Antidote can synergize with alignment-stage defenses.

  • S-L-A (SFT + Lisa + Antidote): This uses SFT for alignment, Lisa for fine-tuning, and then Antidote. It also reduces the average harmful score by 2.23 points (to 61.20) but comes with a slight reduction in fine-tune accuracy (-0.73 points, to 92.01).

  • V-L-A (Vaccine + Lisa + Antidote): This combines Vaccine in alignment, Lisa in fine-tuning, and Antidote. It shows a smaller reduction in harmful score (-0.76 points, to 62.67) compared to V-S-A and S-L-A, while maintaining the same fine-tune accuracy as Vanilla Antidote.

    In summary, combining Antidote with a strong alignment-stage defense like Vaccine (V-S-A) proves to be the most effective strategy, offering improved safety without compromising performance.

6.6. Visualization

The paper includes a visualization section that presents qualitative results: The paper states that for a malicious prompt, Antidote gives refusal answers to sensitive questions, while other methods cannot. This qualitative result complements the quantitative harmful score by showing the actual output behavior, confirming Antidote's ability to maintain safety.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper identifies a crucial hyper-parameter sensitivity issue in existing alignment-stage and fine-tuning-stage defenses against harmful fine-tuning attacks on LLMs. These defenses often fail when using fine-tuning hyper-parameters (like large learning rates or many epochs) that are necessary for optimal downstream task performance. To address this, the authors propose Antidote, a novel post-fine-tuning stage solution that is agnostic to the fine-tuning details. Antidote's core philosophy is to recover the model from harmful behaviors by identifying and removing harmful parameters through one-shot pruning using the Wanda score on a re-alignment dataset. Extensive empirical results demonstrate that Antidote significantly reduces the harmful score across various attack settings and datasets, maintains high fine-tune accuracy, generalizes well to different LLM architectures, and is system-efficient.

7.2. Limitations & Future Work

The paper implicitly notes some limitations and suggests future work:

  • Mask Ratio Tuning: The paper mentions that the mask ratio (α\alpha) was not specifically tuned for each dataset in the generalization experiments (Table 6), implying that further tuning could potentially improve the trade-off between harmful score and fine-tune accuracy.
  • Model Acceleration: While pruning offers inherent potential for model acceleration (especially with a larger mask ratio), the paper explicitly states that model acceleration was not its main focus and is left as future work.
  • Potential Misuse: The Impact Statement acknowledges that the findings of this paper, while proposing a defense, could potentially be misused by the public to launch attacks towards commercial LLM services, highlighting a broader ethical consideration in security research.

7.3. Personal Insights & Critique

The paper's identification of the hyper-parameter sensitivity issue is a significant contribution. It highlights a practical challenge for LLM service providers who need to balance user customization (which might require specific fine-tuning hyper-parameters) with safety compliance. The Antidote approach is elegantly simple in its design, leveraging existing model sparsification techniques (Wanda score, pruning) for a novel application in safety alignment. The "post-fine-tuning" nature of Antidote offers a modularity that could be highly beneficial, allowing users to fine-tune models as they normally would, with Antidote acting as a subsequent safety layer.

One area for potential critique or deeper exploration is the definition and selection of the re-alignment dataset. While the paper validates its necessity and size, the quality and representativeness of this harmful dataset are crucial. If the re-alignment dataset itself is not comprehensive, Antidote might miss certain harmful parameters or inadvertently remove benign ones. Further research into dynamic re-alignment dataset generation or adaptation could be valuable. Additionally, the specific mechanism of how pruning affects the LLM's internal representations and its long-term safety resilience could be further investigated from a mechanistic interpretability perspective.

The paper provides a compelling solution for a pressing issue in LLM deployment. Its methods, particularly the idea of repurposing pruning for safety, could inspire similar applications in other domains where model robustness against specific malicious behaviors is critical, beyond just LLMs. The low computational overhead is a strong practical advantage, making Antidote a promising candidate for real-world integration.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.