Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
TL;DR Summary
Antidote mitigates harmful fine-tuning attacks on large language models by one-shot pruning harmful parameters post-fine-tuning, independent of hyperparameters, effectively reducing harmful outputs while preserving task accuracy.
Abstract
Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. While several defenses have been proposed, our evaluation shows that existing defenses fail \textit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks. Code is available at https://github.com/git-disl/Antidote.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
1.2. Authors
Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Joshua Kimball, Ling Liu
1.3. Journal/Conference
Published at arXiv, a preprint server. This indicates it is a pre-publication version of a research paper, often submitted to conferences or journals like AAAI2025-AIA or ICML2025 as mentioned in the acknowledgments. arXiv is a highly influential platform for rapid dissemination of research in fields like AI.
1.4. Publication Year
2024 (Specifically, August 18, 2024, as per the UTC timestamp provided in the user prompt).
1.5. Abstract
Safety-aligned Large Language Models (LLMs) are susceptible to harmful fine-tuning attacks, where even a small amount of malicious data mixed into the fine-tuning dataset can compromise their safety alignment. Existing defenses often fail under specific fine-tuning hyper-parameters, such as a large learning rate or a high number of training epochs. To address this, the paper proposes Antidote, a solution applied after the fine-tuning stage. Antidote operates on the principle of removing harmful parameters to recover the model's safety, irrespective of how these parameters were formed during fine-tuning. It introduces a one-shot pruning stage to identify and remove weights responsible for generating harmful content. Despite its simplicity, empirical results show Antidote effectively reduces the harmful score while maintaining accuracy on downstream tasks.
1.6. Original Source Link
https://arxiv.org/abs/2408.09600 PDF Link: https://arxiv.org/pdf/2408.09600v3.pdf Publication Status: Preprint (on arXiv).
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the vulnerability of safety-aligned Large Language Models (LLMs) to harmful fine-tuning attacks. Traditionally, LLMs are safety-aligned to refuse harmful content generation. However, studies show that even a small amount of harmful data injected during fine-tuning can cause the model to forget its safety knowledge and generate unsafe responses.
This problem is important because fine-tuning-as-a-service is an emerging paradigm where users customize LLMs with their own data. Service providers (e.g., OpenAI) have an obligation to ensure that the fine-tuned models remain harmless to avoid governance issues or legal repercussions.
Existing mitigation strategies, broadly categorized into alignment stage defenses and user fine-tuning stage defenses, have a common weakness: they are highly sensitive to fine-tuning hyper-parameters, such as the learning rate and number of epochs. A large learning rate or many epochs, while often necessary for good performance on downstream tasks, can easily invalidate these defenses, leading to a degradation in safety alignment. This creates a trade-off where ensuring safety might compromise task performance. The paper's innovative idea is to propose a defense that is agnostic to these fine-tuning hyper-parameters.
2.2. Main Contributions / Findings
The paper's primary contributions are:
- Identification of Hyper-parameter Sensitivity Issue: A systematic evaluation revealing that existing
alignment-stageandfine-tuning-stagedefenses are highly sensitive tofine-tuning hyper-parameters(learning rate and epochs), often failing when these are set to values required for good downstream task performance. - Proposal of
Antidote: Introducing a novelpost-fine-tuning stage solutionnamedAntidote, which is designed to be agnostic to the training details of the fine-tuning stage. - Core Philosophy:
Antidoteoperates on the philosophy that removingharmful parameterscan recover a model from harmful behaviors, irrespective of how thoseharmful parameterswere formed. - Methodology: Implementing
Antidotethrough aone-shot pruningstage that uses theWanda scoreon are-alignment datasetto identify and removeharmful weights. - Empirical Validation: Comprehensive experiments demonstrating that
Antidotesignificantly reduces theharmful score(up to 17.8% compared toSFTwithout defense) while largely maintainingfine-tuning accuracy(with a marginal loss of up to 1.83%). It also shows robustness across differentharmful ratios,fine-tuning samplesizes, and even againstbenign fine-tuning attacks. - Generalizability and Efficiency: Demonstrating that
Antidotegeneralizes well across differentfine-tuning datasets,alignment datasets, andLLMarchitectures (Llama2-7B, Mistral-7B, Gemma-7B, Llama3-8B). Furthermore, it introduces only a slight increase inclock time overheadand comparableGPU memory usagecompared toSFT, making it a system-efficient defense.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a novice reader should be familiar with the following concepts:
- Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data to understand and generate human-like text. They are capable of performing various natural language processing tasks, from answering questions to writing creative content. Examples include OpenAI's GPT series, Google's Gemma, and Meta's Llama.
- Fine-tuning: This is a process where a pre-trained
LLM(which has learned general language patterns) is further trained on a smaller, specific dataset to adapt it to a particular task or domain. This allows the model to become more specialized for user-specific needs without starting training from scratch. - Safety Alignment: This refers to the process of training
LLMsto adhere to ethical guidelines, avoid generating harmful, biased, or inappropriate content, and generally align their behavior with human values. Techniques likeReinforcement Learning from Human Feedback (RLHF)are commonly used forsafety alignment. The goal is for theLLMto providerefusal responses(e.g., "I cannot assist with that request") when prompted with harmful queries. - Harmful Fine-tuning Attacks: This is a security vulnerability where malicious or even unintentionally harmful data is included in the
fine-tuning dataset. When anLLMis fine-tuned on such data, it canunlearnorforgetits previously establishedsafety alignment, leading it to generate harmful content upon request. - Hyper-parameters: These are configuration settings that are external to the model and whose values cannot be estimated from data. They are typically set before the training process begins and control how the model learns. In the context of this paper, key
hyper-parametersinclude:- Learning Rate (
lr): This determines the step size at each iteration while moving toward a minimum of a loss function. A largerlearning ratemeans bigger steps, potentially leading to faster convergence but also overshooting the minimum or instability. A smallerlearning ratemeans smaller steps, which can be more stable but slower. - Epochs (
ep): Anepochrepresents one complete pass through the entire training dataset during training. Moreepochsmean the model sees the data multiple times, potentially leading to better learning but alsooverfittingor, in this context, more deviation from safety alignment if harmful data is present.
- Learning Rate (
- LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into the
Transformerarchitecture layers. This significantly reduces the number of trainable parameters for specific tasks, making fine-tuning more efficient and less memory-intensive. The paper usesLoRAfor bothalignmentandfine-tuning. - Pruning: In machine learning,
pruningis a technique used to reduce the size and complexity of a neural network by removing (or setting to zero) less important connections (weights). This can lead to faster inference, reduced memory footprint, and sometimes even improved generalization.One-shot pruningimplies pruning is done once, after training, rather than iteratively during training. - Wanda Score: A scoring mechanism used in model sparsification (pruning) literature to quantify the importance of individual weights. It measures the importance of parameters based on their absolute value and the norm of their corresponding input activations. A higher
Wanda scoreindicates a more important parameter.
3.2. Previous Works
The paper categorizes previous works into three main areas:
-
Safety Alignment:
- Methods like
RLHF(Ouyang et al., 2022) and its variants (Dai et al., 2023; Bai et al., 2022; Wu et al., 2023; Dong et al., 2023; Rafailov et al., 2023; Yuan et al., 2023) are foundational for aligningLLMswith human values. - More recent solutions focus on augmenting
alignment data(Liu et al., 2023a;b; Ye et al., 2023; Tekin et al., 2024). - These methods aim to ensure
LLMsproduce safe outputs initially.
- Methods like
-
Harmful Fine-tuning Defenses: This is the most relevant category to
Antidoteand is divided into two sub-categories:-
Alignment Stage Solutions: These methods modify the
alignment stageto improve the model's robustness againstharmful fine-tuningin later stages.Vaccine(Huang et al., 2024d): Adds artificialperturbationin thealignment stageto simulateharmful embedding driftduringfine-tuning, usingminimax optimizationto make the model immune.RepNoise(Rosati et al., 2024b;a): Uses arepresentation noising techniqueto degrade therepresentation distributionof harmful data, making it harder for the model to learn harmful content generation.- Other examples:
CTRL(Liu et al., 2024c),TAR(Tamirisa et al., 2024),Booster(Huang et al., 2024a),SN-Tune(Zhao et al., 2025b),T-Vaccine(Liu et al., 2024a),CTRAP(Yi et al., 2025b),KT-IPA(Cheng et al., 2025),SAM unlearning(Fan et al., 2025),Reward Neutralization(Cao, 2025),SEAM(Wang et al., 2025c).
-
Fine-tuning Stage Solutions: These methods modify the
fine-tuning stageto preventalignment knowledge forgettingwhile still learning user tasks.LDIFS(Mukhoti et al., 2023): Introduces aregularizerto constrain thefeature space driftof thefine-tuning iterateto remain close to that of thealigned model.Lisa(Huang et al., 2024b): Alternately optimizes overalignment dataandfine-tuning data, using aproximal regularizerto enforce proximity between iterates.- Other examples: (Bianchi et al., 2023; Zong et al., 2024; Wang et al., 2024; Lyu et al., 2024; Qi et al., 2024a; Shen et al., 2024; Choi et al., 2024; Du et al., 2024; Li et al., 2025; Eiras et al., 2024; Li & Kim, 2025; Li et al., 2024b; Liu et al., 2024b; Zhao et al., 2025a; Liu et al., 2025; Li, 2025; Wu et al., 2025; Peng et al., 2025).
-
-
Model Sparsification (Pruning):
- The concept of
model sparsification(Frankle & Carbin, 2018) explores finding sparse, trainable neural networks. - For
LLMs,SparseGPT(Frantar & Alistarh, 2023) proposeslayer-wise reconstructionfor importance scoring. Wanda score(Sun et al., 2023) utilizes jointweights/activation metricsto measurecoordinate importance.Antidotedirectly borrows theWanda scorefor identifying harmful parameters.OWL(Yin et al., 2023) builds onWandafor further compression.
- The concept of
-
Concurrent Post-Fine-tuning Defenses: The paper acknowledges several concurrent works that also aim at purifying the model after fine-tuning completes, such as
RESTA(Bhardwaj et al., 2024),LAT(Casper et al., 2024),Safe LoRA(Hsu et al., 2024),SOMF(Yi et al., 2024c), (Tong et al., 2024) forself-contrastive decoding,IRR(Wu et al., 2024) andNLSR(Yi et al., 2024b) forneuron correction,SafetyLock(Zhu et al., 2024) foractivation patching, andPanacea(Wang et al., 2025b) for optimizing post-fine-tuningperturbation. The key differentiation highlighted by Antidote is its focus on thehyper-parameter sensitivity issue, which was not systematically studied by these prior works.
3.3. Technological Evolution
The field of LLM safety has evolved from initial safety alignment techniques (RLHF, SFT) to addressing jailbreaking and, more recently, harmful fine-tuning attacks. Initially, research focused on making LLMs safe from scratch. As LLMs became more widely adopted and customized via fine-tuning, the vulnerability introduced by user-provided data became apparent. This led to the development of defenses at different stages:
-
Pre-alignment/Alignment Stage: Strengthening the base model's robustness.
-
Fine-tuning Stage: Modifying the fine-tuning process itself to prevent safety degradation.
-
Post-Fine-tuning Stage (Antidote's domain): Realignment or purification after the fine-tuning has completed.
Antidotefits into this timeline as a post-fine-tuning solution, addressing a critical gap identified: thehyper-parameter sensitivityof existing defenses. It leverages techniques frommodel sparsification, a field primarily focused on efficiency, and repurposes them for safety.
3.4. Differentiation Analysis
Antidote differentiates itself from existing alignment-stage and fine-tuning-stage defenses primarily by its post-fine-tuning approach and its agnosticism to fine-tuning hyperparameters.
- Existing
Alignment Stage Defenses(e.g.,Vaccine,RepNoise): These modify the initialsafety alignmentprocess to make the base model more robust. However, they can still be overcome by aggressivefine-tuning(largelearning ratesor manyepochs) as the model drifts too far from the initial robust state.Antidoteworks after this drift has occurred. - Existing
Fine-tuning Stage Defenses(e.g.,Lisa,LDIFS): These introduceregularizersor modified training procedures duringfine-tuningto keep the model close to its safe state. Their effectiveness is also tied to thefine-tuning hyper-parameters; largelearning ratescan makeregularizationineffective, causing the model to diverge.Antidotebypasses this by operating on the final corrupted model. - Core Innovation (
Hyper-parameter Agnosticism):Antidote's core difference is that it does not intervene during thefine-tuningprocess itself. Instead, it takes thefine-tuned(potentially corrupted) model andpurifiesit afterward. This means its effectiveness is not dependent on thelearning rate,epochs, or other training settings chosen during the user'sfine-tuningstage. This is a significant advantage, as users might need specifichyper-parametersfor optimal downstream task performance. - Methodological Innovation (
Pruning for Safety): Whilepruningtechniques (Wanda score) traditionally aim for model compression and efficiency,Antidoterepurposesone-shot pruningto identify and removeharmful parametersspecifically for safety realignment. This is a novel application ofsparsificationfor safety.
4. Methodology
4.1. Principles
The core idea behind Antidote is rooted in the philosophy that harmful behaviors in an LLM are caused by specific harmful parameters that have been modified or strengthened during fine-tuning. By identifying and removing these harmful parameters (i.e., weights), the model can be "recovered" or "re-aligned" to its original safe state, regardless of the specific hyper-parameters or training dynamics used during the fine-tuning process. This post-fine-tuning approach makes Antidote agnostic to the training details of the fine-tuning stage.
4.2. Core Methodology In-depth (Layer by Layer)
Antidote operates as a three-stage pipeline, as shown in Figure 1 and Figure 4 (which are identical diagrams illustrating the overall process).
The three stages are:
-
Safety Alignment: A
Pretrained LLMundergoessafety alignment(e.g., usingRLHForSFT) withalignment data(containingharmful prompt-safe answer pairs) to become asafety-aligned model. This is the initial step to make theLLMgenerally safe. -
User Fine-tuning: The
safety-aligned modelis thenfine-tunedby users on afine-tuning dataset. This dataset may contain a mixture ofdownstream task dataand a certain percentage ofharmful data(containingharmful prompt-harmful answer pairs), which can lead tosafety alignment brokenorcorrupted models. This is the stage where thehyper-parameter sensitivity issueof existing defenses arises. -
One-shot Pruning (Antidote's Contribution): After the
user fine-tuningis complete and the model might besafety alignment broken,Antidoteintervenes. This stage aims to re-align the model by identifying and removing theharmful parametersthat cause the model to generate unsafe content.Here's a detailed breakdown of the
one-shot pruningstage as described inAlgorithm 1:
Algorithm 1 Antidote: a post-fine-tuning safety alignment Input:
Mask ratio, : This is ahyper-parameterdetermining the percentage ofweightsto be pruned.Re-alignment dataset, : A dataset specifically designed to containharmful prompt-harmful answer pairs. This dataset is used to identify theharmful parameters.Safety alignment-broken fine-tuned model, : Theweightsof theLLMafteruser fine-tuning, which may have lost itssafety alignment.
Output:
- The
re-aligned modelready for deployment.
Procedure:
-
Calculate Importance Score (Identify Harmful Parameters): The first step is to quantify the "harmfulness" or importance of each individual
weightin thefine-tuned modelwith respect to generating harmful content. This is done using a modifiedWanda score, calculated over there-alignment dataset. TheWanda scorefor eachweight coordinateis given by:-
: The
Wanda scorefor the -thweight coordinatein the model , calculated using there-alignment dataset. -
: The vector of all
weightsin thesafety alignment-broken fine-tuned model. -
: The total number of data points (samples) in the
re-alignment dataset. -
: Summation over all data points in the
re-alignment dataset. -
: The absolute value of the -th
weight coordinateof the model. This term captures the magnitude of theweight. -
: The norm of the
hidden activationassociated with the -thweight coordinatewhen the input is and the modelweightsare . This term reflects the influence of the input on that specificweight.Intuitively, a
weightis consideredimportant(and potentiallyharmful) if it has a large absolute value and is frequently activated byharmful datain there-alignment dataset.
-
-
Create Harmful Mask: After calculating the
Wanda scoresfor allweights, amaskis created to identify the topharmful parameters. Thismaskis a binary vector (containing 0s and 1s) that indicates whichweightsare to be removed.- : The
harmful maskvector. - : A function that returns a binary
mask. It sets themaskto 1 for the top percentage ofweight coordinatesthat have the highestWanda scores, and 0 for the rest. - : The
mask ratio, ahyper-parametercontrolling the percentage ofweightsto be masked (i.e., identified as harmful). For example, if , then 20% of theweightswith the highestWanda scoreswill be set to 1 in themask.
- : The
-
Remove Harmful Parameters (Pruning Operation): Finally, the
harmful parametersare removed from thefine-tuned modelusing the createdharmful mask. This effectively sets the identifiedharmful weightsto zero.-
: The
re-aligned modelthat is now considered safe and ready for deployment. -
: A vector of ones with the same dimensions as and .
-
: This operation inverts the
mask. Where has a 1 (indicating aharmful weightto be removed), will have a 0. Where has a 0 (indicating anon-harmful weightto be kept), will have a 1. -
: The
Hadamard product(element-wise multiplication). This operation multiplies eachweightin by the corresponding element in the invertedmask. -
Effectively, this means any
weightin that corresponds to a 1 in (i.e., identified as harmful) will be multiplied by 0, thereby being set to zero (pruned).Weightscorresponding to 0 in (i.e., not identified as harmful) will be multiplied by 1, thus retaining their original value.The
re-aligned modelis then deployed to serve users' customized tasks, now having recovered itssafety alignment.
-
The following figure (Figure 4 from the original paper) shows the system overview of Antidote:
该图像是论文Antidote中展示的三阶段流程示意图,包含对齐阶段、微调阶段和一次性剪枝阶段,描述了如何通过一步剪枝去除有害参数以恢复模型安全性。
5. Experimental Setup
5.1. Datasets
The experiments used several types of datasets:
-
Pre-trained Models:
- Llama2-7B (default backbone for most evaluations)
- Mistral-7B
- Gemma-7B
- Llama3-8B (for advanced model evaluation)
These are all mainstream, open-source
LLMs.
-
Harmful Data-related Datasets: All harmful data are sampled from
BeaverTails(Ji et al., 2023), a human-preference dataset forLLMsafety alignment.- Alignment Dataset (): Contains
alignment data(harmful prompt-safe answer pairs). This is sampled fromBeaverTailswithis_safe=Truelabels. - Fine-tuning Dataset: This dataset is a mixture of
downstream task dataandharmful data.- percentage of
harmful data(harmful prompt-harmful answer pairs), sampled fromBeaverTailswithis_safe=False. 1-ppercentage ofdownstream task data. The harmful data in this dataset is distinct from the harmful data in there-alignment dataset.
- percentage of
- Re-alignment Dataset (): Solely constituted by
harmful data(harmful prompt-harmful answer pairs), also sampled fromBeaverTailswithis_safe=False. This dataset is used byAntidoteto calculateWanda scoresfor pruning. This dataset is also distinct from the harmful data in thefine-tuning dataset.
- Alignment Dataset (): Contains
-
Downstream Task Datasets: Four different datasets were used for
fine-tuningtasks:- SST2 (Stanford Sentiment Treebank v2): A sentiment analysis dataset for classifying movie review sentences as positive or negative.
- AGNEWS: A topic classification dataset of news articles into four categories (World, Sports, Business, Sci/Tech).
- GSM8K (Grade School Math 8K): A dataset of 8,500 grade school math word problems.
- AlpacaEval: A dataset used for evaluating instruction-following capabilities of
LLMs.
-
Data Sample Example: The paper indicates a prompt format for constructing supervised datasets:
Prompt:Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. Instruction:{instruction} Input: {input} Response: Output: {output}ForAGNEWSandSST2, the paper referencesHuang et al., 2024d;b, which are text classification tasks, meaninginstructioncould be "Classify the sentiment of the following text" or "Categorize the news article," withinputbeing the text andoutputthe label. ForGSM8K, theinstructionwould involve solving a math problem, withinputbeing the problem statement andoutputthe solution. ForBeaverTails,instructionwould be a harmful prompt,inputcould be empty or context, andoutputwould be either a safe refusal or a harmful response.
5.2. Evaluation Metrics
Two main metrics are used for evaluation:
-
Finetune Accuracy (FA):
- Conceptual Definition: This metric measures the performance of the
fine-tuned modelon its specificdownstream task(e.g., sentiment analysis, math problem solving). It quantifies how well the model performs its intended function afterfine-tuning. - Mathematical Formula: For classification tasks (like SST2, AGNEWS),
Finetune Accuracyis typically calculated as the proportion of correctly predicted labels out of the total number of predictions. $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $ - Symbol Explanation:
Number of Correct Predictions: The count of instances where the model's output matches the true label.Total Number of Predictions: The total number of instances in the test set. For generative tasks likeGSM8KorAlpacaEval, accuracy might refer to the percentage of problems solved correctly or instructions followed accurately, often requiring more complex parsing or human evaluation for correctness.
- Conceptual Definition: This metric measures the performance of the
-
Harmful Score (HS):
- Conceptual Definition: This metric quantifies the degree to which a model generates harmful content when presented with malicious instructions. A higher
Harmful Scoreindicates a greater lack ofsafety alignment. - Mathematical Formula: The paper defines
Harmful Scoreas the ratio of flagged unsafe output. $ \text{Harmful Score} = \frac{\text{Number of Flagged Unsafe Outputs}}{\text{Total Number of Malicious Instructions}} \times 100% $ - Symbol Explanation:
Number of Flagged Unsafe Outputs: The count of model responses to malicious instructions that are identified as unsafe by amoderation model.Total Number of Malicious Instructions: The total number of harmful prompts given to the model during testing. The paper uses amoderation modelfrom (Ji et al., 2023) to flag unsafe outputs. A sample of 1000 harmful instructions fromBeaverTails(Ji et al., 2023) is used to calculate this score.
- Conceptual Definition: This metric quantifies the degree to which a model generates harmful content when presented with malicious instructions. A higher
5.3. Baselines
The paper compares Antidote against five baselines:
-
SFT (Supervised Fine-Tuning): This is the standard approach where the model is fine-tuned on both
alignment dataanduser fine-tuning datawithout any specificsafety defense mechanismduring thefine-tuningstage. It serves as a baseline to show the vulnerability without defense. -
RepNoise (Rosati et al., 2024a): An
alignment-stage solutionthat modifies thealignment processto improve robustness by degradingharmful data representationstorandom Gaussian noise. It keeps thefine-tuning stageasSFT. -
Vaccine (Huang et al., 2024d): Another
alignment-stage solutionthatvaccinatesthe model by addingembedding perturbationsduringalignmentto make it robust againstharmful fine-tuning. It also keeps thefine-tuning stageasSFT. -
Lisa (Huang et al., 2024b): A
fine-tuning-stage solutionthat modifies thefine-tuning processitself. It alternately optimizes overalignment dataandfine-tuning dataand uses aproximal regularizerto enforce proximity betweeniterates. -
LDIFS (Mukhoti et al., 2023): Another
fine-tuning-stage solutionthat introduces aregularizerduringfine-tuningto enforce theiterate's embeddingto remain in close proximity to that of thealigned model.These baselines are representative as they cover the two main existing categories of
defenses(alignment-stage and fine-tuning-stage) againstharmful fine-tuning.
5.4. Training Details and Hyper-parameters
- Parameter-Efficient Fine-tuning: All
alignmentandfine-tuningprocesses utilizeLoRA(Hu et al., 2021). Therankof theLoRA adaptoris set to 256 for both tasks. - Alignment Stage:
Safety samples: 5000Optimizer:AdamWLearning Rate:Epochs: 20
- Fine-tuning Stage (Default Settings):
Total samples(): 5000Harmful data percentage(): 0.2 (i.e., 20% offine-tuning datais harmful)Optimizer:AdamWLearning Rate(lr):Epochs(ep): 20Default dataset: SST2
- Antidote Specific Hyper-parameters (Default Settings):
Mask Ratio(): 0.2 (meaning 20% of weights with highestWanda scoresare pruned).Mask RatioforGSM8K: (tuned for this specific task).Re-alignment dataset size: 2000 harmful samples are used to form .
- Baselines Hyper-parameters:
Vaccine:perturbation intensity.RepNoise: , .Lisa:proximal penalty.LDIFS:regularization coefficient(tuned from a set of values).
- Hardware: All experiments were conducted on an H100 GPU.
6. Results & Analysis
6.1. Core Results Analysis
Robustness to Harmful Ratio
The following are the results from Table 1 of the original paper:
| Methods | Harmful score | Finetune accuacy | ||||||||||
| clean | p=0.05 | p=0.1 | p=0.2 | p=0.5 | Average | clean | p=0.05 | p=0.1 | p=0.2 | p=0.5 | Average | |
| SFT | 52.30 | 76.70 | 79.00 | 79.40 | 80.20 | 73.52 | 95.87 | 95.18 | 95.07 | 95.18 | 93.69 | 95.00 |
| Repnoise | 42.40 | 79.20 | 79.50 | 77.90 | 82.60 | 72.32 | 95.07 | 94.84 | 94.84 | 94.38 | 94.61 | 94.75 |
| Vaccine | 44.80 | 80.20 | 80.00 | 81.50 | 81.90 | 73.68 | 95.53 | 95.53 | 94.04 | 95.18 | 94.04 | 94.86 |
| Lisa | 53.00 | 60.90 | 64.80 | 68.20 | 72.10 | 63.80 | 93.92 | 93.69 | 93.58 | 93.23 | 91.17 | 93.12 |
| LDIFS | 51.70 | 67.70 | 68.80 | 72.30 | 71.80 | 66.46 | 93.46 | 93.23 | 93.69 | 93.23 | 94.04 | 93.53 |
| Antidote | 52.90 | 61.20 | 61.20 | 64.60 | 64.50 | 60.88 | 93.58 | 93.46 | 93.12 | 93.35 | 91.74 | 93.05 |
Antidote demonstrates the best defense performance across varying harmful ratios (), achieving the lowest average harmful score (60.88). This is a significant 11.56% reduction compared to SFT (73.52). It maintains competitive fine-tune accuracy (93.05 average), with only a marginal 1.45% drop from SFT (95.00). Notably, Antidote's performance remains consistent even as the harmful ratio increases, unlike other methods (e.g., Lisa's harmful score increases from 60.90 at to 72.10 at ). This validates Antidote's post-fine-tuning design, making it robust to how harmful data is introduced. The alignment stage solutions (RepNoise, Vaccine) generally perform worse than SFT at higher harmful ratios, suggesting they are less effective in these attack scenarios.
Robustness to Fine-tuning Samples
The following are the results from Table 2 of the original paper:
| Methods | Harmful score | Finetune accuacy | ||||||||||
| n=100 | n=1000 | n=2000 | n=3000 | n=5000 | Average | n=100 | n=1000 | n=2000 | n=3000 | n=5000 | Average | |
| SFT | 65.50 | 76.90 | 77.80 | 80.70 | 79.40 | 76.06 | 92.20 | 94.72 | 94.27 | 94.50 | 95.18 | 94.17 |
| Repnoise | 66.50 | 77.60 | 78.80 | 78.60 | 77.90 | 75.88 | 89.45 | 92.66 | 93.69 | 94.72 | 94.38 | 92.98 |
| Vaccine | 66.40 | 79.00 | 78.60 | 81.10 | 81.50 | 77.32 | 90.48 | 93.92 | 94.95 | 95.30 | 95.18 | 93.97 |
| Lisa | 52.80 | 52.40 | 54.00 | 64.30 | 68.20 | 58.34 | 26.72 | 33.72 | 49.54 | 91.17 | 93.23 | 58.88 |
| LDIFS | 55.70 | 64.60 | 67.10 | 68.90 | 72.30 | 65.72 | 87.73 | 91.17 | 92.32 | 92.43 | 93.23 | 91.38 |
| Antidote | 57.00 | 60.70 | 62.80 | 61.70 | 64.60 | 61.36 | 90.02 | 92.43 | 93.12 | 93.00 | 93.35 | 92.38 |
Antidote achieves the best defense performance among baselines, reducing the harmful score by 13.42% on average. It maintains robust performance across different numbers of fine-tuning samples (), with harmful scores remaining relatively stable. Lisa shows a lower average harmful score (58.34) but at the cost of unacceptably low fine-tune accuracy at smaller sample sizes (e.g., 26.72% for ). This highlights the trade-off. Antidote provides a good balance, effectively reducing harmful scores while keeping fine-tune accuracy high (92.38 average).
Robustness to Learning Rate in Fine-tuning
The following are the results from Table 3 of the original paper:
| Methods | Harmful score | Finetune accuacy | ||||||||||
| lr=1e-7 | lr=1e-6 | lr=1e-5 | lr=1e-4 | lr=1e-3 | Average | lr=1e-7 | lr=1e-6 | lr=1e-5 | lr=1e-4 | lr=1e-3 | Average | |
| SFT | 52.80 | 70.30 | 80.10 | 77.80 | 79.80 | 72.16 | 4.30 | 14.00 | 23.10 | 21.90 | 23.30 | 17.32 |
| Repnoise | 52.50 | 70.10 | 79.00 | 80.20 | 75.50 | 71.46 | 4.80 | 12.60 | 24.90 | 23.50 | 24.70 | 18.10 |
| Vaccine | 46.50 | 66.00 | 79.40 | 80.60 | 77.50 | 70.00 | 1.80 | 10.90 | 25.50 | 24.20 | 25.80 | 17.64 |
| Lisa | 52.30 | 55.00 | 64.40 | 73.20 | 77.30 | 64.44 | 4.00 | 5.70 | 13.60 | 21.90 | 24.70 | 13.98 |
| LDIFS | 53.20 | 56.10 | 59.00 | 68.50 | 78.50 | 63.06 | 4.00 | 4.80 | 5.40 | 6.10 | 14.10 | 6.88 |
| Antidote | 53.50 | 61.80 | 65.60 | 65.30 | 68.80 | 63.00 | 4.10 | 11.20 | 17.50 | 16.10 | 20.40 | 13.86 |
Using GSM8K dataset, Antidote reduces the average harmful score by 6.56% (from 72.16 for SFT to 63.00 for Antidote) with only a 0.38% average fine-tune accuracy drop. While Lisa and LDIFS achieve larger harmful score reductions, their fine-tune accuracy is severely impacted (e.g., LDIFS average FA is 6.88%). This confirms the hyper-parameter sensitivity issue for Lisa and LDIFS at higher learning rates (), where their harmful scores escalate significantly. In contrast, Antidote's harmful score is less susceptible to the exact learning rate setting, validating its agnostic design. The trade-off between safety and accuracy is clear, and Antidote provides a more balanced solution.
The following figure (Figure 2 from the original paper) shows the harmful score and finetune accuracy with different learning rates:
该图像是图表,展示了在固定微调轮数为20时,不同学习率下五种方法的有害分数和微调准确率变化趋势。
Robustness to Training Epochs in Fine-tuning
The following are the results from Table 4 of the original paper:
| Methods | Harmful score | Finetune accuacy | ||||||||||
| ep=1 | ep=5 | ep=10 | ep=20 | ep=40 | Average | ep=1 | ep=5 | ep=10 | ep=20 | ep=40 | Average | |
| SFT | 76.50 | 78.90 | 79.90 | 77.80 | 78.70 | 78.36 | 21.00 | 25.80 | 26.50 | 21.90 | 24.60 | 23.96 |
| Repnoise | 76.30 | 79.50 | 79.00 | 80.20 | 80.80 | 79.16 | 19.70 | 26.20 | 26.10 | 23.50 | 22.70 | 23.64 |
| Vaccine | 75.80 | 82.10 | 79.60 | 80.60 | 80.40 | 79.70 | 20.40 | 26.00 | 25.10 | 24.20 | 22.60 | 23.66 |
| Lisa | 55.40 | 54.80 | 71.50 | 73.20 | 75.00 | 65.98 | 4.50 | 4.50 | 21.70 | 21.90 | 24.40 | 15.40 |
| LDIFS | 56.70 | 61.50 | 64.90 | 68.50 | 72.40 | 64.80 | 4.90 | 5.00 | 5.70 | 6.10 | 6.10 | 5.56 |
| Antidote | 61.50 | 66.80 | 66.60 | 65.30 | 63.60 | 64.76 | 13.60 | 17.80 | 19.80 | 16.10 | 13.90 | 16.24 |
Similar to learning rate, Antidote shows remarkable robustness to varying numbers of fine-tuning epochs. While other defenses (Lisa, LDIFS) tend to see their harmful scores increase with more epochs (indicating safety degradation), Antidote's harmful score actually decreases slightly from to (66.60 to 63.60). This further supports the claim that Antidote is less susceptible to fine-tuning hyper-parameters. Again, Lisa and LDIFS achieve lower average harmful scores but with substantial sacrifices in fine-tune accuracy at lower epochs.
The following figure (Figure 3 from the original paper) shows the harmful score and finetune accuracy with different finetuning epochs:
该图像是论文中图3的图表,展示了不同微调轮次下多种方法的有害得分和微调准确率。图中微调学习率固定为1e-5,可见有害得分随轮次增多普遍提升,而微调准确率整体保持较高水平。
Robustness to Benign Fine-tuning Attack
The following are the results from Table 5 of the original paper:
| Harmful Score | Fine-tune Accuracy | |
| SFT | 61.50 | 27.60 |
| RepNoise | 66.10 | 27.40 |
| Vaccine | 58.90 | 26.60 |
| LDIFS | 64.40 | 6.70 |
| Lisa | 59.20 | 27.60 |
| Antidote | 57.10 | 27.80 |
Antidote effectively reduces the harmful score (57.10) even when the fine-tuning is on benign data (GSM8K), without significantly hurting fine-tune accuracy (27.80). This demonstrates its ability to mitigate safety degradation from benign fine-tuning as well, a known attack vector where seemingly harmless data can still degrade safety. LDIFS again shows a low harmful score but with a drastically reduced fine-tune accuracy (6.70).
6.2. Generalizations on Datasets and Models
Generalization to Fine-tuning Datasets
The following are the results from Table 6 of the original paper:
| Methods | SST2 | AGNEWS | GSM8K | AlpacaEval | Average | |||||
| HS | FA | HS | FA | HS | FA | HS | FA | HS | FA | |
| SFT | 79.40 | 95.18 | 79.60 | 92.70 | 77.80 | 21.90 | 73.80 | 43.27 | 77.65 | 63.26 |
| Repnoise | 77.90 | 94.38 | 82.30 | 92.20 | 80.20 | 23.50 | 73.50 | 42.00 | 78.90 | 63.14 |
| Vaccine | 81.50 | 95.18 | 81.10 | 93.00 | 80.60 | 24.20 | 73.40 | 40.10 | 79.15 | 63.12 |
| Lisa | 68.20 | 93.23 | 74.80 | 90.80 | 73.20 | 21.90 | 65.20 | 39.90 | 72.45 | 61.92 |
| LDIFS | 72.30 | 93.23 | 69.60 | 87.10 | 68.50 | 6.10 | 66.60 | 39.81 | 69.25 | 56.56 |
| Antidote | 64.60 | 93.35 | 69.50 | 88.00 | 65.30 | 16.10 | 60.50 | 41.83 | 64.98 | 59.82 |
Antidote generalizes well across different fine-tuning tasks (SST2, AGNEWS, GSM8K, AlpacaEval). On average, it reduces the harmful score by 11.75% compared to SFT (from 77.65 to 64.98) with an average fine-tune accuracy loss of 3.08% (from 63.26 to 59.82). While the fine-tune accuracy loss is higher than in some other tables, it's noted that the mask ratio () was not specifically tuned for each dataset, indicating potential for further optimization.
Generalization to Alignment Datasets
The following are the results from Table 7 of the original paper:
| p=0 | p=0.05 | p=0.1 | p=0.2 | p=0.5 | ||||||
| Methods | HS | FA | HS | FA | HS | FA | HS | FA | HS | FA |
| SFT | 13.5 | 29.2 | 80.3 | 28.2 | 78.8 | 28.1 | 78.6 | 26.8 | 82.3 | 24.1 |
| Lisa | 45.5 | 29.7 | 67.8 | 28.5 | 75.5 | 28.5 | 78.7 | 27.2 | 78.7 | 24.1 |
| Antidote | 2.3 | 22.2 | 11.3 | 22.6 | 15.5 | 21.9 | 21.8 | 20.6 | 36.5 | 19.3 |
When using a stronger safety alignment dataset (BeaverTails refusal, from Rosati et al., 2024a), Antidote achieves even better defense performance. For instance, at , Antidote significantly reduces the harmful score to 21.8, which is over a 40% reduction compared to SFT (78.6) and Lisa (78.7). This indicates that Antidote is compatible with and benefits from stronger initial safety alignment, leading to better overall safety.
Generalization to Models
The following are the results from Table 8 of the original paper:
| Methods | Llama2-7B | Mistral-7B | Gemma-7b | Average | ||||
| HS | FA | HS | FA | HS | FA | HS | FA | |
| SFT | 79.40 | 95.18 | 80.30 | 95.99 | 80.90 | 96.22 | 80.20 | 95.80 |
| Repnoise | 77.90 | 94.38 | 79.00 | 94.95 | 80.70 | 88.76 | 79.20 | 92.70 |
| Vaccine | 81.50 | 95.18 | 80.60 | 94.04 | 79.10 | 94.72 | 80.40 | 94.65 |
| Lisa | 68.20 | 93.23 | 65.30 | 95.07 | 75.40 | 96.22 | 69.63 | 94.84 |
| LDIFS | 72.30 | 93.23 | 69.50 | 92.09 | 72.70 | 93.35 | 71.50 | 92.89 |
| Antidote | 64.60 | 93.35 | 64.80 | 94.95 | 59.40 | 94.04 | 62.93 | 94.11 |
Antidote generalizes across different LLM architectures (Llama2-7B, Mistral-7B, Gemma-7B). It achieves harmful score reductions of 11.6% (Llama2-7B), 20.0% (Mistral-7B), and 22.5% (Gemma-7B) compared to SFT. These reductions come with minor fine-tune accuracy drops of 1.49%, 0.92%, and 1.72% respectively. Notably, Antidote appears more effective in reducing harmful scores for stronger backbone models (e.g., Gemma-7B had the highest reduction). This indicates its potential for robust application across a range of LLMs.
The following are the results from Table 9 of the original paper:
| Methods | Harmful Score | Finetune Accuracy |
| SFT | 80.30 | 42.40 |
| Vaccine | 77.50 | 36.90 |
| RepNoise | 78.30 | 41.40 |
| Lisa | 74.40 | 41.30 |
| LDIFS | 71.50 | 15.90 |
| Antidote | 71.20 | 39.00 |
Antidote also shows effective performance on the more advanced Llama3-8B model. It achieves the lowest harmful score (71.20) among all methods, notably outperforming LDIFS which has a similar harmful score (71.50) but with a much lower fine-tune accuracy (15.90 vs 39.00). This confirms Antidote's generalizability to state-of-the-art LLMs.
6.3. Statistical and System Evaluation
Harmful Embedding Drift (HED)
The following figure (Figure 5 from the original paper) shows the Harmful Embedding Drift (HED) under different learning rates and epochs in the fine-tuning stage:
该图像是图表,展示了在微调阶段不同学习率和训练轮数下有害嵌入漂移(Harmful Embedding Drift,HED)的变化趋势。结果显示,Antidote方法在各种设置下都保持了较低的HED值,表现出较好的安全性鲁棒性。
HED measures the norm of the difference between the hidden embedding of the aligned model and that of the fine-tuned model over the same alignment data. The results show that Antidote consistently maintains HED at a small scale. While SFT and Antidote share identical processes in their first two stages (leading to similar HED before Antidote's intervention), the one-shot pruning in Antidote's post-fine-tuning stage significantly reduces the HED. This indicates that removing the identified harmful parameters effectively recovers the alignment knowledge and mitigates the harmful embedding drift issue. In contrast, other methods (RepNoise, Vaccine, Lisa, LDIFS) show increasing HED with larger learning rates and epochs, demonstrating their hyper-parameter sensitivity.
Output Logit Drift Visualization
The following figure (Figure 6 from the original paper) shows the visualization of output logit for harmful and GSM8K samples:
该图像是图表,展示了图6中模型在修剪前后对有害样本和GSM8K样本输出logit漂移的可视化。左图比较了使用Antidote修剪前后的漂移,右图展示了随机修剪的对比,反映了不同修剪方法对模型输出的影响。
This visualization compares the output logit drift caused by Antidote's pruning versus random pruning. The output logit represents the raw, unnormalized prediction scores from the model's final layer. Antidote results in a smaller logit drift (13058) for GSM8K (benign) samples compared to random pruning (22000). At the same time, Antidote achieves a similar drift for harmful samples (24469) as random pruning (26172). This suggests that Antidote's targeted pruning effectively shifts the logits of harmful outputs towards a more benign state without severely disrupting the logits for benign tasks, thereby preserving general performance.
System Performance
The following are the results from Table 10 of the original paper:
| Methods | Clock time (hour) | GPU Memory (GB) | |||||||
| Alignment | Fine-tuning | Post-FT | Sum | Alignment | Fine-tuning | Post-FT | Max | ||
| SFT | 0.92 (1x) | 0.78 (1x) | 0 | 1.70 (1x) | 35.45 (1x) | 33.06 (1x) | 0 | 35.45 (1x) | |
| Repnoise | 1.97 (2.14x) | 0.78 (1x) | 0 | 2.75 (1.62x) | 75.26 (2.12x) | 33.06 (1x) | 0 | 75.26 (2.12x) | |
| Vaccine | 1.84 (2x) | 0.78 (1x) | 0 | 2.63 (1.54x) | 56.46 (1.71x) | 33.06 (1x) | 0 | 56.46 (1.71x) | |
| Lisa | 0.92 (2.14x) | 0.80 (1.03x) | 0 | 1.72 (1.01x) | 35.45 (1x) | 52.95 (1.60x) | 0 | 52.95 (1.49x) | |
| LDIFS | 0.92 (2.14x) | 1.19 (1.53x) | 0 | 2.11 (1.24x) | 35.45 (1x) | 64.53 (1.95x) | 0 | 64.53 (1.82x) | |
| Antidote | 0.92 (1x) | 0.78 (1x) | 0.04 | 1.78 (1.02x) | 35.45 (1x) | 33.06 (1x) | 22.35 | 35.45 (1x) | |
The system evaluation shows that Antidote is highly efficient. It introduces only a slight increase in total clock time (1.02x compared to SFT's 1.00x) and maintains the same GPU memory usage (1x) for the overall pipeline. The extra overhead for Antidote comes from the Post-FT stage (0.04 hours for Wanda score calculation). In stark contrast, alignment stage solutions (RepNoise, Vaccine) incur significant overhead in the alignment stage (e.g., RepNoise is 2.14x clock time and 2.12x GPU memory). Fine-tuning stage solutions (Lisa, LDIFS) also increase GPU memory (1.60x and 1.95x respectively) and sometimes clock time (LDIFS is 1.53x in fine-tuning). This makes Antidote a practical and scalable solution.
6.4. Hyper-parameters Analysis and Ablation Study
Impact of Mask Ratio ()
The following are the results from Table 11 of the original paper:
| α=0.01 | α=0.05 | α=0.1 | α=0.15 | α=0.2 | α=0.25 | |
| HS | 73.60 | 68.70 | 64.60 | 58.90 | 58.40 | 57.00 |
| FA | 94.95 | 94.50 | 93.35 | 91.06 | 86.58 | 80.05 |
As the mask ratio () increases, both the harmful score (HS) and the fine-tune accuracy (FA) decrease. This is an expected trade-off: pruning more parameters (higher ) leads to better safety (lower HS) but also removes more task-relevant parameters, thus reducing FA. The results show a clear inverse relationship: from to , HS drops from 73.60 to 57.00, while FA drops from 94.95 to 80.05. This demonstrates that is a crucial hyper-parameter for balancing safety and utility. A larger mask ratio could also offer benefits in model acceleration, though that is considered future work.
Necessity of Re-alignment Dataset
The following are the results from Table 12 of the original paper:
| p=0.05 | p=0.1 | p=0.2 | p=0.5 | Average | |
| HS (w/ harmful data) | 63.10 | 68.30 | 68.80 | 69.20 | 67.35 |
| HS (w/ fine-tuning data) | 63.30 (+0.20) | 69.80 (+1.50) | 68.50 (-0.30) | 70.50 (+1.30) | 68.03 (+0.68) |
| HS (w/ benign data) | 63.80 (+0.70) | 69.70 (+1.40) | 69.20 (+0.40) | 71.20 (+2.00) | 68.48 (+1.13) |
This ablation study confirms the necessity of using a dedicated re-alignment dataset (composed solely of harmful data) for calculating the Wanda score. Using the fine-tuning data (which contains benign and harmful mixture) or benign data (fine-tuning data without harmful samples) for Wanda score calculation leads to higher harmful scores compared to using the dedicated harmful re-alignment data. Specifically, using benign data results in the worst performance, as it cannot effectively identify harmful parameters. This validates Antidote's design choice of a specialized re-alignment dataset.
Impact of the Size of Re-alignment Dataset
The following are the results from Table 13 of the original paper:
| |Drealign| | 0 | 5 | 10 | 100 | 1k | 2k | 5k |
| HS | 72.1 | 70.60 | 70.30 | 70.20 | 69.30 | 69.20 | 69.40 |
| (0) | (-1.5) | (-1.8) | (-1.9) | (-2.8) | (-2.9) | (-2.7) |
The results show a general trend: increasing the number of harmful samples in the re-alignment dataset () tends to reduce the harmful score. This is because a larger dataset better reflects the true harmful distribution, allowing for more precise identification of harmful parameters. The benefit of increasing samples diminishes after a certain point; 1000 samples already yield a significant reduction in HS, and further increases to 2000 or 5000 samples provide only marginal additional benefits. The case where implies Wanda score reduces to weight magnitude, yielding the highest harmful score (72.1), further proving the utility of re-alignment data. The practicality of collecting 1k harmful samples supports the feasibility of Antidote.
6.5. Extensions
The following are the results from Table 14 of the original paper:
| HS | FA | |||
| p=0.1 p=0.2 p=0.5 | Average | p=0.1 p=0.2 p=0.5 | Average | |
| Antidote | 61.20 64.60 64.50 | 63.43 | 93.12 93.35 91.74 | 92.74 |
| V-S-A | 58.90 62.30 61.70 | 60.97 (-2.46) | 94.04 93.00 91.74 | 92.93 (+0.19) |
| S-L-A | 61.10 61.60 60.90 | 61.20 (-2.23) | 91.28 92.89 91.86 | 92.01 (-0.73) |
| V-L-A | 63.70 63.70 60.60 | 62.67 (-0.76) | 93.12 93.58 91.51 | 92.74 (0) |
This section explores combining Antidote with other defenses due to its post-fine-tuning nature.
-
V-S-A (Vaccine + SFT + Antidote): This combination uses
Vaccinein thealignment stage, standardSFTinfine-tuning, and thenAntidote. It achieves the best safety performance, reducing the averageharmful scoreby 2.46 points compared toVanilla Antidote(from 63.43 to 60.97), while slightly increasing averagefine-tune accuracyby 0.19 points (from 92.74 to 92.93). This suggestsAntidotecan synergize withalignment-stage defenses. -
S-L-A (SFT + Lisa + Antidote): This uses
SFTforalignment,Lisaforfine-tuning, and thenAntidote. It also reduces the averageharmful scoreby 2.23 points (to 61.20) but comes with a slight reduction infine-tune accuracy(-0.73 points, to 92.01). -
V-L-A (Vaccine + Lisa + Antidote): This combines
Vaccineinalignment,Lisainfine-tuning, andAntidote. It shows a smaller reduction inharmful score(-0.76 points, to 62.67) compared toV-S-AandS-L-A, while maintaining the samefine-tune accuracyasVanilla Antidote.In summary, combining
Antidotewith a strongalignment-stage defenselikeVaccine(V-S-A) proves to be the most effective strategy, offering improved safety without compromising performance.
6.6. Visualization
The paper includes a visualization section that presents qualitative results:
The paper states that for a malicious prompt, Antidote gives refusal answers to sensitive questions, while other methods cannot. This qualitative result complements the quantitative harmful score by showing the actual output behavior, confirming Antidote's ability to maintain safety.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper identifies a crucial hyper-parameter sensitivity issue in existing alignment-stage and fine-tuning-stage defenses against harmful fine-tuning attacks on LLMs. These defenses often fail when using fine-tuning hyper-parameters (like large learning rates or many epochs) that are necessary for optimal downstream task performance. To address this, the authors propose Antidote, a novel post-fine-tuning stage solution that is agnostic to the fine-tuning details. Antidote's core philosophy is to recover the model from harmful behaviors by identifying and removing harmful parameters through one-shot pruning using the Wanda score on a re-alignment dataset. Extensive empirical results demonstrate that Antidote significantly reduces the harmful score across various attack settings and datasets, maintains high fine-tune accuracy, generalizes well to different LLM architectures, and is system-efficient.
7.2. Limitations & Future Work
The paper implicitly notes some limitations and suggests future work:
- Mask Ratio Tuning: The paper mentions that the
mask ratio() was not specifically tuned for each dataset in the generalization experiments (Table 6), implying that further tuning could potentially improve the trade-off betweenharmful scoreandfine-tune accuracy. - Model Acceleration: While
pruningoffers inherent potential formodel acceleration(especially with a largermask ratio), the paper explicitly states thatmodel accelerationwas not its main focus and is left asfuture work. - Potential Misuse: The
Impact Statementacknowledges that the findings of this paper, while proposing a defense, could potentially be misused by the public to launch attacks towards commercialLLMservices, highlighting a broader ethical consideration in security research.
7.3. Personal Insights & Critique
The paper's identification of the hyper-parameter sensitivity issue is a significant contribution. It highlights a practical challenge for LLM service providers who need to balance user customization (which might require specific fine-tuning hyper-parameters) with safety compliance. The Antidote approach is elegantly simple in its design, leveraging existing model sparsification techniques (Wanda score, pruning) for a novel application in safety alignment. The "post-fine-tuning" nature of Antidote offers a modularity that could be highly beneficial, allowing users to fine-tune models as they normally would, with Antidote acting as a subsequent safety layer.
One area for potential critique or deeper exploration is the definition and selection of the re-alignment dataset. While the paper validates its necessity and size, the quality and representativeness of this harmful dataset are crucial. If the re-alignment dataset itself is not comprehensive, Antidote might miss certain harmful parameters or inadvertently remove benign ones. Further research into dynamic re-alignment dataset generation or adaptation could be valuable. Additionally, the specific mechanism of how pruning affects the LLM's internal representations and its long-term safety resilience could be further investigated from a mechanistic interpretability perspective.
The paper provides a compelling solution for a pressing issue in LLM deployment. Its methods, particularly the idea of repurposing pruning for safety, could inspire similar applications in other domains where model robustness against specific malicious behaviors is critical, beyond just LLMs. The low computational overhead is a strong practical advantage, making Antidote a promising candidate for real-world integration.
Similar papers
Recommended via semantic vector search.