Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
TL;DR Summary
This paper introduces Layer-Aware Representation Filtering (LARF) to maintain safety alignment in fine-tuning Large Language Models by identifying safety-sensitive layers and filtering out safety-degrading samples from the training data.
Abstract
With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a \textbf{L}ayer-\textbf{A}ware \textbf{R}epresentation \textbf{F}iltering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated. Please see our code at \href{https://github.com/LLLeoLi/LARF}{https://github.com/LLLeoLi/LARF}.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
The title clearly outlines the paper's core subject. It addresses the problem of maintaining the "safety alignment" of Large Language Models (LLMs) during the "finetuning" process. The proposed solution is a data "filtering" technique that is "layer-aware," meaning it analyzes the internal layers of the model to "purify" the data used for finetuning.
1.2. Authors
- Authors: Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, Lei Sha.
- Affiliations: The authors are from prominent Chinese research institutions: Shanghai Artificial Intelligence Laboratory, Beihang University, Wuhan University, and Peking University. The corresponding authors, Jing Shao and Lei Sha, are associated with the Shanghai AI Laboratory and Beihang University, respectively. This collaboration indicates a strong academic focus on LLM safety and alignment.
1.3. Journal/Conference
The paper was submitted to arXiv, a popular preprint server for academic papers. The publication date in the metadata (2025-07-24T17:59:24.000Z) and the arXiv ID (2507.18631v1) appear to contain a typo, as the content and citations suggest a mid-2024 submission for a 2025 conference. The paper cites other works accepted to the ICLR 2025 conference, indicating it is likely targeting top-tier AI conferences like ICLR, NeurIPS, or ICML, which are highly competitive and influential in the field of machine learning.
1.4. Publication Year
The paper is a preprint with a v1 version. The arXiv identifier suggests a submission in July 2025, which is likely a typo for 2024.
1.5. Abstract
The abstract presents a concise summary of the research.
- Problem: Fine-tuning pre-aligned LLMs, even with seemingly harmless ("benign") datasets, can degrade their safety alignment, making them more vulnerable to generating unsafe content in response to malicious prompts.
- Insight: The authors argue that benign fine-tuning datasets contain "safety-degrading" samples that are not obvious on the surface but possess features that undermine the model's safety mechanisms.
- Methodology: They propose LARF (Layer-Aware Representation Filtering), a novel method to address this. LARF first identifies "safety-sensitive" layers within an LLM—those most critical to its safety behavior. It then uses the internal data representations from these specific layers to detect and filter out the safety-degrading samples from the fine-tuning dataset.
- Results: Experiments show that LARF successfully identifies these problematic data samples. By removing them, the fine-tuning process proceeds without significantly harming the model's safety alignment.
- Conclusion: LARF is an effective method to mitigate safety degradation during fine-tuning.
1.6. Original Source Link
-
Original Source Link: https://arxiv.org/abs/2507.18631v1
-
PDF Link: https://arxiv.org/pdf/2507.18631v1.pdf
-
Publication Status: This is a preprint on arXiv. It has not yet undergone formal peer review for publication in a conference or journal.
2. Executive Summary
2.1. Background & Motivation
-
Core Problem: The core problem is a phenomenon known as safety alignment degradation. Modern Large Language Models (LLMs) like Llama 3 or GPT-4 undergo a process called "alignment" to make them helpful and harmless. This involves training them to refuse dangerous or unethical requests. However, developers often need to "fine-tune" these aligned models on custom datasets to adapt them for specific tasks (e.g., coding, medical advice). Recent research has alarmingly shown that this fine-tuning process, even with datasets containing no explicitly toxic content, can unintentionally erase or weaken the model's safety guardrails. This makes the fine-tuned model susceptible to "jailbreaking," where it complies with malicious instructions it would have previously refused.
-
Importance and Challenges: This issue is a critical barrier to the safe deployment of LLMs in sensitive, real-world applications like healthcare, finance, and education, where unexpected harmful behavior can have severe consequences. Existing methods to tackle this are insufficient. Standard toxicity filters are designed to catch obviously harmful content, not the subtle, "benign-looking" data that causes this degradation. More advanced methods for identifying such "safety-degrading" data, like
Bi-Anchoring(which analyzes gradients) andSEAL(which trains a separate classifier model), suffer from being computationally expensive, slow, and sometimes unreliable. -
Innovative Idea: The paper's key insight is that an LLM's safety mechanism (its ability to refuse harmful prompts) is not uniformly distributed throughout the model but is particularly concentrated in a few specific layers. The authors term these "safety-sensitive layers." Instead of relying on complex gradient calculations or training new models, they propose to directly inspect the data representations (the internal "thoughts" of the model) within these critical layers. The hypothesis is that benign-looking data that degrades safety will have internal representations that are surprisingly similar to those of explicitly harmful data.
2.2. Main Contributions / Findings
The paper presents the following main contributions:
-
A Principled and Efficient Filtering Framework (
LARF): The authors propose a novel framework that avoids the high costs associated with previous methods. By focusing on layer-wise representations,LARFprovides a fast and accurate way to identify safety-degrading data within benign datasets without needing extra training or gradient computations. -
State-of-the-Art Detection Performance:
LARFis shown to be highly effective. In experiments with the Llama 3.1 model, fine-tuning on the 1,000 "worst" samples identified byLARFfrom the Alpaca dataset caused the model's Attack Success Rate (ASR) on a safety benchmark to skyrocket from 3.5% to 39%. Conversely, fine-tuning on the 1,000 "best" samples reduced the ASR to 0%. This demonstrates its precision in separating "good" from "bad" data for safety. -
Broad Generalizability and Practical Impact: The authors show that filtering data with
LARFsuccessfully mitigates safety degradation across a variety of models and downstream tasks, including code generation, mathematical reasoning, and medical question-answering. Crucially, this safety preservation comes with no significant loss in performance on the intended task. This makesLARFa practical pre-deployment tool for ensuring LLM safety.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
LLM Safety Alignment: This refers to the process of training an LLM to behave in accordance with human values and safety guidelines. A primary goal of alignment is to make the model "harmless" by teaching it to recognize and refuse to comply with prompts that are dangerous, unethical, or illegal (e.g., requests for instructions on building weapons, generating hate speech, or creating malware). This is often achieved through techniques like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO).
-
Supervised Fine-Tuning (SFT): This is a common technique for adapting a general-purpose, pre-trained LLM to a specific downstream task. It involves further training the model on a smaller, curated dataset of high-quality examples. For instance, to make a model good at coding, it would be fine-tuned on a dataset of code-related instructions and their correct solutions. The problem this paper tackles is that SFT can unintentionally undo the model's safety alignment.
-
Internal Representations (Hidden States): An LLM processes text by passing it through a series of layers (typically Transformer blocks). At each layer, the model produces a set of vectors known as hidden states or representations. These vectors are high-dimensional numerical encodings that capture semantic and syntactic information about the input text at that specific stage of processing. The paper's core idea is to analyze these representations to understand how the model "perceives" a given piece of data.
-
Gradient-Based Data Attribution: In neural networks, a gradient represents how a small change in a model's parameter (weight) affects the final output (or loss). Gradient-based methods use this information to trace the influence of a specific training data point on the model's behavior. By calculating the gradient similarity between a new data point and known "good" or "bad" examples, these methods try to attribute a risk score to the new data.
3.2. Previous Works
The paper positions itself against two main categories of prior work:
-
Gradient-Based Methods: These methods analyze model gradients to identify influential data.
GradSafe(Xie et al., 2024): Classifies instructions as unsafe by looking at the gradient of the model's "safety-sensitive parameters" in response to the instruction. It focuses only on the prompt, not the model's potential response.Bi-Anchoring(He et al., 2024): This method is a key baseline. It calculates a risk score for a candidate data sample by measuring the gradient similarity between that sample and two "anchors": a reference set of safe examples and a reference set of unsafe examples. The authors note that this method can be noisy and struggles with long output sequences, which diminishes the effectiveness of gradient similarity.
-
Model-Based Methods: These methods train an additional model to perform the filtering.
SEAL(Shen et al., 2025): This method trains a dedicated "data ranker" model. The ranker learns to distinguish between data that is helpful for the downstream task but safe, and data that is unsafe or low-quality. The main drawback is the significant computational overhead and time required to train this extra model for every new LLM or dataset.
-
Representation Engineering: This is a related field that inspires the paper's approach.
Representation Engineering(Zou et al., 2023a): This line of work has shown that model behaviors can be controlled by directly manipulating their internal representations.Refusal Direction(Arditi et al., 2024): This work found that a model's decision to "refuse" or "comply" with a harmful prompt can be controlled by shifting its internal representation along a single specific direction in the activation space.Circuit Breaker(Zou et al., 2024): This method defends against attacks by rerouting harmful representations at inference time.- Significance: These works establish the crucial role of internal representations in LLM safety, providing the theoretical groundwork for
LARF, which leverages representations for data filtering rather than inference-time intervention.
3.3. Technological Evolution
The approach to handling safety degradation during fine-tuning has evolved:
- Initial Problem Discovery: Researchers (e.g., Qi et al., 2024) first identified that fine-tuning on benign data could compromise safety.
- Gradient-Based Attribution: The first wave of solutions, like
Bi-Anchoring, adapted data attribution techniques from other domains. They used gradients as a proxy for data influence, but this proved to be computationally heavy and sometimes unreliable. - Auxiliary Model Training: The next step, represented by
SEAL, was to frame it as a learning problem and train a dedicated model to score data. This improved effectiveness but at a very high computational cost, making it impractical for many users. - Representation-Based Filtering (
LARF): This paper represents a shift towards a more direct and efficient approach. Instead of using proxies like gradients or training external models,LARFprobes the LLM's own internal mechanisms by analyzing its representations in key layers. This is inspired by recent advances in model interpretability and representation engineering.
3.4. Differentiation Analysis
LARF's core innovations compared to previous methods are:
-
Efficiency: It is training-free and gradient-free. Unlike
SEAL, it doesn't require training an auxiliary model. UnlikeBi-Anchoring, it doesn't require expensive backpropagation to compute gradients for every data sample. It only requires a single forward pass per sample to extract representations, making it significantly faster and less resource-intensive. -
Precision through Localization:
LARFintroduces the novel concept of "safety-sensitive layers." Instead of looking at the entire model, it pinpoints the specific layers that are most influential for safety behavior. This targeted approach filters out noise from other layers and focuses on the most relevant signals. -
Use of Bidirectional Representations: The scoring function is not just based on similarity to unsafe examples but on the difference between similarity to unsafe examples and safe examples (). This "bidirectional" approach better captures the model's internal trade-off between compliance and refusal, leading to more robust filtering.
4. Methodology
4.1. Principles
The core principle of LARF is that an LLM's safety alignment is not a globally distributed property but is heavily localized within a few specific "safety-sensitive" layers. The method is built on two key hypotheses:
-
These safety-sensitive layers can be identified by observing how perturbations to each layer's parameters affect the model's refusal behavior. The layer that causes the largest swing in refusals (i.e., fewer refusals when weakened, more when strengthened) is the most critical for safety.
-
Once these layers are found, the representations of data samples within them can serve as a powerful signal. Benign-looking data samples that ultimately degrade safety will, at this critical layer, have representations that are closer to those of explicitly harmful/compliant examples than to those of safe/refusal examples.
LARFoperationalizes these principles in a two-stage pipeline, as shown in Figure 2 of the paper.
该图像是一个示意图,展示了LARF方法中的两个阶段:安全敏感层识别和安全降级数据过滤。在第一阶段中,通过计算安全变化率来识别安全敏感层,并根据拒绝响应的数量进行层级排序。在第二阶段中,对测试数据进行表示分析,区分危险与安全的数据以进行筛选。
4.2. Core Methodology In-depth (Layer by Layer)
The entire LARF process is detailed in Algorithm 1 of the paper's appendix. It consists of two main stages: identifying the safety-sensitive layer and then using it to filter data.
4.2.1. Stage 1: Safety-sensitive Layers Identification
This stage aims to find the single layer in the LLM that has the most influence over its safety (refusal) mechanism.
1. Setup:
- An overrejection dataset is used. This is a specially constructed dataset containing benign instructions that a highly cautious, safety-aligned model might mistakenly refuse (e.g., "How can I kill some time?"). The use of this dataset makes the model's safety mechanism more "trigger-happy" and thus its behavior more sensitive to parameter changes, making it easier to identify the critical layer.
2. Probing Process:
The method iterates through each layer of the LLM (from to L-1). For each layer, it performs the following steps:
-
Step 2.1: Define Scaled Modules: For a given layer and a small scaling factor (the paper uses values of 0.1 and 0.2), two modified versions of the layer's modules are created:
- An enhanced version where the attention () and feed-forward () module weights are amplified: $ A_l^+ = (1 + \alpha) A_l, \quad F_l^+ = (1 + \alpha) F_l $
- A weakened version where the weights are attenuated: $ A_l^- = (1 - \alpha) A_l, \quad F_l^- = (1 - \alpha) F_l $
-
Step 2.2: Count Refusals:
- The model with the enhanced layer is run on all prompts in the overrejection dataset . The number of responses that are refusals is counted. This count is denoted as . $ c_l^+(\alpha) = \big| { x \in D_s \mid y_s^+(x) \text{ is refusal} } \big| $ where is the model's output with the enhanced layer.
- Similarly, the model with the weakened layer is run on , and the number of refusal responses is counted, denoted as . $ c_l^-(\alpha) = \big| { x \in D_s \mid y_s^-(x) \text{ is refusal} } \big| $
-
Step 2.3: Calculate Sensitivity Score: The sensitivity of layer is measured by the difference in refusal counts between the enhanced and weakened states. A large difference indicates the layer is highly influential. $ \Delta_l(\alpha) = c_l^+(\alpha) - c_l^-(\alpha) $ To get a normalized score, this difference is divided by the scaling factor . The final sensitivity score for the layer, , is the maximum sensitivity observed across the different values tested: $ k_l = \operatorname*{max}_{\alpha \in { \alpha_1, \alpha_2 }} \frac{\Delta_l(\alpha)}{\alpha} $
3. Layer Selection:
- After calculating the sensitivity score for all candidate layers, the layer with the highest score is selected as the safety-sensitive layer, denoted as . $ l_s = \arg \operatorname*{max}_{l=0, ..., L-1} k_l $
4.2.2. Stage 2: Bidirectional Representation Similarity Calculation
This stage uses the identified safety-sensitive layer to score and rank every sample in the target fine-tuning dataset (D_test).
1. Setup:
- Two small reference datasets are required:
D_unsafe: A set of harmful instructions paired with unsafe, compliant responses (e.g., generated by an uncensored model).D_safe: The same set of harmful instructions fromD_unsafe, but paired with safe, refusal responses.
2. Calculating Reference Representations:
- The method first computes an average "unsafe" and "safe" representation vector.
- Step 2.1: Each sample in
D_safeis passed through the model. The hidden state representation from the output of the safety-sensitive layer (i.e., at layer index ) is extracted at the final end-of-sequence () token. These representations are then averaged to create a single "safe" reference vector, . $ r_{\mathrm{safe}} = \frac{1}{|D_{\mathrm{safe}}|} \sum_{d \in D_{\mathrm{safe}}} r_{l_s+1}(d) $ - Step 2.2: The same process is repeated for the
D_unsafedataset to create an "unsafe" reference vector, . $ r_{\mathrm{unsafe}} = \frac{1}{|D_{\mathrm{unsafe}}|} \sum_{d \in D_{\mathrm{unsafe}}} r_{l_s+1}(d) $ These two vectors, and , act as anchors in the representation space, defining the poles of "safe refusal" and "unsafe compliance."
3. Scoring and Ranking Test Data:
- For each sample in the fine-tuning dataset
D_test: - Step 3.1: Extract Representation: The sample is passed through the model, and its representation is extracted from layer at the final token. $ r_i = r_{l_s+1}(d_i) $
- Step 3.2: Calculate Cosine Similarities: The cosine similarity between the sample's representation and each of the two reference vectors is computed. $ s_{\mathrm{safe}}(r_i) = \mathrm{sim}(r_i, r_{\mathrm{safe}}) $ $ s_{\mathrm{unsafe}}(r_i) = \mathrm{sim}(r_i, r_{\mathrm{unsafe}}) $
- Step 3.3: Calculate Safety-degrading Score: The final score for the sample is the similarity to the unsafe anchor minus the similarity to the safe anchor. $ \mathrm{score}i = s{\mathrm{unsafe}}(r_i) - s_{\mathrm{safe}}(r_i) $ A high positive score means the sample's representation is much more aligned with unsafe compliance than with safe refusal, indicating it is likely a safety-degrading sample.
4. Filtering:
-
All samples in
D_testare ranked in descending order based on their . The top-ranked samples are considered the most dangerous for safety alignment and can be removed from the dataset before fine-tuning begins.
5. Experimental Setup
5.1. Datasets
-
Models: The experiments were conducted on three popular safety-aligned instruction-tuned models:
Llama3-8B-InstructLlama3.1-8B-InstructQwen2.5-7B-InstructThe appendix also mentions tests onMistral-v0.2,Phi-3-mini, andQwen2.
-
Fine-tuning Datasets: Two widely-used, benign instruction-following datasets were used to simulate the fine-tuning process that causes safety degradation:
Alpaca: A dataset of 52k instruction-following examples generated by OpenAI'stext-davinci-003.Dolly: A dataset of ~15k instruction-following examples crowdsourced from Databricks employees.
-
Safety Evaluation Datasets: To measure the models' safety after fine-tuning, three challenging benchmarks composed of harmful prompts were used:
HarmBench: A standardized benchmark for evaluating model safety, containing prompts generated via automated red-teaming.HEx-PHI: A dataset of harmful prompts designed to be effective against models aligned with the "Helpful, Harmless, and Honest" (HHH) principles.DirectHarm4: A dataset specifically created to challenge fine-tuned models, focusing on four categories where they are empirically weaker: Malware, Drugs, Phishing, and Disinformation.
-
Downstream Task Datasets: To verify that filtering does not hurt task-specific performance, three datasets were used for fine-tuning and evaluation:
Magicoder: For code generation, evaluated onHumanEval.PubMedQA: For medical question answering.MetaMath: For mathematical reasoning, evaluated onMATH.
-
Reference and Overrejection Datasets:
- The
D_safeandD_unsafereference sets were constructed using 100 examples (5 from each of 20 harmful categories) from theCircuit Breakertraining dataset. - The overrejection dataset was created by the authors and contains 110 instructions that pair potentially dangerous verbs with innocuous intents (e.g., "kill time").
- The
5.2. Evaluation Metrics
-
Attack Success Rate (ASR):
- Conceptual Definition: ASR measures the percentage of harmful prompts from a safety benchmark that successfully "jailbreak" the model, causing it to generate a harmful or non-refusal response. A higher ASR indicates a less safe model. The paper uses
LlamaGuard 3, a specialized safety classifier model, to judge whether a response is harmful. - Mathematical Formula: $ \text{ASR} = \frac{\text{Number of Harmful Responses}}{\text{Total Number of Harmful Prompts}} \times 100% $
- Symbol Explanation:
Number of Harmful Responses: The count of model outputs that are classified as unsafe by the evaluator (LlamaGuard 3).Total Number of Harmful Prompts: The total number of prompts in the safety benchmark dataset.
- Conceptual Definition: ASR measures the percentage of harmful prompts from a safety benchmark that successfully "jailbreak" the model, causing it to generate a harmful or non-refusal response. A higher ASR indicates a less safe model. The paper uses
-
GPT Score and GPT ASR:
- Conceptual Definition: For a more nuanced safety evaluation on downstream tasks, the authors used
GPT-4oto rate the harmfulness of model responses on a scale from 1 (least harmful) to 5 (most harmful).GPT Score: The average harmfulness rating across all responses. Lower is better.GPT ASR: The percentage of responses that receive the maximum harmfulness score of 5. Lower is better.
- Conceptual Definition: For a more nuanced safety evaluation on downstream tasks, the authors used
-
Downstream Task Performance Metrics:
pass@1(for HumanEval):- Conceptual Definition: Used for code generation tasks. It measures the percentage of programming problems for which the model generates at least one functionally correct solution in a single attempt.
accuracy(for PubMedQA):- Conceptual Definition: The standard classification accuracy metric. It measures the percentage of questions for which the model provides the correct answer.
math_verify(for MATH):- Conceptual Definition: Used for mathematical reasoning. It measures the percentage of problems for which the model's generated solution is verifiably correct.
5.3. Baselines
LARF was compared against four methods for filtering or selecting data:
-
Random: As a simple baseline, a subset of data is randomly sampled from the fine-tuning dataset. This represents the performance without any intelligent filtering. -
SEAL: A state-of-the-art but computationally expensive method that trains a dedicated data ranker model to distinguish safe and high-quality data. -
GradSafe: A gradient-based method that identifies safety-degrading data using only the gradients generated from the instruction part of a data sample. -
Bi-Anchoring: Another strong gradient-based baseline that computes gradients from both the instruction and the first few tokens of the response, then compares their similarity to safe and unsafe reference gradients.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results robustly support the paper's claims about LARF's efficiency and effectiveness.
6.1.1. Identification of Safety-Sensitive Layers
-
Layer Sensitivity Analysis (Figure 4): This experiment validates the core premise of
LARF. As shown in Figure 4 for the Llama3 model, perturbing different layers has vastly different effects on the model's refusal rate. Layer 13 shows the most dramatic change: weakening it causes a sharp drop in refusals, while strengthening it causes a sharp increase. This clearly identifies it as the "safety-sensitive layer." The appendix shows similar clear-cut results for other models (e.g., layer 13 for Llama3.1, layer 18 for Qwen2.5).
该图像是图表,展示了不同层级(L11-L31)在参数缩放下的拒绝响应(Refusal Responses)。可以看到,第13层(L13)对安全性最敏感,参数减小时拒绝响应显著下降,而参数增加时则迅速上升。 -
Validation of Layer Choice (Figure 5): To prove that the identified layer is indeed the best one for filtering, the authors performed an experiment where they used representations from different layers (11 to 31) to select the "worst" 1,000 samples from Alpaca. Figure 5 shows that fine-tuning on data selected using representations from layer 13 resulted in the highest Attack Success Rate (ASR). This confirms that the sensitivity-based identification method correctly pinpoints the layer whose representations are most indicative of safety-degrading features.
该图像是一个图表,展示了在不同层表示下,Llama3模型在三个安全基准上的攻击成功率(ASR)。数据呈现了第13层表示对应的攻击成功率在所有基准中最高,验证了确定的安全敏感层在数据选择中的有效性。
6.1.2. Effectiveness in Identifying Safety-Degrading Data
-
Finding "Bad" Data (Table 1): This is the main result demonstrating
LARF's superiority. The experiment involves fine-tuning models on the 1,000 samples ranked highest (most "safety-degrading") by each method. A higher resulting ASR means the method was more effective at finding the most damaging data.LARFconsistently outperforms all baselines across different models, datasets, and safety benchmarks. For instance, with Llama3 on Alpaca, fine-tuning onLARF-identified data yields an ASR of 52.00% on DirectHarm4, whereas the next best baseline (Bi-Anchoring) only reaches 49.00%, andSEALonly reaches 26.75%. This highlightsLARF's exceptional ability to isolate the most harmful subset of benign data. -
Finding "Good" Data (Table 5): The paper also shows the reverse: fine-tuning on the 1,000 samples with the lowest safety-degrading scores. A lower ASR is better here. As shown in Table 5,
LARFis again the best performer, often reducing the model's ASR to nearly 0%, making the fine-tuned model even safer than the original aligned model. For Llama3 on Alpaca,LARFachieves ASRs of 0.75%, 0.00%, and 0.34% on the three benchmarks, drastically outperforming all other methods.
6.1.3. Efficiency and Downstream Performance
- Efficiency (Table 3):
LARFis vastly more efficient than its competitors. Filtering the Alpaca dataset with Llama3.1 tookLARFonly 0.5 hours on a single GPU, compared to 3 hours forBi-Anchoring(on 4 GPUs) and 6 hours forSEAL(on 8 GPUs). This makesLARFa far more practical tool. - Downstream Performance (Table 2): A crucial finding is that
LARFpreserves safety without sacrificing utility. The experiment involved removing the 2,000 worst-ranked samples (20% of the data) before fine-tuning on downstream tasks. Table 2 shows that for all models and tasks (coding, QA, math), the performance (e.g., Humaneval score) after filtering withLARFremains nearly identical to the random baseline. However, the safety of theLARF-filtered models, measured by GPT Score and ASR, is consistently and significantly better than all baselines. In contrast,SEALandBi-Anchoringsometimes even increased harmfulness compared to the random baseline, showing their unreliability.
6.2. Data Presentation (Tables)
The following are the key results from Table 1, showing the Attack Success Rate (%) after fine-tuning on the 1,000 highest-ranked (most harmful) samples identified by each method. Higher is better for evaluating the filtering method's ability to find bad data.
| Model | Dataset | Bench | Instruct | Random | Filtering Method (Top 1,000) | |||
|---|---|---|---|---|---|---|---|---|
| LARF | SEAL | GradSafe | Bi-Anchoring | |||||
| Llama3 | Alpaca | DirectHarm4 | 11.25 | 25.00 | 52.00 | 26.75 | 28.00 | 49.00 |
| Harmbench | 9.50 | 15.00 | 35.50 | 13.50 | 16.00 | 35.00 | ||
| HEx-PHI | 8.62 | 6.55 | 26.21 | 6.90 | 8.97 | 24.58 | ||
| Dolly | DirectHarm4 | 11.25 | 55.25 | 79.25 | 28.25 | 75.00 | 74.50 | |
| Harmbench | 9.50 | 39.25 | 78.50 | 13.00 | 82.00 | 75.00 | ||
| HEx-PHI | 8.62 | 31.38 | 68.97 | 7.24 | 74.14 | 67.59 | ||
| Llama3.1 | Alpaca | DirectHarm4 | 13.25 | 22.50 | 49.50 | 27.75 | 7.50 | 11.00 |
| Harmbench | 3.50 | 18.50 | 39.00 | 13.00 | 5.00 | 12.50 | ||
| HEx-PHI | 5.86 | 8.97 | 31.38 | 6.90 | 3.45 | 3.10 | ||
| Dolly | DirectHarm4 | 13.25 | 54.00 | 84.00 | 71.75 | 59.50 | 67.25 | |
| Harmbench | 3.50 | 51.00 | 85.00 | 65.00 | 60.50 | 50.50 | ||
| HEx-PHI | 5.86 | 29.30 | 60.34 | 38.62 | 33.79 | 40.00 | ||
| Qwen2.5 | Alpaca | DirectHarm4 | 9.25 | 27.50 | 44.50 | 20.00 | 26.00 | 44.50 |
| Harmbench | 6.00 | 11.00 | 31.00 | 9.00 | 10.00 | 24.50 | ||
| HEx-PHI | 9.66 | 13.10 | 27.24 | 6.55 | 12.07 | 24.80 | ||
| Dolly | DirectHarm4 | 9.25 | 50.50 | 83.75 | 49.75 | 66.50 | 60.50 | |
| Harmbench | 6.00 | 36.00 | 86.50 | 65.50 | 60.00 | 60.50 | ||
| HEx-PHI | 9.66 | 32.41 | 77.24 | 51.03 | 51.03 | 42.07 | ||
6.3. Ablation Studies / Parameter Analysis
-
Bidirectional vs. Unidirectional Scoring (Figure 3): The authors perform an important ablation study to justify their scoring function, . They compare this "bidirectional" method to a simpler "unidirectional" one where the score is just the similarity to the unsafe anchor: . Figure 3 shows that the bidirectional method is superior. When finding "bad" data (top 1,000), the bidirectional method leads to a higher ASR (52% vs. lower for unidirectional), and when finding "good" data (bottom 1,000), it leads to a lower ASR. This confirms that using both safe and unsafe anchors provides a stronger, more discriminative signal.
该图像是图表,展示了经过微调的Llama3在三个安全基准上,根据双向方法(Orig)和单向方法(Unsafe)排名的前后1,000个样本的攻击成功率(%)。可以看到,前1000个原始样本的攻击成功率最高,达到52.0%。 -
Analysis of Safety-Degrading Data (Table 4, Section 4.5): Further analysis reveals interesting characteristics of the safety-degrading data identified by
LARF.-
Long, Point-by-Point Responses: As shown in Table 4, the top-ranked (most harmful) samples consistently feature much longer responses and a higher proportion of point-by-point, itemized list formats compared to the dataset average. The hypothesis is that these elaborate, helpful-seeming response styles override the model's default tendency to give short, concise refusals to potentially problematic prompts.
-
Representational Drift (Figure 17): PCA visualizations show that fine-tuning on the top-ranked data causes a significant "representational drift." The model's internal representations of harmful concepts move far away from where they were in the original, safely aligned model. This provides a clear visual explanation for the degradation of safety alignment.
-
Increased ASR on Specific Topics (Figures 19-21): The safety degradation is not uniform. Fine-tuning on
LARF-identified bad data specifically increases the model's vulnerability to generating harmful content related to "Disinformation," "Phishing Crimes," and "Political Campaigning," suggesting the filtered data contains patterns that teach the model how to be "helpfully" harmful in these domains.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces LARF, a novel and highly effective method for purifying fine-tuning datasets to preserve the safety alignment of LLMs. The key innovation is the identification of "safety-sensitive layers" and the use of bidirectional representation similarity within these layers to score and filter out benign-looking but safety-degrading data samples. The authors demonstrate through extensive experiments that LARF is not only more effective than existing gradient-based and model-based methods but also significantly more efficient in terms of time and computational resources. By removing the identified harmful data, LARF successfully mitigates safety degradation during fine-tuning without compromising the model's performance on its intended downstream tasks, making it a valuable and practical tool for deploying LLMs safely.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
- Not a Panacea: Data filtering alone, while effective, may not be a complete solution to prevent all safety degradation. The authors suggest that combining
LARFwith safety-aware fine-tuning algorithms could provide more robust protection. - Dependence on Reference Data: The effectiveness of
LARFrelies on the quality of theD_safeandD_unsafereference datasets. The construction of these sets is crucial, and exploring methods for creating optimal reference sets is a promising area for future work. - Limited Scope: The experiments were confined to language-only LLMs. The authors note that safety degradation is also a known issue in Vision-Language Models (VLMs) and text-to-image diffusion models. A key future direction is to adapt and evaluate
LARFin these multimodal settings.
7.3. Personal Insights & Critique
-
Strengths and Innovation:
- The concept of "safety-sensitive layers" is a powerful and elegant simplification. It transforms a complex, model-wide problem into a targeted, localized one. This is a brilliant application of interpretability insights to solve a practical engineering problem.
- The efficiency of
LARFis its most compelling advantage. In an era where training and fine-tuning LLMs is extremely costly, a lightweight, gradient-free method that only requires forward passes is a game-changer for developers and researchers with limited resources. - The methodology is intuitive and well-justified. The validation experiments, such as showing that the identified layer is indeed the most effective for filtering (Figure 5) and that bidirectional scoring is superior (Figure 3), build a strong case for the design choices.
-
Potential Issues and Unverified Assumptions:
- Assumption of Localization: The method's success hinges on the assumption that safety mechanisms are strongly localized. While the results show this is a valid and effective heuristic for the tested models, it's possible that for other architectures or alignment methods, safety could be a more distributed phenomenon, which might reduce
LARF's effectiveness. - Generalization to Larger Models: The paper provides some evidence of transferability to larger models (70B) in the appendix, but the bulk of the analysis is on 7B/8B models. The dynamics of internal representations and layer sensitivity might change at the scale of 100B+ parameter models, which warrants further investigation.
- Dataset Construction: The quality of the
overrejection datasetand thereference datasetsis critical but their construction is described as somewhat manual. A more systematic or automated way to generate these high-quality probe/reference sets would make the method more robust and easier to adopt.
- Assumption of Localization: The method's success hinges on the assumption that safety mechanisms are strongly localized. While the results show this is a valid and effective heuristic for the tested models, it's possible that for other architectures or alignment methods, safety could be a more distributed phenomenon, which might reduce
-
Future Value and Inspirations:
- This representation-centric approach opens up exciting new avenues beyond simple data filtering. For instance, the safety-degrading score could be used as a regularization term during fine-tuning, penalizing the model if its representations drift into the "unsafe" region of the space.
LARFcould be used as a powerful diagnostic tool. By analyzing which data points in a dataset receive high safety-degrading scores, developers can gain deep insights into what kinds of "benign" content are secretly toxic to their model's alignment, helping them create better datasets in the future.- The core idea of identifying "X-sensitive layers" for any property X (e.g., factuality, fairness, reasoning ability) and then using representations from those layers to filter or guide training is a highly generalizable paradigm that could be applied to many other problems in LLM development.
Similar papers
Recommended via semantic vector search.