Paper status: completed

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

Published:07/25/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces Layer-Aware Representation Filtering (LARF) to maintain safety alignment in fine-tuning Large Language Models by identifying safety-sensitive layers and filtering out safety-degrading samples from the training data.

Abstract

With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a \textbf{L}ayer-\textbf{A}ware \textbf{R}epresentation \textbf{F}iltering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated. Please see our code at \href{https://github.com/LLLeoLi/LARF}{https://github.com/LLLeoLi/LARF}.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

The title clearly outlines the paper's core subject. It addresses the problem of maintaining the "safety alignment" of Large Language Models (LLMs) during the "finetuning" process. The proposed solution is a data "filtering" technique that is "layer-aware," meaning it analyzes the internal layers of the model to "purify" the data used for finetuning.

1.2. Authors

  • Authors: Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, Lei Sha.
  • Affiliations: The authors are from prominent Chinese research institutions: Shanghai Artificial Intelligence Laboratory, Beihang University, Wuhan University, and Peking University. The corresponding authors, Jing Shao and Lei Sha, are associated with the Shanghai AI Laboratory and Beihang University, respectively. This collaboration indicates a strong academic focus on LLM safety and alignment.

1.3. Journal/Conference

The paper was submitted to arXiv, a popular preprint server for academic papers. The publication date in the metadata (2025-07-24T17:59:24.000Z) and the arXiv ID (2507.18631v1) appear to contain a typo, as the content and citations suggest a mid-2024 submission for a 2025 conference. The paper cites other works accepted to the ICLR 2025 conference, indicating it is likely targeting top-tier AI conferences like ICLR, NeurIPS, or ICML, which are highly competitive and influential in the field of machine learning.

1.4. Publication Year

The paper is a preprint with a v1 version. The arXiv identifier suggests a submission in July 2025, which is likely a typo for 2024.

1.5. Abstract

The abstract presents a concise summary of the research.

  • Problem: Fine-tuning pre-aligned LLMs, even with seemingly harmless ("benign") datasets, can degrade their safety alignment, making them more vulnerable to generating unsafe content in response to malicious prompts.
  • Insight: The authors argue that benign fine-tuning datasets contain "safety-degrading" samples that are not obvious on the surface but possess features that undermine the model's safety mechanisms.
  • Methodology: They propose LARF (Layer-Aware Representation Filtering), a novel method to address this. LARF first identifies "safety-sensitive" layers within an LLM—those most critical to its safety behavior. It then uses the internal data representations from these specific layers to detect and filter out the safety-degrading samples from the fine-tuning dataset.
  • Results: Experiments show that LARF successfully identifies these problematic data samples. By removing them, the fine-tuning process proceeds without significantly harming the model's safety alignment.
  • Conclusion: LARF is an effective method to mitigate safety degradation during fine-tuning.

2. Executive Summary

2.1. Background & Motivation

  • Core Problem: The core problem is a phenomenon known as safety alignment degradation. Modern Large Language Models (LLMs) like Llama 3 or GPT-4 undergo a process called "alignment" to make them helpful and harmless. This involves training them to refuse dangerous or unethical requests. However, developers often need to "fine-tune" these aligned models on custom datasets to adapt them for specific tasks (e.g., coding, medical advice). Recent research has alarmingly shown that this fine-tuning process, even with datasets containing no explicitly toxic content, can unintentionally erase or weaken the model's safety guardrails. This makes the fine-tuned model susceptible to "jailbreaking," where it complies with malicious instructions it would have previously refused.

  • Importance and Challenges: This issue is a critical barrier to the safe deployment of LLMs in sensitive, real-world applications like healthcare, finance, and education, where unexpected harmful behavior can have severe consequences. Existing methods to tackle this are insufficient. Standard toxicity filters are designed to catch obviously harmful content, not the subtle, "benign-looking" data that causes this degradation. More advanced methods for identifying such "safety-degrading" data, like Bi-Anchoring (which analyzes gradients) and SEAL (which trains a separate classifier model), suffer from being computationally expensive, slow, and sometimes unreliable.

  • Innovative Idea: The paper's key insight is that an LLM's safety mechanism (its ability to refuse harmful prompts) is not uniformly distributed throughout the model but is particularly concentrated in a few specific layers. The authors term these "safety-sensitive layers." Instead of relying on complex gradient calculations or training new models, they propose to directly inspect the data representations (the internal "thoughts" of the model) within these critical layers. The hypothesis is that benign-looking data that degrades safety will have internal representations that are surprisingly similar to those of explicitly harmful data.

2.2. Main Contributions / Findings

The paper presents the following main contributions:

  1. A Principled and Efficient Filtering Framework (LARF): The authors propose a novel framework that avoids the high costs associated with previous methods. By focusing on layer-wise representations, LARF provides a fast and accurate way to identify safety-degrading data within benign datasets without needing extra training or gradient computations.

  2. State-of-the-Art Detection Performance: LARF is shown to be highly effective. In experiments with the Llama 3.1 model, fine-tuning on the 1,000 "worst" samples identified by LARF from the Alpaca dataset caused the model's Attack Success Rate (ASR) on a safety benchmark to skyrocket from 3.5% to 39%. Conversely, fine-tuning on the 1,000 "best" samples reduced the ASR to 0%. This demonstrates its precision in separating "good" from "bad" data for safety.

  3. Broad Generalizability and Practical Impact: The authors show that filtering data with LARF successfully mitigates safety degradation across a variety of models and downstream tasks, including code generation, mathematical reasoning, and medical question-answering. Crucially, this safety preservation comes with no significant loss in performance on the intended task. This makes LARF a practical pre-deployment tool for ensuring LLM safety.


3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • LLM Safety Alignment: This refers to the process of training an LLM to behave in accordance with human values and safety guidelines. A primary goal of alignment is to make the model "harmless" by teaching it to recognize and refuse to comply with prompts that are dangerous, unethical, or illegal (e.g., requests for instructions on building weapons, generating hate speech, or creating malware). This is often achieved through techniques like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO).

  • Supervised Fine-Tuning (SFT): This is a common technique for adapting a general-purpose, pre-trained LLM to a specific downstream task. It involves further training the model on a smaller, curated dataset of high-quality examples. For instance, to make a model good at coding, it would be fine-tuned on a dataset of code-related instructions and their correct solutions. The problem this paper tackles is that SFT can unintentionally undo the model's safety alignment.

  • Internal Representations (Hidden States): An LLM processes text by passing it through a series of layers (typically Transformer blocks). At each layer, the model produces a set of vectors known as hidden states or representations. These vectors are high-dimensional numerical encodings that capture semantic and syntactic information about the input text at that specific stage of processing. The paper's core idea is to analyze these representations to understand how the model "perceives" a given piece of data.

  • Gradient-Based Data Attribution: In neural networks, a gradient represents how a small change in a model's parameter (weight) affects the final output (or loss). Gradient-based methods use this information to trace the influence of a specific training data point on the model's behavior. By calculating the gradient similarity between a new data point and known "good" or "bad" examples, these methods try to attribute a risk score to the new data.

3.2. Previous Works

The paper positions itself against two main categories of prior work:

  1. Gradient-Based Methods: These methods analyze model gradients to identify influential data.

    • GradSafe (Xie et al., 2024): Classifies instructions as unsafe by looking at the gradient of the model's "safety-sensitive parameters" in response to the instruction. It focuses only on the prompt, not the model's potential response.
    • Bi-Anchoring (He et al., 2024): This method is a key baseline. It calculates a risk score for a candidate data sample by measuring the gradient similarity between that sample and two "anchors": a reference set of safe examples and a reference set of unsafe examples. The authors note that this method can be noisy and struggles with long output sequences, which diminishes the effectiveness of gradient similarity.
  2. Model-Based Methods: These methods train an additional model to perform the filtering.

    • SEAL (Shen et al., 2025): This method trains a dedicated "data ranker" model. The ranker learns to distinguish between data that is helpful for the downstream task but safe, and data that is unsafe or low-quality. The main drawback is the significant computational overhead and time required to train this extra model for every new LLM or dataset.
  3. Representation Engineering: This is a related field that inspires the paper's approach.

    • Representation Engineering (Zou et al., 2023a): This line of work has shown that model behaviors can be controlled by directly manipulating their internal representations.
    • Refusal Direction (Arditi et al., 2024): This work found that a model's decision to "refuse" or "comply" with a harmful prompt can be controlled by shifting its internal representation along a single specific direction in the activation space.
    • Circuit Breaker (Zou et al., 2024): This method defends against attacks by rerouting harmful representations at inference time.
    • Significance: These works establish the crucial role of internal representations in LLM safety, providing the theoretical groundwork for LARF, which leverages representations for data filtering rather than inference-time intervention.

3.3. Technological Evolution

The approach to handling safety degradation during fine-tuning has evolved:

  1. Initial Problem Discovery: Researchers (e.g., Qi et al., 2024) first identified that fine-tuning on benign data could compromise safety.
  2. Gradient-Based Attribution: The first wave of solutions, like Bi-Anchoring, adapted data attribution techniques from other domains. They used gradients as a proxy for data influence, but this proved to be computationally heavy and sometimes unreliable.
  3. Auxiliary Model Training: The next step, represented by SEAL, was to frame it as a learning problem and train a dedicated model to score data. This improved effectiveness but at a very high computational cost, making it impractical for many users.
  4. Representation-Based Filtering (LARF): This paper represents a shift towards a more direct and efficient approach. Instead of using proxies like gradients or training external models, LARF probes the LLM's own internal mechanisms by analyzing its representations in key layers. This is inspired by recent advances in model interpretability and representation engineering.

3.4. Differentiation Analysis

LARF's core innovations compared to previous methods are:

  • Efficiency: It is training-free and gradient-free. Unlike SEAL, it doesn't require training an auxiliary model. Unlike Bi-Anchoring, it doesn't require expensive backpropagation to compute gradients for every data sample. It only requires a single forward pass per sample to extract representations, making it significantly faster and less resource-intensive.

  • Precision through Localization: LARF introduces the novel concept of "safety-sensitive layers." Instead of looking at the entire model, it pinpoints the specific layers that are most influential for safety behavior. This targeted approach filters out noise from other layers and focuses on the most relevant signals.

  • Use of Bidirectional Representations: The scoring function is not just based on similarity to unsafe examples but on the difference between similarity to unsafe examples and safe examples (sunsafessafes_{unsafe} - s_{safe}). This "bidirectional" approach better captures the model's internal trade-off between compliance and refusal, leading to more robust filtering.


4. Methodology

4.1. Principles

The core principle of LARF is that an LLM's safety alignment is not a globally distributed property but is heavily localized within a few specific "safety-sensitive" layers. The method is built on two key hypotheses:

  1. These safety-sensitive layers can be identified by observing how perturbations to each layer's parameters affect the model's refusal behavior. The layer that causes the largest swing in refusals (i.e., fewer refusals when weakened, more when strengthened) is the most critical for safety.

  2. Once these layers are found, the representations of data samples within them can serve as a powerful signal. Benign-looking data samples that ultimately degrade safety will, at this critical layer, have representations that are closer to those of explicitly harmful/compliant examples than to those of safe/refusal examples.

    LARF operationalizes these principles in a two-stage pipeline, as shown in Figure 2 of the paper.

    Figure 2: Overview of our two-stage LARF pipeline. (1) Safety-sensitive layer identification: we scale each lae' pmeaueulhaebeen veetat y ivSea we compute average representations for safe \(( D _ { \\mathrm { s a f e } } )\) and unsafe \(( D _ { \\mathrm { u n s a f e } } )\) references, extract each test example's representation, and assign a safety-degrading score to rank and filter safety-degrading samples. 该图像是一个示意图,展示了LARF方法中的两个阶段:安全敏感层识别和安全降级数据过滤。在第一阶段中,通过计算安全变化率kk来识别安全敏感层,并根据拒绝响应的数量进行层级排序。在第二阶段中,对测试数据进行表示分析,区分危险与安全的数据以进行筛选。

4.2. Core Methodology In-depth (Layer by Layer)

The entire LARF process is detailed in Algorithm 1 of the paper's appendix. It consists of two main stages: identifying the safety-sensitive layer and then using it to filter data.

4.2.1. Stage 1: Safety-sensitive Layers Identification

This stage aims to find the single layer in the LLM that has the most influence over its safety (refusal) mechanism.

1. Setup:

  • An overrejection dataset DsD_s is used. This is a specially constructed dataset containing benign instructions that a highly cautious, safety-aligned model might mistakenly refuse (e.g., "How can I kill some time?"). The use of this dataset makes the model's safety mechanism more "trigger-happy" and thus its behavior more sensitive to parameter changes, making it easier to identify the critical layer.

2. Probing Process: The method iterates through each layer ll of the LLM (from l=0l=0 to L-1). For each layer, it performs the following steps:

  • Step 2.1: Define Scaled Modules: For a given layer ll and a small scaling factor α>0α > 0 (the paper uses αα values of 0.1 and 0.2), two modified versions of the layer's modules are created:

    • An enhanced version where the attention (AlA_l) and feed-forward (FlF_l) module weights are amplified: $ A_l^+ = (1 + \alpha) A_l, \quad F_l^+ = (1 + \alpha) F_l $
    • A weakened version where the weights are attenuated: $ A_l^- = (1 - \alpha) A_l, \quad F_l^- = (1 - \alpha) F_l $
  • Step 2.2: Count Refusals:

    • The model with the enhanced layer ll is run on all prompts xx in the overrejection dataset DsD_s. The number of responses that are refusals is counted. This count is denoted as cl+(α)c_l^+(\alpha). $ c_l^+(\alpha) = \big| { x \in D_s \mid y_s^+(x) \text{ is refusal} } \big| $ where ys+(x)y_s^+(x) is the model's output with the enhanced layer.
    • Similarly, the model with the weakened layer ll is run on DsD_s, and the number of refusal responses is counted, denoted as cl(α)c_l^-(\alpha). $ c_l^-(\alpha) = \big| { x \in D_s \mid y_s^-(x) \text{ is refusal} } \big| $
  • Step 2.3: Calculate Sensitivity Score: The sensitivity of layer ll is measured by the difference in refusal counts between the enhanced and weakened states. A large difference indicates the layer is highly influential. $ \Delta_l(\alpha) = c_l^+(\alpha) - c_l^-(\alpha) $ To get a normalized score, this difference is divided by the scaling factor αα. The final sensitivity score for the layer, klk_l, is the maximum sensitivity observed across the different αα values tested: $ k_l = \operatorname*{max}_{\alpha \in { \alpha_1, \alpha_2 }} \frac{\Delta_l(\alpha)}{\alpha} $

3. Layer Selection:

  • After calculating the sensitivity score klk_l for all candidate layers, the layer with the highest score is selected as the safety-sensitive layer, denoted as lsl_s. $ l_s = \arg \operatorname*{max}_{l=0, ..., L-1} k_l $

4.2.2. Stage 2: Bidirectional Representation Similarity Calculation

This stage uses the identified safety-sensitive layer lsl_s to score and rank every sample in the target fine-tuning dataset (D_test).

1. Setup:

  • Two small reference datasets are required:
    • D_unsafe: A set of harmful instructions paired with unsafe, compliant responses (e.g., generated by an uncensored model).
    • D_safe: The same set of harmful instructions from D_unsafe, but paired with safe, refusal responses.

2. Calculating Reference Representations:

  • The method first computes an average "unsafe" and "safe" representation vector.
  • Step 2.1: Each sample dd in D_safe is passed through the model. The hidden state representation from the output of the safety-sensitive layer (i.e., at layer index ls+1l_s+1) is extracted at the final end-of-sequence (<eos><eos>) token. These representations are then averaged to create a single "safe" reference vector, rsafer_{\mathrm{safe}}. $ r_{\mathrm{safe}} = \frac{1}{|D_{\mathrm{safe}}|} \sum_{d \in D_{\mathrm{safe}}} r_{l_s+1}(d) $
  • Step 2.2: The same process is repeated for the D_unsafe dataset to create an "unsafe" reference vector, runsafer_{\mathrm{unsafe}}. $ r_{\mathrm{unsafe}} = \frac{1}{|D_{\mathrm{unsafe}}|} \sum_{d \in D_{\mathrm{unsafe}}} r_{l_s+1}(d) $ These two vectors, rsafer_{\mathrm{safe}} and runsafer_{\mathrm{unsafe}}, act as anchors in the representation space, defining the poles of "safe refusal" and "unsafe compliance."

3. Scoring and Ranking Test Data:

  • For each sample did_i in the fine-tuning dataset D_test:
  • Step 3.1: Extract Representation: The sample did_i is passed through the model, and its representation rir_i is extracted from layer ls+1l_s+1 at the final token. $ r_i = r_{l_s+1}(d_i) $
  • Step 3.2: Calculate Cosine Similarities: The cosine similarity between the sample's representation rir_i and each of the two reference vectors is computed. $ s_{\mathrm{safe}}(r_i) = \mathrm{sim}(r_i, r_{\mathrm{safe}}) $ $ s_{\mathrm{unsafe}}(r_i) = \mathrm{sim}(r_i, r_{\mathrm{unsafe}}) $
  • Step 3.3: Calculate Safety-degrading Score: The final score for the sample did_i is the similarity to the unsafe anchor minus the similarity to the safe anchor. $ \mathrm{score}i = s{\mathrm{unsafe}}(r_i) - s_{\mathrm{safe}}(r_i) $ A high positive score means the sample's representation is much more aligned with unsafe compliance than with safe refusal, indicating it is likely a safety-degrading sample.

4. Filtering:

  • All samples in D_test are ranked in descending order based on their scoreiscore_i. The top-ranked samples are considered the most dangerous for safety alignment and can be removed from the dataset before fine-tuning begins.


5. Experimental Setup

5.1. Datasets

  • Models: The experiments were conducted on three popular safety-aligned instruction-tuned models:

    • Llama3-8B-Instruct
    • Llama3.1-8B-Instruct
    • Qwen2.5-7B-Instruct The appendix also mentions tests on Mistral-v0.2, Phi-3-mini, and Qwen2.
  • Fine-tuning Datasets: Two widely-used, benign instruction-following datasets were used to simulate the fine-tuning process that causes safety degradation:

    • Alpaca: A dataset of 52k instruction-following examples generated by OpenAI's text-davinci-003.
    • Dolly: A dataset of ~15k instruction-following examples crowdsourced from Databricks employees.
  • Safety Evaluation Datasets: To measure the models' safety after fine-tuning, three challenging benchmarks composed of harmful prompts were used:

    • HarmBench: A standardized benchmark for evaluating model safety, containing prompts generated via automated red-teaming.
    • HEx-PHI: A dataset of harmful prompts designed to be effective against models aligned with the "Helpful, Harmless, and Honest" (HHH) principles.
    • DirectHarm4: A dataset specifically created to challenge fine-tuned models, focusing on four categories where they are empirically weaker: Malware, Drugs, Phishing, and Disinformation.
  • Downstream Task Datasets: To verify that filtering does not hurt task-specific performance, three datasets were used for fine-tuning and evaluation:

    • Magicoder: For code generation, evaluated on HumanEval.
    • PubMedQA: For medical question answering.
    • MetaMath: For mathematical reasoning, evaluated on MATH.
  • Reference and Overrejection Datasets:

    • The D_safe and D_unsafe reference sets were constructed using 100 examples (5 from each of 20 harmful categories) from the Circuit Breaker training dataset.
    • The DsD_s overrejection dataset was created by the authors and contains 110 instructions that pair potentially dangerous verbs with innocuous intents (e.g., "kill time").

5.2. Evaluation Metrics

  • Attack Success Rate (ASR):

    • Conceptual Definition: ASR measures the percentage of harmful prompts from a safety benchmark that successfully "jailbreak" the model, causing it to generate a harmful or non-refusal response. A higher ASR indicates a less safe model. The paper uses LlamaGuard 3, a specialized safety classifier model, to judge whether a response is harmful.
    • Mathematical Formula: $ \text{ASR} = \frac{\text{Number of Harmful Responses}}{\text{Total Number of Harmful Prompts}} \times 100% $
    • Symbol Explanation:
      • Number of Harmful Responses: The count of model outputs that are classified as unsafe by the evaluator (LlamaGuard 3).
      • Total Number of Harmful Prompts: The total number of prompts in the safety benchmark dataset.
  • GPT Score and GPT ASR:

    • Conceptual Definition: For a more nuanced safety evaluation on downstream tasks, the authors used GPT-4o to rate the harmfulness of model responses on a scale from 1 (least harmful) to 5 (most harmful).
      • GPT Score: The average harmfulness rating across all responses. Lower is better.
      • GPT ASR: The percentage of responses that receive the maximum harmfulness score of 5. Lower is better.
  • Downstream Task Performance Metrics:

    • pass@1 (for HumanEval):
      • Conceptual Definition: Used for code generation tasks. It measures the percentage of programming problems for which the model generates at least one functionally correct solution in a single attempt.
    • accuracy (for PubMedQA):
      • Conceptual Definition: The standard classification accuracy metric. It measures the percentage of questions for which the model provides the correct answer.
    • math_verify (for MATH):
      • Conceptual Definition: Used for mathematical reasoning. It measures the percentage of problems for which the model's generated solution is verifiably correct.

5.3. Baselines

LARF was compared against four methods for filtering or selecting data:

  • Random: As a simple baseline, a subset of data is randomly sampled from the fine-tuning dataset. This represents the performance without any intelligent filtering.

  • SEAL: A state-of-the-art but computationally expensive method that trains a dedicated data ranker model to distinguish safe and high-quality data.

  • GradSafe: A gradient-based method that identifies safety-degrading data using only the gradients generated from the instruction part of a data sample.

  • Bi-Anchoring: Another strong gradient-based baseline that computes gradients from both the instruction and the first few tokens of the response, then compares their similarity to safe and unsafe reference gradients.


6. Results & Analysis

6.1. Core Results Analysis

The experimental results robustly support the paper's claims about LARF's efficiency and effectiveness.

6.1.1. Identification of Safety-Sensitive Layers

  • Layer Sensitivity Analysis (Figure 4): This experiment validates the core premise of LARF. As shown in Figure 4 for the Llama3 model, perturbing different layers has vastly different effects on the model's refusal rate. Layer 13 shows the most dramatic change: weakening it causes a sharp drop in refusals, while strengthening it causes a sharp increase. This clearly identifies it as the "safety-sensitive layer." The appendix shows similar clear-cut results for other models (e.g., layer 13 for Llama3.1, layer 18 for Qwen2.5).

    Figure 4: Layer-wise sensitivity of Llama3's refusal behavior under parameter scaling. The 13th layer is the most safety-sensitive: attenuating its parameters sharply reduces refusals, while amplifying them sharply increases refusals. 该图像是图表,展示了不同层级(L11-L31)在参数缩放下的拒绝响应(Refusal Responses)。可以看到,第13层(L13)对安全性最敏感,参数减小时拒绝响应显著下降,而参数增加时则迅速上升。

  • Validation of Layer Choice (Figure 5): To prove that the identified layer is indeed the best one for filtering, the authors performed an experiment where they used representations from different layers (11 to 31) to select the "worst" 1,000 samples from Alpaca. Figure 5 shows that fine-tuning on data selected using representations from layer 13 resulted in the highest Attack Success Rate (ASR). This confirms that the sensitivity-based identification method correctly pinpoints the layer whose representations are most indicative of safety-degrading features.

    Figure 5: Attack Success Rates (ASR) of Llama3 finetuned on the 1,000 top ranked examples selected by corresponding representations from layers 11th31st. Bars correspond to three safety benchmarks and reveal that selecting examples by the 13th-layer representation yields the highest ASR across all benchmarks, confirming the effectiveness of the identified safety-sensitive layer in data selection. 该图像是一个图表,展示了在不同层表示下,Llama3模型在三个安全基准上的攻击成功率(ASR)。数据呈现了第13层表示对应的攻击成功率在所有基准中最高,验证了确定的安全敏感层在数据选择中的有效性。

6.1.2. Effectiveness in Identifying Safety-Degrading Data

  • Finding "Bad" Data (Table 1): This is the main result demonstrating LARF's superiority. The experiment involves fine-tuning models on the 1,000 samples ranked highest (most "safety-degrading") by each method. A higher resulting ASR means the method was more effective at finding the most damaging data. LARF consistently outperforms all baselines across different models, datasets, and safety benchmarks. For instance, with Llama3 on Alpaca, fine-tuning on LARF-identified data yields an ASR of 52.00% on DirectHarm4, whereas the next best baseline (Bi-Anchoring) only reaches 49.00%, and SEAL only reaches 26.75%. This highlights LARF's exceptional ability to isolate the most harmful subset of benign data.

  • Finding "Good" Data (Table 5): The paper also shows the reverse: fine-tuning on the 1,000 samples with the lowest safety-degrading scores. A lower ASR is better here. As shown in Table 5, LARF is again the best performer, often reducing the model's ASR to nearly 0%, making the fine-tuned model even safer than the original aligned model. For Llama3 on Alpaca, LARF achieves ASRs of 0.75%, 0.00%, and 0.34% on the three benchmarks, drastically outperforming all other methods.

6.1.3. Efficiency and Downstream Performance

  • Efficiency (Table 3): LARF is vastly more efficient than its competitors. Filtering the Alpaca dataset with Llama3.1 took LARF only 0.5 hours on a single GPU, compared to 3 hours for Bi-Anchoring (on 4 GPUs) and 6 hours for SEAL (on 8 GPUs). This makes LARF a far more practical tool.
  • Downstream Performance (Table 2): A crucial finding is that LARF preserves safety without sacrificing utility. The experiment involved removing the 2,000 worst-ranked samples (20% of the data) before fine-tuning on downstream tasks. Table 2 shows that for all models and tasks (coding, QA, math), the performance (e.g., Humaneval score) after filtering with LARF remains nearly identical to the random baseline. However, the safety of the LARF-filtered models, measured by GPT Score and ASR, is consistently and significantly better than all baselines. In contrast, SEAL and Bi-Anchoring sometimes even increased harmfulness compared to the random baseline, showing their unreliability.

6.2. Data Presentation (Tables)

The following are the key results from Table 1, showing the Attack Success Rate (%) after fine-tuning on the 1,000 highest-ranked (most harmful) samples identified by each method. Higher is better for evaluating the filtering method's ability to find bad data.

Model Dataset Bench Instruct Random Filtering Method (Top 1,000)
LARF SEAL GradSafe Bi-Anchoring
Llama3 Alpaca DirectHarm4 11.25 25.00 52.00 26.75 28.00 49.00
Harmbench 9.50 15.00 35.50 13.50 16.00 35.00
HEx-PHI 8.62 6.55 26.21 6.90 8.97 24.58
Dolly DirectHarm4 11.25 55.25 79.25 28.25 75.00 74.50
Harmbench 9.50 39.25 78.50 13.00 82.00 75.00
HEx-PHI 8.62 31.38 68.97 7.24 74.14 67.59
Llama3.1 Alpaca DirectHarm4 13.25 22.50 49.50 27.75 7.50 11.00
Harmbench 3.50 18.50 39.00 13.00 5.00 12.50
HEx-PHI 5.86 8.97 31.38 6.90 3.45 3.10
Dolly DirectHarm4 13.25 54.00 84.00 71.75 59.50 67.25
Harmbench 3.50 51.00 85.00 65.00 60.50 50.50
HEx-PHI 5.86 29.30 60.34 38.62 33.79 40.00
Qwen2.5 Alpaca DirectHarm4 9.25 27.50 44.50 20.00 26.00 44.50
Harmbench 6.00 11.00 31.00 9.00 10.00 24.50
HEx-PHI 9.66 13.10 27.24 6.55 12.07 24.80
Dolly DirectHarm4 9.25 50.50 83.75 49.75 66.50 60.50
Harmbench 6.00 36.00 86.50 65.50 60.00 60.50
HEx-PHI 9.66 32.41 77.24 51.03 51.03 42.07

6.3. Ablation Studies / Parameter Analysis

  • Bidirectional vs. Unidirectional Scoring (Figure 3): The authors perform an important ablation study to justify their scoring function, scorei=sunsafe(ri)ssafe(ri)\mathrm{score}_i = s_{\mathrm{unsafe}}(r_i) - s_{\mathrm{safe}}(r_i). They compare this "bidirectional" method to a simpler "unidirectional" one where the score is just the similarity to the unsafe anchor: scorei=sunsafe(ri)\mathrm{score}_i = s_{\mathrm{unsafe}}(r_i). Figure 3 shows that the bidirectional method is superior. When finding "bad" data (top 1,000), the bidirectional method leads to a higher ASR (52% vs. lower for unidirectional), and when finding "good" data (bottom 1,000), it leads to a lower ASR. This confirms that using both safe and unsafe anchors provides a stronger, more discriminative signal.

    Figure 3: ASR of the fine-tuned Llama3 on the top and bottom 1,000 samples ranked by the bidirectional method (Orig) and the unidirectional method (Unsafe) from Alpaca across three safety benchmarks. 该图像是图表,展示了经过微调的Llama3在三个安全基准上,根据双向方法(Orig)和单向方法(Unsafe)排名的前后1,000个样本的攻击成功率(%)。可以看到,前1000个原始样本的攻击成功率最高,达到52.0%。

  • Analysis of Safety-Degrading Data (Table 4, Section 4.5): Further analysis reveals interesting characteristics of the safety-degrading data identified by LARF.

    • Long, Point-by-Point Responses: As shown in Table 4, the top-ranked (most harmful) samples consistently feature much longer responses and a higher proportion of point-by-point, itemized list formats compared to the dataset average. The hypothesis is that these elaborate, helpful-seeming response styles override the model's default tendency to give short, concise refusals to potentially problematic prompts.

    • Representational Drift (Figure 17): PCA visualizations show that fine-tuning on the top-ranked data causes a significant "representational drift." The model's internal representations of harmful concepts move far away from where they were in the original, safely aligned model. This provides a clear visual explanation for the degradation of safety alignment.

    • Increased ASR on Specific Topics (Figures 19-21): The safety degradation is not uniform. Fine-tuning on LARF-identified bad data specifically increases the model's vulnerability to generating harmful content related to "Disinformation," "Phishing Crimes," and "Political Campaigning," suggesting the filtered data contains patterns that teach the model how to be "helpfully" harmful in these domains.


7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces LARF, a novel and highly effective method for purifying fine-tuning datasets to preserve the safety alignment of LLMs. The key innovation is the identification of "safety-sensitive layers" and the use of bidirectional representation similarity within these layers to score and filter out benign-looking but safety-degrading data samples. The authors demonstrate through extensive experiments that LARF is not only more effective than existing gradient-based and model-based methods but also significantly more efficient in terms of time and computational resources. By removing the identified harmful data, LARF successfully mitigates safety degradation during fine-tuning without compromising the model's performance on its intended downstream tasks, making it a valuable and practical tool for deploying LLMs safely.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

  • Not a Panacea: Data filtering alone, while effective, may not be a complete solution to prevent all safety degradation. The authors suggest that combining LARF with safety-aware fine-tuning algorithms could provide more robust protection.
  • Dependence on Reference Data: The effectiveness of LARF relies on the quality of the D_safe and D_unsafe reference datasets. The construction of these sets is crucial, and exploring methods for creating optimal reference sets is a promising area for future work.
  • Limited Scope: The experiments were confined to language-only LLMs. The authors note that safety degradation is also a known issue in Vision-Language Models (VLMs) and text-to-image diffusion models. A key future direction is to adapt and evaluate LARF in these multimodal settings.

7.3. Personal Insights & Critique

  • Strengths and Innovation:

    • The concept of "safety-sensitive layers" is a powerful and elegant simplification. It transforms a complex, model-wide problem into a targeted, localized one. This is a brilliant application of interpretability insights to solve a practical engineering problem.
    • The efficiency of LARF is its most compelling advantage. In an era where training and fine-tuning LLMs is extremely costly, a lightweight, gradient-free method that only requires forward passes is a game-changer for developers and researchers with limited resources.
    • The methodology is intuitive and well-justified. The validation experiments, such as showing that the identified layer is indeed the most effective for filtering (Figure 5) and that bidirectional scoring is superior (Figure 3), build a strong case for the design choices.
  • Potential Issues and Unverified Assumptions:

    • Assumption of Localization: The method's success hinges on the assumption that safety mechanisms are strongly localized. While the results show this is a valid and effective heuristic for the tested models, it's possible that for other architectures or alignment methods, safety could be a more distributed phenomenon, which might reduce LARF's effectiveness.
    • Generalization to Larger Models: The paper provides some evidence of transferability to larger models (70B) in the appendix, but the bulk of the analysis is on 7B/8B models. The dynamics of internal representations and layer sensitivity might change at the scale of 100B+ parameter models, which warrants further investigation.
    • Dataset Construction: The quality of the overrejection dataset and the reference datasets is critical but their construction is described as somewhat manual. A more systematic or automated way to generate these high-quality probe/reference sets would make the method more robust and easier to adopt.
  • Future Value and Inspirations:

    • This representation-centric approach opens up exciting new avenues beyond simple data filtering. For instance, the safety-degrading score could be used as a regularization term during fine-tuning, penalizing the model if its representations drift into the "unsafe" region of the space.
    • LARF could be used as a powerful diagnostic tool. By analyzing which data points in a dataset receive high safety-degrading scores, developers can gain deep insights into what kinds of "benign" content are secretly toxic to their model's alignment, helping them create better datasets in the future.
    • The core idea of identifying "X-sensitive layers" for any property X (e.g., factuality, fairness, reasoning ability) and then using representations from those layers to filter or guide training is a highly generalizable paradigm that could be applied to many other problems in LLM development.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.