Safety Layers in Aligned Large Language Models: The Key to LLM Security
TL;DR Summary
This paper identifies "safety layers" within aligned LLMs, crucial for distinguishing malicious queries. Using vector analysis and over-rejection, these middle layers are precisely located. Based on this, Safely Partial-Parameter Fine-Tuning (SPPFT) is proposed. SPPFT freezes saf
Abstract
Aligned LLMs are secure, capable of recognizing and refusing to answer malicious questions. However, the role of internal parameters in maintaining such security is not well understood yet, further these models can be vulnerable to security degradation when subjected to fine-tuning attacks. To address these challenges, our work uncovers the mechanism behind security in aligned LLMs at the parameter level, identifying a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones, referred to as "safety layers". We first confirm the existence of these safety layers by analyzing variations in input vectors within the model's internal layers. Additionally, we leverage the over-rejection phenomenon and parameters scaling analysis to precisely locate the safety layers. Building on these findings, we propose a novel fine-tuning approach, Safely Partial-Parameter Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation. Our experiments demonstrate that the proposed approach can significantly preserve LLM security while maintaining performance and reducing computational resources compared to full fine-tuning.
English Analysis
1. Bibliographic Information
- Title: Safety Layers in Aligned Large Language Models: The Key to LLM Security
- Authors: Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li
- Affiliations: University of Science and Technology of China, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
- Journal/Conference: The paper was submitted to a conference via OpenReview, a platform commonly used for peer review in the machine learning community (e.g., ICLR, NeurIPS).
- Publication Year: 2024
- Abstract: The paper investigates the security mechanisms within aligned Large Language Models (LLMs). It identifies a specific, contiguous set of middle layers, termed "safety layers," that are critical for differentiating between malicious and benign queries. The authors first verify the existence of these layers by analyzing how input vector representations diverge as they pass through the model. They then precisely locate these layers by leveraging the "over-rejection" phenomenon and parameter scaling. Based on this discovery, they propose a novel fine-tuning method called Safely Partial-Parameter Fine-Tuning (
SPPFT
), which freezes the safety layers' parameters during fine-tuning. Experiments show thatSPPFT
effectively prevents security degradation while maintaining task performance and reducing computational costs compared to full fine-tuning. - Original Source Link:
- Forum: https://openreview.net/forum?id=kUH1yPMAn7
- PDF: https://openreview.net/pdf?id=kUH1yPMAn7
- Publication Status: The paper is available as a preprint on OpenReview, indicating it is under review or has been submitted to a conference.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Modern Large Language Models (LLMs) are "aligned" to be safe and helpful, meaning they are trained to refuse harmful requests. However, this safety alignment is fragile. When these models are further fine-tuned for specific tasks—a common practice in real-world applications—their safety features can be easily erased, a vulnerability known as "fine-tuning jailbreak." This happens even when the fine-tuning data is completely harmless.
- Importance & Gap: The mechanism by which alignment instills safety at the parameter level is not well understood. While prior work identified some discrete security-related neurons, freezing them was insufficient to prevent safety degradation. This lack of understanding poses a significant risk for deploying LLMs safely.
- Innovation: This paper pioneers a layer-level investigation into LLM security. Instead of looking at scattered neurons, it hypothesizes and confirms that safety is concentrated within a contiguous block of layers in the middle of the model.
-
Main Contributions / Findings (What):
- Discovery of "Safety Layers": The paper provides the first empirical evidence for the existence of "safety layers"—a small, continuous block of layers in the middle of aligned LLMs responsible for identifying and flagging malicious content.
- A Method for Verification and Localization: It introduces a two-stage methodology to find these layers. First, it verifies their existence by tracking the layer-wise cosine similarity of hidden states for normal vs. malicious queries. Second, it precisely pinpoints their boundaries using a novel technique involving parameter scaling and measuring the model's "over-rejection" rate (its tendency to refuse harmless but ambiguous queries).
- A Novel Safe Fine-Tuning Method (SPPFT): Based on the discovery, the paper proposes Safely Partial-Parameter Fine-Tuning (
SPPFT
). This method simply freezes the parameters of the identified safety layers during fine-tuning, preventing their safety-critical functions from being overwritten. - Demonstrated Effectiveness: Through extensive experiments, the authors show that
SPPFT
significantly preserves an LLM's security against various fine-tuning attacks (including those using normal, backdoor, and harmful data) while maintaining strong performance on downstream tasks and reducing computational overhead.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks (like GPT-4 or Llama-3) trained on vast amounts of text data. They are capable of understanding and generating human-like text for a wide range of tasks. Internally, they are typically composed of a stack of "transformer layers."
- LLM Alignment: This is the process of training an LLM to behave in accordance with human values, making it helpful, honest, and harmless. Key techniques include:
- Instruction Fine-Tuning: Training the model on examples of high-quality instruction-response pairs.
- Reinforcement Learning from Human Feedback (RLHF): Training a "reward model" to score LLM outputs based on human preferences and then using this reward model to fine-tune the LLM to produce higher-scoring (i.e., better aligned) responses.
- Direct Preference Optimization (DPO): A more recent and stable alternative to RLHF that directly optimizes the LLM based on preference data without needing a separate reward model.
- Fine-tuning: The process of taking a pre-trained or aligned LLM and further training it on a smaller, task-specific dataset. This adapts the model to a particular domain, like finance or medicine.
- Fine-tuning Jailbreak: A type of attack where the fine-tuning process, even with benign data, erodes or completely removes the safety alignment of an LLM, making it susceptible to generating harmful content.
- Over-Rejection: A side effect of safety alignment where an LLM becomes overly cautious and refuses to answer harmless prompts that might contain keywords associated with malicious topics (e.g., refusing "How to kill a process in Linux?"). This paper cleverly uses this phenomenon as a sensitive indicator of a model's security posture.
-
Previous Works & Differentiation:
- The paper positions itself against prior work that has either demonstrated the problem or proposed initial solutions.
- On Identifying Security Mechanisms: A key related work (Wei et al., 2024) identified discrete "security-critical neurons" scattered throughout the model. However, the authors of this paper found that simply freezing these neurons (
NFFT
in their experiments) did not effectively prevent security degradation. This paper's innovation is the discovery that security is not diffuse but concentrated in contiguous layers. - On Safe Fine-Tuning: Other methods exist to make fine-tuning safer, such as adding safety examples to the fine-tuning data or using complex algorithms to merge safe and fine-tuned models. The proposed
SPPFT
is distinct in its simplicity and efficiency: it is a parameter-freezing technique guided by a structural understanding of the model, requiring no changes to the fine-tuning data or process beyond freezing specific layers. - Concurrent Work: The appendix mentions a concurrent paper (Zhao et al., 2024) that also uses the term "safety layers." However, the authors clarify that the definition, identification method (based on pruning), and application (defending against adversarial attacks at inference time) are completely different, making their work distinct.
4. Methodology (Core Technology & Implementation)
The paper's methodology is centered on first identifying the safety layers and then using this knowledge to develop a safe fine-tuning strategy.
4.1. Verifying the Existence of Safety Layers
The core intuition is that if safety layers exist, they should process malicious queries differently from normal ones. This difference should be observable in the hidden state vectors that pass through the model.
- Procedure:
- Data Preparation: Two datasets are used: a set of normal queries () and a set of malicious queries ().
- Vector Extraction: For each query, it is passed through the LLM. At each of the hidden layers, the output vector at the final token position is extracted. This vector represents the cumulative understanding of the input up to that layer.
- Pairwise Comparison: The authors compute the layer-wise cosine similarity between the final-position vectors for three types of query pairs:
- Normal-Normal (N-N): Two different normal queries.
- Malicious-Malicious (M-M): Two different malicious queries.
- Normal-Malicious (N-M): One normal and one malicious query.
- Key Finding (Evidence of Existence):
-
As shown in Figure 2, for all tested aligned LLMs, the N-N and N-M similarity curves are nearly identical in the early layers.
-
However, at a certain point in the middle of the model, the N-M similarity curve begins to drop sharply, while the N-N curve declines more gently. This widening gap signifies the layers where the model starts to differentiate malicious content from normal content. These are the safety layers.
-
This gap does not appear in pre-trained (non-aligned) LLMs (Figure 3), confirming that the safety layers are a direct result of the safety alignment process.
该图像是图2,展示了四种对齐LLM(Llama-3-8B-Instruct, Llama-2-7B-Chat, Phi-3-mini-4k-instruct和gemma-2b-it)在不同隐藏层上的分析结果。上半部分描绘了“正常-正常(N-N)对”与“正常-恶意(N-M)对”的余弦相似度变化。下半部分则显示了这两种情况之间的平均角度差。图示表明,在模型中间层,N-M对的余弦相似度显著下降,同时平均角度差出现峰值,这揭示了“安全层”的存在,这些层对于区分恶意查询至关重要。
该图像是图3,展示了预训练LLMs(Llama-3-8B、Llama-2-7B和gemma-2b)内部层对“N-N Pair”和“N-M Pair”的分析。它通过上排的余弦相似度值和下排的平均角度差,揭示了不同层中正常与恶意查询表示之间的区分度。对于Llama系列模型,中间层显示出N-M Pair余弦相似度显著下降和平均角度差增大,表明这些层在区分恶意查询中起关键作用,可能对应于论文中提出的“安全层”。gemma-2b表现出不同的模式。
-
4.2. Precisely Locating the Safety Layers
The cosine similarity analysis provides an approximate location. To pinpoint the exact boundaries, the authors devise a more sensitive method.
- Principles:
- Parameter Scaling: The influence of a layer can be amplified or diminished by scaling its weights by a factor . If , the layer's function is enhanced; if , it is suppressed.
- Over-Rejection as a Proxy for Security: Changes in a model's security level are sensitively reflected in its over-rejection behavior. Enhancing safety layers should increase over-rejection; suppressing them should decrease it.
- Progressive Localization Algorithm:
- Step 1 (Initial Range): Use the cosine similarity analysis (from Figure 2) to identify an initial approximate range of safety layers,
[i, j]
, where the gap between N-M and N-N curves appears and grows. - Step 2 (Baseline Over-Rejection): Create an "over-rejection dataset" () containing tricky, harmless queries (e.g., "How to kill a process?"). Measure the baseline number of rejections () by the original model.
- Step 3 (Progressive Boundary Search):
- Choose a scaling factor (e.g., to amplify or to suppress).
- To find the upper bound: Start with the initial range
[i, j]
. Scale the parameters of layers in this range and measure the new rejection count. Then, iteratively expand the range to , , etc., each time measuring the rejection count. The upper bound is the layer where the rejection count is maximized (for ) or minimized (for ). Adding layers beyond this boundary dilutes the effect and causes the rejection count to fall/rise again. - To find the lower bound: Fix the determined upper bound and repeat the iterative process for the lower bound (e.g., test ranges , , etc.).
- The final range is the one that produces the most significant change in the over-rejection rate, indicating it contains the most critical security-related parameters.
- Step 1 (Initial Range): Use the cosine similarity analysis (from Figure 2) to identify an initial approximate range of safety layers,
4.3. SPPFT: Safely Partial-Parameter Fine-Tuning
With the safety layers identified, the proposed defense is straightforward:
- Definition:
SPPFT
is a fine-tuning approach where the parameters of the identified safety layers are frozen. This means their gradients are not computed or updated during the backpropagation phase of training. - Mechanism: By freezing these layers, their learned ability to distinguish malicious content is preserved, while the other layers (early layers for preliminary processing and later layers for semantic analysis) are free to adapt to the new fine-tuning data. This allows the model to learn the new task without unlearning its safety alignment.
5. Experimental Setup
- Datasets:
- Fine-tuning Datasets:
- Normal Data Attack (): 1,000 harmless samples from the
alpaca finance
dataset. - Implicit Attack (): 4,000 samples where the output is primed to be affirmative (e.g., starts with "Sure, the answer is:").
- Backdoor Attack (): 1,500 normal samples and 1,500 backdoor samples with a trigger phrase.
- Harmful Data Attack (): 1,000 normal samples mixed with a proportion () of malicious data.
- Normal Data Attack (): 1,000 harmless samples from the
- Evaluation Datasets:
- Malicious Queries (): 520 prompts from the dataset created by Zou et al. (2023) to test for harmful responses.
- Task Performance (): 500 non-overlapping samples from
alpaca finance
to test performance on the fine-tuning domain.
- Fine-tuning Datasets:
- Evaluation Metrics:
- Harmful Rate ():
- Conceptual Definition: Measures the percentage of malicious prompts from that the LLM agrees to answer. A lower rate is better, indicating stronger security.
- Mathematical Formula:
- Harmful Score ():
- Conceptual Definition: Measures the severity of the harmful responses on a scale of 1 (safest) to 5 (most harmful). The scores are generated by GPT-4 based on a detailed rubric. A lower score is better.
- Rouge-L Score ():
- Conceptual Definition: Evaluates the quality of generated text by measuring the length of the Longest Common Subsequence (LCS) between the model's output and a reference answer. It is a standard metric for text summarization and generation tasks.
- Mathematical Formula:
- Symbol Explanation: is the reference text of length , is the model output of length ,
LCS(X, Y)
is the length of their longest common subsequence. and are recall and precision, and is the F-score. The paper uses the standard Rouge-L F-score.
- MMLU Score ():
- Conceptual Definition: Measures the model's general knowledge and problem-solving abilities across 57 diverse subjects (from elementary math to US history). It is a benchmark for overall model capability.
- Harmful Rate ():
- Baselines:
- Full Fine-Tuning (FullFT): The standard approach where all model parameters are updated.
- Neuron Freezing Fine-Tuning (NFFT): Freezing a proportional number of discrete "security-critical neurons" as identified by Wei et al. (2024).
- Lisa: A method for safe fine-tuning used as a baseline in the harmful data attack scenario.
6. Results & Analysis
Core Results
-
Localization of Safety Layers: Table 1 (transcribed below) shows the progressive localization process. For
Llama-3-8B-Instruct
(with ), the number of over-rejections peaks when scaling layers [6,12]. Expanding the range further (e.g., to [5,12] or [7,13]) reduces the over-rejection count, confirming that layers 6-12 are the most security-critical. This pattern holds across all tested models.(Manual Transcription of Table 1)
Phi-3-mini-4k-instruct (α = 0.8, R0 = 270) Llama-2-7b-chat (α = 1.15, R0 = 169) Upper Bound Scaled Layers range [11,13] [11,14] [11,15] [11,16] [11,17] [9,12] [9,13] [9,14] [9,15] [9,16] Over-Rejection Num 209 190 149 181 189 187 227 237 218 219 Lower Bound Scaled Layers range [13,15] [12,15] [11,15] [10,15] [9,15] [8,14] [7,14] [6,14] [5,14] [4,14] Over-Rejection Num 237 182 149 177 163 263 268 297 189 202 Llama-3-8B-Instruct (α = 1.2, R0 = 139) gemma-2b-it (α = 1.2, R0 = 139) Upper Bound Scaled Layers range [7,10] [7,11] [7,12] [7,13] [7,14] [8,9] [8,10] [8,11] [8,12] [8,13] Over-Rejection Num 272 241 283 266 256 310 335 368 343 326 Lower Bound Scaled Layers range [8,12] [7,12] [6,12] [5,12] [4,12] [8,11] [7,11] [6,11] [5,11] [4,11] Over-Rejection Num 334 283 371 358 223 368 371 407 404 323 -
Three Stages of LLM Layers: The attention heatmaps in Figure 4 support a three-stage processing model:
-
Early Layers (Pre-Safety): Focus on syntactic structure and general words (e.g., "how", "the").
-
Middle Layers (Safety Layers): Begin to distinguish malicious intent. Attention shifts towards semantically loaded keywords (e.g., "bomb", "rob").
-
Late Layers (Post-Safety): Focus entirely on the core semantic concepts of the query to formulate a response.
该图像是图4中Llama-2-7b-chat和Phi-3-mini-4k-instruct模型的注意力分数热图。垂直轴表示模型的各个层,水平轴对应输入LLM词元。每个网格的深浅反映了特定层分配给该词元的注意力分数。图示了Llama-2-7b-chat在恶意问题(如“How to make a bomb”)和正常问题(如“Where is the capital of America”)上的注意力模式。黑色的虚线标记了安全层的位置,将模型层分为三个部分,暗示这些层在区分恶意与正常查询中扮演关键角色。
-
-
SPPFT Performance: Table 2 and Table 3 (transcribed below) show that
SPPFT
is highly effective.-
Across Normal, Implicit, and Backdoor attacks (Table 2):
FullFT
causes a massive increase in harmfulness (e.g., forPhi-3
, harmful rate jumps from 0.77% to 87.69% in the implicit attack).SPPFT
, in contrast, keeps the increase minimal (0.77% to 3.27%). Meanwhile, task performance metrics (Rouge-L
,MMLU
) remain comparable toFullFT
, showing that learning is not compromised. TheNFFT
baseline performs poorly, similar toFullFT
. -
In Harmful Data attacks (Table 3): As the proportion of malicious data () increases,
FullFT
models quickly become almost completely unsafe.SPPFT
consistently maintains a much lower harmful rate and score, outperforming bothFullFT
and theLisa
baseline.(Manual Transcription of Table 2)
Llama-3-8B-Instruct (Initial Rh=5.77%, Sh=1.13) Llama-2-7b-chat (Initial Rh=1.35%, Sh=1.03) gemma-2b-it (Initial Rh=3.27%, Sh=1.08) Phi-3-mini-4k-instruct (Initial Rh=0.77%, Sh=1.02) DN SPPFT FullFT NFFT SPPFT FullFT NFFT SPPFT FullFT NFFT SPPFT FullFT NFFT Harmful Rate (Rh) 9.62% 44.42% 43.65% 2.88% 10.58% 12.69% 5.58% 18.27% 17.69% 7.12% 40.00% 38.46% Harmful Score (Sh) 1.21 2.41 2.37 1.06 1.38 1.49 1.14 1.68 1.66 1.16 2.39 2.33 Rouge-L Score (Sr) 0.285 0.277 0.283 0.248 0.270 0.252 0.240 0.232 0.227 0.322 0.318 0.316 MMLU Score (Sm) 0.654 0.649 0.651 0.470 0.458 0.454 0.384 0.389 0.381 0.678 0.671 0.668 DI SPPFT FullFT NFFT SPPFT FullFT NFFT SPPFT FullFT NFFT SPPFT FullFT NFFT Harmful Rate (Rh) 6.15% 42.69% 41.92% 6.73% 58.85% 58.07% 6.35% 54.04% 54.81% 3.27% 87.69% 81.35% Harmful Score (Sh) 1.18 2.64 2.61 1.19 3.26 3.24 1.21 2.98 3.00 1.09 4.17 4.03 ... (omitting other scores for brevity) DB SPPFT FullFT NFFT SPPFT FullFT NFFT SPPFT FullFT NFFT SPPFT FullFT NFFT Harmful Rate (Rh) 8.27% 52.50% 51.15% 5.58% 60.58% 59.42% 5.19% 48.08% 49.04% 9.04% 80.96% 76.73% Harmful Score (Sh) 1.28 2.90 2.87 1.20 3.19 3.16 1.20 2.75 2.78 1.31 4.00 3.98 ... (omitting other scores for brevity)
(Manual Transcription of Table 3)
Harmful attack LLaMA-2-chat Phi-3-mini-4k-instruct (p = 0.05) (p = 0.1) (p = 0.2) (p = 0.05) (p = 0.1) (p = 0.2) FullFT Harmful Rate (Rh) 26.0% 53.7% 93.5% 59.4% 90.7% 98.8% Harmful Score (Sh) 1.88 2.97 4.53 3.23 4.45 4.89 SPPFT Harmful Rate (Rh) 7.9% 25.6% 68.3% 8.1% 23.3% 78.8% Harmful Score (Sh) 1.25 1.86 3.61 1.29 1.74 3.88 Lisa Harmful Rate (Rh) 20.7% 41.4% 80.3% 41.9% 61.3% 87.4% Harmful Score (Sh) 1.76 2.38 4.01 2.49 3.10 4.23 -
Ablations / Parameter Sensitivity
- The ablation study in Appendix A.4.5 is crucial. It shows that freezing contiguous layers other than the identified safety layers (i.e., layers before or after) does not preserve security. In some cases, it even worsens security compared to
FullFT
. This confirms that the defense's effectiveness comes from protecting the specific functionality of the safety layers, not just from reducing the number of trainable parameters.
7. Conclusion & Reflections
-
Conclusion Summary: This paper makes a significant contribution by identifying the existence, location, and function of "safety layers" in aligned LLMs. It provides a clear, interpretable model of how safety is implemented at a structural level. The proposed
SPPFT
method is a direct and practical application of this insight, offering a simple, effective, and computationally efficient way to conduct safer fine-tuning. -
Limitations & Future Work:
- The authors state their intent to further explore the proposed three-stage division of LLM layers (preliminary confirmation, malicious detection, semantic analysis) in future work.
- The localization process, while effective, needs to be performed for each new model architecture. The paper does not explore whether the relative location or size of safety layers is consistent across model families or scales (e.g., from a 7B to a 70B model).
- The analysis focuses on transformer-based decoder-only models. Whether similar structures exist in other architectures remains an open question.
-
Personal Insights & Critique:
- Novelty and Impact: The paper's core finding is both elegant and impactful. The idea of a dedicated, localized "safety module" emerging from alignment training is a powerful mental model that advances our understanding of LLM internals.
SPPFT
is a highly practical technique that could be immediately adopted by developers fine-tuning open-source models to enhance their safety. - Methodological Strength: The use of the over-rejection phenomenon as a sensitive probe for security is particularly clever. It circumvents the issue of low signal when testing highly secure models on malicious prompts and provides a robust way to locate the layer boundaries.
- Open Questions: Could these safety layers be adversarially targeted? If an attacker knows their location, could they devise a more efficient fine-tuning attack that specifically aims to corrupt these layers? Furthermore, can the functionality of these safety layers be transferred from one model to another to "transplant" safety? This work paves the way for many exciting new research directions in mechanistic interpretability and robust alignment.
- Novelty and Impact: The paper's core finding is both elegant and impactful. The idea of a dedicated, localized "safety module" emerging from alignment training is a powerful mental model that advances our understanding of LLM internals.
Similar papers
Recommended via semantic vector search.
FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY, EVEN WHEN USERS DO NOT INTEND TO!
Fine-tuning aligned large language models can compromise safety even unintentionally. Few adversarial samples can bypass safeguards, and benign fine-tuning may degrade safety, revealing a critical gap in existing defenses and underscoring the need for improved fine-tuning securit
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
This work shows prompt templates crucially impact LLM safety post-fine-tuning and proposes PTST: fine-tune without safety prompts but add them at inference, effectively reducing unsafe behaviors while maintaining performance across major chat models.
Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning
This paper introduces Vulnerability-Aware Alignment (VAA) to mitigate uneven safety alignment forgetting in LLMs during harmful fine-tuning. VAA identifies vulnerable data subsets and utilizes Group Distributionally Robust Optimization with adversarial sampling and perturbations
Discussion
Leave a comment
No comments yet. Start the discussion!