- Title: Safety Layers in Aligned Large Language Models: The Key to LLM Security
- Authors: Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li
- Affiliations: University of Science and Technology of China, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
- Journal/Conference: The paper was submitted to a conference via OpenReview, a platform commonly used for peer review in the machine learning community (e.g., ICLR, NeurIPS).
- Publication Year: 2024
- Abstract: The paper investigates the security mechanisms within aligned Large Language Models (LLMs). It identifies a specific, contiguous set of middle layers, termed "safety layers," that are critical for differentiating between malicious and benign queries. The authors first verify the existence of these layers by analyzing how input vector representations diverge as they pass through the model. They then precisely locate these layers by leveraging the "over-rejection" phenomenon and parameter scaling. Based on this discovery, they propose a novel fine-tuning method called Safely Partial-Parameter Fine-Tuning (
SPPFT
), which freezes the safety layers' parameters during fine-tuning. Experiments show that SPPFT
effectively prevents security degradation while maintaining task performance and reducing computational costs compared to full fine-tuning.
- Original Source Link:
2. Executive Summary
4. Methodology (Core Technology & Implementation)
The paper's methodology is centered on first identifying the safety layers and then using this knowledge to develop a safe fine-tuning strategy.
4.1. Verifying the Existence of Safety Layers
The core intuition is that if safety layers exist, they should process malicious queries differently from normal ones. This difference should be observable in the hidden state vectors that pass through the model.
- Procedure:
- Data Preparation: Two datasets are used: a set of normal queries (N) and a set of malicious queries (M).
- Vector Extraction: For each query, it is passed through the LLM. At each of the K hidden layers, the output vector at the final token position is extracted. This vector represents the cumulative understanding of the input up to that layer.
- Pairwise Comparison: The authors compute the layer-wise cosine similarity between the final-position vectors for three types of query pairs:
- Normal-Normal (N-N): Two different normal queries.
- Malicious-Malicious (M-M): Two different malicious queries.
- Normal-Malicious (N-M): One normal and one malicious query.
- Key Finding (Evidence of Existence):
-
As shown in Figure 2, for all tested aligned LLMs, the N-N and N-M similarity curves are nearly identical in the early layers.
-
However, at a certain point in the middle of the model, the N-M similarity curve begins to drop sharply, while the N-N curve declines more gently. This widening gap signifies the layers where the model starts to differentiate malicious content from normal content. These are the safety layers.
-
This gap does not appear in pre-trained (non-aligned) LLMs (Figure 3), confirming that the safety layers are a direct result of the safety alignment process.
该图像是图2,展示了四种对齐LLM(Llama-3-8B-Instruct, Llama-2-7B-Chat, Phi-3-mini-4k-instruct和gemma-2b-it)在不同隐藏层上的分析结果。上半部分描绘了“正常-正常(N-N)对”与“正常-恶意(N-M)对”的余弦相似度变化。下半部分则显示了这两种情况之间的平均角度差。图示表明,在模型中间层,N-M对的余弦相似度显著下降,同时平均角度差出现峰值,这揭示了“安全层”的存在,这些层对于区分恶意查询至关重要。
该图像是图3,展示了预训练LLMs(Llama-3-8B、Llama-2-7B和gemma-2b)内部层对“N-N Pair”和“N-M Pair”的分析。它通过上排的余弦相似度值和下排的平均角度差,揭示了不同层中正常与恶意查询表示之间的区分度。对于Llama系列模型,中间层显示出N-M Pair余弦相似度显著下降和平均角度差增大,表明这些层在区分恶意查询中起关键作用,可能对应于论文中提出的“安全层”。gemma-2b表现出不同的模式。
4.2. Precisely Locating the Safety Layers
The cosine similarity analysis provides an approximate location. To pinpoint the exact boundaries, the authors devise a more sensitive method.
- Principles:
- Parameter Scaling: The influence of a layer can be amplified or diminished by scaling its weights by a factor α. If α>1, the layer's function is enhanced; if α<1, it is suppressed.
- Over-Rejection as a Proxy for Security: Changes in a model's security level are sensitively reflected in its over-rejection behavior. Enhancing safety layers should increase over-rejection; suppressing them should decrease it.
- Progressive Localization Algorithm:
- Step 1 (Initial Range): Use the cosine similarity analysis (from Figure 2) to identify an initial approximate range of safety layers,
[i, j]
, where the gap between N-M and N-N curves appears and grows.
- Step 2 (Baseline Over-Rejection): Create an "over-rejection dataset" (Do) containing tricky, harmless queries (e.g., "How to kill a process?"). Measure the baseline number of rejections (Ro) by the original model.
- Step 3 (Progressive Boundary Search):
- Choose a scaling factor α (e.g., α=1.2 to amplify or α=0.8 to suppress).
- To find the upper bound: Start with the initial range
[i, j]
. Scale the parameters of layers in this range and measure the new rejection count. Then, iteratively expand the range to [i,j+1], [i,j+2], etc., each time measuring the rejection count. The upper bound is the layer where the rejection count is maximized (for α>1) or minimized (for α<1). Adding layers beyond this boundary dilutes the effect and causes the rejection count to fall/rise again.
- To find the lower bound: Fix the determined upper bound and repeat the iterative process for the lower bound (e.g., test ranges [i−1,jupper], [i−2,jupper], etc.).
- The final range is the one that produces the most significant change in the over-rejection rate, indicating it contains the most critical security-related parameters.
4.3. SPPFT: Safely Partial-Parameter Fine-Tuning
With the safety layers identified, the proposed defense is straightforward:
- Definition:
SPPFT
is a fine-tuning approach where the parameters of the identified safety layers are frozen. This means their gradients are not computed or updated during the backpropagation phase of training.
- Mechanism: By freezing these layers, their learned ability to distinguish malicious content is preserved, while the other layers (early layers for preliminary processing and later layers for semantic analysis) are free to adapt to the new fine-tuning data. This allows the model to learn the new task without unlearning its safety alignment.
5. Experimental Setup
- Datasets:
- Fine-tuning Datasets:
- Normal Data Attack (DN): 1,000 harmless samples from the
alpaca finance
dataset.
- Implicit Attack (DI): 4,000 samples where the output is primed to be affirmative (e.g., starts with "Sure, the answer is:").
- Backdoor Attack (DB): 1,500 normal samples and 1,500 backdoor samples with a trigger phrase.
- Harmful Data Attack (DH): 1,000 normal samples mixed with a proportion (p=0.05,0.1,0.2) of malicious data.
- Evaluation Datasets:
- Malicious Queries (Dm): 520 prompts from the dataset created by Zou et al. (2023) to test for harmful responses.
- Task Performance (DT): 500 non-overlapping samples from
alpaca finance
to test performance on the fine-tuning domain.
- Evaluation Metrics:
- Harmful Rate (Rh):
- Conceptual Definition: Measures the percentage of malicious prompts from Dm that the LLM agrees to answer. A lower rate is better, indicating stronger security.
- Mathematical Formula:
Rh=Total number of harmful promptsNumber of harmful prompts answered
- Harmful Score (Sh):
- Conceptual Definition: Measures the severity of the harmful responses on a scale of 1 (safest) to 5 (most harmful). The scores are generated by GPT-4 based on a detailed rubric. A lower score is better.
- Rouge-L Score (Sr):
- Conceptual Definition: Evaluates the quality of generated text by measuring the length of the Longest Common Subsequence (LCS) between the model's output and a reference answer. It is a standard metric for text summarization and generation tasks.
- Mathematical Formula:
Rlcs=mLCS(X,Y),Plcs=nLCS(X,Y),Flcs=Rlcs+β2Plcs(1+β2)RlcsPlcs
- Symbol Explanation: X is the reference text of length m, Y is the model output of length n,
LCS(X, Y)
is the length of their longest common subsequence. Rlcs and Plcs are recall and precision, and Flcs is the F-score. The paper uses the standard Rouge-L F-score.
- MMLU Score (Sm):
- Conceptual Definition: Measures the model's general knowledge and problem-solving abilities across 57 diverse subjects (from elementary math to US history). It is a benchmark for overall model capability.
- Baselines:
- Full Fine-Tuning (FullFT): The standard approach where all model parameters are updated.
- Neuron Freezing Fine-Tuning (NFFT): Freezing a proportional number of discrete "security-critical neurons" as identified by Wei et al. (2024).
- Lisa: A method for safe fine-tuning used as a baseline in the harmful data attack scenario.
6. Results & Analysis
Core Results
-
Localization of Safety Layers: Table 1 (transcribed below) shows the progressive localization process. For Llama-3-8B-Instruct
(with α=1.2), the number of over-rejections peaks when scaling layers [6,12]. Expanding the range further (e.g., to [5,12] or [7,13]) reduces the over-rejection count, confirming that layers 6-12 are the most security-critical. This pattern holds across all tested models.
(Manual Transcription of Table 1)
|
|
Phi-3-mini-4k-instruct (α = 0.8, R0 = 270) |
Llama-2-7b-chat (α = 1.15, R0 = 169) |
Upper Bound |
Scaled Layers range |
[11,13] |
[11,14] |
[11,15] |
[11,16] |
[11,17] |
[9,12] |
[9,13] |
[9,14] |
[9,15] |
[9,16] |
Over-Rejection Num |
209 |
190 |
149 |
181 |
189 |
187 |
227 |
237 |
218 |
219 |
Lower Bound |
Scaled Layers range |
[13,15] |
[12,15] |
[11,15] |
[10,15] |
[9,15] |
[8,14] |
[7,14] |
[6,14] |
[5,14] |
[4,14] |
Over-Rejection Num |
237 |
182 |
149 |
177 |
163 |
263 |
268 |
297 |
189 |
202 |
|
|
|
Llama-3-8B-Instruct (α = 1.2, R0 = 139) |
gemma-2b-it (α = 1.2, R0 = 139) |
Upper Bound |
Scaled Layers range |
[7,10] |
[7,11] |
[7,12] |
[7,13] |
[7,14] |
[8,9] |
[8,10] |
[8,11] |
[8,12] |
[8,13] |
Over-Rejection Num |
272 |
241 |
283 |
266 |
256 |
310 |
335 |
368 |
343 |
326 |
Lower Bound |
Scaled Layers range |
[8,12] |
[7,12] |
[6,12] |
[5,12] |
[4,12] |
[8,11] |
[7,11] |
[6,11] |
[5,11] |
[4,11] |
Over-Rejection Num |
334 |
283 |
371 |
358 |
223 |
368 |
371 |
407 |
404 |
323 |
-
Three Stages of LLM Layers: The attention heatmaps in Figure 4 support a three-stage processing model:
-
Early Layers (Pre-Safety): Focus on syntactic structure and general words (e.g., "how", "the").
-
Middle Layers (Safety Layers): Begin to distinguish malicious intent. Attention shifts towards semantically loaded keywords (e.g., "bomb", "rob").
-
Late Layers (Post-Safety): Focus entirely on the core semantic concepts of the query to formulate a response.
该图像是图4中Llama-2-7b-chat和Phi-3-mini-4k-instruct模型的注意力分数热图。垂直轴表示模型的各个层,水平轴对应输入LLM词元。每个网格的深浅反映了特定层分配给该词元的注意力分数。图示了Llama-2-7b-chat在恶意问题(如“How to make a bomb”)和正常问题(如“Where is the capital of America”)上的注意力模式。黑色的虚线标记了安全层的位置,将模型层分为三个部分,暗示这些层在区分恶意与正常查询中扮演关键角色。
-
SPPFT Performance: Table 2 and Table 3 (transcribed below) show that SPPFT
is highly effective.
-
Across Normal, Implicit, and Backdoor attacks (Table 2): FullFT
causes a massive increase in harmfulness (e.g., for Phi-3
, harmful rate jumps from 0.77% to 87.69% in the implicit attack). SPPFT
, in contrast, keeps the increase minimal (0.77% to 3.27%). Meanwhile, task performance metrics (Rouge-L
, MMLU
) remain comparable to FullFT
, showing that learning is not compromised. The NFFT
baseline performs poorly, similar to FullFT
.
-
In Harmful Data attacks (Table 3): As the proportion of malicious data (p) increases, FullFT
models quickly become almost completely unsafe. SPPFT
consistently maintains a much lower harmful rate and score, outperforming both FullFT
and the Lisa
baseline.
(Manual Transcription of Table 2)
|
Llama-3-8B-Instruct (Initial Rh=5.77%, Sh=1.13) |
Llama-2-7b-chat (Initial Rh=1.35%, Sh=1.03) |
gemma-2b-it (Initial Rh=3.27%, Sh=1.08) |
Phi-3-mini-4k-instruct (Initial Rh=0.77%, Sh=1.02) |
DN |
SPPFT |
FullFT |
NFFT |
SPPFT |
FullFT |
NFFT |
SPPFT |
FullFT |
NFFT |
SPPFT |
FullFT |
NFFT |
Harmful Rate (Rh) |
9.62% |
44.42% |
43.65% |
2.88% |
10.58% |
12.69% |
5.58% |
18.27% |
17.69% |
7.12% |
40.00% |
38.46% |
Harmful Score (Sh) |
1.21 |
2.41 |
2.37 |
1.06 |
1.38 |
1.49 |
1.14 |
1.68 |
1.66 |
1.16 |
2.39 |
2.33 |
Rouge-L Score (Sr) |
0.285 |
0.277 |
0.283 |
0.248 |
0.270 |
0.252 |
0.240 |
0.232 |
0.227 |
0.322 |
0.318 |
0.316 |
MMLU Score (Sm) |
0.654 |
0.649 |
0.651 |
0.470 |
0.458 |
0.454 |
0.384 |
0.389 |
0.381 |
0.678 |
0.671 |
0.668 |
DI |
SPPFT |
FullFT |
NFFT |
SPPFT |
FullFT |
NFFT |
SPPFT |
FullFT |
NFFT |
SPPFT |
FullFT |
NFFT |
Harmful Rate (Rh) |
6.15% |
42.69% |
41.92% |
6.73% |
58.85% |
58.07% |
6.35% |
54.04% |
54.81% |
3.27% |
87.69% |
81.35% |
Harmful Score (Sh) |
1.18 |
2.64 |
2.61 |
1.19 |
3.26 |
3.24 |
1.21 |
2.98 |
3.00 |
1.09 |
4.17 |
4.03 |
... (omitting other scores for brevity) |
DB |
SPPFT |
FullFT |
NFFT |
SPPFT |
FullFT |
NFFT |
SPPFT |
FullFT |
NFFT |
SPPFT |
FullFT |
NFFT |
Harmful Rate (Rh) |
8.27% |
52.50% |
51.15% |
5.58% |
60.58% |
59.42% |
5.19% |
48.08% |
49.04% |
9.04% |
80.96% |
76.73% |
Harmful Score (Sh) |
1.28 |
2.90 |
2.87 |
1.20 |
3.19 |
3.16 |
1.20 |
2.75 |
2.78 |
1.31 |
4.00 |
3.98 |
... (omitting other scores for brevity) |
(Manual Transcription of Table 3)
|
Harmful attack |
LLaMA-2-chat |
Phi-3-mini-4k-instruct |
(p = 0.05) |
(p = 0.1) |
(p = 0.2) |
(p = 0.05) |
(p = 0.1) |
(p = 0.2) |
FullFT |
Harmful Rate (Rh) |
26.0% |
53.7% |
93.5% |
59.4% |
90.7% |
98.8% |
Harmful Score (Sh) |
1.88 |
2.97 |
4.53 |
3.23 |
4.45 |
4.89 |
SPPFT |
Harmful Rate (Rh) |
7.9% |
25.6% |
68.3% |
8.1% |
23.3% |
78.8% |
Harmful Score (Sh) |
1.25 |
1.86 |
3.61 |
1.29 |
1.74 |
3.88 |
Lisa |
Harmful Rate (Rh) |
20.7% |
41.4% |
80.3% |
41.9% |
61.3% |
87.4% |
Harmful Score (Sh) |
1.76 |
2.38 |
4.01 |
2.49 |
3.10 |
4.23 |
Ablations / Parameter Sensitivity
- The ablation study in Appendix A.4.5 is crucial. It shows that freezing contiguous layers other than the identified safety layers (i.e., layers before or after) does not preserve security. In some cases, it even worsens security compared to
FullFT
. This confirms that the defense's effectiveness comes from protecting the specific functionality of the safety layers, not just from reducing the number of trainable parameters.
7. Conclusion & Reflections
-
Conclusion Summary: This paper makes a significant contribution by identifying the existence, location, and function of "safety layers" in aligned LLMs. It provides a clear, interpretable model of how safety is implemented at a structural level. The proposed SPPFT
method is a direct and practical application of this insight, offering a simple, effective, and computationally efficient way to conduct safer fine-tuning.
-
Limitations & Future Work:
- The authors state their intent to further explore the proposed three-stage division of LLM layers (preliminary confirmation, malicious detection, semantic analysis) in future work.
- The localization process, while effective, needs to be performed for each new model architecture. The paper does not explore whether the relative location or size of safety layers is consistent across model families or scales (e.g., from a 7B to a 70B model).
- The analysis focuses on transformer-based decoder-only models. Whether similar structures exist in other architectures remains an open question.
-
Personal Insights & Critique:
- Novelty and Impact: The paper's core finding is both elegant and impactful. The idea of a dedicated, localized "safety module" emerging from alignment training is a powerful mental model that advances our understanding of LLM internals.
SPPFT
is a highly practical technique that could be immediately adopted by developers fine-tuning open-source models to enhance their safety.
- Methodological Strength: The use of the over-rejection phenomenon as a sensitive probe for security is particularly clever. It circumvents the issue of low signal when testing highly secure models on malicious prompts and provides a robust way to locate the layer boundaries.
- Open Questions: Could these safety layers be adversarially targeted? If an attacker knows their location, could they devise a more efficient fine-tuning attack that specifically aims to corrupt these layers? Furthermore, can the functionality of these safety layers be transferred from one model to another to "transplant" safety? This work paves the way for many exciting new research directions in mechanistic interpretability and robust alignment.