- Title: Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
- Authors: Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Vu, Anirudh Goyal, and Sanjeev Arora.
- Affiliations: The authors are affiliated with the Computer Science Department & Princeton Language and Intelligence at Princeton University. Sanjeev Arora is a highly respected figure in theoretical computer science and machine learning, lending significant credibility to the work.
- Journal/Conference: The paper is available on arXiv, a preprint server for academic articles. This means it has not yet undergone formal peer review for publication in a conference or journal but is shared to disseminate findings quickly.
- Publication Year: 2024
- Abstract: The paper addresses the problem of safety alignment loss in Large Language Models (LLMs) like Llama 2-Chat after they are fine-tuned, even on benign datasets. The authors find that the prompt templates used during fine-tuning and inference are critical for preserving safety. They propose a counter-intuitive strategy called "Pure Tuning, Safe Testing" (PTST), which involves fine-tuning the model without a safety prompt but re-introducing it at inference time. Through extensive experiments on models like Llama 2, Mistral 7B, and GPT-3.5 Turbo, they show that PTST significantly reduces the emergence of unsafe behaviors while maintaining the performance benefits of fine-tuning.
- Original Source Link: The paper can be accessed at: https://arxiv.org/abs/2402.18540
2. Executive Summary
-
Foundational Concepts:
- Large Language Model (LLM): A type of deep learning model with billions of parameters, trained on vast amounts of text data to understand and generate human-like language. Examples include OpenAI's GPT series, Meta's Llama, and Mistral AI's models.
- Fine-Tuning: The process of taking a pre-trained LLM and further training it on a smaller, specialized dataset to adapt its capabilities to a specific task, such as solving math problems, writing code, or acting as a chatbot for a particular domain.
- Alignment Training: The process of training an LLM to behave in accordance with human values. This typically involves making the model helpful (accurately follows instructions), honest (doesn't invent facts), and harmless (refuses to generate unsafe, unethical, or illegal content). A common technique for this is Reinforcement Learning from Human Feedback (RLHF).
- Safety Alignment: A subset of alignment focused specifically on the "harmless" aspect. A safety-aligned model should refuse harmful requests.
- Catastrophic Forgetting: A well-known problem in neural networks where training on a new task causes the model to lose its ability to perform previously learned tasks. In this paper's context, the LLM "forgets" its safety training after being fine-tuned on a new utility-focused task.
- Prompt Template: A pre-defined string format that structures the input to an LLM. It typically includes placeholders for user input and can contain special tokens (e.g.,
[INST]
, [/INST]) or a system prompt (an initial instruction that sets the context or rules for the model's behavior, such as a safety prompt).
- Attack Success Rate (ASR): A metric used to measure an LLM's safety, defined as the percentage of harmful user queries that successfully elicit a harmful response from the model.
-
Previous Works:
- The paper builds directly on Qi et al. [2024], who first highlighted that fine-tuning on benign data can degrade LLM safety. However, Qi et al. used the same prompt template for training and testing, a practice this paper identifies as flawed.
- The authors contrast their work with research on malicious fine-tuning (e.g., Yang et al., 2023), where the goal is to intentionally break alignment. This paper focuses on the more subtle and arguably more dangerous problem of unintentional safety loss by benign developers.
- The paper also considers existing lightweight safety defenses like
Self-Reminder
(Xie et al., 2023), which adds safety instructions before and after a user's query, and In-Context Defense
(Wei et al., 2023), which provides a safe example in the prompt. The paper investigates how these defenses interact with fine-tuning.
-
Differentiation: The key innovation of this paper is its focus on the discrepancy between training and inference prompt templates. While others saw catastrophic forgetting as an algorithmic problem requiring complex solutions, this paper proposes a simple, practical solution at the data-processing level. The PTST strategy introduces an intended distribution shift as a mechanism to preserve safety, a novel and counter-intuitive idea.
4. Methodology (Core Technology & Implementation)
The core technology proposed is not a new model or algorithm but a strategic procedure named Pure Tuning, Safe Testing (PTST).
该图像是图示,展示了“纯调优、安全测试”(PTST)策略在细调和推理阶段安全提示词使用的效果对比,表明仅推理时加入安全提示词可有效保持模型安全对齐,避免产生不安全回答。
As illustrated in Figure 1, the PTST strategy involves two distinct phases:
-
Principles & Intuition: The underlying hypothesis is that an LLM's safety alignment is strongly tied to the specific prompts used during its initial alignment training. When a model is fine-tuned on a new task using the same safety prompt, it may learn a new, conflicting association. For example, it might learn that "when I see this safety prompt, my goal is to solve math problems," overwriting the original instruction, "when I see this safety prompt, my primary goal is to be safe." This is known as task association interference. PTST avoids this by separating the tasks: fine-tuning focuses purely on the downstream task (utility), while the safety prompt is reserved exclusively for triggering the model's original, "uncontaminated" safety knowledge during inference.
-
Steps & Procedures:
-
Select an Aligned LLM: Start with a public, safety-aligned model (e.g., Llama 2-Chat). This model already knows how to refuse harmful requests when prompted correctly.
-
Fine-tuning (Pure Tuning): Prepare the benign, downstream dataset (e.g., GSM8K). Format each data point using a "pure" prompt template that lacks any system-level safety instructions. For example, a minimal template like [INST]input[/INST] is used instead of one with a lengthy safety preamble.
-
Inference (Safe Testing): After fine-tuning, deploy the model. Before passing any user query to the model, enforce a "safe" prompt template. This template prepends a detailed safety prompt (e.g., Llama 2's recommended system prompt: "You are a helpful, respectful and honest assistant...") to the user's input.
This creates an intentional mismatch or distribution shift between the data seen during fine-tuning and the data seen at inference, which the paper shows is beneficial for safety.
5. Experimental Setup
6. Results & Analysis
The paper's results are presented through a series of tables and figures that compellingly argue for the PTST strategy.
-
Core Results: The Danger of Same-Template Fine-tuning and the Efficacy of PTST
The results from fine-tuning Llama-2-7b-chat on GSM8K
are shown in the transcribed Table 1 below.
Manually Transcribed Table 1 (a): Helpfulness (Exact Match % on GSM8K)
test\train |
TV |
TA |
CV |
CA |
CL |
No FT |
15.31 |
9.10 |
20.32 |
20.62 |
6.52 |
TV |
32.98 |
27.02 |
31.94 |
27.02 |
23.76 |
TA |
6.06 |
33.99 |
21.31 |
32.22 |
23.98 |
CV |
25.12 |
20.82 |
33.39 |
24.74 |
30.00 |
CA |
7.48 |
32.52 |
15.57 |
33.08 |
21.76 |
CL |
20.87 |
29.34 |
31.59 |
31.01 |
33.51 |
Manually Transcribed Table 1 (c): ASR on DirectHarm4 (%)
test\train |
TV |
TA |
CV |
CA |
CL |
No FT |
11.75 |
16.25 |
2.75 |
4.75 |
0.00 |
TV |
40.08 |
29.50 |
7.83 |
9.42 |
0.42 |
TA |
17.17 |
57.50 |
4.92 |
11.00 |
0.08 |
CV |
34.08 |
33.50 |
11.00 |
20.50 |
1.08 |
CA |
19.33 |
51.58 |
8.08 |
46.42 |
1.00 |
CL |
29.50 |
63.00 |
6.83 |
18.92 |
18.08 |
- Analysis:
- The diagonal entries (bolded) confirm that fine-tuning with the same template for training and testing improves helpfulness (Table 1a) but causes a massive spike in ASR (Table 1c). For example, training and testing with
CA
increases helpfulness from 20.62% to 33.08% but explodes the ASR from 4.75% to 46.42%.
- Most shockingly, fine-tuning with the safety prompt (
CL
) and testing with it (CL:CL
) results in an ASR of 18.08%, a huge increase from the base model's 0% ASR. This is a critical finding: including the safety prompt during fine-tuning actively harms safety.
- PTST works. Look at the
CL
column for testing. When training with a "pure" template like CV
and testing with the "safe" template CL
, the ASR is only 1.08%. This is a dramatic reduction from the 18.08% of CL:CL
and 11.00% of CV:CV
, while helpfulness remains high (30.00%). This pattern holds for all pure training templates (TV
, TA
, CV
, CA
).
-
Ablations / Further Analysis:
-
PTST vs. Early Stopping:
该图像是一张二维散点和折线图,展示了不同fine-tuning策略下模型ASR与Helpfulness的关系。图中含有多条曲线,分别对应CV:CV、CL:CL、CV:CL以及无微调情况,显示PTST策略对安全性与有用性的影响。
该图像是论文中的二维散点和折线图,显示不同微调策略下ASR与Helpfulness的关系。图中不同颜色和形状的点表示不同的训练与测试设置,反映了纯微调与安全测试策略对模型安全性的影响。
Figure 2 (composed of 2.jpg
and 3.jpg
) plots ASR vs. Helpfulness over training epochs. The blue (CV:CV
) and orange (CL:CL
) lines show that as the model gets more helpful, it also becomes less safe. In contrast, the green line (CV:CL
), which represents PTST, maintains a very low ASR even as helpfulness increases. This demonstrates that PTST provides a fundamentally better safety-utility trade-off than just stopping the conventional fine-tuning process earlier.
-
Generality across Models, Datasets, and Prompts:
-
The same trend was observed for GPT-3.5 Turbo (Table 3) and Mistral 7B (Table 7 in Appendix).
-
Fine-tuning on ChatDoctor (Table 4) and OpenOrca (Table 5) also confirmed that PTST preserves safety while same-template training compromises it.
-
Experiments with other safety prompts (MPT, Llama-short) showed that PTST is not specific to one particular safety prompt. Furthermore, fine-tuning with one safety prompt and testing with another different safety prompt was still less safe than PTST, reinforcing the "Pure Tuning" component of the strategy.
该图像是论文中图3的图表,展示了Llama 2-Chat和GPT-3.5 Turbo在不同训练和测试提示模板下,对DirectHarm4数据集的ASR和帮助度的统计结果。图中比较了llama、same as test、vanilla三种训练模板对模型安全性和有效性的影响,验证了PTST策略在保持安全性方面的优势。
-
Effect of Mixing Safety Data:
The authors tested whether adding safety examples (refusals to harmful queries) to the fine-tuning data would make PTST redundant.
Manually Transcribed Table 6 (b): Safety Evaluation with Safety Data Added
(Showing ASR % for Without safety data
/ With safety data
on GSM-Danger dataset)
test\train |
CV |
CA |
CL |
CV |
22 / 22 |
52 / 17 |
54 / 32 |
CA |
12 / 12 |
41 / 41 |
13 / 59 |
CL |
5 / 4 |
1 / 1 |
12 / 38 |
- Analysis: While adding safety data helped reduce ASR on in-distribution attacks, it failed to protect against the OOD
GSM-Danger
attacks. As seen in the transcribed table, the CL:CL
strategy (CL
row, CL
column) still resulted in a very high ASR of 38% even with safety data. In contrast, PTST (CV:CL
and CA:CL
) kept the ASR extremely low (4% and 1% respectively), demonstrating its superior robustness.
7. Conclusion & Reflections
-
Conclusion Summary: The paper convincingly demonstrates that the common practice of using identical prompt templates for fine-tuning and inference is a significant, previously overlooked cause of safety degradation in aligned LLMs. The proposed "Pure Tuning, Safe Testing" (PTST) strategy—fine-tuning without a safety prompt and using one only at inference—is a simple, zero-cost, and highly effective method to mitigate this alignment loss. It preserves the model's original safety training while allowing it to gain new capabilities, offering a much better balance between utility and safety.
-
Limitations & Future Work:
- The paper's explanation for why PTST works is intuitive (task association interference) but lacks a deep, mechanistic analysis of the model's internal representations. Future work could use network probing techniques to verify this hypothesis.
- The safety evaluation relies on a
GPT-4 judge
, which, while powerful, is not infallible and may have its own biases.
- The study focuses on benign fine-tuning. While effective in this context, PTST is not designed as a defense against deliberate, malicious attacks to break alignment.
-
Personal Insights & Critique:
- Significance: This is a highly impactful paper. Its finding is both surprising and immediately actionable. It provides a clear, evidence-backed "best practice" for any developer or organization that fine-tunes publicly available aligned LLMs. The simplicity of the solution is its greatest strength.
- Transferability: The PTST principle could potentially apply to other forms of "catastrophic forgetting" beyond safety. For instance, if a model is fine-tuned for a specific domain (e.g., law), its general knowledge might degrade. A similar strategy of separating the "domain prompt" from the fine-tuning process might help preserve general capabilities.
- Open Questions: The results raise fascinating questions about how LLMs learn and associate concepts. The fact that including a safety prompt during fine-tuning is worse than not including one at all suggests a complex interplay between the model's weights, the training data distribution, and the specific tokens in the prompt. This opens up a rich area for future research into the "mechanistic interpretability" of LLM alignment.