Paper status: completed

Mitigating the Alignment Tax of RLHF

Published:09/12/2023

RL Training for Large Language Models (63)Alignment Tax (1)Model Weight Averaging (1)Reinforcement Learning with Human Feedback (1)Heterogeneous Model Averaging for Transformer Layers (1)

Original Link PDF

Price: 0.100000

9 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper addresses RLHF's "alignment tax" where LLMs forget pre-trained abilities, creating an alignment-forgetting trade-off. It shows simple model averaging surprisingly achieves the best Pareto front. Building on theoretical insights, a novel Heterogeneous Model Averaging (H

Abstract

LLMs acquire a wide range of abilities during pre-training, but aligning LLMs under Reinforcement Learning with Human Feedback (RLHF) can lead to forgetting pretrained abilities, which is also known as the alignment tax. To investigate alignment tax, we conducted experiments with existing RLHF algorithms using OpenLLaMA-3B, which revealed a pronounced alignment tax in NLP tasks. Whereas, despite various techniques to mitigate forgetting, they are often at odds with the RLHF performance, leading to a trade-off between alignment performance and forgetting mitigation, leading to an alignment-forgetting trade-off. In this paper we show that model averaging, which simply interpolates between pre and post RLHF model weights, surprisingly achieves the most strongest alignment-forgetting Pareto front among a wide range of competing methods. To understand its effectiveness, we offer theoretical insights into model averaging, revealing that it enhances performance Pareto front by increasing feature diversity on the layers where tasks share overlapped feature spaces. Empirical evidence corroborates our analysis by showing the benefits of averaging low-level transformer layers. Building on the analysis and the observation that averaging different layers of the transformer leads to significantly different alignment-forgetting trade-offs, we propose Heterogeneous Model Averaging (HMA) to Heterogeneously find various combination ratios of model layers. HMA seeks to maximize the alignment performance while incurring minimal alignment tax. Moreover, we validate HMA's performance across a range of RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B which is evaluated by open-sourced preference model and GPT4. Code available here: https://github.com/avalonstrel/Mitigating-the-Alignment-Tax-of-RLHF.git.

Mind Map

In-depth Reading

English Analysis~17 min read · 19,907 chars

1. Bibliographic Information

Title: Mitigating the Alignment Tax of RLHF
Authors: Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Heng Ji, Yuan Yao, Tong Zhang.
Affiliations: The authors are affiliated with several leading academic and industrial research institutions, including Princeton University, The Hong Kong University of Science and Technology, University of Illinois Urbana-Champaign, and NVIDIA.
Journal/Conference: The paper was submitted to arXiv, a popular repository for preprints in fields like computer science and physics. This indicates it is a research work shared with the community, potentially before or during a formal peer-review process for a conference or journal.
Publication Year: The initial version was submitted in September 2023.
Abstract: The paper investigates the "alignment tax," a phenomenon where aligning Large Language Models (LLMs) with human preferences via Reinforcement Learning with Human Feedback (RLHF) causes them to forget their pre-trained abilities. Experiments on OpenLLaMA-3B confirm this pronounced tax on NLP tasks. While various mitigation techniques exist, they often create a trade-off between alignment performance and forgetting. The authors find that simple model averaging—interpolating weights between the pre- and post-RLHF models—achieves the best alignment-forgetting Pareto front. They provide theoretical and empirical evidence suggesting this is due to increased feature diversity in shared, low-level layers. Building on this, they propose Heterogeneous Model Averaging (HMA), which uses different averaging ratios for different model layers to further improve performance. HMA's effectiveness is validated on OpenLLaMA-3B and Mistral-7B using various RLHF algorithms and evaluation methods, including GPT-4.
Original Source Link:
- arXiv Link: https://arxiv.org/abs/2309.06256
- PDF Link: http://arxiv.org/pdf/2309.06256v4

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Modern LLMs are aligned with human values (to be helpful, honest, and harmless) using a technique called Reinforcement Learning with Human Feedback (RLHF). However, this alignment process often degrades the model's performance on a wide range of general abilities it learned during pre-training, such as translation, reasoning, and reading comprehension. This degradation is termed the "alignment tax."
- Importance: This problem is critical because the goal is to create AI systems that are both aligned and capable. If alignment comes at the cost of core competencies, the utility of these powerful models is diminished. Existing methods to prevent this "forgetting" often reduce the effectiveness of the alignment itself, leading to an undesirable alignment-forgetting trade-off.
- Innovation: The paper provides a comprehensive, empirical investigation into this trade-off and discovers that a surprisingly simple, post-hoc technique—model averaging—is more effective than many complex, training-time interventions. It then builds upon this insight to develop an even more effective method, Heterogeneous Model Averaging (HMA).
Main Contributions / Findings (What):
1. Systematic Investigation of Alignment Tax: The paper conducts a thorough empirical study demonstrating the alignment-forgetting trade-off across various NLP tasks and RLHF algorithms.
2. Effectiveness of Model Averaging (MA): It shows that simply averaging the weights of the model before and after RLHF training achieves a stronger performance trade-off (Pareto front) than a wide range of competing methods like regularization, LoRA, and knowledge distillation.
3. Theoretical and Empirical Explanation: The authors provide a theoretical framework to explain why model averaging works. It improves performance by diversifying features in layers where the pre-training and alignment tasks share underlying structure (e.g., low-level linguistic features). This is supported by experiments showing that averaging the initial layers of a Transformer is particularly beneficial.
4. Proposal of Heterogeneous Model Averaging (HMA): Based on the insight that different layers contribute differently, HMA is introduced. This method adaptively finds optimal, different averaging ratios for various blocks of layers in the model, further pushing the Pareto front to achieve better alignment with minimal forgetting.
5. Extensive Validation: The proposed methods are validated across multiple LLMs (OpenLLaMA-3B, Mistral-7B), several RLHF algorithms (RSF, DPO, PPO), and different evaluation schemes (reward models, PairRM, and GPT-4), demonstrating the robustness and generalizability of the findings.

Foundational Concepts:
- Large Language Models (LLMs): These are deep neural networks, typically based on the Transformer architecture, trained on massive amounts of text data. This pre-training endows them with a broad understanding of language, facts, and reasoning abilities. Examples include GPT-4, Llama, and Mistral.
- Reinforcement Learning with Human Feedback (RLHF): A multi-stage process to fine-tune LLMs to better align with human preferences. It typically involves:
  1. Supervised Fine-Tuning (SFT): The pre-trained LLM is fine-tuned on a smaller, high-quality dataset of instruction-response pairs.
  2. Reward Modeling: Human labelers rank several responses to a given prompt. A separate "reward model" is trained to predict which response a human would prefer.
  3. RL Optimization: The SFT model is further fine-tuned using a reinforcement learning algorithm (like PPO). The reward model provides the signal, rewarding the LLM for generating responses that humans would likely prefer.
- Alignment Tax: The negative side effect of RLHF. By optimizing heavily for the human preference reward signal, the model's weights shift in a way that can harm its performance on other, more general tasks learned during pre-training. As shown in Figure 1, an increase in "Helpful" ability can coincide with a decrease in "Translation" or "Comprehension."
- Pareto Front: In multi-objective optimization, a Pareto front represents the set of all solutions where it's impossible to improve one objective without degrading another. In this paper's context, the two objectives are Alignment Reward (which we want to maximize) and Alignment Tax (which we want to minimize, equivalent to maximizing NLP task performance). A method with a "stronger" Pareto front can achieve a better combination of both objectives than another method.
- Model Averaging: A simple technique where the parameters (weights) of two or more models are averaged together to create a new model. In this paper, it refers to the linear interpolation between the weights of the model before RLHF ( $\theta_0$ ) and after RLHF ( $\theta$ ).
Previous Works & Differentiation:
- Catastrophic Forgetting in Continual Learning: The alignment tax is a form of catastrophic forgetting. The continual learning field has developed methods like:
  - Regularization: Adding a penalty to the loss function to keep the new model's weights close to the old model's weights (e.g., L1/L2 penalty).
  - Experience Replay (ER): Mixing in data from the old task (pre-training) while training on the new task (RLHF). The paper shows this is impractical for LLMs because the pre-training dataset is enormous and often unavailable. Their experiments show MA can outperform ER even when a subset of pre-training data is used.
  - Knowledge Distillation: Penalizing the new model if its output distribution diverges from the old model's. The paper shows in Figure 3 that simple model averaging outperforms these more complex, integrated methods.
- Model Averaging in LLMs: Previous work used model averaging to merge models trained for different objectives or to create better reward models. However, this paper is the first to systematically study its effectiveness specifically for mitigating the alignment tax.
- Adaptive Model Merging: Methods like AdaMerging exist to find layer-specific merging weights to maximize performance on a single, specific target task. This is unsuitable for mitigating alignment tax, which requires preserving performance across a wide and often unknown range of tasks. The paper shows in Figure 15 that optimizing for one task (Reading Comprehension) with AdaMerging does not help, and can even hurt, performance on another (Common Sense QA).

4. Methodology (Core Technology & Implementation)

The paper's core technical contributions are the systematic evaluation of Model Averaging (MA) and the proposal of its enhancement, Heterogeneous Model Averaging (HMA).

Model Averaging (MA)
- Principle: The core idea is to find a "sweet spot" between the general capabilities of the pre-RLHF model and the aligned behavior of the post-RLHF model by simply blending their weights.
- Procedure: Let $\theta_0$ be the weights of the model after initial instruction fine-tuning (but before RLHF), and let $\theta$ be the weights after RLHF. The averaged model, $\theta_{avg}$ , is computed as a linear interpolation: $\theta_{avg} = (1 - \alpha)\theta_0 + \alpha\theta$
- Symbol Explanation:
  - $\theta_0$ : The parameters of the base model, which is highly capable on general NLP tasks.
  - $\theta$ : The parameters of the aligned model, which scores high on the alignment reward but suffers from the alignment tax.
  - $\alpha \in [0, 1]$ : A hyperparameter that controls the interpolation. $\alpha=0$ recovers the base model, while $\alpha=1$ recovers the fully aligned model. Varying $\alpha$ traces out the Pareto front.
Theoretical Insights into MA's Effectiveness
- Underlying Theory: The paper draws on the framework of Lin et al. (2023), which posits that models can rely on different sets of "weak features" to solve a task. Model averaging works by diversifying the set of features the final model relies on, making it more robust.
- Proposition 5.1 (Simplified): The paper extends this theory to the multi-task setting and shows that the performance gain from averaging is greater when the two tasks are more similar (i.e., they share a common feature space). This allows the averaged model to leverage a richer, more diverse set of features for both tasks. If the tasks are completely dissimilar, averaging provides little to no benefit.
- Connection to Transformers: This theory is connected to LLMs with the insight that different layers of a Transformer learn features at different levels of abstraction.
  - Low-level layers (e.g., layers 1-8) learn general features like word representations and basic syntax. Both general NLP tasks and alignment tasks rely on these shared features, making them "similar" from the perspective of these layers.
  - High-level layers learn more task-specific features.
- Hypothesis & Empirical Validation: The theory predicts that averaging the low-level layers should be most beneficial, potentially improving performance on both alignment and NLP tasks. Figure 4 confirms this: "Input Part MA" shows a "magical" region where both reward and NLP F1-score increase simultaneously, a result not seen when averaging higher-level layers.
  
  该图像是图4，左侧示意图展示了将Transformer模型划分为输入、中间和输出三部分，对应从低层到高层。右侧曲线图则比较了对不同部分进行模型平均（MA）时，HH RLHF奖励与阅读理解F1之间的对齐-遗忘权衡关系，其中整体MA展现出更强的帕累托前沿。
Heterogeneous Model Averaging (HMA)
- Principle: Since averaging different layers has vastly different effects, HMA proposes to learn an optimal, distinct averaging ratio for each part of the model instead of using a single global $\alpha$ . This allows for a more fine-grained trade-off, for instance, by leaning more towards the pre-RLHF model in the lower layers and the post-RLHF model in the higher layers.
  
  $Figure 2: Illustration of Heterogeneous Model Averaging (HMA) when $K = 3$ .$ 该图像是图2，展示了K=3时异构模型平均（HMA）的示意图。它通过将预训练模型（ $\theta_0$ ）和经RLHF微调的模型（ $\theta$ ）的权重在不同层（如输出、中间、输入部分）以不同比例进行加权平均。例如，输出层为 $0.3 \theta_0^{[3]} + 0.7 \theta^{[3]}$ ，展示了层级间的权重组合差异。
- Procedure:
  1. Divide the Transformer model into $K$ parts (e.g., $K=3$ for input, middle, and output blocks of layers).
  2. Assign a unique averaging ratio $\alpha_k$ to each part $k$ . The $k$ -th part of the merged model $\theta(K)$ is: $\theta^{[k]}(K) := \alpha_k \theta^{[k]} + (1 - \alpha_k) \theta_0^{[k]}, \quad \forall k \in \{1, \dots, K\}$
  3. To find the optimal ratios $(\alpha_1, \dots, \alpha_K)$ , the goal is to maximize the alignment reward, constrained by the condition that the average of the ratios is a fixed value $\alpha$ . This ensures a fair comparison with vanilla MA at a similar trade-off point. The optimization problem is: $\operatorname* { m a x } _ { ( \alpha _ { 1 } , \ldots , \alpha _ { K } ) \in \Omega } \mathbb { E } _ { x } \mathbb { E } _ { a \sim \pi _ { \theta ( K ) } ( \cdot | x ) } \left[ r ^ { * } ( x , a ) \right]$ where the constraint set is $\Omega := \left\{ \frac{1}{K} \sum_k \alpha_k = \alpha, \alpha_1, \dots, \alpha_K \in [0, 1] \right\}$ .
- Implementation via Proxy Distillation: Directly solving the RL problem above is complex. The authors propose a practical proxy:
  1. First, generate a high-reward dataset $\mathcal{D}_\theta$ by sampling responses from the fully aligned model $\pi_\theta$ .
  2. Then, find the ratios $(\alpha_1, \dots, \alpha_K)$ that maximize the log-likelihood of this dataset under the merged model $\pi_{\theta(K)}$ . This turns the RL problem into a simpler supervised learning problem (distillation). $\operatorname* { m a x } _ { \alpha _ { 1 } , \dots , \alpha _ { K } \in \Omega } \frac { 1 } { | { \mathcal D } _ { \theta } | } \sum _ { ( x , a ) \in { \mathcal D } _ { \theta } } \log [ \pi _ { \theta ( K ) } ( a | x ) ]$

5. Experimental Setup

Datasets:
- Instruction Tuning: ShareGPT.
- RLHF/Alignment: Helpfulness and Harmlessness (HH-RLHF) dataset from Bai et al. (2022), which contains human preference data.
- Alignment Tax Evaluation Datasets:
  - Common Sense QA: ARC Easy & Challenge, Race, PIQA. These datasets test reasoning and common sense knowledge.
  - Reading Comprehension: SQuAD and DROP. These test the model's ability to understand and extract information from passages.
  - Translation: WMT 2014 French to English.
- Larger Model Evaluation: AlpacaEval 2.0 benchmark, used for evaluating Mistral-7B models.
Evaluation Metrics:
- Alignment Performance:
  - HH RLHF Reward: The score given by the trained reward model.
  - Win-Rate: The percentage of times a model's output is preferred over a baseline's output, as judged by a stronger model like PairRM or GPT-4.
- NLP Task Performance (Alignment Tax):
  - Accuracy:
    - Conceptual Definition: The proportion of correct predictions out of the total number of examples. It is a straightforward measure of correctness, suitable for classification-style tasks like multiple-choice QA.
    - Mathematical Formula: $\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$
  - F1 Score:
    - Conceptual Definition: The harmonic mean of precision and recall. It is commonly used for tasks like question answering where a perfect match is not always possible and we want to balance finding all correct information (recall) with ensuring the found information is correct (precision).
    - Mathematical Formula: $\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$
    - Symbol Explanation:
      - $\text{Precision} = \frac{TP}{TP + FP}$ (of the predicted positive cases, how many were correct?)
      - $\text{Recall} = \frac{TP}{TP + FN}$ (of all actual positive cases, how many were predicted?)
      - TP: True Positives, FP: False Positives, FN: False Negatives. In extractive QA, these are typically computed at the token level between the predicted and ground-truth answer spans.
  - BLEU (Bilingual Evaluation Understudy):
    - Conceptual Definition: Measures the similarity between a machine-generated translation and one or more high-quality human translations. It calculates the precision of n-grams (contiguous sequences of n words) and applies a brevity penalty to penalize translations that are too short.
    - Mathematical Formula: $\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$
    - Symbol Explanation:
      - $\text{BP}$ : Brevity Penalty, which is 1 if the candidate length is greater than the reference length, and $\exp(1 - r/c)$ otherwise, where $r$ is the reference length and $c$ is the candidate length.
      - $p_n$ : The geometric average of the n-gram precision.
      - $w_n$ : Weights for each n-gram precision, typically uniform ( $1/N$ ). $N$ is usually 4.
Baselines: The paper compares MA and HMA against a comprehensive set of methods for mitigating forgetting:
- Early stopping: Using a model checkpoint from an earlier stage of RLHF training.
- Regularization: L1 and L2 penalties to keep weights close to $\theta_0$ .
- LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method that updates only a small number of parameters.
- Knowledge distillation: Forcing the model's output probabilities to stay close to $\pi_{\theta_0}$ .
- SMA (Stochastic Moving Averaging): A variant of weight averaging.
- Experience Replay (ER): Mixing pre-training data into the RLHF training batches.
- KL reward penalty: A standard term in PPO that penalizes divergence from the base policy.

6. Results & Analysis

Core Results:

The Existence of the Alignment Tax: Figure 12 shows a clear trend during RLHF training (Early Stopping points): as the alignment reward increases, performance on Reading Comprehension and Translation consistently drops. Common Sense QA performance first increases slightly before dropping, but the overall trade-off is evident.

该图像是图12，展示了训练过程中对齐-遗忘的权衡。它由三个折线图组成，分别描绘了阅读理解、常识问答准确率和法语-英语翻译任务的性能，随着HH RLHF奖励的变化而波动。图中显示，随着对齐奖励的增加，预训练任务的性能通常呈下降趋势，体现了对齐税效应。
Superiority of Model Averaging: Figure 3 is a key result, plotting the performance of various methods on the alignment-forgetting plane. The orange line, representing Model Averaging (MA), forms a dominant Pareto front. For any given level of alignment reward, MA achieves higher NLP task performance than almost all other methods. The other methods appear as scattered points well below this front.

该图像是图3，展示了现有方法在不访问预训练数据时在阅读理解、常识问答和法英翻译三个NLP任务上的表现。X轴代表HH RLHF奖励，Y轴为相应的任务指标（F1、ACC、BLEU）。模型平均（MA (RSF)）方法在所有任务中均达到了最强的对齐-遗忘帕累托前沿，显著优于其他正则化、MoA、Graft、LoRA和Early Stopping等方法，有效平衡了对齐性能与遗忘缓解。
HMA Outperforms MA: Figure 5 and the more detailed Figure 16 demonstrate that HMA (red line) consistently improves upon vanilla MA (orange line). It pushes the Pareto front further, achieving higher alignment rewards for the same or better NLP task performance. This holds true for both RSF and DPO alignment algorithms.

$Figure 5: Results of our HMA. (Top) HMA for RSF ( $\\alpha \\in \[ 0 . 1 , 0 . 6 \] $ ), (Bottom) HMA for DPO ( $\\alpha \\in \[ 0 . 1 , 0 . 6 \] )$ (Right) HMA for RSF with different choices of $K$ .Refer t…$ 该图像是图表5，展示了异构模型平均（HMA）的实验结果。左图和中图分别比较了在RSF和DPO算法下，HMA与传统模型平均（MA）在“HH RLHF Reward”和“Reading Comprehension (F1)”之间的帕累托前沿表现。右图则进一步展示了RSF算法中，不同 $K$ 值选择下的HMA性能，表明HMA在减轻对齐税方面通常优于MA。
Validation on Mistral-7B: The findings generalize to stronger, larger models. Figure 6 shows that for Zephyr-7B-β (a Mistral-7B model), HMA again improves upon the MA Pareto front. Table 1 provides GPT-4 evaluation results, showing that HMA with $\alpha=0.2$ improves upon the baseline Zephyr model in terms of GPT-4 win-rate and performance on all three NLP task categories. This is a powerful demonstration of achieving better alignment with less tax.

$Figure 6: Results of Zephyr-7B- $\\boldsymbol { \\cdot } \\beta$ evaluated by open sourced preference model. (Top) Similar trends evaluated by PairRM when we average different blocks. (Bottom) Our HMA c…$ 该图像是图表6，展示了Zephyr-7B模型经开源偏好模型评估的结果。上方图表显示了对不同模块进行平均时，DROP (F1)与PairRM胜率的趋势。下方图表对比了HMA与MA在阅读理解(F1)与PairRM胜率上的表现，表明HMA始终优于MA。

Below is a transcription of Table 1 from the paper.

Table 1: GPT4 evaluation of experiments of Zephyr7B-β and Zephyr-7B-Gemma on Alpaca benchmark.

Model	Win-Rate	Reading	CommonSense	Trans
Zephyr-7B-β	8.10%	37.47	66.34	36.55
HMA (Ours)	9.32%	38.93	66.55	37.23
Zephyr-7B-Gemma	11.3%	41.15	66.3	38.09
HMA (Ours)	11.5%	42.45	66.4	38.71

Note: Reading is Reading Comprehension (F1), CommonSense is Accuracy (%), Trans is Translation Fr-En (BLEU).

Ablations / Parameter Sensitivity:
- Effect of $K$ in HMA: Figure 5 (right) explores how the number of model parts ( $K$ ) affects HMA's performance. While $K=6$ and $K=9$ can achieve slightly higher peak rewards, the overall trade-off curve is best for $K=3$ . Using too many parts ( $K>3$ ) may lead to overfitting on the proxy distillation task, slightly worsening the alignment-forgetting trade-off. This suggests that a coarse-grained division of the model is sufficient and more robust.
- Practical Choice for $\alpha$ : The paper suggests that an averaging ratio of $\alpha=0.2$ is a robust and effective choice. Figures 10 and 11 highlight this point, showing that at $\alpha=0.2$ , both MA and HMA can significantly reduce the alignment tax (improving NLP scores) often without any loss in alignment reward compared to the original $\alpha=1$ model.
  
  $Figure 10: Illustration of $\\alpha = 0 . 2$ on vanilla model averaging$ 该图像是图10，展示了三个散点图，用以说明在不同NLP任务（常识问答、阅读理解、法语-英语翻译）中，各种方法在RLHF奖励与任务性能之间的对齐-遗忘权衡。横轴为HH RLHF奖励，纵轴为任务性能。橙色曲线代表模型平均（MA (RSF)），形成了一个帕累托前沿。图中特别标注了 $\alpha = 0.2$ 的点，表明了香草模型平均的特定插值权重。该图突出了模型平均在实现强大对齐-遗忘帕累托前沿方面的有效性。
  
  $Figure 11: Illustration of $\\alpha = 0 . 2$ on HMA$ 该图像是图11，展示了HMA中 $\alpha = 0.2$ 参数的影响。左侧图表比较了MA (RSF)和HMA (RSF)在HH RLHF奖励和阅读理解F1分数上的表现，右侧图表则展示了MA (DPO)和HMA (DPO)的类似对比。两张图中的曲线均表示了对齐性能与遗忘缓解之间的权衡，并用箭头明确标注了 $\alpha = 0.2$ 对应的具体点，以说明HMA在此参数设置下的相对表现。

7. Conclusion & Reflections

Conclusion Summary: The paper provides a rigorous investigation into the alignment tax of RLHF, a critical problem in developing safe and capable LLMs. It demonstrates that simple model averaging (MA) is a surprisingly effective and efficient method for mitigating this tax, outperforming many more complex techniques. The authors provide both theoretical justification and empirical evidence for its success. They then propose Heterogeneous Model Averaging (HMA), an enhancement that optimizes layer-specific averaging ratios to achieve an even better alignment-forgetting trade-off. The findings are robustly validated across different models, algorithms, and evaluation standards.
Limitations & Future Work: The authors acknowledge that while HMA significantly alleviates the alignment tax, it does not completely eliminate it. A trade-off still exists, albeit a much more favorable one. Future work could investigate the theoretical lower bounds of the alignment tax to understand the limits of what is achievable and explore methods that might reach this optimal trade-off.
Personal Insights & Critique:
- The Power of Simplicity: The most striking takeaway is the remarkable effectiveness of a simple, post-hoc method (MA) over complex, integrated training solutions. It serves as a reminder that in deep learning, elegant and simple solutions can often be the most powerful. MA is computationally cheap and requires no modification to the complex RLHF training pipeline, making it an extremely practical tool for practitioners.
- Actionable Insights: The paper provides a clear, actionable recommendation: if you are aligning an LLM with RLHF, you should experiment with averaging your final checkpoint with your initial pre-RLHF checkpoint. An $\alpha$ value around 0.2 appears to be a great starting point for a "free lunch"—reducing forgetting without hurting alignment.
- Strong Theoretical Grounding: The connection between the theoretical model (feature diversity in shared spaces) and the empirical results (benefits of averaging low-level layers) is a significant strength of the paper. It moves beyond a purely empirical finding to offer a compelling explanation for why the method works.
- Open Questions: While the proxy distillation method for HMA is practical, it would be interesting to see if directly solving the constrained RL optimization problem could yield even better results, though this would be far more computationally expensive. Furthermore, exploring whether the optimal heterogeneous ratios are transferable across different models or tasks could be a valuable direction for future research.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.