-
The Existence of the Alignment Tax: Figure 12 shows a clear trend during RLHF training (Early Stopping points): as the alignment reward increases, performance on Reading Comprehension and Translation consistently drops. Common Sense QA performance first increases slightly before dropping, but the overall trade-off is evident.
该图像是图12,展示了训练过程中对齐-遗忘的权衡。它由三个折线图组成,分别描绘了阅读理解、常识问答准确率和法语-英语翻译任务的性能,随着HH RLHF奖励的变化而波动。图中显示,随着对齐奖励的增加,预训练任务的性能通常呈下降趋势,体现了对齐税效应。
-
Superiority of Model Averaging: Figure 3 is a key result, plotting the performance of various methods on the alignment-forgetting plane. The orange line, representing Model Averaging (MA), forms a dominant Pareto front. For any given level of alignment reward, MA achieves higher NLP task performance than almost all other methods. The other methods appear as scattered points well below this front.
该图像是图3,展示了现有方法在不访问预训练数据时在阅读理解、常识问答和法英翻译三个NLP任务上的表现。X轴代表HH RLHF奖励,Y轴为相应的任务指标(F1、ACC、BLEU)。模型平均(MA (RSF))方法在所有任务中均达到了最强的对齐-遗忘帕累托前沿,显著优于其他正则化、MoA、Graft、LoRA和Early Stopping等方法,有效平衡了对齐性能与遗忘缓解。
-
HMA Outperforms MA: Figure 5 and the more detailed Figure 16 demonstrate that HMA (red line) consistently improves upon vanilla MA (orange line). It pushes the Pareto front further, achieving higher alignment rewards for the same or better NLP task performance. This holds true for both RSF and DPO alignment algorithms.
该图像是图表5,展示了异构模型平均(HMA)的实验结果。左图和中图分别比较了在RSF和DPO算法下,HMA与传统模型平均(MA)在“HH RLHF Reward”和“Reading Comprehension (F1)”之间的帕累托前沿表现。右图则进一步展示了RSF算法中,不同K值选择下的HMA性能,表明HMA在减轻对齐税方面通常优于MA。
-
Validation on Mistral-7B: The findings generalize to stronger, larger models. Figure 6 shows that for Zephyr-7B-β (a Mistral-7B model), HMA again improves upon the MA Pareto front. Table 1 provides GPT-4 evaluation results, showing that HMA with α=0.2 improves upon the baseline Zephyr model in terms of GPT-4 win-rate and performance on all three NLP task categories. This is a powerful demonstration of achieving better alignment with less tax.
该图像是图表6,展示了Zephyr-7B模型经开源偏好模型评估的结果。上方图表显示了对不同模块进行平均时,DROP (F1)与PairRM胜率的趋势。下方图表对比了HMA与MA在阅读理解(F1)与PairRM胜率上的表现,表明HMA始终优于MA。
Below is a transcription of Table 1 from the paper.