The paper's methodology is analytical and empirical rather than proposing a new algorithm. It focuses on identifying and validating the principles governing forgetting.
-
Core Result 1: RL Forgets Less than SFT
该图像由三部分组成,属于示意图和散点图组合。左侧的示意图展示四种训练方法的分类:按训练样本分为仅正样本和正负样本,按训练方式分为离线和在线;对应方法分别是SFT、1-0 Reinforce、GRPO和SIMPO。右侧两个散点图分别展示不同方法在新任务准确率与KL散度(左图)、以及新任务准确率与之前任务平均得分(右图)之间的关系。图中点颜色对应四种方法,展示了RL方法在保持旧任务表现时KL值更小。
Figure 2 clearly illustrates the central finding across LLM and robotics tasks. For any given level of new-task accuracy, the RL-tuned models (red dashed lines) maintain significantly higher performance on prior tasks compared to SFT-tuned models (blue dotted lines). SFT shows a steep trade-off: gaining proficiency on the new task comes at a high cost of forgetting prior knowledge.
-
Core Result 2: KL Divergence Predicts Forgetting
该图像是示意图,展示了策略空间中的策略分布及其优化路径。左侧用蓝色和绿色区域分别表示可行策略集合与最优策略集合,多个点代表不同策略,箭头表示策略从初始策略 π₀ 向最优策略 P 的迭代更新过程。右侧以直方图形式展示在不同训练步骤(1、2、…、n)中策略分布的变化,突出了策略逐渐趋近最优区域的过程。整体说明了RL算法如何在策略空间中收敛并保持接近原策略。*
The controlled ParityMNIST
experiments in Figure 3 provide strong evidence for the "empirical forgetting law."
-
The middle panel is the key result: regardless of whether the model was trained with RL, standard SFT, or even an "oracle" SFT, the amount of forgetting (drop in FashionMNIST
score) is a near-perfect function of the KL-divergence from the base model.
-
The left panel shows that SFT on optimal dist.
(the oracle SFT) outperforms even RL, proving that the advantage is not unique to the RL algorithm itself but to finding a KL-minimal solution. When SFT is guided to this solution, it forgets the least.
-
This finding is also shown for LLMs in Figure 11, where RL models cluster at low KL and high prior-task scores, while SFT models drift to high KL and low scores.
该图像为包含三部分的科学图表,展示了新任务准确率与先前任务得分及KL散度之间的关系。左图显示SFT和RL方法在不同分布下的新任务准确率与先前任务得分对比,RL曲线表现出更少遗忘。中图绘制了相同任务得分与KL散度的负相关关系,且拟合优度R²=0.961,表明KL散度能很好解释遗忘程度。右图展示新任务准确率随KL散度变化,RL方法在较低KL范围内准确率更高,体现RL更倾向保持原模型分布。整体支持论文中“RL倾向保持低KL解”的论点。
-
Core Result 3: On-Policy Training is the Key Mechanism
该图像为三个连续的散点图,展示了不同模型在新任务准确率(横轴)和之前任务平均得分(纵轴)上的表现。从左到右依次为“所有模型”、“帕累托前沿上的模型”和“最终图”,后者通过一条拟合曲线连接帕累托前沿点,展现两指标间的权衡关系,体现了模型在学习新任务时对旧任务性能的保持情况。
Figure 4 dissects why RL has this advantage. The experiment compares four algorithms:
-
GRPO
(On-policy, uses negatives)
-
1-0 Reinforce
(On-policy, no negatives)
-
SFT
(Offline, no negatives)
-
SimPO
(Offline, uses negatives)
The results show a clear split: the two on-policy methods behave similarly, achieving low KL and low forgetting. The two offline methods also behave similarly, suffering from high KL and high forgetting. This demonstrates that the defining factor is whether the training data is sampled from the model's current policy (on-policy
), not whether the algorithm learns from negative examples.
-
Analysis of Alternative Hypotheses
The paper rigorously tests other potential explanations for forgetting in Section 6 and Table 1. It measures changes at the weight level (L1, Fisher-weighted L2), representation level (activation changes), and update properties (sparsity, rank). None of these alternatives provides a consistent or strong prediction of forgetting. Forward KL-divergence (R2=0.96) stands out as the most reliable predictor, far surpassing all other candidates.
-
Additional Supporting Results (from Appendix)
-
Representation Drift (CKA): Figure 7 shows that SFT causes significant representational drift (low CKA score), while RL fine-tuning leaves the model's internal representations largely intact (high CKA score), aligning with the behavioral findings.
该图像为二维散点图,横轴表示MNIST准确率,纵轴表示Fashion MNIST准确率。散点用蓝色圆点表示SFT学生模型的表现,红色星形标记表示RL教师模型的表现。图中显示SFT模型的准确率在两个任务间存在权衡,而RL模型在MNIST精度较高的同时,Fashion MNIST准确率保持相对较好,体现了RL相比SFT在保留先前知识上的优势。
-
Effect of Model Scale: Figure 8 shows that while larger models are inherently more robust and start from a better performance point, they still exhibit the same catastrophic forgetting trade-off when fine-tuned with SFT. Scaling does not eliminate the problem.
该图像为散点图,展示了不同学习率(lr=1e-4, 2e-4, 4e-4)下,训练过程中KL散度变化(横轴)与梯度相似度(纵轴)的关系,左图为Parity MNIST任务,右图为Fashion MNIST任务。图中不同颜色点和对应拟合线反映了KL变化较小与梯度相似度较高的趋势,支持论文中RL更新偏向KL最小解的结论。