Paper status: completed

RL's Razor: Why Online Reinforcement Learning Forgets Less

Published:09/04/2025

Forgetting Analysis in Online Reinforcement Learning (1)Large Language Model Fine-Tuning (49)KL-Divergence based Distribution Shift (1)Robotic Foundation Model Fine-Tuning (1)Mechanism of RL Preserving Prior Knowledge (2)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces "RL's Razor," demonstrating why online reinforcement learning prevents catastrophic forgetting better than supervised fine-tuning. Through KL-divergence analysis and experiments, it finds on-policy RL implicitly minimizes distributional shift from the base m

Abstract

Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL preserves prior knowledge and capabilities significantly better. We find that the degree of forgetting is determined by the distributional shift, measured as the KL-divergence between the fine-tuned and base policy evaluated on the new task. Our analysis reveals that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate these findings through experiments with large language models and robotic foundation models and further provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle $\textit{RL's Razor}$ : among all ways to solve a new task, RL prefers those closest in KL to the original model.

Mind Map

In-depth Reading

English Analysis~14 min read · 14,270 chars

1. Bibliographic Information

Title: RL's Razor: Why Online Reinforcement Learning Forgets Less
Authors: Idan Shenfeld, Jyothish Pari, Pulkit Agrawal (Improbable AI Lab, MIT)
Journal/Conference: The paper is presented as a preprint on arXiv. Conference and journal venues like NeurIPS, ICLR, or ICML are typical for this type of research, known for their rigorous peer-review process in machine learning.
Publication Year: The paper uses placeholder dates, with an arXiv ID suggesting a future date (2509.04259v1) and citing many works from 2025. This indicates it is a forward-looking or hypothetical manuscript.
Abstract: The paper compares two fine-tuning methods, Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), for adapting foundation models. It finds that while both can achieve similar performance on a new task, RL is significantly better at preserving prior knowledge, thus forgetting less. The core discovery is an "empirical forgetting law": the amount of forgetting is strongly predicted by the KL-divergence between the fine-tuned and base model's policies, evaluated on the new task's data. The authors argue that on-policy RL is implicitly biased towards solutions with minimal KL-divergence, a principle they name RL's Razor. This bias is validated through experiments on large language and robotic models and supported by a theoretical analysis.
Original Source Link:
- Source: https://arxiv.org/abs/2509.04259v1
- PDF: https://arxiv.org/pdf/2509.04259
- Status: Preprint on arXiv.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Foundation models, while powerful, are often static after deployment. To create long-lived, continuously adapting AI agents, they must learn new skills without forgetting old ones. This is hindered by catastrophic forgetting, where training on a new task erases previously learned capabilities.
- Importance & Gaps: Existing methods to combat forgetting often treat the symptoms (e.g., constraining weight changes) rather than the root cause. It remains unclear why different training algorithms, like RL and SFT, exhibit vastly different forgetting behaviors, even when achieving the same new-task performance.
- Innovation: This paper moves beyond heuristic solutions to propose a fundamental principle. It identifies a simple, measurable quantity—the KL-divergence on the new task's data distribution—as a reliable predictor of forgetting. It then provides a causal explanation for why on-policy RL inherently minimizes this KL-divergence, thus forgetting less.
Main Contributions / Findings (What):
1. Empirical Finding: RL fine-tuning consistently forgets less than SFT, even when both methods are optimized to achieve identical performance on a new task.
2. The Empirical Forgetting Law: The paper discovers that the degree of catastrophic forgetting is strongly predicted by the KL-divergence between the base and fine-tuned policies, specifically when measured on the distribution of the new task ( $\mathbb{E}_{x \sim \tau} [ \mathbf{KL}(\pi_0 || \pi) ]$ ). This is a practical metric as it doesn't require access to old task data.
3. RL's Razor Principle: The paper introduces the principle of "RL's Razor," stating that among all possible solutions that solve a new task, on-policy RL methods are inherently biased towards the one that is closest in KL-divergence to the original model. SFT, in contrast, can converge to any solution dictated by the supervision data, which may be arbitrarily far in KL-space.
4. Causal Mechanism: The on-policy nature of RL is identified as the key mechanism. By sampling from its own distribution, the model's updates are conservative and localized, leading to a gradual shift rather than a drastic pull towards a potentially distant target distribution, as seen in SFT. This is supported by both empirical ablations and theoretical analysis.

Foundational Concepts:
- Foundation Models: Large-scale models (like GPT-4 or Llama) pre-trained on vast amounts of data. They serve as a general-purpose base that can be adapted to various downstream tasks.
- Supervised Fine-Tuning (SFT): A common method to adapt a pre-trained model by training it on a smaller, curated dataset of input-output examples (e.g., instruction-response pairs) using a standard cross-entropy loss.
- Reinforcement Learning (RL): A training paradigm where an agent learns by interacting with an environment. It receives rewards (or penalties) for its actions and adjusts its policy to maximize cumulative reward. In the context of LLMs, this is often done with human feedback (RLHF) or automated rewards.
- Catastrophic Forgetting: The tendency of a neural network to abruptly lose knowledge of previously learned tasks when it is trained on a new task.
- KL-Divergence (Kullback-Leibler Divergence): A measure of how one probability distribution diverges from a second, reference probability distribution. A KL-divergence of zero means the two distributions are identical. The paper focuses on the "forward" KL, $KL(P||Q)$ , which measures the expected log-difference in probabilities.
Previous Works & Differentiation:
- Forgetting Mitigation: Previous methods like Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) penalize changes to important model weights, while others focus on preserving learned features or replaying old data. The paper argues these methods address the symptoms of forgetting. This work, instead, identifies a predictive principle (KL-divergence) that explains the cause.
- SFT vs. RL Comparisons: Prior work often compared SFT and RL based on new-task performance or generalization (Ross et al., 2011; Chu et al., 2025). This paper is novel in its focus on their differing susceptibility to catastrophic forgetting.
- Concurrent Work: The paper acknowledges Lai et al. (2025), which also found RL forgets less. However, it differentiates itself by attributing the cause to the on-policy nature of RL, not the use of negative examples, and by introducing the empirical forgetting law and the RL's Razor principle.

4. Methodology (Core Technology & Implementation)

The paper's methodology is analytical and empirical rather than proposing a new algorithm. It focuses on identifying and validating the principles governing forgetting.

Principles:
1. Forgetting is a function of distributional shift: The core hypothesis is that the degree to which a model forgets prior knowledge is determined by how much its output distribution shifts away from the base model's distribution.
2. The shift that matters is on the new task: Crucially, this distributional shift can be measured on the data distribution of the new task, making it a practical tool.
3. On-policy learning is inherently conservative: The central mechanism of RL's Razor is that on-policy updates are a form of conservative projection. The model learns by re-weighting samples it already considers plausible, which naturally leads to smaller changes in its output distribution compared to SFT, which forces the model towards an arbitrary external target distribution.
Steps & Procedures (Theoretical Justification): The paper provides a theoretical lens to understand why on-policy RL minimizes KL-divergence, framing the policy gradient update as an alternating projection procedure analogous to the Expectation-Maximization (EM) algorithm.

该图像为图表，展示了不同训练步数下，基于中心化核对齐（CKA）得分评估的模型知识保留情况。图中蓝色虚线表示监督微调（SFT），随梯度步数增加，CKA得分明显下降，显示遗忘严重；红色虚线表示强化学习（RL），CKA得分保持较高，遗忘较少，验证了论文中RL保留先验知识更好的结论。

As shown in Figure 5, the process can be seen as:
1. I-Projection (Information Projection): At step $t$ $t$ , given the current policy $\pi_t$ $π_{t}$ , find the optimal policy $q_t$ $q_{t}$ that is closest to $\pi_t$ $π_{t}$ in KL-divergence. For a binary reward (success/failure), this corresponds to rejection sampling—conditioning the current policy $\pi_t$ $π_{t}$ on successful outcomes.
  - Lemma 5.1: This step is formalized as: $q_{RS} = \arg\min_{\mathbf{q}} D_{KL}(\mathbf{q} || \mathbf{p}) \quad s.t \quad \mathbb{E}_{\mathbf{y} \sim \mathbf{q}}[R(\mathbf{y})] = 1$ where $p$ is the current policy, $R$ is the binary reward, and $q_{RS}$ is the target distribution.
2. M-Projection (Moment Projection): Update the policy from $\pi_t$ to $\pi_{t+1}$ by finding the policy within the model's representable family (Π) that is closest to the target distribution $q_t$ . A policy gradient step is shown to be equivalent to performing this M-projection.
- Theorem 5.2: This iterative process of I- and M-projections is shown to converge to the optimal policy within the representable family that has the minimum KL-divergence to the initial policy $\pi_0$ . $\pi^{\dagger} = \arg\min_{\pi \in P^* \cap \Pi} D_{KL}(\pi || \pi_0)$ This theorem provides the theoretical backbone for RL's Razor.

5. Experimental Setup

Datasets & Tasks:
- Large Language Models (LLMs):
  - Base Model: Qwen 2.5 3B-Instruct.
  - Tasks: Math reasoning (Open-Reasoner-Zero), Science Q&A (SciKnowEval), and Tool use (ToolAlpaca).
- Robotics:
  - Base Model: OpenVLA 7B.
  - Environment: SimplerEnv.
  - Task: Pick and place a can.
- Controlled Toy Setting:
  - Task: ParityMNIST, a modified MNIST task where the goal is to predict the parity (even/odd) of a digit. This allows for multiple correct output distributions (e.g., any even digit is a correct answer for an even digit image).
  - Model: A 3-layer MLP pre-trained on both ParityMNIST and FashionMNIST.
Evaluation Metrics:
- New Task Performance: Accuracy on a held-out test set for the new task.
- Prior Task Performance (Forgetting): Average performance on a suite of diverse, unrelated benchmarks.
  - For LLMs: Hellaswag, TruthfulQA, MMLU, IFEval, Winogrande, HumanEval.
  - For Robotics: Performance on other SimplerEnv tasks (e.g., open/close drawer).
  - For ParityMNIST: Accuracy on FashionMNIST.
Baselines & Methods Compared:
- SFT: Standard supervised fine-tuning.
- RL (GRPO): On-policy reinforcement learning using the GRPO algorithm, with a simple binary success reward and no explicit KL regularization.
- 1-0 Reinforce: An on-policy method that only learns from positive examples (equivalent to SFT on model-generated correct samples).
- SimPO: An offline method that learns from both positive and negative examples, but the data is fixed (offline) rather than sampled from the current policy.
- Oracle SFT: An SFT setup for ParityMNIST where the training labels are sampled from a distribution that is analytically calculated to be the KL-minimal distribution that achieves 100% task accuracy.

6. Results & Analysis

Core Result 1: RL Forgets Less than SFT

该图像由三部分组成，属于示意图和散点图组合。左侧的示意图展示四种训练方法的分类：按训练样本分为仅正样本和正负样本，按训练方式分为离线和在线；对应方法分别是SFT、1-0 Reinforce、GRPO和SIMPO。右侧两个散点图分别展示不同方法在新任务准确率与KL散度（左图）、以及新任务准确率与之前任务平均得分（右图）之间的关系。图中点颜色对应四种方法，展示了RL方法在保持旧任务表现时KL值更小。

Figure 2 clearly illustrates the central finding across LLM and robotics tasks. For any given level of new-task accuracy, the RL-tuned models (red dashed lines) maintain significantly higher performance on prior tasks compared to SFT-tuned models (blue dotted lines). SFT shows a steep trade-off: gaining proficiency on the new task comes at a high cost of forgetting prior knowledge.
Core Result 2: KL Divergence Predicts Forgetting

该图像是示意图，展示了策略空间中的策略分布及其优化路径。左侧用蓝色和绿色区域分别表示可行策略集合与最优策略集合，多个点代表不同策略，箭头表示策略从初始策略 π₀ 向最优策略 P 的迭代更新过程。右侧以直方图形式展示在不同训练步骤（1、2、…、n）中策略分布的变化，突出了策略逐渐趋近最优区域的过程。整体说明了RL算法如何在策略空间中收敛并保持接近原策略。*

The controlled ParityMNIST experiments in Figure 3 provide strong evidence for the "empirical forgetting law."
- The middle panel is the key result: regardless of whether the model was trained with RL, standard SFT, or even an "oracle" SFT, the amount of forgetting (drop in FashionMNIST score) is a near-perfect function of the KL-divergence from the base model.
- The left panel shows that SFT on optimal dist. (the oracle SFT) outperforms even RL, proving that the advantage is not unique to the RL algorithm itself but to finding a KL-minimal solution. When SFT is guided to this solution, it forgets the least.
- This finding is also shown for LLMs in Figure 11, where RL models cluster at low KL and high prior-task scores, while SFT models drift to high KL and low scores.
  
  该图像为包含三部分的科学图表，展示了新任务准确率与先前任务得分及KL散度之间的关系。左图显示SFT和RL方法在不同分布下的新任务准确率与先前任务得分对比，RL曲线表现出更少遗忘。中图绘制了相同任务得分与KL散度的负相关关系，且拟合优度R²=0.961，表明KL散度能很好解释遗忘程度。右图展示新任务准确率随KL散度变化，RL方法在较低KL范围内准确率更高，体现RL更倾向保持原模型分布。整体支持论文中“RL倾向保持低KL解”的论点。
Core Result 3: On-Policy Training is the Key Mechanism

该图像为三个连续的散点图，展示了不同模型在新任务准确率（横轴）和之前任务平均得分（纵轴）上的表现。从左到右依次为“所有模型”、“帕累托前沿上的模型”和“最终图”，后者通过一条拟合曲线连接帕累托前沿点，展现两指标间的权衡关系，体现了模型在学习新任务时对旧任务性能的保持情况。

Figure 4 dissects why RL has this advantage. The experiment compares four algorithms:
1. GRPO (On-policy, uses negatives)
2. 1-0 Reinforce (On-policy, no negatives)
3. SFT (Offline, no negatives)
4. SimPO (Offline, uses negatives)
  
  The results show a clear split: the two on-policy methods behave similarly, achieving low KL and low forgetting. The two offline methods also behave similarly, suffering from high KL and high forgetting. This demonstrates that the defining factor is whether the training data is sampled from the model's current policy (on-policy), not whether the algorithm learns from negative examples.
Analysis of Alternative Hypotheses The paper rigorously tests other potential explanations for forgetting in Section 6 and Table 1. It measures changes at the weight level ( $L_1$ , Fisher-weighted $L_2$ ), representation level (activation changes), and update properties (sparsity, rank). None of these alternatives provides a consistent or strong prediction of forgetting. Forward KL-divergence ( $R^2=0.96$ ) stands out as the most reliable predictor, far surpassing all other candidates.
Additional Supporting Results (from Appendix)
- Representation Drift (CKA): Figure 7 shows that SFT causes significant representational drift (low CKA score), while RL fine-tuning leaves the model's internal representations largely intact (high CKA score), aligning with the behavioral findings.
  
  该图像为二维散点图，横轴表示MNIST准确率，纵轴表示Fashion MNIST准确率。散点用蓝色圆点表示SFT学生模型的表现，红色星形标记表示RL教师模型的表现。图中显示SFT模型的准确率在两个任务间存在权衡，而RL模型在MNIST精度较高的同时，Fashion MNIST准确率保持相对较好，体现了RL相比SFT在保留先前知识上的优势。
- Effect of Model Scale: Figure 8 shows that while larger models are inherently more robust and start from a better performance point, they still exhibit the same catastrophic forgetting trade-off when fine-tuned with SFT. Scaling does not eliminate the problem.
  
  该图像为散点图，展示了不同学习率（lr=1e-4, 2e-4, 4e-4）下，训练过程中KL散度变化（横轴）与梯度相似度（纵轴）的关系，左图为Parity MNIST任务，右图为Fashion MNIST任务。图中不同颜色点和对应拟合线反映了KL变化较小与梯度相似度较高的趋势，支持论文中RL更新偏向KL最小解的结论。

7. Conclusion & Reflections

Conclusion Summary: The paper makes a compelling case that catastrophic forgetting during fine-tuning is not an arbitrary side effect but is governed by a simple principle: it is proportional to the KL-divergence between the fine-tuned and base models, measured on the new task distribution. It introduces "RL's Razor," explaining that on-policy RL's superior ability to preserve knowledge stems from its inherent bias towards KL-minimal solutions. This provides a unified explanation for the empirical differences between RL and SFT and suggests a new direction for designing continual learning algorithms: explicitly seek KL-minimal solutions.
Limitations & Future Work:
- Mechanism: The paper establishes a strong correlation but does not fully explain the mechanistic link between KL-divergence on the new task and performance degradation on old tasks. The exact dynamics of representational interference remain an open question.
- Scale: While tested on models up to 14B parameters, the behavior at the scale of frontier models (100B+ parameters) is yet to be confirmed.
- Algorithm Scope: The study primarily focuses on on-policy RL. The behavior of modern off-policy algorithms in this context is not explored.
Personal Insights & Critique:
- Significance: This paper is significant because it shifts the conversation about catastrophic forgetting from algorithmic heuristics to a more fundamental, information-theoretic principle. The "empirical forgetting law" is a powerful and practical insight.
- RL's Razor: The "RL's Razor" principle is an elegant and intuitive explanation. It reframes RL not just as a method for maximizing rewards but as a "conservative" learning process that respects the model's prior knowledge. This has profound implications for building safer and more reliable AI systems that can learn continuously.
- Practical Implications: The findings suggest that future continual learning methods could combine the data efficiency of SFT with the knowledge preservation of RL. For instance, one could design SFT datasets or loss functions that explicitly minimize KL-divergence, potentially achieving the best of both worlds, as hinted at by the "oracle SFT" experiment.
- Open Questions: Could the KL-divergence on the new task be used as a real-time regularizer during any fine-tuning process to control the amount of forgetting? How does this principle interact with other continual learning techniques like parameter-efficient fine-tuning (PEFT)? This work lays a strong foundation for a new and promising line of research.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.