AiPaper
Status: completed

ExGRPO: Learning to Reason from Experience

Reinforcement Learning for Math ReasoningSequence Policy OptimizationRL Training for Large Language Models
Original LinkPDFEdit PDF notes
Price: 0.10
Price: 0.10
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ExGRPO identifies key experience metrics to prioritize valuable reasoning data, improving reinforcement learning efficiency and reasoning performance in large language models, with stable training across diverse model scales.

Abstract

Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.

English Analysis

1. Bibliographic Information

  • Title: ExGRPO: Learning to Reason from Experience
  • Authors: Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng.
  • Affiliations: The authors are from the University of Macau, Shanghai AI Laboratory, Nanjing University, and The Chinese University of Hong Kong. This collaboration brings together expertise from both academic institutions and a major AI research lab.
  • Journal/Conference: This paper is an arXiv preprint. Preprints are non-peer-reviewed manuscripts that researchers share to disseminate findings quickly. While not formally published, they often represent cutting-edge research.
  • Publication Year: 2025 (as listed on arXiv, though this is a future date, likely a placeholder or typo in the source document; the first version was submitted in 2024/2025 cycle).
  • Abstract: The paper addresses the inefficiency of standard on-policy reinforcement learning for improving the reasoning abilities of large language models (LLMs). This inefficiency stems from discarding valuable training experiences after a single use. The authors first investigate what makes a reasoning experience valuable, identifying rollout correctness and trajectory entropy as key indicators. Based on this, they propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that manages and prioritizes high-value experiences from a replay buffer. ExGRPO uses a mixed-policy objective to balance learning from new (on-policy) and past (off-policy) experiences. Experiments on models from 1.5B to 8B parameters show that ExGRPO significantly improves reasoning performance over on-policy methods and stabilizes training, especially for weaker models where standard methods fail.
  • Original Source Link: https://arxiv.org/abs/2510.02245v1 (Note: This is a placeholder link from the source text and does not correspond to a real paper as of late 2024. The analysis proceeds based on the provided content.)

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: Training large language models (LLMs) for complex reasoning tasks using reinforcement learning is computationally expensive and often unstable. The standard approach, on-policy reinforcement learning, is sample-inefficient because it generates a vast amount of "experience" (i.e., attempts at solving problems) but discards it after only one learning update.
    • Importance & Gap: Reusing past experience (a technique known as experience replay) is a known solution for sample inefficiency in traditional RL. However, for LLM reasoning, it's not well understood which past experiences are most valuable for learning. Simply replaying all past successes can be inefficient or even harmful. The paper identifies a critical gap: the lack of a systematic method to identify, prioritize, and effectively learn from high-value reasoning experiences.
    • Fresh Angle: The paper's innovation is to first empirically determine the properties of a "valuable" reasoning experience. It hypothesizes and confirms that experiences related to medium-difficulty problems (not too easy, not too hard) and those with low uncertainty (low entropy) are the most beneficial for training.
  • Main Contributions / Findings (What):

    • Analysis of Experience Value: The paper is the first to systematically analyze and identify that rollout correctness (as a proxy for problem difficulty) and trajectory entropy (as a proxy for reasoning quality) are effective indicators of valuable experiences for RL-based reasoning training.
    • ExGRPO Framework: It introduces ExGRPO (Experiential Group Relative Policy Optimization), a novel framework that incorporates principled experience management into the RL training loop. Its key features are:
      1. Experience Management: It organizes past successful experiences into a replay buffer, partitioned by problem difficulty.
      2. Prioritized Sampling: It prioritizes replaying experiences from medium-difficulty problems and selects the specific reasoning trajectory with the lowest entropy (i.e., the most "confident" solution).
      3. Mixed-Policy Optimization: It combines learning from new, exploratory rollouts with learning from these prioritized past experiences, using a unified objective function that corrects for statistical discrepancies.
    • Empirical Results: ExGRPO consistently improves reasoning performance across five different LLMs, achieving an average gain of +3.5 points on math benchmarks and +7.6 points on general reasoning benchmarks compared to standard on-policy RL. Critically, it also stabilizes training for models that would otherwise fail or "collapse" with on-policy methods.

3. Prerequisite Knowledge & Related Work

To understand this paper, a beginner should be familiar with the following concepts:

  • Foundational Concepts:

    • Large Language Models (LLMs): These are massive neural networks (e.g., GPT-4, Llama) trained on vast amounts of text, capable of generating human-like language. In this paper, they act as the "agent" that learns to reason.
    • Chain-of-Thought (CoT): A technique where an LLM is prompted to generate intermediate reasoning steps before giving a final answer, mimicking a human's thought process. This sequence of steps is treated as a "trajectory" in the RL framework.
    • Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make a sequence of decisions in an environment to maximize a cumulative reward. In this paper, the LLM is the agent, generating a CoT is the sequence of decisions, and the reward is given based on whether the final answer is correct.
    • Reinforcement Learning from Verifiable Rewards (RLVR): A specific application of RL to LLMs where tasks have a clearly right or wrong answer (e.g., math problems). A "verifier" automatically checks the LLM's final answer and provides a binary reward (1 for correct, 0 for incorrect). This avoids the need for expensive human feedback.
    • On-policy vs. Off-policy RL:
      • On-policy: The agent learns exclusively from experiences generated by its current policy (its current strategy). This is stable but inefficient, as data is used once and thrown away.
      • Off-policy: The agent can learn from experiences generated by past versions of its policy (or even other policies). This is more sample-efficient but can be unstable if the past data is too different from what the current policy would generate. ExGRPO is an off-policy method.
    • Experience Replay: A core technique in off-policy RL. Past experiences (state, action, reward) are stored in a memory buffer and are "replayed" (sampled) for training alongside new experiences.
    • Group Relative Policy Optimization (GRPO): An on-policy RL algorithm that ExGRPO builds upon. Instead of using a complex value model, it estimates the "advantage" of a solution by comparing its reward to the average reward of a group of solutions generated for the same problem. This simplifies training.
    • Entropy: In information theory, entropy measures uncertainty or randomness. For an LLM, a low-entropy output means the model is very confident about the sequence of tokens it's generating. The paper uses this as a proxy for high-quality, non-random reasoning.
  • Previous Works & Differentiation:

    • The paper situates itself between two lines of work: RLVR and Experience-based RL.
    • RLVR Methods: Most existing RLVR methods like PPO, GRPO, and Dr.GRPO are on-policy. They are computationally expensive and discard valuable data. Some recent works have explored off-policy RLVR by mixing in expert data or developing new update rules, but they often overlook the quality and characteristics of the replayed data.
    • Experience-based RL: Methods like ReMix, RePO, and RLEP use experience replay for LLMs. However, they often treat all successful experiences as equally valuable or use simple replay strategies. ARPO combines GRPO with a replay buffer but lacks the sophisticated analysis of experience value.
    • Differentiation: ExGRPO's key innovation is its principled approach to experience management. It doesn't just replay past data; it first analyzes what makes an experience valuable (medium difficulty, low entropy) and then designs a system to specifically prioritize and leverage these high-value experiences. This targeted approach is what leads to improved efficiency and stability.

4. Methodology (Core Technology & Implementation)

The core of ExGRPO is built on insights from a preliminary study, which then inform a two-part framework: Experience Management and Policy Optimization.

3.2 Preliminary Study on Experience Data

Before designing the algorithm, the authors conducted a study to answer: What makes a reasoning experience valuable?

  • Setup: They trained a Qwen2.5-Math 7B model using a standard on-policy RLVR method. They categorized training questions into three groups based on the model's online success rate (rollout correctness):
    • Easy: 75-100% success rate.
    • Medium: 25-75% success rate.
    • Hard: 0-25% success rate.
  • Key Findings (from Figure 1):
    1. Medium-difficulty questions are most valuable: As shown in Figure 1(a), the model trained exclusively on Medium questions achieved the highest performance. Easy questions offer little new information, while hard questions are often too difficult to provide a useful learning signal. This suggests an ideal "zone of proximal development."

    2. Low entropy signals better reasoning: The authors used a powerful external LLM to judge whether the reasoning steps (CoT) were logically correct, even if the final answer was right. Figure 1(b) shows that trajectories with logically correct reasoning consistently have lower entropy than those with incorrect reasoning.

    3. Beware of "lucky hits": High-entropy trajectories, even if they lead to a correct answer, often contain flawed reasoning. Replaying these "lucky hits" can teach the model bad habits, a phenomenon the authors call a "snowball effect."

      These findings lead to two guiding principles for ExGRPO: prioritize medium-difficulty questions and select low-entropy trajectories.

      Figure 1: Analysis of question difficulty and entropy in on-policy RLVR training: (a) Test performance of models trained on different question groups; (b) Entropy distributions of logically correct t…

      Figure 1: This figure shows the results of the preliminary analysis. (a) Training on "Medium" difficulty questions yields the best test performance. (b) Trajectories with logically correct reasoning have lower entropy (are more confident) than incorrect ones. (c) Correct trajectories from "Medium" problems are concentrated at lower entropy values.

4.1 ExGRPO: Experience Management

This is the first phase of the ExGRPO pipeline, designed to collect, organize, and select valuable experiences.

Figure 3: A comparison of benchmark performance for different backbone models and training variants, showing performance on both in-distribution and out-of-distribution tasks (cf. Section E.3).

Figure 2: This diagram provides an overview of the ExGRPO framework. (a) The Experience Management phase involves collecting successful rollouts, partitioning them by difficulty into a replay buffer, and selecting high-value questions and trajectories. (b) The Policy Optimization phase uses these selected experiences alongside new rollouts in a mixed-policy objective to update the model.

  1. Experience Collection: During training, for each question, the model generates KK solutions (trajectories). All trajectories that result in a correct final answer are stored in a replay buffer, E\mathcal{E}. The success rate for each question, Acc(q)\operatorname{Acc}(q^*), is also recorded.
  2. Experience Partition: The replay buffer E\mathcal{E} is not a flat list. It's partitioned into "buckets" based on the latest correctness rate Acc(q)\operatorname{Acc}(q^*) of each question. This organizes experiences by their difficulty level. A special Retired Set is used to store questions that the model consistently solves (e.g., 100% success rate), removing them from active training to focus on more challenging problems.
  3. Experience Selection: When creating a training batch, ExGRPO performs a two-step selection from the buffer:
    • (1) Question Sampling: It samples questions from the buckets with a bias towards the Medium difficulty group. This is achieved by setting the sampling probability pˉ\bar{p} to be proportional to a Gaussian distribution centered at μ=0.5\mu=0.5 correctness: pˉN(Acc(q);μ=0.5,σˉ=1)\bar{p} \propto \mathcal{N}(\operatorname{Acc}(q^*); \mu=0.5, \bar{\sigma}=1).
    • (2) Trajectory Selection: For each sampled question, which may have multiple successful trajectories stored in the buffer, ExGRPO selects the single trajectory with the lowest entropy. The entropy is calculated using the current policy πθ\pi_{\theta}. This ensures the model replays the most confident and likely highest-quality past success. The formula is: oargminoi{o}H(oi;πθ)o^* \gets \arg\min_{o_i \in \{o^*\}} H(o_i; \pi_{\theta})

4.2 ExGRPO: Experiential Policy Optimization

This is the second phase, where the selected experiences are used to update the model.

  1. Mixed-Policy Mini-batch: Each training mini-batch B\mathcal{B} is composed of two parts, with the ratio controlled by a hyperparameter ρ\rho:

    • On-policy samples (BonB_{\mathrm{on}}): A fraction (1ρ)(1-\rho) of the batch consists of new questions, for which the model generates fresh trajectories. This maintains exploration.
    • Experiential samples (BexpB_{\mathrm{exp}}): A fraction ρ\rho of the batch consists of experiences selected from the replay buffer using the procedure above. This exploits past successes.
  2. Unified Optimization Objective: The overall objective function combines the GRPO loss for both on-policy and off-policy (experiential) samples. IExGRPO(θ)=(1ρ)EqBon[1Ki=1KCLIP(wi(θ),A^(oi,Gq))]+ρEqBexp[1K(CLIP(w(θ),A^(o,Gq))+i=1K1CLIP(wi(θ),A^(oi,Gq)))] \mathcal{I}_{\mathrm{ExGRPO}}(\theta) = (1 - \rho) \cdot \mathbb{E}_{q \sim \mathcal{B}_{\mathrm{on}}} \left[ \frac{1}{K} \sum_{i=1}^{K} \mathrm{CLIP}(w_i(\theta), \widehat{A}(o_i, \mathcal{G}_q)) \right] \\ + \rho \cdot \mathbb{E}_{q^* \sim \mathcal{B}_{\mathrm{exp}}} \left[ \frac{1}{K} \left( \mathbf{CLIP}(w^*(\theta), \widehat{A}(o^*, \mathcal{G}_{q^*})) + \sum_{i=1}^{K-1} \mathbf{CLIP}(w_i(\theta), \widehat{A}(o_i, \mathcal{G}_{q^*})) \right) \right]

    • Explanation:
      • The first term is the standard on-policy GRPO loss for new samples.
      • The second term is the loss for experiential samples. Here, the advantage estimation group Gq\mathcal{G}_{q^*} is a mix: it contains the one replayed trajectory oo^* and K-1 newly generated trajectories.
      • wi(θ)w_i(\theta) and w(θ)w^*(\theta) are importance sampling weights. They are ratios of probabilities of a trajectory under the current policy πθ\pi_{\theta} versus the policy that generated it (πθold\pi_{\theta_{\mathrm{old}}} for new rollouts, πθpast\pi_{\theta_{\mathrm{past}}} for replayed ones). This corrects for the fact that off-policy data comes from a different policy distribution.
      • A^\widehat{A} is the advantage calculated using the GRPO method (normalizing the reward within the group).
  3. Stability Mechanisms:

    • Policy Shaping: Directly optimizing on low-entropy replayed trajectories could cause the policy to become too deterministic too quickly (entropy collapse), hurting exploration. To prevent this, the CLIP term for the replayed trajectory oo^* is replaced with a shaped function: f(w(θ))A^(o,Gq)f(w^*(\theta)) \cdot \widehat{A}(o^*, \mathcal{G}_{q^*}), where f(x)=xx+βf(x) = \frac{x}{x + \beta}. This non-linear function dampens the gradients from very high-probability (already well-learned) parts of the replayed trajectory, encouraging the model to learn from its more novel aspects.
    • Delayed Start: Experience replay is only activated after the model's performance on a training batch (Pass@1) surpasses a certain threshold (e.g., 35%). This ensures that the replay buffer is initially populated with reasonably good experiences, avoiding pollution from the very early, low-capability stages of training.

5. Experimental Setup

  • Datasets:

    • Training: A 45k subset of the OpenR1-Math dataset.
    • Evaluation (In-Distribution Math): AIME 2024/2025, AMC, MATH-500, Minerva, and OlympiadBench. These are challenging competition-level math problem datasets.
    • Evaluation (Out-of-Distribution General Reasoning): ARC-c (science questions), GPQA-Diamond (graduate-level, Google-proof questions), and MMLU-Pro (professional-level questions). These test the model's ability to generalize its reasoning skills.
  • Evaluation Metrics:

    1. Pass@k (Pass rate at k):
      • Conceptual Definition: This metric measures the probability that a model solves a problem correctly in at least one out of kk independent attempts. For this paper, Pass@1 is primarily used, which is simply the percentage of problems solved correctly on the first try. It is a direct measure of problem-solving accuracy.
      • Mathematical Formula: The unbiased estimator for Pass@k when generating nn samples per problem and finding cc correct solutions is: Pass@k=1(nck)(nk) \text{Pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}
      • Symbol Explanation:
        • nn: The total number of samples generated for a single problem.
        • cc: The number of correct samples among the nn generated samples.
        • kk: The number of attempts allowed.
        • (nk)\binom{n}{k}: The binomial coefficient, "n choose k".
    2. Avg@BBZ:
      • Conceptual Definition: The paper states this is used for small benchmarks like AIME and AMC and is calculated as the average performance over 32 independent runs. This is not a standard metric name but likely refers to Average Pass@k over Bootstrap Samples or multiple runs. Its purpose is to get a more stable and reliable performance estimate on benchmarks with very few questions, where single-run results can be noisy.
  • Baselines:

    • Backbone Models: Qwen2.5-Math-7B (primary), Qwen2.5-Math-1.5B, Qwen2.5-7B-Instruct, and Llama-3.1-8B (Base and Instruct). This variety tests the method's generalizability across different model families, sizes, and initializations.
    • Comparison Methods:
      • On-Policy: The standard RLVR baseline using the GRPO algorithm without experience replay.
      • Other RLVR methods (PRIME-Zero, Oat-Zero, GPG-Zero, RePO-Zero): These represent other state-of-the-art RLVR techniques.
      • SFT and SFT+RLSFT+RL: Supervised fine-tuning baselines.
      • LUFFY: A model already trained with off-policy data, used for continual learning experiments.

6. Results & Analysis

The paper's experiments convincingly demonstrate the effectiveness and robustness of ExGRPO.

  • Core Results:

(Manual Transcription of Table 1) The following table shows the main performance comparison on in-distribution and out-of-distribution benchmarks. The primary model is Qwen2.5-Math-7B.

Model In-Distribution Performance Out-of-Distribution Performance
AIME24 AIME25 AMC MATH-500 Minerva Olympiad Avg. ARC-c GPQA* MMLU-Pro Avg.
Qwen-Base 11.5 4.9 31.3 43.6 7.4 15.6 19.0 18.2 11.1 16.9 15.4
Qwen-Instruct 12.5 10.2 48.5 80.4 32.7 41.0 37.6 70.3 24.7 34.1 43.0
Previous Zero RLVR Methods
PRIME-Zero 17.0 12.8 54.0 81.4 39.0 40.3 40.7 73.3 18.2 32.7 41.4
Oat-Zero 33.4 11.9 61.2 78.0 34.6 43.4 43.7 70.1 23.7 41.7 45.2
GPG-Zero 29.8 12.1 67.8 80.8 30.9 44.7 44.4 70.3 40.4 50.5 41.6
RePO-Zero 19.8 10.2 54.0 76.8 34.2 40.1 39.2 73.8 24.2 42.5 46.8
Zero RLVR with ExGRPO
On-Policy 24.9 15.5 59.2 84.8 38.2 49.3 45.3 82.6 37.4 49.2 56.4
ExGRPO 31.6 18.7 66.3 87.4 36.0 50.1 48.3 84.7 37.4 52.9 58.3
Off-policy Learning Methods
SFT+RL 22.2/25.8 23.1 62.7 82.6/87.2 40.8/39.7 43.7/50.4 44.1/48.2 75.2/72.4 24.7/24.2 42.7/37.7 47.5/44.8
Continual RLVR with ExGRPO
LUFFY 29.4 23.1 65.6 87.6 37.5 57.2 50.1 80.5 39.9 53.0 57.8
→ Continual LUFFY 30.7 22.5 66.2 86.8 41.2 55.3 50.4 81.8 49.0 54.7 61.8
→ On-Policy 24.8 17.8 67.5 88.4 38.6 55.3 48.7 81.9 47.0 53.3 60.7
→ ExGRPO 32.3 25.7 65.6 87.6 40.1 57.0 51.4 83.6 42.4 54.5 60.2
  • Analysis of Table 1: ExGRPO consistently outperforms the On-Policy baseline. On the Qwen2.5-Math-7B model, it achieves an average score of 48.3 on in-distribution math tasks (vs. 45.3 for On-Policy, a +3.0 gain) and 58.3 on out-of-distribution tasks (vs. 56.4 for On-Policy, a +1.9 gain, though the abstract claims a larger average gain across all models). The improvements are particularly strong on very difficult benchmarks like AIME24 (+6.7 points).

  • Robustness Across Models (Figure 3): ExGRPO is not a one-trick pony. It delivers consistent gains over the On-Policy baseline across different models, including the smaller Qwen2.5-Math-1.5B and the instruction-tuned Qwen2.5-7B-Instruct. This shows the principles of experience management are broadly applicable.

    Figure 4: Learning dynamics of On-Policy vs. ExGRPO during training Llama-3.1 8B. ExGRPO stabilizes training and achieves higher rewards, while on-policy suffers from training collapse.

    Figure 3: This bar chart compares the performance of various models with and without ExGRPO. Across different model families (Qwen, Llama) and sizes, the ExGRPO variant (darker shade) consistently outperforms the On-Policy baseline (lighter shade) on both in-distribution and out-of-distribution tasks.

  • Stabilizing Training (Figure 4): This is a critical result. For the weaker Llama-3.1-8B base model, standard On-Policy RLVR fails. The model gets stuck with low rewards, its policy entropy explodes (it starts generating random gibberish), and training collapses. In contrast, ExGRPO provides a stable learning signal by replaying early "lucky hits," allowing the model to gradually improve its reward and maintain stable entropy, successfully training where the baseline failed.

    Figure 5: Dynamics of experience replay buffer and retried set.

    Figure 4: This line chart shows the training dynamics for Llama-3.1 8B. The On-Policy method (blue line) shows collapsing reward and exploding entropy. ExGRPO (red line) shows a steadily increasing reward and stable entropy, demonstrating its ability to rescue training.

  • Ablation Studies (Figure 7):

    • This study dissects ExGRPO to verify that its components are all contributing. Removing Question Selection or Trajectory Selection and replacing them with random sampling hurts performance, confirming that the specific heuristics (prioritizing medium difficulty and low entropy) are effective.

    • Removing Policy Shaping also degrades performance, showing its importance in balancing exploitation and exploration. The results validate that the performance gains are due to the principled design of ExGRPO, not just one single trick.

      Figure 8: Dynamics of policy entropy during training. ExGRPO without policy shaping even drops dramatically at an early stage, performing worse than the on-policy baseline.

      Figure 7: This chart shows an ablation study. The full ExGRPO model (blue line) achieves the best validation performance. Removing key components like Question Selection (w/o Q. Sel.) or Trajectory Selection (w/o T. Sel.) leads to worse performance, demonstrating their importance.

  • Efficiency of Experience Utilization (Figure 6): The paper shows that more replay is not always better. An excessively high replay ratio (ρ=75%\rho=75\%) harms performance by stifling exploration. The best results are achieved with a balance (ρ=50%\rho=50\%), reinforcing the idea that how experience is managed is more important than the sheer volume of replay.

    Figure 7: Comparison of validation performance of different ExGRPO variants.

    Figure 6: This figure illustrates how different replay strategies affect the experience buffer. A high replay ratio (ρ=75ρ=75%) leads to a smaller buffer and worse performance, showing that balancing exploration and exploitation is key.

7. Conclusion & Reflections

  • Conclusion Summary: The paper successfully demonstrates that the sample inefficiency and instability of RLVR for large reasoning models can be significantly mitigated through principled experience management. It provides a clear, data-driven answer to "what makes a reasoning experience valuable," identifying medium problem difficulty and low solution entropy as key indicators. The proposed ExGRPO framework operationalizes these insights, leading to consistent performance improvements and enhanced training stability across a range of models and benchmarks. The work highlights that a thoughtful approach to experience replay is a critical component for scaling up RL for LLM reasoning.

  • Limitations & Future Work:

    • The authors explicitly state they plan to extend their work to multi-modal reasoning (e.g., problems involving both text and images) and agentic reinforcement learning (where LLMs act as agents in more complex, interactive environments).
    • The current method relies on a verifiable, binary reward. Its applicability to tasks with more nuanced, non-verifiable rewards (e.g., creative writing) remains an open question.
  • Personal Insights & Critique:

    • Practicality and Elegance: The core insight of using online-computable metrics like rollout correctness and entropy as proxies for "experience value" is both elegant and highly practical. These metrics do not require external models or expensive offline analysis, making the framework efficient to implement.
    • Generalizability: The framework's principles—identifying a "zone of proximal development" for training examples and preferring confident (low-entropy) solutions—feel fundamental and are likely transferable to other domains beyond mathematical reasoning, such as code generation or complex instruction following.
    • The "Snowball Effect": The identification of the risk of replaying "lucky hits" with bad reasoning is a subtle but crucial insight. It cautions against naive experience replay and strongly motivates the need for a quality filter, for which entropy serves as a clever proxy.
    • Open Questions: While entropy is a good proxy, is it the best one? Could other metrics, perhaps related to the complexity or diversity of the reasoning steps, provide an even better signal for trajectory quality? Furthermore, the optimal balance between exploration and exploitation (the ρρ parameter) was found empirically; future work could explore adapting this ratio dynamically based on the model's learning state.

Similar papers

Recommended via semantic vector search.

Discussion

Leave a comment

Sign in to join the discussion.

No comments yet. Start the discussion!