- Title: ExGRPO: Learning to Reason from Experience
- Authors: Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng.
- Affiliations: The authors are from the University of Macau, Shanghai AI Laboratory, Nanjing University, and The Chinese University of Hong Kong. This collaboration brings together expertise from both academic institutions and a major AI research lab.
- Journal/Conference: This paper is an arXiv preprint. Preprints are non-peer-reviewed manuscripts that researchers share to disseminate findings quickly. While not formally published, they often represent cutting-edge research.
- Publication Year: 2025 (as listed on arXiv, though this is a future date, likely a placeholder or typo in the source document; the first version was submitted in 2024/2025 cycle).
- Abstract: The paper addresses the inefficiency of standard on-policy reinforcement learning for improving the reasoning abilities of large language models (LLMs). This inefficiency stems from discarding valuable training experiences after a single use. The authors first investigate what makes a reasoning experience valuable, identifying rollout correctness and trajectory entropy as key indicators. Based on this, they propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that manages and prioritizes high-value experiences from a replay buffer. ExGRPO uses a mixed-policy objective to balance learning from new (on-policy) and past (off-policy) experiences. Experiments on models from 1.5B to 8B parameters show that ExGRPO significantly improves reasoning performance over on-policy methods and stabilizes training, especially for weaker models where standard methods fail.
- Original Source Link:
https://arxiv.org/abs/2510.02245v1
(Note: This is a placeholder link from the source text and does not correspond to a real paper as of late 2024. The analysis proceeds based on the provided content.)
2. Executive Summary
To understand this paper, a beginner should be familiar with the following concepts:
4. Methodology (Core Technology & Implementation)
The core of ExGRPO
is built on insights from a preliminary study, which then inform a two-part framework: Experience Management and Policy Optimization.
3.2 Preliminary Study on Experience Data
Before designing the algorithm, the authors conducted a study to answer: What makes a reasoning experience valuable?
- Setup: They trained a
Qwen2.5-Math 7B
model using a standard on-policy RLVR method. They categorized training questions into three groups based on the model's online success rate (rollout correctness):
- Easy: 75-100% success rate.
- Medium: 25-75% success rate.
- Hard: 0-25% success rate.
- Key Findings (from Figure 1):
-
Medium-difficulty questions are most valuable: As shown in Figure 1(a), the model trained exclusively on Medium
questions achieved the highest performance. Easy questions offer little new information, while hard questions are often too difficult to provide a useful learning signal. This suggests an ideal "zone of proximal development."
-
Low entropy signals better reasoning: The authors used a powerful external LLM to judge whether the reasoning steps (CoT) were logically correct, even if the final answer was right. Figure 1(b) shows that trajectories with logically correct reasoning consistently have lower entropy than those with incorrect reasoning.
-
Beware of "lucky hits": High-entropy trajectories, even if they lead to a correct answer, often contain flawed reasoning. Replaying these "lucky hits" can teach the model bad habits, a phenomenon the authors call a "snowball effect."
These findings lead to two guiding principles for ExGRPO
: prioritize medium-difficulty questions and select low-entropy trajectories.

Figure 1: This figure shows the results of the preliminary analysis. (a) Training on "Medium" difficulty questions yields the best test performance. (b) Trajectories with logically correct reasoning have lower entropy (are more confident) than incorrect ones. (c) Correct trajectories from "Medium" problems are concentrated at lower entropy values.
4.1 ExGRPO: Experience Management
This is the first phase of the ExGRPO
pipeline, designed to collect, organize, and select valuable experiences.

Figure 2: This diagram provides an overview of the ExGRPO framework. (a) The Experience Management phase involves collecting successful rollouts, partitioning them by difficulty into a replay buffer, and selecting high-value questions and trajectories. (b) The Policy Optimization phase uses these selected experiences alongside new rollouts in a mixed-policy objective to update the model.
- Experience Collection: During training, for each question, the model generates K solutions (trajectories). All trajectories that result in a correct final answer are stored in a replay buffer, E. The success rate for each question, Acc(q∗), is also recorded.
- Experience Partition: The replay buffer E is not a flat list. It's partitioned into "buckets" based on the latest correctness rate Acc(q∗) of each question. This organizes experiences by their difficulty level. A special
Retired Set
is used to store questions that the model consistently solves (e.g., 100% success rate), removing them from active training to focus on more challenging problems.
- Experience Selection: When creating a training batch,
ExGRPO
performs a two-step selection from the buffer:
- (1) Question Sampling: It samples questions from the buckets with a bias towards the
Medium
difficulty group. This is achieved by setting the sampling probability pˉ to be proportional to a Gaussian distribution centered at μ=0.5 correctness: pˉ∝N(Acc(q∗);μ=0.5,σˉ=1).
- (2) Trajectory Selection: For each sampled question, which may have multiple successful trajectories stored in the buffer,
ExGRPO
selects the single trajectory with the lowest entropy. The entropy is calculated using the current policy πθ. This ensures the model replays the most confident and likely highest-quality past success. The formula is:
o∗←argminoi∈{o∗}H(oi;πθ)
4.2 ExGRPO: Experiential Policy Optimization
This is the second phase, where the selected experiences are used to update the model.
-
Mixed-Policy Mini-batch: Each training mini-batch B is composed of two parts, with the ratio controlled by a hyperparameter ρ:
- On-policy samples (Bon): A fraction (1−ρ) of the batch consists of new questions, for which the model generates fresh trajectories. This maintains exploration.
- Experiential samples (Bexp): A fraction ρ of the batch consists of experiences selected from the replay buffer using the procedure above. This exploits past successes.
-
Unified Optimization Objective: The overall objective function combines the GRPO loss for both on-policy and off-policy (experiential) samples.
IExGRPO(θ)=(1−ρ)⋅Eq∼Bon[K1i=1∑KCLIP(wi(θ),A(oi,Gq))]+ρ⋅Eq∗∼Bexp[K1(CLIP(w∗(θ),A(o∗,Gq∗))+i=1∑K−1CLIP(wi(θ),A(oi,Gq∗)))]
- Explanation:
- The first term is the standard on-policy GRPO loss for new samples.
- The second term is the loss for experiential samples. Here, the advantage estimation group Gq∗ is a mix: it contains the one replayed trajectory o∗ and
K-1
newly generated trajectories.
- wi(θ) and w∗(θ) are importance sampling weights. They are ratios of probabilities of a trajectory under the current policy πθ versus the policy that generated it (πθold for new rollouts, πθpast for replayed ones). This corrects for the fact that off-policy data comes from a different policy distribution.
- A is the advantage calculated using the GRPO method (normalizing the reward within the group).
-
Stability Mechanisms:
- Policy Shaping: Directly optimizing on low-entropy replayed trajectories could cause the policy to become too deterministic too quickly (entropy collapse), hurting exploration. To prevent this, the
CLIP
term for the replayed trajectory o∗ is replaced with a shaped function: f(w∗(θ))⋅A(o∗,Gq∗), where f(x)=x+βx. This non-linear function dampens the gradients from very high-probability (already well-learned) parts of the replayed trajectory, encouraging the model to learn from its more novel aspects.
- Delayed Start: Experience replay is only activated after the model's performance on a training batch (
Pass@1
) surpasses a certain threshold (e.g., 35%). This ensures that the replay buffer is initially populated with reasonably good experiences, avoiding pollution from the very early, low-capability stages of training.
5. Experimental Setup
-
Datasets:
- Training: A
45k
subset of the OpenR1-Math
dataset.
- Evaluation (In-Distribution Math):
AIME 2024/2025
, AMC
, MATH-500
, Minerva
, and OlympiadBench
. These are challenging competition-level math problem datasets.
- Evaluation (Out-of-Distribution General Reasoning):
ARC-c
(science questions), GPQA-Diamond
(graduate-level, Google-proof questions), and MMLU-Pro
(professional-level questions). These test the model's ability to generalize its reasoning skills.
-
Evaluation Metrics:
Pass@k
(Pass rate at k):
- Conceptual Definition: This metric measures the probability that a model solves a problem correctly in at least one out of k independent attempts. For this paper,
Pass@1
is primarily used, which is simply the percentage of problems solved correctly on the first try. It is a direct measure of problem-solving accuracy.
- Mathematical Formula: The unbiased estimator for
Pass@k
when generating n samples per problem and finding c correct solutions is:
Pass@k=1−(kn)(kn−c)
- Symbol Explanation:
- n: The total number of samples generated for a single problem.
- c: The number of correct samples among the n generated samples.
- k: The number of attempts allowed.
- (kn): The binomial coefficient, "n choose k".
Avg@BBZ
:
- Conceptual Definition: The paper states this is used for small benchmarks like
AIME
and AMC
and is calculated as the average performance over 32 independent runs. This is not a standard metric name but likely refers to Average Pass@k
over Bootstrap Samples or multiple runs. Its purpose is to get a more stable and reliable performance estimate on benchmarks with very few questions, where single-run results can be noisy.
-
Baselines:
- Backbone Models:
Qwen2.5-Math-7B
(primary), Qwen2.5-Math-1.5B
, Qwen2.5-7B-Instruct
, and Llama-3.1-8B
(Base and Instruct). This variety tests the method's generalizability across different model families, sizes, and initializations.
- Comparison Methods:
On-Policy
: The standard RLVR baseline using the GRPO algorithm without experience replay.
- Other RLVR methods (
PRIME-Zero
, Oat-Zero
, GPG-Zero
, RePO-Zero
): These represent other state-of-the-art RLVR techniques.
SFT
and SFT+RL: Supervised fine-tuning baselines.
LUFFY
: A model already trained with off-policy data, used for continual learning experiments.
6. Results & Analysis
The paper's experiments convincingly demonstrate the effectiveness and robustness of ExGRPO
.
(Manual Transcription of Table 1)
The following table shows the main performance comparison on in-distribution and out-of-distribution benchmarks. The primary model is Qwen2.5-Math-7B.
Model |
In-Distribution Performance |
Out-of-Distribution Performance |
AIME24 |
AIME25 |
AMC |
MATH-500 |
Minerva |
Olympiad |
Avg. |
ARC-c |
GPQA* |
MMLU-Pro |
Avg. |
Qwen-Base |
11.5 |
4.9 |
31.3 |
43.6 |
7.4 |
15.6 |
19.0 |
18.2 |
11.1 |
16.9 |
15.4 |
Qwen-Instruct |
12.5 |
10.2 |
48.5 |
80.4 |
32.7 |
41.0 |
37.6 |
70.3 |
24.7 |
34.1 |
43.0 |
Previous Zero RLVR Methods |
PRIME-Zero |
17.0 |
12.8 |
54.0 |
81.4 |
39.0 |
40.3 |
40.7 |
73.3 |
18.2 |
32.7 |
41.4 |
Oat-Zero |
33.4 |
11.9 |
61.2 |
78.0 |
34.6 |
43.4 |
43.7 |
70.1 |
23.7 |
41.7 |
45.2 |
GPG-Zero |
29.8 |
12.1 |
67.8 |
80.8 |
30.9 |
44.7 |
44.4 |
70.3 |
40.4 |
50.5 |
41.6 |
RePO-Zero |
19.8 |
10.2 |
54.0 |
76.8 |
34.2 |
40.1 |
39.2 |
73.8 |
24.2 |
42.5 |
46.8 |
Zero RLVR with ExGRPO |
On-Policy |
24.9 |
15.5 |
59.2 |
84.8 |
38.2 |
49.3 |
45.3 |
82.6 |
37.4 |
49.2 |
56.4 |
ExGRPO |
31.6 |
18.7 |
66.3 |
87.4 |
36.0 |
50.1 |
48.3 |
84.7 |
37.4 |
52.9 |
58.3 |
Off-policy Learning Methods |
SFT+RL |
22.2/25.8 |
23.1 |
62.7 |
82.6/87.2 |
40.8/39.7 |
43.7/50.4 |
44.1/48.2 |
75.2/72.4 |
24.7/24.2 |
42.7/37.7 |
47.5/44.8 |
Continual RLVR with ExGRPO |
LUFFY |
29.4 |
23.1 |
65.6 |
87.6 |
37.5 |
57.2 |
50.1 |
80.5 |
39.9 |
53.0 |
57.8 |
→ Continual LUFFY |
30.7 |
22.5 |
66.2 |
86.8 |
41.2 |
55.3 |
50.4 |
81.8 |
49.0 |
54.7 |
61.8 |
→ On-Policy |
24.8 |
17.8 |
67.5 |
88.4 |
38.6 |
55.3 |
48.7 |
81.9 |
47.0 |
53.3 |
60.7 |
→ ExGRPO |
32.3 |
25.7 |
65.6 |
87.6 |
40.1 |
57.0 |
51.4 |
83.6 |
42.4 |
54.5 |
60.2 |
-
Analysis of Table 1: ExGRPO
consistently outperforms the On-Policy
baseline. On the Qwen2.5-Math-7B
model, it achieves an average score of 48.3 on in-distribution math tasks (vs. 45.3 for On-Policy
, a +3.0 gain) and 58.3 on out-of-distribution tasks (vs. 56.4 for On-Policy
, a +1.9 gain, though the abstract claims a larger average gain across all models). The improvements are particularly strong on very difficult benchmarks like AIME24
(+6.7 points).
-
Robustness Across Models (Figure 3): ExGRPO
is not a one-trick pony. It delivers consistent gains over the On-Policy
baseline across different models, including the smaller Qwen2.5-Math-1.5B
and the instruction-tuned Qwen2.5-7B-Instruct
. This shows the principles of experience management are broadly applicable.

Figure 3: This bar chart compares the performance of various models with and without ExGRPO. Across different model families (Qwen
, Llama
) and sizes, the ExGRPO
variant (darker shade) consistently outperforms the On-Policy
baseline (lighter shade) on both in-distribution and out-of-distribution tasks.
-
Stabilizing Training (Figure 4): This is a critical result. For the weaker Llama-3.1-8B
base model, standard On-Policy
RLVR fails. The model gets stuck with low rewards, its policy entropy explodes (it starts generating random gibberish), and training collapses. In contrast, ExGRPO
provides a stable learning signal by replaying early "lucky hits," allowing the model to gradually improve its reward and maintain stable entropy, successfully training where the baseline failed.

Figure 4: This line chart shows the training dynamics for Llama-3.1 8B. The On-Policy
method (blue line) shows collapsing reward and exploding entropy. ExGRPO
(red line) shows a steadily increasing reward and stable entropy, demonstrating its ability to rescue training.
-
Ablation Studies (Figure 7):
-
This study dissects ExGRPO
to verify that its components are all contributing. Removing Question Selection or Trajectory Selection and replacing them with random sampling hurts performance, confirming that the specific heuristics (prioritizing medium difficulty and low entropy) are effective.
-
Removing Policy Shaping also degrades performance, showing its importance in balancing exploitation and exploration. The results validate that the performance gains are due to the principled design of ExGRPO
, not just one single trick.

Figure 7: This chart shows an ablation study. The full ExGRPO
model (blue line) achieves the best validation performance. Removing key components like Question Selection (w/o Q. Sel.
) or Trajectory Selection (w/o T. Sel.
) leads to worse performance, demonstrating their importance.
-
Efficiency of Experience Utilization (Figure 6): The paper shows that more replay is not always better. An excessively high replay ratio (ρ=75%) harms performance by stifling exploration. The best results are achieved with a balance (ρ=50%), reinforcing the idea that how experience is managed is more important than the sheer volume of replay.

Figure 6: This figure illustrates how different replay strategies affect the experience buffer. A high replay ratio (ρ=75) leads to a smaller buffer and worse performance, showing that balancing exploration and exploitation is key.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully demonstrates that the sample inefficiency and instability of RLVR for large reasoning models can be significantly mitigated through principled experience management. It provides a clear, data-driven answer to "what makes a reasoning experience valuable," identifying medium problem difficulty and low solution entropy as key indicators. The proposed ExGRPO
framework operationalizes these insights, leading to consistent performance improvements and enhanced training stability across a range of models and benchmarks. The work highlights that a thoughtful approach to experience replay is a critical component for scaling up RL for LLM reasoning.
-
Limitations & Future Work:
- The authors explicitly state they plan to extend their work to multi-modal reasoning (e.g., problems involving both text and images) and agentic reinforcement learning (where LLMs act as agents in more complex, interactive environments).
- The current method relies on a verifiable, binary reward. Its applicability to tasks with more nuanced, non-verifiable rewards (e.g., creative writing) remains an open question.
-
Personal Insights & Critique:
- Practicality and Elegance: The core insight of using online-computable metrics like rollout correctness and entropy as proxies for "experience value" is both elegant and highly practical. These metrics do not require external models or expensive offline analysis, making the framework efficient to implement.
- Generalizability: The framework's principles—identifying a "zone of proximal development" for training examples and preferring confident (low-entropy) solutions—feel fundamental and are likely transferable to other domains beyond mathematical reasoning, such as code generation or complex instruction following.
- The "Snowball Effect": The identification of the risk of replaying "lucky hits" with bad reasoning is a subtle but crucial insight. It cautions against naive experience replay and strongly motivates the need for a quality filter, for which entropy serves as a clever proxy.
- Open Questions: While entropy is a good proxy, is it the best one? Could other metrics, perhaps related to the complexity or diversity of the reasoning steps, provide an even better signal for trajectory quality? Furthermore, the optimal balance between exploration and exploitation (the ρ parameter) was found empirically; future work could explore adapting this ratio dynamically based on the model's learning state.