Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
TL;DR Summary
This study investigates the effectiveness of Reinforcement Learning with Verifiable Rewards (RLVR) in enhancing reasoning capabilities of large language models (LLMs). It finds that current setups fail to elicit new reasoning patterns, with base models performing better at larger
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
1.2. Authors
Yang Yue*, Zhiqi Chen*, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. *Equal Contribution, †Project Lead, Corresponding Author. Affiliations: LeapLab, Tsinghua University; Shanghai Jiao Tong University. The authors are primarily affiliated with Tsinghua University, a prominent research institution known for its strong programs in computer science and artificial intelligence. Gao Huang is a well-known researcher in the deep learning community, particularly for his work on DenseNet.
1.3. Journal/Conference
This paper is published on arXiv, a preprint server. While not a peer-reviewed journal or conference in its current form, arXiv is a widely used platform for disseminating cutting-edge research quickly within the AI and machine learning communities. Papers on arXiv often undergo subsequent peer review and publication in prestigious conferences or journals.
1.4. Publication Year
2025
1.5. Abstract
The paper investigates the effectiveness of Reinforcement Learning with Verifiable Rewards (RLVR) in enhancing the reasoning capabilities of large language models (LLMs). While RLVR has shown success in math and programming tasks and is believed to enable LLMs to acquire novel reasoning abilities, this study critically examines this premise. Using pass@k at large values as an evaluation metric across various model families, RL algorithms, and benchmarks (math, coding, visual reasoning), the authors find that current RLVR training does not elicit fundamentally new reasoning patterns. RLVR-trained models outperform base models at small (e.g., ), but base models achieve higher pass@k scores at large . Coverage and perplexity analyses suggest that the observed reasoning abilities originate from and are bounded by the base model. Six popular RLVR algorithms perform similarly and suboptimally in leveraging the base model's potential. In contrast, distillation is found to genuinely expand reasoning capabilities by introducing new patterns from a teacher model. The findings suggest that current RLVR methods have not yet realized RL's potential for novel reasoning and highlight the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction.
1.6. Original Source Link
https://arxiv.org/abs/2504.13837 Publication Status: Preprint.
1.7. PDF Link
https://arxiv.org/pdf/2504.13837v5.pdf
2. Executive Summary
2.1. Background & Motivation
The development of reasoning-centric Large Language Models (LLMs) has seen significant advancements, particularly in complex logical tasks like mathematics and programming. Reinforcement Learning with Verifiable Rewards (RLVR) is identified as a key driver behind this progress. RLVR optimizes LLMs using automatically computable rewards (e.g., matching ground-truth solutions or passing unit tests), enabling scalable training without extensive human labeling. Inspired by the success of traditional RL in game playing, where agents discover novel strategies through self-improvement, it is widely believed that RLVR similarly allows LLMs to develop new reasoning patterns beyond their base models, potentially leading to continuously self-evolving LLMs.
However, despite its empirical success, the fundamental effectiveness of current RLVR is largely underexamined. The core problem the paper aims to solve is to critically assess whether current RLVR genuinely enables LLMs to acquire novel reasoning abilities, or if it merely optimizes the utilization of reasoning patterns already present in the base model. This question is crucial because the promise of RL for LLMs lies in its potential to discover truly new, emergent capabilities, rather than just refining existing ones. The gap in prior research is a systematic, rigorous evaluation of the reasoning capability boundaries of RLVR-trained models compared to their base models, especially under conditions that probe the full potential of a model (e.g., large sampling budgets).
The paper's entry point is to use the pass@k metric at large values, which provides a more robust measure of a model's potential reasoning capacity rather than just average-case performance. This allows for a deeper probe into whether RLVR truly expands the problem-solving boundary of LLMs.
2.2. Main Contributions / Findings
The paper makes several primary contributions and presents key findings that challenge conventional assumptions about RLVR:
- RLVR Narrows Reasoning Coverage: While RLVR models outperform base models at small (e.g.,
pass@1, which reflects sampling efficiency), base models consistently surpass RLVR models as increases across all benchmarks and LLM families. This suggests that current RLVR training often narrows, rather than expands, the scope of solvable problems. - Reasoning Paths are Bounded by the Base Model: Further analysis (accuracy distribution, perplexity) reveals that the reasoning paths generated by current RLVR models largely exist within the sampling distribution of their base models. RLVR improves performance by more efficiently sampling correct responses for problems already solvable by the base model, but it does not enable the model to solve new problems. This indicates that RLVR does not introduce fundamentally new reasoning capabilities.
- Similar Performance Across RLVR Algorithms: Six popular RLVR algorithms (PPO, GRPO, Reinforce++, RLOO, ReMax, DAPO) perform similarly in terms of
Sampling Efficiency Gap(), and all remain far from optimally leveraging the potential of the base model. This suggests that the core limitation is not algorithm-specific but perhaps inherent to the current RLVR paradigm. - Distillation Expands Reasoning Boundaries: In contrast to RLVR, distillation can genuinely expand a model's reasoning capabilities by transferring new reasoning patterns from a stronger teacher model. Distilled models demonstrate an expanded reasoning scope beyond that of the base model, indicating a different mechanism of capability acquisition.
- Need for Improved RL Paradigms: The findings highlight a significant gap between existing RLVR methods and the ideal goals of RL (discovering genuinely new strategies). The paper suggests the need for improved RL paradigms, such as effective exploration mechanisms, more deliberate and large-scale data curation, fine-grained process signals, and multi-turn agent interaction, to unlock RL's full potential for novel reasoning in LLMs.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Large Language Models (LLMs): These are neural networks, often based on the transformer architecture, trained on vast amounts of text data to generate human-like text, translate languages, write different kinds of creative content, and answer questions in an informative way. They learn to predict the next word in a sequence, thereby developing a deep understanding of language structure, facts, and reasoning patterns.
- Reinforcement Learning (RL): A subfield of machine learning where an
agentlearns to makedecisionsby performingactionsin anenvironmentto maximize a cumulativereward. The agent is not explicitly told what actions to take but discovers which actions yield the most reward by trial and error.- Agent: The decision-making entity (e.g., an LLM).
- Environment: The system with which the agent interacts (e.g., the problem-solving task).
- Action: A decision or output generated by the agent (e.g., generating a token, a reasoning step, or a final answer).
- Reward: A scalar feedback signal from the environment indicating the desirability of an action (e.g., 1 for a correct answer, 0 for an incorrect one).
- Policy: The agent's strategy for choosing actions based on its current state.
- Reinforcement Learning with Verifiable Rewards (RLVR): A specific application of RL to LLMs where the reward signal is automatically computed based on verifiable outcomes, such as a mathematical solution being correct or code passing unit tests. This contrasts with
Reinforcement Learning from Human Feedback (RLHF), where human annotators provide preference-based rewards. The key advantage of RLVR is its scalability, as it does not require human labeling. - Chain-of-Thought (CoT) Reasoning: A prompting technique for LLMs where the model is explicitly asked to show its step-by-step reasoning process before arriving at a final answer. This often improves performance on complex reasoning tasks by allowing the model to break down problems into smaller, more manageable steps.
- Pass@k Metric: An evaluation metric typically used for code generation and reasoning tasks. It measures the probability that at least one out of independent samples generated by a model is correct. A higher value allows the model more "attempts" to solve a problem, potentially revealing its full reasoning capacity, as opposed to just
pass@1which measures the accuracy of the first (or greedy) attempt. - Perplexity (PPL): A measure of how well a probability distribution or language model predicts a sample. It is calculated as the exponentiated average negative log-likelihood of a sequence. Lower perplexity indicates that the model assigns a higher probability to the observed sequence, meaning it predicts the sequence more confidently and accurately. In this paper, it's used to assess how likely the base model is to generate the responses produced by the RL-trained model.
- Distillation: A technique where a smaller "student" model is trained to mimic the behavior of a larger, more powerful "teacher" model. This is typically done by training the student model on outputs (e.g.,
logits,soft targets, orreasoning traces) generated by the teacher model. Distillation can transfer knowledge and capabilities from the teacher to the student, often resulting in a smaller, more efficient model with comparable performance.
3.2. Previous Works
The paper primarily discusses works related to RLVR and its application to LLMs, particularly for mathematical and programming reasoning tasks. It references several key developments:
-
Early Reasoning-Centric LLMs:
- OpenAI-o1 (Jaech et al., 2024): This work is cited as an encouraging landmark, being among the first large-scale applications of RL for reasoning, achieving state-of-the-art results.
- DeepSeek-R1 (Guo et al., 2025): The first open-weight model to match or surpass the performance of OpenAI-o1. It introduced the "zero" setting, applying RL directly to a base LLM without intermediate supervised fine-tuning (SFT).
- Kimi-1.5 (Team et al., 2025): Another reasoning-centric LLM mentioned for its advancements.
-
Traditional RL and Self-Improvement:
- DQN (Mnih et al., 2015): Deep Q-Networks demonstrated human-level control in Atari games, showcasing RL's ability to learn complex strategies from scratch. The core idea is to approximate the optimal action-value function
Q(s, a)using a deep neural network. $ Q(s, a; \theta) \approx Q^*(s, a) $ The network is trained using experience replay and a target network to stabilize learning, minimizing the loss: $ L(\theta) = \mathbb{E}{(s, a, r, s') \sim U(D)} \left[ \left( r + \gamma \max{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right] $ where are the current network weights, are the target network weights, is the discount factor, andU(D)is uniform sampling from the experience replay buffer . - AlphaGo (Silver et al., 2017): Demonstrated superhuman performance in Go, largely through self-play and
Monte Carlo Tree Search (MCTS)guided by deep neural networks. This work highlighted RL's capacity for autonomous strategy discovery.
- DQN (Mnih et al., 2015): Deep Q-Networks demonstrated human-level control in Atari games, showcasing RL's ability to learn complex strategies from scratch. The core idea is to approximate the optimal action-value function
-
RL Algorithms for LLMs:
- Proximal Policy Optimization (PPO) (Schulman et al., 2017): A widely used policy gradient algorithm for RL. PPO aims to keep the new policy close to the old policy while taking the largest possible improvement step. This is achieved through a
clipped surrogate objectivefunction. $ \mathcal{L}^{CLIP}(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t) \right] $ where is the ratio of new to old probabilities, is theadvantage estimate, and is a clipping hyperparameter. PPO is designed for stability and efficiency. - GRPO (Shao et al., 2024): A critic-free variant that estimates advantage using a normalized reward within a group of responses to the same question.
- Reinforce++ (Hu, 2025): A simple and efficient policy gradient method for aligning LLMs.
- RLOO (Ahmadian et al., 2024): Adopts a
leave-one-outbaseline for advantage estimation. - ReMax (Li et al., 2024): Uses the greedy response reward as the advantage baseline.
- DAPO (Yu et al., 2025): An open-source LLM reinforcement learning system.
- Proximal Policy Optimization (PPO) (Schulman et al., 2017): A widely used policy gradient algorithm for RL. PPO aims to keep the new policy close to the old policy while taking the largest possible improvement step. This is achieved through a
-
Instruction Tuning and Distillation:
- Instruction-tuned approaches (Achiam et al., 2023; Grattafiori et al., 2024): Traditional methods that rely on human-curated annotations for fine-tuning LLMs.
- Distillation (Guo et al., 2025): Training a student model on outputs (long CoT reasoning traces) generated by a stronger teacher model.
3.3. Technological Evolution
The evolution of LLMs has moved from initial pretraining on vast text corpora to various post-training phases aimed at enhancing specific capabilities:
-
Pretraining: Initial large-scale training on diverse text data to learn language fundamentals.
-
Supervised Fine-Tuning (SFT): Fine-tuning pretrained models on human-curated datasets for specific tasks or instruction following (e.g.,
instruction-tuned models). -
Reinforcement Learning from Human Feedback (RLHF): Training models with human preferences as rewards, often building on SFT.
-
Reinforcement Learning with Verifiable Rewards (RLVR): A more scalable approach for reasoning tasks where objective, verifiable rewards (e.g., correctness in math, passing code tests) are automatically computed. This is where OpenAI-o1 and DeepSeek-R1 fit in, aiming for
self-improvementandnovel reasoning patterns. -
Distillation: A parallel approach to transfer knowledge from large, powerful models to smaller, more efficient ones, or to inject new reasoning patterns from highly capable "teacher" models.
This paper's work fits within the RLVR and distillation phases, critically evaluating the former's ability to truly generate novel reasoning beyond the
pretrained priorof the base model.
3.4. Differentiation Analysis
The core differentiation of this paper's approach lies in its systematic and rigorous investigation of the boundaries of reasoning capacity using pass@k at large k values, combined with perplexity and coverage analyses.
- Traditional RLVR: Focuses on improving
pass@1(average performance) or overall accuracy, often implicitly assuming that improved performance implies the acquisition of novel reasoning abilities, similar to traditional RL agents exploring new strategies. - This Paper's Innovation:
-
Probing Reasoning Boundaries: Instead of just looking at
pass@1, the paper usespass@kat very large (up to 1024 samples) to reveal the potential maximum number of problems a model can solve. This allows them to differentiate between increased sampling efficiency (making correct answers more likely) and increased reasoning capacity (solving problems previously unsolvable). -
Systematic Comparison: It compares base models with RLVR-trained models across a wide array of
LLM families,model sizes,RL algorithms, andbenchmarks(math, coding, visual reasoning), providing a comprehensive view. -
Deep Dive into Mechanisms: Through
accuracy distributionandperplexity analysis, the paper investigates why RLVR behaves the way it does, concluding that it primarily sharpens the distribution within the base model's prior rather than expanding it. -
Contrast with Distillation: By explicitly comparing RLVR with distillation, the paper highlights that distillation can introduce new reasoning patterns, unlike current RLVR, which remains bounded by the base model.
Previous work might have observed a decline in
pass@kpost-RLVR (e.g., Dang et al., 2025) or similar trends limited to specific models (e.g., Deepseek-Math, Shao et al., 2024), but they did not conduct the comprehensive cross-model, cross-algorithm, and deep analytical investigation that this paper undertakes. This study provides a much more robust and generalized conclusion about the limitations of current RLVR.
-
4. Methodology
4.1. Principles
The core idea behind the methodology is to systematically evaluate and compare the reasoning capacity boundaries of base Large Language Models (LLMs) and their Reinforcement Learning with Verifiable Rewards (RLVR)-trained counterparts. The theoretical basis rests on the distinction between sampling efficiency and actual reasoning capability expansion. Sampling efficiency refers to how effectively a model can generate a correct solution on its first few attempts. Reasoning capability expansion refers to the model's ability to solve problems that it could not solve at all before, even with many attempts.
The intuition is that if RLVR truly helps LLMs acquire novel reasoning abilities, then an RLVR-trained model should be able to solve a wider range of problems than its base model, especially when given ample opportunity to sample multiple solutions (i.e., at large in pass@k). If, however, RLVR merely makes the model better at finding solutions it already could find, then the base model, given enough attempts, should eventually match or even surpass the RLVR model in terms of the total number of solvable problems. The paper uses perplexity analysis to further investigate if RLVR responses are truly "new" or just higher-likelihood versions of what the base model could produce.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Reinforcement Learning with Verifiable Rewards (RLVR) Fundamentals
The paper begins by outlining the general setup for RLVR.
An LLM with parameters , denoted as , generates a token sequence conditioned on a natural-language prompt . A deterministic verifier provides a binary reward: . A reward signifies that the model's final answer is exactly correct. The objective of RL is to learn a policy (i.e., optimize ) to maximize the expected reward.
The objective function for RL is: $ \mathcal{\bar{J}}(\theta) = \mathbb{E}{x \sim \mathcal{D}} \left[ \mathbb{E}{\mathbf{y} \sim \pi_\theta(\cdot|x)} [r] \right] $ Where:
- is the expected reward to be maximized.
- denotes the expectation over prompts sampled from the distribution .
- denotes the expectation over token sequences generated by the policy conditioned on prompt .
- is the binary reward received for the generated sequence given prompt .
4.2.2. RLVR Algorithms
The paper compares several popular RL algorithms, primarily focusing on policy gradient methods.
Proximal Policy Optimization (PPO)
PPO (Schulman et al., 2017) is a widely used algorithm that maximizes a clipped surrogate objective function to update the policy. This objective encourages improvements while preventing excessively large policy updates, which can lead to instability.
The clipped surrogate objective is defined as: $ \mathcal{L}_{\mathrm{CLIP}} = \mathbb{E}_t \left[ \min(r_t(\theta) A_t, \ \mathrm{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) A_t) \right] $ Where:
-
denotes the empirical average over a batch of samples.
-
is the ratio of the new policy's probability of taking action (generating token ) to the old policy's probability, given state (prompt and previous tokens ). Specifically: $ r_t(\theta) = \frac{\pi_\theta(y_t | x, \mathbf{y}{<t})}{\pi{\theta_{\mathrm{old}}}(y_t | x, \mathbf{y}_{<t})} $
-
is the
advantage estimatefor the action at time , usually estimated by a value network . The advantageA_t = R_t - V_\phi(s_t), where is the discounted sum of future rewards and is the estimated value of the state. -
clips the probability ratio to be within the interval . This mechanism prevents large policy updates if the new policy deviates too much from the old one, promoting stable learning.
-
is a hyperparameter, typically a small value (e.g., 0.2), that controls the clipping.
Other RLVR algorithms mentioned and used in the study include:
-
GRPO (Shao et al., 2024): A critic-free variant where the advantage is estimated by normalizing rewards within a group of responses to the same question: $ A_i = [r_i - \mathrm{mean}(\mathbf{r})] / \mathrm{std}(\mathbf{r}) $ where is the set of rewards for sampled responses.
-
RLOO (Ahmadian et al., 2024): Adopts a
leave-one-outbaseline for advantage estimation within each batch. -
Reinforce++ (Hu, 2025), ReMax (Li et al., 2024), DAPO (Yu et al., 2025): These are also policy gradient methods with specific choices for advantage estimation or other enhancements. For a fair comparison, the authors often remove the KL divergence term (optionally applied in PPO) to avoid constraining model learning, following practices in DAPO and Oat-Zero.
Zero-RL Training: This refers to applying RL directly to the base model without any
supervised fine-tuning (SFT). The paper primarily uses this setting for math tasks to isolate the effect of RLVR. For coding and visual reasoning, instruction-tuned models are used as starting points, consistent with open-source practices.
4.2.3. Metrics for LLM Reasoning Capacity Boundary
The paper emphasizes the pass@k metric as crucial for evaluating reasoning capacity boundaries.
Pass@k Metric Definition and Estimation:
Given a problem, outputs are sampled from the model. The pass@k for this question is 1 if at least one of the samples passes verification (is correct); otherwise, it's 0. The average pass@k over a dataset reflects the proportion of problems the model can solve within trials.
To accurately estimate pass@k with low variance, the paper adopts an unbiased estimator from Chen et al. (2021). For each problem from the evaluation dataset , samples are generated (where ), and is the number of correct samples.
The unbiased estimator of pass@k over the dataset is given by:
$
\operatorname{pass@}k := \mathbb{E}_{x_i \sim \mathcal{D}} \left[ 1 - \frac{\binom{n - c_i}{k}}{\binom{n}{k}} \right]
$
Where:
-
is the estimated pass rate when sampling times.
-
denotes the expectation (average) over all problems in the dataset .
-
is the number of correct samples obtained for problem from total samples.
-
is the binomial coefficient, representing the number of ways to choose items from a set of items, calculated as .
-
The term calculates the probability that at least one of the chosen samples (from total samples for problem ) is correct. This is equivalent to
1minus the probability that all chosen samples are incorrect.The
pass@kmetric is preferred overBest-of-NorMajority Votingbecause it focuses on the model's potential to solve a problem, regardless of whether a specific selection method identifies the correct answer.
4.2.4. Perplexity Analysis
To determine if reasoning paths generated by RLVR models already exist within the output distribution of their base models, the paper uses perplexity.
Given a model , a problem , and a response , the perplexity reflects the model's ability to predict the given response conditioned on the prompt .
The perplexity is defined as: $ \operatorname{PPL}m(\mathbf{Y} \mid x) = \exp \left( - \frac{1}{T} \sum{t=1}^T \log P(y_t \mid x, y_1, \ldots, y_{t-1}) \right) $ Where:
-
is the perplexity of the response given prompt , according to model .
-
is the exponential function.
-
is the total number of tokens in the response .
-
is the log-likelihood of the -th token given the prompt and all preceding tokens , as computed by model .
-
A
lower perplexityindicates that the model assigns a higher probability to the response sequence, meaning it finds the sequence more "expected" or "natural".By comparing (perplexity of RL model responses under the base model) with (perplexity of base model responses under the base model), the authors infer whether RL-generated responses are within the base model's high-likelihood distribution.
4.2.5. Distillation Methodology
For distillation, the paper focuses on a representative model, DeepSeek-R1-Distill-Qwen-7B. This model distills knowledge from a stronger teacher model (DeepSeek-R1) into a student model (Qwen2.5-Math-7B). The training data for distillation consists of long Chain-of-Thought (CoT) reasoning traces generated by the teacher model. This is analogous to instruction-following fine-tuning, but with richer, more complex reasoning data. The effectiveness of distillation is then evaluated using the same pass@k metric to see if it genuinely expands the reasoning boundary.
4.2.6. Sampling Efficiency Gap ()
To quantify the difference in sampling efficiency, the authors propose the Sampling Efficiency Gap (), defined as:
$
\Delta_{\mathrm{SE}} = \operatorname{pass@1}{\mathrm{RL}} - \operatorname{pass@k}{\mathrm{Base}}
$
Where:
- is the
pass@1score of the RL-trained model. - is the
pass@kscore of the base model, with used as a proxy for its upper-bound performance. - A
lowerindicates that the RL algorithm is closer to achieving optimal sampling efficiency, meaning itspass@1is closer to the base model's full potential.
5. Experimental Setup
5.1. Datasets
The study evaluates models across three representative domains: mathematics, code generation, and visual reasoning.
-
Mathematics:
- GSM8K (Cobbe et al., 2021): A dataset of grade school math word problems.
- MATH500 (Hendrycks et al., 2021): A dataset of challenging mathematics problems.
- Minerva (Lewkowycz et al., 2022): A dataset containing diverse math and science problems, known for requiring advanced reasoning.
- Olympiad (He et al., 2024): Problems from international math olympiads, representing extremely challenging reasoning tasks.
- AIME24, AMC23: Problems from American Invitational Mathematics Examination and American Mathematics Competitions, respectively. These are challenging competitive math problems.
- Omni-MATH-Rule (Gao et al., 2025): A subset of the Omni-MATH dataset containing verifiable problems. Used for in-domain and out-of-domain generalization studies.
-
Code Generation:
- LiveCodeBench v5 (Jain et al., 2025): Comprises 279 coding problems spanning August 2024 to January 2025.
- HumanEval+ (Liu et al., 2023): A widely used benchmark for evaluating code generation models.
- MBPP+ (Liu et al., 2023): Another popular benchmark for code generation models, focusing on simpler Python programming problems.
-
Visual Reasoning:
-
Geometry3K (Lu et al., 2021): Used for training visual reasoning models, focusing on geometry problems.
-
MathVista-TestMini (Lu et al., 2024): A filtered version of MathVista, specifically for evaluating math in visual contexts, with multiple-choice questions removed.
-
MathVision-TestMini (Wang et al., 2024): Similar to MathVista-TestMini, focusing on visual math reasoning.
The datasets were chosen to cover a range of difficulty levels and reasoning types (numerical, logical, symbolic, multi-modal), ensuring a robust validation of the methods. The inclusion of competitive programming and math olympiad problems pushes the boundaries of LLM reasoning.
-
5.2. Evaluation Metrics
The primary evaluation metric used across all tasks is pass@k.
-
Pass@k:
- Conceptual Definition:
Pass@kmeasures the probability that at least one out of sampled outputs from a model correctly solves a given problem. It reflects the model's potential to solve a problem if given multiple attempts, thereby revealing its reasoning capacity boundary. A higherpass@kindicates a greater ability to eventually find a correct solution. - Mathematical Formula: For each problem from the evaluation dataset , we generate samples (where ) and count the number of correct samples as . The unbiased estimator of
pass@kover the dataset is given by: $ \operatorname{pass@}k := \mathbb{E}_{x_i \sim \mathcal{D}} \left[ 1 - \frac{\binom{n - c_i}{k}}{\binom{n}{k}} \right] $ - Symbol Explanation:
-
: The estimated pass rate for a model when samples are drawn per problem.
-
: The expectation (average) over all problems in the dataset .
-
: The total number of samples generated for each problem during evaluation (where ).
-
: The number of correct samples obtained for problem out of total samples.
-
: The binomial coefficient, which is read as "N choose K" and calculated as . It represents the number of ways to choose items from a set of items without regard to the order of selection.
-
The term calculates the probability that at least one of the selected samples (from the available samples for problem ) is correct. This is equivalent to
1minus the probability that all selected samples are incorrect (i.e., chosen from the incorrect samples).The paper uses typically as 128, 256, or 1024, corresponding to the largest value in the
pass@kcurves.
-
- Conceptual Definition:
-
Perplexity (PPL):
- Conceptual Definition: Perplexity is a measure of how well a probability distribution or language model predicts a sample. It quantifies how "surprised" the model is by a given sequence of words or tokens. A lower perplexity indicates that the model assigns a higher probability to the sequence, meaning it predicts the sequence more confidently and accurately. In this study, it helps assess whether RL-generated responses are high-likelihood outcomes for the base model.
- Mathematical Formula: $ \operatorname{PPL}m(\mathbf{Y} \mid x) = \exp \left( - \frac{1}{T} \sum{t=1}^T \log P(y_t \mid x, y_1, \ldots, y_{t-1}) \right) $
- Symbol Explanation:
- : The perplexity of the response given prompt , as computed by model .
- : The exponential function (base ).
- : The total number of tokens in the response .
- : The natural logarithm of the probability assigned by model to the -th token , given the initial prompt and all preceding tokens in the sequence.
- The term
\frac{1}{T} \sum_{t=1}^T \log P(y_t \mid x, y_1, \ldots, y_{t-1})is the average negative log-likelihood per token.
5.3. Baselines
The study primarily compares the RLVR-trained models against their corresponding base models.
- Base Models: These are the original pretrained LLMs (or instruction-tuned models for coding/visual tasks) from which the RLVR models are derived. Examples include
Qwen2.5-7B/14B/32B-Base,LLaMA-3.1-8B,Qwen2.5-Math-7B,Qwen2.5-7B-Instruct,DeepSeek-R1-Distill-Qwen-14B, andQwen2.5-VL-7B. - Other Baselines:
-
Instruction-tuned Models: For certain comparisons (e.g., in Figure 7),
Qwen2.5-Math-7B-Instructis included as a baseline to show the effect of supervised instruction fine-tuning. -
Distilled Models:
DeepSeek-R1-Distill-Qwen-7Bis used as a baseline to compare the effect of distillation against RLVR. -
Different RLVR Algorithms: For the deep analysis in Section 4.3, various RL algorithms (
PPO,GRPO, ,RLOO,ReMax,DAPO) are implemented and compared against each other, all starting fromQwen2.5-7B. -
Magistral-Medium-2506: Compared against its base model,
Mistral-Medium-3-2505, for evaluating model size scaling effects.The base models are representative as they represent the state of the model before RLVR training, thus allowing direct assessment of RLVR's impact. Distilled and instruction-tuned models provide different paradigms for LLM capability enhancement, offering a broader comparative context.
-
5.4. Experimental Protocols
- Sampling: For both base and RLVR models, a
temperatureof 0.6 and atop-pvalue of 0.95 were used. A maximum generation length of 16,384 tokens was allowed. - Prompting for Base Models: To ensure a fair comparison and avoid confounding effects,
few-shot exampleswere not used for base models. The samezero-shot prompt(as used in RLVR training or the benchmark's default prompt) was used for both base and RLVR models. - Zero-RL Setting: For math tasks,
zero-RL(RL applied directly to pretrained models without SFT) was followed. For coding and visual reasoning,instruction-tuned modelswere used as starting points for RLVR, reflecting common practice. - Manual CoT Inspection: For mathematical problems, where "guessing" correct answers is a concern, a subset of
Chain-of-Thought (CoT)responses for challenging problems were manually inspected to confirm that correct final answers stemmed from genuinely correct reasoning paths. This validates thepass@kmetric for math.
6. Results & Analysis
6.1. Core Results Analysis
The core results consistently demonstrate a surprising trend: while RLVR-trained models outperform their base models at small (e.g., pass@1), base models achieve higher pass@k scores when is large. This suggests that RLVR primarily improves the sampling efficiency of already solvable problems rather than expanding the reasoning capacity to solve fundamentally new problems.
6.1.1. RLVR for Mathematical Reasoning
The following figure (Figure 2 from the original paper) shows the pass@k curves of base models and their RLVR-trained counterparts across multiple mathematical benchmarks.

Figure 2: Pass @ k curves of base models and their RLVR-trained counterparts across multiple mathematical benchmarks. When is small, RL-trained models outperform their base versions. However, as increases to the tens or hundreds, base modls consistently catch u and surpass RL-trai models. Mor result n GM8K and AMC23 can b fond at Figure 10.
Analysis:
-
Small (e.g., ): On all mathematical benchmarks (e.g., GSM8K, MATH500, Minerva, Olympiad, AIME24, AMC23), RLVR-trained models (red/orange lines) show higher
pass@1scores compared to their corresponding base models (green/blue lines). This indicates that RLVR makes the models more likely to sample a correct response on their first attempt, confirming its role in improving sampling efficiency. -
Large (e.g., , ): As the number of samples increases, the
pass@kcurves for base models rise more steeply and consistently catch up to and eventually surpass the RLVR-trained models. For instance, on Minerva (Qwen2.5-32B), the base model outperforms the RL-trained model by approximately 9% at . This suggests that while RLVR models are more efficient, base models have a broadercoverage of solvable problemsand can solve more problems overall if given enough attempts. -
RLVR Narrows Coverage: This contrasting trend implies that current RLVR training does not expand the scope of problems an LLM can solve. Instead, it seems to make the model "forget" or deprioritize some reasoning paths, leading to a narrower reasoning boundary compared to its base model at high . The following figure (Figure 10 from the original paper) shows more results of SimpleRLZoo on GSM8K and AMC23.
Figure 10: More results of SimpleRLZoo on GSM8K and AMC23.Analysis: This figure further supports the observation from Figure 2, showing consistent trends for various Qwen2.5 models and LLaMA-3.1-8B on GSM8K and AMC23. For all models, the RL-trained version initially outperforms the base model at small , but the base model eventually surpasses it at larger values.
The following figure (Figure 11 from the original paper) shows Oat-Zero-7B and DAPO-32B evaluated on AIME24 and compared against their respective base models.

Figure 11: Oat-Zero-7B and DAPO-32B are evaluated on AIME24 and compared against their respectiv base models.
Analysis: Even for strong RL models like Oat-Zero and DAPO, which demonstrate significant initial performance gains (e.g., nearly 30% higher than the base model at pass@1 for Oat-Zero), the base models eventually surpass them in pass@k as increases. This reinforces the conclusion that RLVR improves initial sampling efficiency but does not fundamentally expand reasoning capacity.
6.1.2. RLVR for Code Generation
The following figure (Figure 4 (left) from the original paper) shows pass@k curves of base and RLVR models for code generation.

Figure 4: Pass @ k curves of base and RLVR models. (Left) Code Generation. (Right) Visual Reasoning.
Analysis: The trends for code generation (LiveCodeBench, HumanEval+, MBPP+) are highly consistent with those observed in mathematical reasoning. The RLVR models (Coder-R1-Zero-Qwen2.5-7B and Deepcoder-14B-Preview) initially show higher pass@k at small values, but their base counterparts eventually outperform them at larger . Since code problems are less susceptible to "guessing" (due to unit tests), this further strengthens the argument that RLVR's primary effect is on sampling efficiency.
The following figure (Figure 12 from the original paper) shows Coder-R1 on LiveCodeBench.

Figure 12: Coder-R1 on LiveCodeBench.
Analysis: This figure, including separate plots for 2023 and 2024 LiveCodeBench tasks, consistently shows the same pattern. The RL-trained Coder-R1-Zero (red line) starts higher but is eventually surpassed by the Qwen2.5-7B-Instruct-1M base model (blue line) as increases.
6.1.3. RLVR for Visual Reasoning
The following figure (Figure 4 (right) from the original paper) shows pass@k curves of base and RLVR models for visual reasoning.

Figure 4: Pass @ k curves of base and RLVR models. (Left) Code Generation. (Right) Visual Reasoning.
Analysis: The results for visual reasoning (MathVista-TestMini and MathVision-TestMini) mirror the findings in math and coding. The RLVR-trained EasyR1 model shows an initial advantage at small , but the base Qwen2.5-VL-7B model demonstrates broader coverage at higher values. This suggests the observed phenomenon is generalizable across different modalities.
6.1.4. Accuracy Distribution Analysis
The following figure (Figure 5 from the original paper) shows the accuracy histogram of Qwen2.5-7B on Minerva.

Figure 5: Qwen2.5-7B Accuracy Histogram on Minerva.
Analysis: This histogram compares the distribution of problem accuracies for the base model (green) and the RL model (red).
- RLVR increases the frequency of problems solved with high accuracy (near 1.0) and reduces the frequency of problems with low accuracies (e.g., 0.1, 0.2). This confirms that RLVR improves sampling efficiency for problems the base model could already solve.
- Crucially, RLVR also leads to an increased frequency at
accuracy 0, meaning more problems become completelyunsolvableby the RL model compared to the base model. This directly supports the observation that RLVRnarrows the model's overall coverage. The improvement in average scores (pass@1) comes from more efficient sampling on already solvable problems, not from solving new ones.
6.1.5. Perplexity Analysis
The following figure (Figure 6 from the original paper) shows the perplexity distribution of responses.

Figure 6: Perplexity distribution of responses. The conditioning problem is omitted in the figure.
Analysis: This figure shows the perplexity distribution of responses from different models, evaluated by the base model (PPL_Base).
PPL_Base(Y_Base | x)(left-most distribution, green) shows the perplexity of responses generated by the base model, evaluated by the base model itself.PPL_Base(Y_RL | x)(middle distribution, red) shows the perplexity of responses generated by the RL model, evaluated by the base model. This distribution closely matches thelower portionof thePPL_Base(Y_Base | x)distribution. This means that the responses generated by the RL-trained models are highly likely to be generated by the base model; they are not "novel" or "out-of-distribution" for the base model.PPL_Base(Y_GT | x)(right-most distribution, blue) shows the perplexity of responses generated by a powerful teacher model (OpenAI-o1), evaluated by the base model. These responses have much higher perplexity, indicating that they contain reasoning patterns that are outside the base model's typical output distribution. This analysis strongly suggests that RLVR primarily sharpens the distribution within the base model's prior, making it more likely to generate correct, but already "known," reasoning paths.
6.1.6. Distillation vs. RLVR
The following figure (Figure 7 from the original paper) shows pass@k of base, Instruct, RLVR, and distilled models.

Figure 7: pass @ k of base, Instruct, RLVR, and distilled models.
Analysis: This figure provides a crucial comparison.
Base(Qwen2.5-Math-7B, green) andRL(Qwen2.5-Math-7B-Oat-Zero, red) show the familiar pattern: RL outperforms at small but is surpassed by Base at large .Instruct(Qwen2.5-Math-7B-Instruct, light blue) performs better than Base at small but also falls below Base at large .Distilled(DeepSeek-R1-Distill-Qwen-7B, dark blue) consistently and significantly outperforms the base model across all values. Itspass@kcurve is always above that of the base model. This indicates that distillation can genuinely introducenew reasoning patternsfrom a stronger teacher model and expand the model's reasoning capabilities beyond the base model's boundary.
6.1.7. Effects of Model Size Scaling
The following figure (Figure 9 from the original paper) shows pass@k curves of Magistral-Medium.

Figure 9: pass @ k curves of Magistral-Medium.
Analysis: Even for larger, near-frontier models like Magistral-Medium (RLVR-enhanced) and its base Mistral-Medium-3, the same trend holds. RLVR provides significant gains at low (e.g., 7-8 more problems solved at on AIME24/25), but this performance gap steadily narrows or even reverses at higher values. This suggests that the conclusions are not limited to smaller models but extend to highly capable LLMs, indicating a systemic limitation of current RLVR.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Task | Start Model | RL Framework | RL Algorithm(s) | Benchmark(s) |
|---|---|---|---|---|
| Mathematics | LLaMA-3.1-8B Qwen2.5-7B/14B/32B-Base Qwen2.5-Math-7B | SimpleRLZoo Oat-Zero DAPO | GRPO | GSM8K, MATH500 Minerva, Olympiad AIME24, AMC23 |
| Code Generation | Qwen2.5-7B-Instruct DeepSeek-R1-Distill-Qwen-14B | Code-R1 DeepCoder | GRPO | LiveCodeBench HumanEval+ |
| Visual Reasoning | Qwen2.5-VL-7B | EasyR1 | GRPO | MathVista MathVision |
| Deep Analysis | Qwen2.5-7B-Base Qwen2.5-7B-Instruct DeepSeek-R1-Distill-Qwen-7B | VeRL | PPO, GRPO Reinforce++ RLOO, ReMax, DAPO | Omni-Math-Rule MATH500 |
The following are the results from Table 2 of the original paper:
| Base | SimpleRLZoo | AIME24 | MATH500 |
|---|---|---|---|
| ✓ | ✓ | 63.3% | 92.4% |
| ✓ | × | 13.3% | 3.6% |
| X | ✓ | 0.0% | 1.0% |
| X | X | 23.3% | 3.0% |
Analysis of Table 2 (Solvable Problem Coverage): This table quantifies the overlap in solvable problems between the base model and the RLVR-trained model.
-
Base ✓, SimpleRLZoo ✓: A large majority of problems (63.3% on AIME24, 92.4% on MATH500) are solved by both models. This reinforces that RLVR primarily optimizes problems that are already within the base model's capabilities.
-
Base ✓, SimpleRLZoo ×: A significant fraction of problems (13.3% on AIME24, 3.6% on MATH500) are solved by the base model but not by the RLVR model. This directly supports the finding that RLVR narrows the model's overall coverage.
-
Base X, SimpleRLZoo ✓: Crucially, this category shows the percentage of problems that only the RLVR model solves, but the base model cannot. On AIME24, this is 0.0%, and on MATH500, it is only 1.0%. This near-zero percentage is the strongest evidence that RLVR does not enable the model to solve fundamentally new problems. The paper further notes that even these rare cases are often solvable by the base model if given an astronomically large number of samples (e.g., 1024 times).
The following are the results from Table 3 of the original paper:
Model Omni-MATH-Train Omni-MATH-Test MATH500 pass@1 pass@256 pass@1 pass@256 pass@1 pass@256 Qwen2.5-7B 9.9 67.2 10.2 69.1 34.5 96.2 GRPO 26.1 66.3 25.1 68.3 74.4 97.2 PPO 27.2 65.8 26.8 69.2 75.2 97.2 ReMax 24.4 65.5 23.8 67.5 73.5 96.6 RLOO 28.6 66.4 28.1 69.2 75.0 97.4 Reinforce++ 28.2 67.7 28.0 69.7 75.4 96.8 DAPO 31.4 66.1 26.5 67.0 75.6 96.4
Analysis of Table 3 (Different RL Algorithms): This table provides detailed pass@1 and pass@256 values for various RL algorithms compared to the base Qwen2.5-7B model across multiple datasets.
-
pass@1Improvement: All RL algorithms significantly improvepass@1compared to the base model (e.g., from 9.9 to 26.1-31.4 on Omni-MATH-Train). This confirms their effectiveness in improving sampling efficiency. -
pass@256Stability/Slight Decrease: However, forpass@256(representing the reasoning boundary), most RL algorithms show either similar or slightly lower performance compared to the base model. For example, on Omni-MATH-Train, the base model haspass@256of 67.2, while RL algorithms range from 65.5 (ReMax) to 67.7 (Reinforce++). This indicates that even with different algorithmic approaches, the fundamental limitation of not expanding the reasoning boundary persists. -
Algorithm Performance: While DAPO achieves the highest
pass@1, itspass@256is not consistently the best, aligning with the general observation thatpass@ktends to decrease as RLVR training progresses. RLOO and Reinforce++ maintain a good balance betweenpass@1andpass@256.The following are the results from Table 4 of the original paper:
Model Omni-MATH-Train Omni-MATH-Test MATH500 pass@1 pass@256 pass@1 pass@256 pass@1 pass@256 Qwen2.5-7B 9.9 67.2 10.2 69.1 34.5 96.2 GRPO-step150 26.1 66.3 25.1 68.3 74.4 97.2 GRPO-step300 33.6 65.3 27.1 66.6 76.3 96.0 GRPO-step450 42.5 64.3 28.3 63.9 77.2 95.4
Analysis of Table 4 (RL Training Steps): This table shows the effect of increasing RL training steps on pass@1 and pass@256 using the GRPO algorithm.
-
pass@1Improvement: As training progresses (from step 150 to 450),pass@1consistently improves (e.g., from 26.1 to 42.5 on Omni-MATH-Train). This indicates that the model becomes more efficient at generating correct solutions with more training. -
pass@256Decrease: However,pass@256(the reasoning boundary) progressively decreases as RLVR training progresses (e.g., from 66.3 to 64.3 on Omni-MATH-Train, and from 68.3 to 63.9 on Omni-MATH-Test). This is a critical finding, indicating thatovertrainingwith RLVR can actually narrow the model's overall problem-solving capacity, making it less capable of solving a diverse set of problems even with many attempts.The following are the results from Table 5 of the original paper:
Models Problem Indices Qwen2.5-7B-Base 0, 1, 4, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 22, 23, 24, 25, 26, 27, 28, 29 SimpleRL-Qwen2.5-7B 0, 1, 6, 7, 8, 9, 12, 14, 15, 16, 18, 22, 23, 24, 25, 26, 27, 28, 29
Analysis of Table 5 (Solvable Problem Indices in AIME24): This table lists the specific problem indices (out of 30) solved by the base model and the RLVR model on AIME24 (with ).
-
The set of problems solved by
SimpleRL-Qwen2.5-7B(RL) is largely a subset of the problems solved byQwen2.5-7B-Base. For example, problems 4, 11, 17, 19 are solved by the base model but not by the RL model, while there are no problems solved by the RL model that are not also solved by the base model. This provides concrete evidence for the "subset relationship" and the narrowing of reasoning coverage by RLVR.The following are the results from Table 6 of the original paper:
Model Solvable Problem Indices Qwen2.5-7B-Instruct-1M 400, 402, 403, 407, 409, 412, 413, 417, 418, 419, 422, 423, 427, 432, 433, 436, 438, 439, 440, 444, 445, 448, 449 Coder-R1 400, 402, 403, 407, 412, 413, 417, 418, 419, 422, 423, 427, 430, 433, 438, 439, 440, 444, 445, 449
Analysis of Table 6 (Solvable Problem Indices in LiveCodeBench): Similar to Table 5, this table shows problem indices solved by the base and RL models for LiveCodeBench.
- Again,
Coder-R1(RL) solves a subset of the problems solved byQwen2.5-7B-Instruct-1M(Base). Problems 409, 432, 436, 448 are solved by the base but not by the RL model, while problem 430 is solved by RL but not by the base (a rare exception that the authors address by noting base models can solve them with enough attempts). This further confirms the narrowing of reasoning coverage in coding tasks.
6.3. Ablation Studies / Parameter Analysis
6.3.1. Effects of Different RL Algorithms
The following figure (Figure 8 (top) from the original paper) shows the ΔSE values for different RL algorithms.

Figure 8: (Top) Different RL algorithms. (Bottom) Different RL training steps. The detailed valuer for each point at pas @ 1 and pass @ 2 5 6 are provided in Table 3 and Table 4.
Analysis:
- The
Sampling Efficiency Gap(), defined as , quantifies how muchpass@1of an RL model falls short of the base model's potential maximumpass@256. - As shown in Figure 8 (top) and detailed in Table 3, values across all six popular RL algorithms (PPO, GRPO, Reinforce++, RLOO, ReMax, DAPO) on Omni-MATH-Test show only minor variations (ranging from GRPO's 43.9 to RLOO's best 42.6).
- Crucially, remains consistently above 40 points, highlighting that current RL methods are still far from achieving optimal sampling efficiency and fully leveraging the base model's potential. This indicates that the core issue is not specific to a particular RL algorithm but rather a broader limitation of the current RLVR paradigm.
6.3.2. Effects of RL Training Steps
The following figure (Figure 8 (bottom) from the original paper) shows the effects of different RL training steps.

Figure 8: (Top) Different RL algorithms. (Bottom) Different RL training steps. The detailed valuer for each point at pas @ 1 and pass @ 2 5 6 are provided in Table 3 and Table 4.
Analysis:
-
As detailed in Table 4,
pass@1on the training set consistently improves as RL training progresses (e.g., from 26.1 at step 150 to 42.5 at step 450). -
However,
pass@256(the reasoning boundary) progressively decreases with more training steps (e.g., from 66.3 to 64.3 on Omni-MATH-Train, and 69.1 (base) to 63.9 (GRPO-step450) on Omni-MATH-Test). This suggests anasymptotic effectwhere continued RL training, while boosting efficiency, actively narrows the model's overall reasoning coverage, leading to a reduced reasoning boundary. The following figure (Figure 19 from the original paper) shows the curves of training reward, response length, and generation entropy during training, corresponding to experiments in Section 4.
Figure 19: The curves of training reward, response length, and generation entropy during training, corresponding to experiments in Section 4.Analysis: This figure shows how various metrics evolve during RL training for different algorithms.
-
Actor Reward: The reward (
actor_reward) generally increases over training steps for all algorithms, confirming that RL training is successfully optimizing the policy to produce higher-reward (correct) responses. -
Response Length:
response_lengthtends to decrease or stabilize, indicating that RL might be optimizing for shorter, more direct paths to correct answers. -
Generation Entropy:
generation_entropyconsistently decreases for all algorithms as training progresses. This is a crucial observation, as lower entropy means the model's output distribution becomes sharper and less diverse. This reduction in diversity is a likely contributor to the observed narrowing of the reasoning boundary.
6.3.3. Effect of Number of Rollouts
The following figure (Figure 16 from the original paper) shows an ablation study on KL Loss and Rollout Number .

Figure 16: Ablation Study on KL Loss and Rollout Number . For increasing from 8 to 32, we keep the prompt batch size unchanged, which results in increased computation per training step. Due to resource constraints, we train for only 220 steps under this setting, leading to lower pass @ 1 as the model has not yet converged. Nevertheless, the model with achieves a higher pass , highlighting the positive effect of larger rollout numbers in improving pass @ k at higher values of .
Analysis:
- Increasing the number of rollouts () from 8 to 32 (meaning more samples are generated per prompt during training) leads to a slight improvement in
pass@k(specificallypass@128). - However, even with more rollouts, the RL-trained model is still eventually outperformed by the base model at higher values. This suggests that while more exploration during training can help, it doesn't fundamentally overcome the limitation of RLVR being bounded by the base model's capacity.
6.3.4. Effect of KL Loss
The following figure (Figure 16 from the original paper) also shows an ablation study on KL Loss.

Figure 16: Ablation Study on KL Loss and Rollout Number . For increasing from 8 to 32, we keep the prompt batch size unchanged, which results in increased computation per training step. Due to resource constraints, we train for only 220 steps under this setting, leading to lower pass @ 1 as the model has not yet converged. Nevertheless, the model with achieves a higher pass , highlighting the positive effect of larger rollout numbers in improving pass @ k at higher values of .
Analysis:
- Adding a
KL divergence term(with coefficient 0.001) to regularize model deviation results in similarpass@1scores compared to GRPO without KL. - However, the KL-regularized model exhibits a much lower
pass@128. This indicates that constraining the model's deviation from the original policy (via KL loss) can further reduce its reasoning boundary, making it less capable of finding diverse solutions even with more attempts.
6.3.5. Effects of Entropy
The following figure (Figure 18 from the original paper) shows a comparison of Base and RLVR Models with Matched Output Entropy.

该图像是图表,展示了不同模型在多个数据集上的 pass@k 评分和输出熵的比较。图中包含了基线模型和 RLVR 模型在不同样本数量 下的表现,具体数据集包括 AIME24、AMC23、GSM8K、MATH500、Minerva 和 Olympiad。每个子图中均标注了基线模型的输出熵 。
Figure 18: Comparison of Base and RLVR Models with Matched Output Entropy. We evaluate the base model (Qwen2.5-7B) on each dataset using temperature and report its output entropy in the title of each figure. Tn RLVRRLZ approximately matches . For example, on AMC23, we set to achieve . We also include RLVR results at as an additional baseline, which has lower entropy—e.g., 0.22 on AMC23 and 0.33 on MATH500.
Analysis:
- As observed in Figure 19, RL training typically decreases the model's
output entropy. This reduction indiversityis hypothesized to contribute to the narrowing of the reasoning boundary. - To test this, the authors increased the
generation temperatureof the RLVR-trained model to match the output entropy of the base model. - While the RLVR model performs slightly better (
pass@k) at higher temperatures (matched entropy) compared to its own performance at (lower entropy), it still underperforms the base model across allpass@kvalues. - This suggests that while reduced entropy contributes to the narrowing of the reasoning boundary, it is not the sole cause. Other factors, such as the inherent limitation of policy gradient in exploring fundamentally new reasoning patterns, are also at play.
6.4. CoT Case Analysis
The following figure (Figure 20 from the original paper) shows Qwen2.5-Base-7B Correct Response - Case 1.

Figure 20: Qwen2.5-Base-7B Correct Response - Case 1.
The following figure (Figure 21 from the original paper) shows Qwen2.5-Base-7B Correct Response - Case 2.
Analysis: These figures present examples of complex Chain-of-Thought (CoT) reasoning produced by the Qwen2.5-Base-7B model for challenging AIME24 questions.
- The responses are notably long and exhibit a
reflective behavior, where the model attempts multiple approaches, identifies errors, and corrects itself. - The base model successfully generates correct CoTs for these problems, even though they are considered "hardest questions" where traditional RLVR models struggle at high . This manual inspection corroborates the quantitative findings: the strong reasoning ability is already inherent in the base model's sampling distribution; RLVR doesn't necessarily introduce it but rather makes it more accessible or efficient under certain conditions.
7. Conclusion & Reflections
7.1. Conclusion Summary
This study systematically investigated whether Reinforcement Learning with Verifiable Rewards (RLVR) genuinely enhances the reasoning capacity of Large Language Models (LLMs) beyond their base models. The findings, consistent across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, reveal that current RLVR methods primarily improve the sampling efficiency of LLMs, making them more likely to generate correct answers on initial attempts (pass@1). However, RLVR-trained models consistently exhibit a narrower reasoning coverage compared to their base models when evaluated with a large number of samples (pass@k at large ). Detailed analyses using accuracy distributions and perplexity show that the reasoning paths generated by RLVR models are already present and bounded by the base model's sampling distribution. Unlike RLVR, distillation is demonstrated to genuinely expand reasoning capabilities by introducing new patterns from a stronger teacher model. The paper concludes that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs.
7.2. Limitations & Future Work
Limitations pointed out by the authors:
- Proprietary Models: The analysis is constrained by the fact that many of the most capable models and training pipelines are proprietary, limiting the ability to conduct full evaluations on all state-of-the-art systems.
- Rapid Evolution of RL for LLMs: The field is rapidly evolving, and emerging techniques might mitigate some of the identified limitations. The conclusions should be interpreted with this dynamic context in mind.
Future Research Directions suggested by the authors:
- Efficient Exploration Strategies in High-Level Abstraction: Developing mechanisms like
AlphaEvolve(Novikov et al., 2025) for self-evolution in a program-level abstraction space could enable the discovery of out-of-prior reasoning patterns. - Data Scaling via Curriculum: A more deliberate, large-scale data-RL iteration pipeline with a curriculum-based approach (training on easier subproblems first) could help improve sampling efficiency, acquire
meta-skills, and enable RLVR to obtain meaningful rewards on challenging tasks. - Process Reward and Fine-Grained Credit Assignment: Incorporating intermediate signals to guide the reasoning trajectory, rather than just binary outcome rewards, could significantly improve exploration efficiency and direct the model toward more promising solution paths.
- Agentic RL: Moving beyond single-turn responses to multi-turn agentic RL, with richer interactions, iterative refinement, and the ability to actively collect new information (e.g., using search tools), could unlock the potential for truly novel experiences and learning.
7.3. Personal Insights & Critique
This paper provides a crucial and timely reality check on the current state of RLVR for LLMs. The widespread belief that RLVR automatically leads to emergent, novel reasoning capabilities, akin to AlphaGo, is a powerful narrative, but this study rigorously challenges it. The distinction between sampling efficiency and reasoning capacity expansion is paramount and often conflated in the LLM community's discussion of "self-improvement." The consistent finding that base models, given enough attempts, often surpass or match RLVR models in terms of total solvable problems is a profound insight.
The perplexity analysis is particularly elegant, offering a quantitative explanation for why RLVR doesn't introduce novelty: it's merely making the model more confident and precise within its existing knowledge distribution, not pushing it beyond. The observed narrowing of the reasoning boundary with continued RL training (asymptotic effect and decreased entropy) is a concerning side effect that warrants more attention, highlighting a potential overfitting to reward signals that prioritizes narrow high-reward paths over broad exploration.
Applicability: The methods and conclusions are highly transferable. Researchers and practitioners employing RLVR for any LLM task (beyond just math/coding/visual) should consider using pass@k at large to genuinely assess whether their RL training is achieving true capability expansion or just efficiency gains. The findings also strongly advocate for exploring distillation as a viable, and perhaps more effective, alternative for injecting new reasoning patterns.
Potential Issues/Areas for Improvement:
-
Definition of "Novel Reasoning Patterns": While the paper provides strong evidence that RLVR doesn't expand beyond the base model's sampling distribution, the definition of "novel reasoning patterns" could be more explicitly formalized. What constitutes a "novel pattern" vs. a higher-probability existing one?
-
Exploration of Failure Cases: A deeper qualitative analysis of the types of problems RLVR models fail on (that base models solve at high ) could yield more insights into the precise nature of the
reasoning boundary narrowing. What specific conceptual gaps or missing strategies emerge after RLVR? -
The "Astronomic " Counterargument: The paper states that base models might solve "Type 3" problems (only solved by RLVR) given "astronomically large ." While this is theoretically true, practically, if has to be excessively large, the base model is effectively not solving the problem in a useful sense. The focus should remain on realistic values.
-
Teacher Model in Distillation: The success of distillation heavily relies on the "stronger teacher model." The paper hints at this, but a more explicit discussion on what makes a teacher "strong" (e.g., its own training data, architecture, or simply human supervision) would be beneficial.
-
Future RL Paradigms: The proposed future directions (exploration, curriculum, process rewards, agentic RL) are compelling. Future work could try to implement these to see if they genuinely overcome the limitations identified here.
Overall, this paper is a significant contribution, providing a sober and evidence-based perspective that should guide the future development of RL for LLMs, moving beyond optimistic assumptions towards more targeted and effective paradigms.
Similar papers
Recommended via semantic vector search.