Paper status: completed

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Published:04/19/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study investigates the effectiveness of Reinforcement Learning with Verifiable Rewards (RLVR) in enhancing reasoning capabilities of large language models (LLMs). It finds that current setups fail to elicit new reasoning patterns, with base models performing better at larger

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

1.2. Authors

Yang Yue*, Zhiqi Chen*, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. *Equal Contribution, †Project Lead, Corresponding Author. Affiliations: LeapLab, Tsinghua University; Shanghai Jiao Tong University. The authors are primarily affiliated with Tsinghua University, a prominent research institution known for its strong programs in computer science and artificial intelligence. Gao Huang is a well-known researcher in the deep learning community, particularly for his work on DenseNet.

1.3. Journal/Conference

This paper is published on arXiv, a preprint server. While not a peer-reviewed journal or conference in its current form, arXiv is a widely used platform for disseminating cutting-edge research quickly within the AI and machine learning communities. Papers on arXiv often undergo subsequent peer review and publication in prestigious conferences or journals.

1.4. Publication Year

2025

1.5. Abstract

The paper investigates the effectiveness of Reinforcement Learning with Verifiable Rewards (RLVR) in enhancing the reasoning capabilities of large language models (LLMs). While RLVR has shown success in math and programming tasks and is believed to enable LLMs to acquire novel reasoning abilities, this study critically examines this premise. Using pass@k at large kk values as an evaluation metric across various model families, RL algorithms, and benchmarks (math, coding, visual reasoning), the authors find that current RLVR training does not elicit fundamentally new reasoning patterns. RLVR-trained models outperform base models at small kk (e.g., k=1k=1), but base models achieve higher pass@k scores at large kk. Coverage and perplexity analyses suggest that the observed reasoning abilities originate from and are bounded by the base model. Six popular RLVR algorithms perform similarly and suboptimally in leveraging the base model's potential. In contrast, distillation is found to genuinely expand reasoning capabilities by introducing new patterns from a teacher model. The findings suggest that current RLVR methods have not yet realized RL's potential for novel reasoning and highlight the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction.

https://arxiv.org/abs/2504.13837 Publication Status: Preprint.

https://arxiv.org/pdf/2504.13837v5.pdf

2. Executive Summary

2.1. Background & Motivation

The development of reasoning-centric Large Language Models (LLMs) has seen significant advancements, particularly in complex logical tasks like mathematics and programming. Reinforcement Learning with Verifiable Rewards (RLVR) is identified as a key driver behind this progress. RLVR optimizes LLMs using automatically computable rewards (e.g., matching ground-truth solutions or passing unit tests), enabling scalable training without extensive human labeling. Inspired by the success of traditional RL in game playing, where agents discover novel strategies through self-improvement, it is widely believed that RLVR similarly allows LLMs to develop new reasoning patterns beyond their base models, potentially leading to continuously self-evolving LLMs.

However, despite its empirical success, the fundamental effectiveness of current RLVR is largely underexamined. The core problem the paper aims to solve is to critically assess whether current RLVR genuinely enables LLMs to acquire novel reasoning abilities, or if it merely optimizes the utilization of reasoning patterns already present in the base model. This question is crucial because the promise of RL for LLMs lies in its potential to discover truly new, emergent capabilities, rather than just refining existing ones. The gap in prior research is a systematic, rigorous evaluation of the reasoning capability boundaries of RLVR-trained models compared to their base models, especially under conditions that probe the full potential of a model (e.g., large sampling budgets).

The paper's entry point is to use the pass@k metric at large kk values, which provides a more robust measure of a model's potential reasoning capacity rather than just average-case performance. This allows for a deeper probe into whether RLVR truly expands the problem-solving boundary of LLMs.

2.2. Main Contributions / Findings

The paper makes several primary contributions and presents key findings that challenge conventional assumptions about RLVR:

  • RLVR Narrows Reasoning Coverage: While RLVR models outperform base models at small kk (e.g., pass@1, which reflects sampling efficiency), base models consistently surpass RLVR models as kk increases across all benchmarks and LLM families. This suggests that current RLVR training often narrows, rather than expands, the scope of solvable problems.
  • Reasoning Paths are Bounded by the Base Model: Further analysis (accuracy distribution, perplexity) reveals that the reasoning paths generated by current RLVR models largely exist within the sampling distribution of their base models. RLVR improves performance by more efficiently sampling correct responses for problems already solvable by the base model, but it does not enable the model to solve new problems. This indicates that RLVR does not introduce fundamentally new reasoning capabilities.
  • Similar Performance Across RLVR Algorithms: Six popular RLVR algorithms (PPO, GRPO, Reinforce++, RLOO, ReMax, DAPO) perform similarly in terms of Sampling Efficiency Gap (ΔSE\Delta_{SE}), and all remain far from optimally leveraging the potential of the base model. This suggests that the core limitation is not algorithm-specific but perhaps inherent to the current RLVR paradigm.
  • Distillation Expands Reasoning Boundaries: In contrast to RLVR, distillation can genuinely expand a model's reasoning capabilities by transferring new reasoning patterns from a stronger teacher model. Distilled models demonstrate an expanded reasoning scope beyond that of the base model, indicating a different mechanism of capability acquisition.
  • Need for Improved RL Paradigms: The findings highlight a significant gap between existing RLVR methods and the ideal goals of RL (discovering genuinely new strategies). The paper suggests the need for improved RL paradigms, such as effective exploration mechanisms, more deliberate and large-scale data curation, fine-grained process signals, and multi-turn agent interaction, to unlock RL's full potential for novel reasoning in LLMs.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Large Language Models (LLMs): These are neural networks, often based on the transformer architecture, trained on vast amounts of text data to generate human-like text, translate languages, write different kinds of creative content, and answer questions in an informative way. They learn to predict the next word in a sequence, thereby developing a deep understanding of language structure, facts, and reasoning patterns.
  • Reinforcement Learning (RL): A subfield of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent is not explicitly told what actions to take but discovers which actions yield the most reward by trial and error.
    • Agent: The decision-making entity (e.g., an LLM).
    • Environment: The system with which the agent interacts (e.g., the problem-solving task).
    • Action: A decision or output generated by the agent (e.g., generating a token, a reasoning step, or a final answer).
    • Reward: A scalar feedback signal from the environment indicating the desirability of an action (e.g., 1 for a correct answer, 0 for an incorrect one).
    • Policy: The agent's strategy for choosing actions based on its current state.
  • Reinforcement Learning with Verifiable Rewards (RLVR): A specific application of RL to LLMs where the reward signal is automatically computed based on verifiable outcomes, such as a mathematical solution being correct or code passing unit tests. This contrasts with Reinforcement Learning from Human Feedback (RLHF), where human annotators provide preference-based rewards. The key advantage of RLVR is its scalability, as it does not require human labeling.
  • Chain-of-Thought (CoT) Reasoning: A prompting technique for LLMs where the model is explicitly asked to show its step-by-step reasoning process before arriving at a final answer. This often improves performance on complex reasoning tasks by allowing the model to break down problems into smaller, more manageable steps.
  • Pass@k Metric: An evaluation metric typically used for code generation and reasoning tasks. It measures the probability that at least one out of kk independent samples generated by a model is correct. A higher kk value allows the model more "attempts" to solve a problem, potentially revealing its full reasoning capacity, as opposed to just pass@1 which measures the accuracy of the first (or greedy) attempt.
  • Perplexity (PPL): A measure of how well a probability distribution or language model predicts a sample. It is calculated as the exponentiated average negative log-likelihood of a sequence. Lower perplexity indicates that the model assigns a higher probability to the observed sequence, meaning it predicts the sequence more confidently and accurately. In this paper, it's used to assess how likely the base model is to generate the responses produced by the RL-trained model.
  • Distillation: A technique where a smaller "student" model is trained to mimic the behavior of a larger, more powerful "teacher" model. This is typically done by training the student model on outputs (e.g., logits, soft targets, or reasoning traces) generated by the teacher model. Distillation can transfer knowledge and capabilities from the teacher to the student, often resulting in a smaller, more efficient model with comparable performance.

3.2. Previous Works

The paper primarily discusses works related to RLVR and its application to LLMs, particularly for mathematical and programming reasoning tasks. It references several key developments:

  • Early Reasoning-Centric LLMs:

    • OpenAI-o1 (Jaech et al., 2024): This work is cited as an encouraging landmark, being among the first large-scale applications of RL for reasoning, achieving state-of-the-art results.
    • DeepSeek-R1 (Guo et al., 2025): The first open-weight model to match or surpass the performance of OpenAI-o1. It introduced the "zero" setting, applying RL directly to a base LLM without intermediate supervised fine-tuning (SFT).
    • Kimi-1.5 (Team et al., 2025): Another reasoning-centric LLM mentioned for its advancements.
  • Traditional RL and Self-Improvement:

    • DQN (Mnih et al., 2015): Deep Q-Networks demonstrated human-level control in Atari games, showcasing RL's ability to learn complex strategies from scratch. The core idea is to approximate the optimal action-value function Q(s, a) using a deep neural network. $ Q(s, a; \theta) \approx Q^*(s, a) $ The network is trained using experience replay and a target network to stabilize learning, minimizing the loss: $ L(\theta) = \mathbb{E}{(s, a, r, s') \sim U(D)} \left[ \left( r + \gamma \max{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right] $ where θ\theta are the current network weights, θ\theta^- are the target network weights, γ\gamma is the discount factor, and U(D) is uniform sampling from the experience replay buffer DD.
    • AlphaGo (Silver et al., 2017): Demonstrated superhuman performance in Go, largely through self-play and Monte Carlo Tree Search (MCTS) guided by deep neural networks. This work highlighted RL's capacity for autonomous strategy discovery.
  • RL Algorithms for LLMs:

    • Proximal Policy Optimization (PPO) (Schulman et al., 2017): A widely used policy gradient algorithm for RL. PPO aims to keep the new policy close to the old policy while taking the largest possible improvement step. This is achieved through a clipped surrogate objective function. $ \mathcal{L}^{CLIP}(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t) \right] $ where rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} is the ratio of new to old probabilities, AtA_t is the advantage estimate, and ϵ\epsilon is a clipping hyperparameter. PPO is designed for stability and efficiency.
    • GRPO (Shao et al., 2024): A critic-free variant that estimates advantage using a normalized reward within a group of responses to the same question.
    • Reinforce++ (Hu, 2025): A simple and efficient policy gradient method for aligning LLMs.
    • RLOO (Ahmadian et al., 2024): Adopts a leave-one-out baseline for advantage estimation.
    • ReMax (Li et al., 2024): Uses the greedy response reward as the advantage baseline.
    • DAPO (Yu et al., 2025): An open-source LLM reinforcement learning system.
  • Instruction Tuning and Distillation:

    • Instruction-tuned approaches (Achiam et al., 2023; Grattafiori et al., 2024): Traditional methods that rely on human-curated annotations for fine-tuning LLMs.
    • Distillation (Guo et al., 2025): Training a student model on outputs (long CoT reasoning traces) generated by a stronger teacher model.

3.3. Technological Evolution

The evolution of LLMs has moved from initial pretraining on vast text corpora to various post-training phases aimed at enhancing specific capabilities:

  1. Pretraining: Initial large-scale training on diverse text data to learn language fundamentals.

  2. Supervised Fine-Tuning (SFT): Fine-tuning pretrained models on human-curated datasets for specific tasks or instruction following (e.g., instruction-tuned models).

  3. Reinforcement Learning from Human Feedback (RLHF): Training models with human preferences as rewards, often building on SFT.

  4. Reinforcement Learning with Verifiable Rewards (RLVR): A more scalable approach for reasoning tasks where objective, verifiable rewards (e.g., correctness in math, passing code tests) are automatically computed. This is where OpenAI-o1 and DeepSeek-R1 fit in, aiming for self-improvement and novel reasoning patterns.

  5. Distillation: A parallel approach to transfer knowledge from large, powerful models to smaller, more efficient ones, or to inject new reasoning patterns from highly capable "teacher" models.

    This paper's work fits within the RLVR and distillation phases, critically evaluating the former's ability to truly generate novel reasoning beyond the pretrained prior of the base model.

3.4. Differentiation Analysis

The core differentiation of this paper's approach lies in its systematic and rigorous investigation of the boundaries of reasoning capacity using pass@k at large k values, combined with perplexity and coverage analyses.

  • Traditional RLVR: Focuses on improving pass@1 (average performance) or overall accuracy, often implicitly assuming that improved performance implies the acquisition of novel reasoning abilities, similar to traditional RL agents exploring new strategies.
  • This Paper's Innovation:
    • Probing Reasoning Boundaries: Instead of just looking at pass@1, the paper uses pass@k at very large kk (up to 1024 samples) to reveal the potential maximum number of problems a model can solve. This allows them to differentiate between increased sampling efficiency (making correct answers more likely) and increased reasoning capacity (solving problems previously unsolvable).

    • Systematic Comparison: It compares base models with RLVR-trained models across a wide array of LLM families, model sizes, RL algorithms, and benchmarks (math, coding, visual reasoning), providing a comprehensive view.

    • Deep Dive into Mechanisms: Through accuracy distribution and perplexity analysis, the paper investigates why RLVR behaves the way it does, concluding that it primarily sharpens the distribution within the base model's prior rather than expanding it.

    • Contrast with Distillation: By explicitly comparing RLVR with distillation, the paper highlights that distillation can introduce new reasoning patterns, unlike current RLVR, which remains bounded by the base model.

      Previous work might have observed a decline in pass@k post-RLVR (e.g., Dang et al., 2025) or similar trends limited to specific models (e.g., Deepseek-Math, Shao et al., 2024), but they did not conduct the comprehensive cross-model, cross-algorithm, and deep analytical investigation that this paper undertakes. This study provides a much more robust and generalized conclusion about the limitations of current RLVR.

4. Methodology

4.1. Principles

The core idea behind the methodology is to systematically evaluate and compare the reasoning capacity boundaries of base Large Language Models (LLMs) and their Reinforcement Learning with Verifiable Rewards (RLVR)-trained counterparts. The theoretical basis rests on the distinction between sampling efficiency and actual reasoning capability expansion. Sampling efficiency refers to how effectively a model can generate a correct solution on its first few attempts. Reasoning capability expansion refers to the model's ability to solve problems that it could not solve at all before, even with many attempts.

The intuition is that if RLVR truly helps LLMs acquire novel reasoning abilities, then an RLVR-trained model should be able to solve a wider range of problems than its base model, especially when given ample opportunity to sample multiple solutions (i.e., at large kk in pass@k). If, however, RLVR merely makes the model better at finding solutions it already could find, then the base model, given enough attempts, should eventually match or even surpass the RLVR model in terms of the total number of solvable problems. The paper uses perplexity analysis to further investigate if RLVR responses are truly "new" or just higher-likelihood versions of what the base model could produce.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Reinforcement Learning with Verifiable Rewards (RLVR) Fundamentals

The paper begins by outlining the general setup for RLVR. An LLM with parameters θ\theta, denoted as πθ\pi_\theta, generates a token sequence y=(y1,,yT)\mathbf{y} = (y_1, \dots, y_T) conditioned on a natural-language prompt xx. A deterministic verifier V\mathcal{V} provides a binary reward: r=V(x,y){0,1}r = \mathcal{V}(x, \mathbf{y}) \in \{0, 1\}. A reward r=1r=1 signifies that the model's final answer is exactly correct. The objective of RL is to learn a policy (i.e., optimize θ\theta) to maximize the expected reward.

The objective function for RL is: $ \mathcal{\bar{J}}(\theta) = \mathbb{E}{x \sim \mathcal{D}} \left[ \mathbb{E}{\mathbf{y} \sim \pi_\theta(\cdot|x)} [r] \right] $ Where:

  • Jˉ(θ)\mathcal{\bar{J}}(\theta) is the expected reward to be maximized.
  • ExD\mathbb{E}_{x \sim \mathcal{D}} denotes the expectation over prompts xx sampled from the distribution D\mathcal{D}.
  • Eyπθ(x)\mathbb{E}_{\mathbf{y} \sim \pi_\theta(\cdot|x)} denotes the expectation over token sequences y\mathbf{y} generated by the policy πθ\pi_\theta conditioned on prompt xx.
  • rr is the binary reward received for the generated sequence y\mathbf{y} given prompt xx.

4.2.2. RLVR Algorithms

The paper compares several popular RL algorithms, primarily focusing on policy gradient methods.

Proximal Policy Optimization (PPO) PPO (Schulman et al., 2017) is a widely used algorithm that maximizes a clipped surrogate objective function to update the policy. This objective encourages improvements while preventing excessively large policy updates, which can lead to instability.

The clipped surrogate objective LCLIP\mathcal{L}_{\mathrm{CLIP}} is defined as: $ \mathcal{L}_{\mathrm{CLIP}} = \mathbb{E}_t \left[ \min(r_t(\theta) A_t, \ \mathrm{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) A_t) \right] $ Where:

  • Et\mathbb{E}_t denotes the empirical average over a batch of samples.

  • rt(θ)r_t(\theta) is the ratio of the new policy's probability of taking action ata_t (generating token yty_t) to the old policy's probability, given state sts_t (prompt xx and previous tokens y<t\mathbf{y}_{<t}). Specifically: $ r_t(\theta) = \frac{\pi_\theta(y_t | x, \mathbf{y}{<t})}{\pi{\theta_{\mathrm{old}}}(y_t | x, \mathbf{y}_{<t})} $

  • AtA_t is the advantage estimate for the action at time tt, usually estimated by a value network VϕV_\phi. The advantage A_t = R_t - V_\phi(s_t), where RtR_t is the discounted sum of future rewards and Vϕ(st)V_\phi(s_t) is the estimated value of the state.

  • clip(rt(θ),1ϵ,1+ϵ)\mathrm{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) clips the probability ratio rt(θ)r_t(\theta) to be within the interval [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon]. This mechanism prevents large policy updates if the new policy deviates too much from the old one, promoting stable learning.

  • ϵ\epsilon is a hyperparameter, typically a small value (e.g., 0.2), that controls the clipping.

    Other RLVR algorithms mentioned and used in the study include:

  • GRPO (Shao et al., 2024): A critic-free variant where the advantage AiA_i is estimated by normalizing rewards within a group of responses to the same question: $ A_i = [r_i - \mathrm{mean}(\mathbf{r})] / \mathrm{std}(\mathbf{r}) $ where r={r1,,rG}\mathbf{r} = \{r_1, \dots, r_G\} is the set of rewards for GG sampled responses.

  • RLOO (Ahmadian et al., 2024): Adopts a leave-one-out baseline for advantage estimation within each batch.

  • Reinforce++ (Hu, 2025), ReMax (Li et al., 2024), DAPO (Yu et al., 2025): These are also policy gradient methods with specific choices for advantage estimation or other enhancements. For a fair comparison, the authors often remove the KL divergence term (optionally applied in PPO) to avoid constraining model learning, following practices in DAPO and Oat-Zero.

    Zero-RL Training: This refers to applying RL directly to the base model without any supervised fine-tuning (SFT). The paper primarily uses this setting for math tasks to isolate the effect of RLVR. For coding and visual reasoning, instruction-tuned models are used as starting points, consistent with open-source practices.

4.2.3. Metrics for LLM Reasoning Capacity Boundary

The paper emphasizes the pass@k metric as crucial for evaluating reasoning capacity boundaries.

Pass@k Metric Definition and Estimation: Given a problem, kk outputs are sampled from the model. The pass@k for this question is 1 if at least one of the kk samples passes verification (is correct); otherwise, it's 0. The average pass@k over a dataset reflects the proportion of problems the model can solve within kk trials.

To accurately estimate pass@k with low variance, the paper adopts an unbiased estimator from Chen et al. (2021). For each problem xix_i from the evaluation dataset D\mathcal{D}, nn samples Φn\Phi_n are generated (where nkn \geq k), and cic_i is the number of correct samples.

The unbiased estimator of pass@k over the dataset is given by: $ \operatorname{pass@}k := \mathbb{E}_{x_i \sim \mathcal{D}} \left[ 1 - \frac{\binom{n - c_i}{k}}{\binom{n}{k}} \right] $ Where:

  • pass@k\operatorname{pass@}k is the estimated pass rate when sampling kk times.

  • ExiD\mathbb{E}_{x_i \sim \mathcal{D}} denotes the expectation (average) over all problems xix_i in the dataset D\mathcal{D}.

  • cic_i is the number of correct samples obtained for problem xix_i from nn total samples.

  • (NK)\binom{N}{K} is the binomial coefficient, representing the number of ways to choose KK items from a set of NN items, calculated as N!K!(NK)!\frac{N!}{K!(N-K)!}.

  • The term 1(ncik)(nk)1 - \frac{\binom{n - c_i}{k}}{\binom{n}{k}} calculates the probability that at least one of the kk chosen samples (from nn total samples for problem xix_i) is correct. This is equivalent to 1 minus the probability that all kk chosen samples are incorrect.

    The pass@k metric is preferred over Best-of-N or Majority Voting because it focuses on the model's potential to solve a problem, regardless of whether a specific selection method identifies the correct answer.

4.2.4. Perplexity Analysis

To determine if reasoning paths generated by RLVR models already exist within the output distribution of their base models, the paper uses perplexity.

Given a model mm, a problem xx, and a response Y=(y1,,yT)\mathbf{Y} = (y_1, \dots, y_T), the perplexity PPLm(Yx)\mathrm{PPL}_m(\mathbf{Y} \mid x) reflects the model's ability to predict the given response Y\mathbf{Y} conditioned on the prompt xx.

The perplexity is defined as: $ \operatorname{PPL}m(\mathbf{Y} \mid x) = \exp \left( - \frac{1}{T} \sum{t=1}^T \log P(y_t \mid x, y_1, \ldots, y_{t-1}) \right) $ Where:

  • PPLm(Yx)\mathrm{PPL}_m(\mathbf{Y} \mid x) is the perplexity of the response Y\mathbf{Y} given prompt xx, according to model mm.

  • exp()\exp(\cdot) is the exponential function.

  • TT is the total number of tokens in the response Y\mathbf{Y}.

  • logP(ytx,y1,,yt1)\log P(y_t \mid x, y_1, \ldots, y_{t-1}) is the log-likelihood of the tt-th token yty_t given the prompt xx and all preceding tokens y1,,yt1y_1, \ldots, y_{t-1}, as computed by model mm.

  • A lower perplexity indicates that the model assigns a higher probability to the response sequence, meaning it finds the sequence more "expected" or "natural".

    By comparing PPLBase(YRLx)\mathrm{PPL}_{\mathrm{Base}}(\mathbf{Y}_{\mathrm{RL}} | x) (perplexity of RL model responses under the base model) with PPLBase(YBasex)\mathrm{PPL}_{\mathrm{Base}}(\mathbf{Y}_{\mathrm{Base}} | x) (perplexity of base model responses under the base model), the authors infer whether RL-generated responses are within the base model's high-likelihood distribution.

4.2.5. Distillation Methodology

For distillation, the paper focuses on a representative model, DeepSeek-R1-Distill-Qwen-7B. This model distills knowledge from a stronger teacher model (DeepSeek-R1) into a student model (Qwen2.5-Math-7B). The training data for distillation consists of long Chain-of-Thought (CoT) reasoning traces generated by the teacher model. This is analogous to instruction-following fine-tuning, but with richer, more complex reasoning data. The effectiveness of distillation is then evaluated using the same pass@k metric to see if it genuinely expands the reasoning boundary.

4.2.6. Sampling Efficiency Gap (ΔSE\Delta_{SE})

To quantify the difference in sampling efficiency, the authors propose the Sampling Efficiency Gap (ΔSE\Delta_{SE}), defined as: $ \Delta_{\mathrm{SE}} = \operatorname{pass@1}{\mathrm{RL}} - \operatorname{pass@k}{\mathrm{Base}} $ Where:

  • pass@1RL\operatorname{pass@1}_{\mathrm{RL}} is the pass@1 score of the RL-trained model.
  • pass@kBase\operatorname{pass@k}_{\mathrm{Base}} is the pass@k score of the base model, with k=256k=256 used as a proxy for its upper-bound performance.
  • A lower ΔSE\Delta_{\mathrm{SE}} indicates that the RL algorithm is closer to achieving optimal sampling efficiency, meaning its pass@1 is closer to the base model's full potential.

5. Experimental Setup

5.1. Datasets

The study evaluates models across three representative domains: mathematics, code generation, and visual reasoning.

  • Mathematics:

    • GSM8K (Cobbe et al., 2021): A dataset of grade school math word problems.
    • MATH500 (Hendrycks et al., 2021): A dataset of challenging mathematics problems.
    • Minerva (Lewkowycz et al., 2022): A dataset containing diverse math and science problems, known for requiring advanced reasoning.
    • Olympiad (He et al., 2024): Problems from international math olympiads, representing extremely challenging reasoning tasks.
    • AIME24, AMC23: Problems from American Invitational Mathematics Examination and American Mathematics Competitions, respectively. These are challenging competitive math problems.
    • Omni-MATH-Rule (Gao et al., 2025): A subset of the Omni-MATH dataset containing verifiable problems. Used for in-domain and out-of-domain generalization studies.
  • Code Generation:

    • LiveCodeBench v5 (Jain et al., 2025): Comprises 279 coding problems spanning August 2024 to January 2025.
    • HumanEval+ (Liu et al., 2023): A widely used benchmark for evaluating code generation models.
    • MBPP+ (Liu et al., 2023): Another popular benchmark for code generation models, focusing on simpler Python programming problems.
  • Visual Reasoning:

    • Geometry3K (Lu et al., 2021): Used for training visual reasoning models, focusing on geometry problems.

    • MathVista-TestMini (Lu et al., 2024): A filtered version of MathVista, specifically for evaluating math in visual contexts, with multiple-choice questions removed.

    • MathVision-TestMini (Wang et al., 2024): Similar to MathVista-TestMini, focusing on visual math reasoning.

      The datasets were chosen to cover a range of difficulty levels and reasoning types (numerical, logical, symbolic, multi-modal), ensuring a robust validation of the methods. The inclusion of competitive programming and math olympiad problems pushes the boundaries of LLM reasoning.

5.2. Evaluation Metrics

The primary evaluation metric used across all tasks is pass@k.

  • Pass@k:

    1. Conceptual Definition: Pass@k measures the probability that at least one out of kk sampled outputs from a model correctly solves a given problem. It reflects the model's potential to solve a problem if given multiple attempts, thereby revealing its reasoning capacity boundary. A higher pass@k indicates a greater ability to eventually find a correct solution.
    2. Mathematical Formula: For each problem xix_i from the evaluation dataset D\mathcal{D}, we generate nn samples (where nkn \geq k) and count the number of correct samples as cic_i. The unbiased estimator of pass@k over the dataset is given by: $ \operatorname{pass@}k := \mathbb{E}_{x_i \sim \mathcal{D}} \left[ 1 - \frac{\binom{n - c_i}{k}}{\binom{n}{k}} \right] $
    3. Symbol Explanation:
      • pass@k\operatorname{pass@}k: The estimated pass rate for a model when kk samples are drawn per problem.

      • ExiD\mathbb{E}_{x_i \sim \mathcal{D}}: The expectation (average) over all problems xix_i in the dataset D\mathcal{D}.

      • nn: The total number of samples generated for each problem during evaluation (where nkn \geq k).

      • cic_i: The number of correct samples obtained for problem xix_i out of nn total samples.

      • (NK)\binom{N}{K}: The binomial coefficient, which is read as "N choose K" and calculated as N!K!(NK)!\frac{N!}{K!(N-K)!}. It represents the number of ways to choose KK items from a set of NN items without regard to the order of selection.

      • The term 1(ncik)(nk)1 - \frac{\binom{n - c_i}{k}}{\binom{n}{k}} calculates the probability that at least one of the kk selected samples (from the nn available samples for problem xix_i) is correct. This is equivalent to 1 minus the probability that all kk selected samples are incorrect (i.e., chosen from the ncin-c_i incorrect samples).

        The paper uses nn typically as 128, 256, or 1024, corresponding to the largest kk value in the pass@k curves.

  • Perplexity (PPL):

    1. Conceptual Definition: Perplexity is a measure of how well a probability distribution or language model predicts a sample. It quantifies how "surprised" the model is by a given sequence of words or tokens. A lower perplexity indicates that the model assigns a higher probability to the sequence, meaning it predicts the sequence more confidently and accurately. In this study, it helps assess whether RL-generated responses are high-likelihood outcomes for the base model.
    2. Mathematical Formula: $ \operatorname{PPL}m(\mathbf{Y} \mid x) = \exp \left( - \frac{1}{T} \sum{t=1}^T \log P(y_t \mid x, y_1, \ldots, y_{t-1}) \right) $
    3. Symbol Explanation:
      • PPLm(Yx)\mathrm{PPL}_m(\mathbf{Y} \mid x): The perplexity of the response Y\mathbf{Y} given prompt xx, as computed by model mm.
      • exp()\exp(\cdot): The exponential function (base ee).
      • TT: The total number of tokens in the response Y\mathbf{Y}.
      • logP(ytx,y1,,yt1)\log P(y_t \mid x, y_1, \ldots, y_{t-1}): The natural logarithm of the probability assigned by model mm to the tt-th token yty_t, given the initial prompt xx and all preceding tokens y1,,yt1y_1, \ldots, y_{t-1} in the sequence.
      • The term \frac{1}{T} \sum_{t=1}^T \log P(y_t \mid x, y_1, \ldots, y_{t-1}) is the average negative log-likelihood per token.

5.3. Baselines

The study primarily compares the RLVR-trained models against their corresponding base models.

  • Base Models: These are the original pretrained LLMs (or instruction-tuned models for coding/visual tasks) from which the RLVR models are derived. Examples include Qwen2.5-7B/14B/32B-Base, LLaMA-3.1-8B, Qwen2.5-Math-7B, Qwen2.5-7B-Instruct, DeepSeek-R1-Distill-Qwen-14B, and Qwen2.5-VL-7B.
  • Other Baselines:
    • Instruction-tuned Models: For certain comparisons (e.g., in Figure 7), Qwen2.5-Math-7B-Instruct is included as a baseline to show the effect of supervised instruction fine-tuning.

    • Distilled Models: DeepSeek-R1-Distill-Qwen-7B is used as a baseline to compare the effect of distillation against RLVR.

    • Different RLVR Algorithms: For the deep analysis in Section 4.3, various RL algorithms (PPO, GRPO, Reinforce++Reinforce++, RLOO, ReMax, DAPO) are implemented and compared against each other, all starting from Qwen2.5-7B.

    • Magistral-Medium-2506: Compared against its base model, Mistral-Medium-3-2505, for evaluating model size scaling effects.

      The base models are representative as they represent the state of the model before RLVR training, thus allowing direct assessment of RLVR's impact. Distilled and instruction-tuned models provide different paradigms for LLM capability enhancement, offering a broader comparative context.

5.4. Experimental Protocols

  • Sampling: For both base and RLVR models, a temperature of 0.6 and a top-p value of 0.95 were used. A maximum generation length of 16,384 tokens was allowed.
  • Prompting for Base Models: To ensure a fair comparison and avoid confounding effects, few-shot examples were not used for base models. The same zero-shot prompt (as used in RLVR training or the benchmark's default prompt) was used for both base and RLVR models.
  • Zero-RL Setting: For math tasks, zero-RL (RL applied directly to pretrained models without SFT) was followed. For coding and visual reasoning, instruction-tuned models were used as starting points for RLVR, reflecting common practice.
  • Manual CoT Inspection: For mathematical problems, where "guessing" correct answers is a concern, a subset of Chain-of-Thought (CoT) responses for challenging problems were manually inspected to confirm that correct final answers stemmed from genuinely correct reasoning paths. This validates the pass@k metric for math.

6. Results & Analysis

6.1. Core Results Analysis

The core results consistently demonstrate a surprising trend: while RLVR-trained models outperform their base models at small kk (e.g., pass@1), base models achieve higher pass@k scores when kk is large. This suggests that RLVR primarily improves the sampling efficiency of already solvable problems rather than expanding the reasoning capacity to solve fundamentally new problems.

6.1.1. RLVR for Mathematical Reasoning

The following figure (Figure 2 from the original paper) shows the pass@k curves of base models and their RLVR-trained counterparts across multiple mathematical benchmarks.

Figure 2: Pass `@ k` curves of base models and their RLVR-trained counterparts across multiple mathematical benchmarks. When \(k\) is small, RL-trained models outperform their base versions. However, as \(k\) increases to the tens or hundreds, base modls consistently catch u and surpass RL-trai models. Mor result n GM8K and AMC23 can b fond at Figure 10.
Figure 2: Pass @ k curves of base models and their RLVR-trained counterparts across multiple mathematical benchmarks. When kk is small, RL-trained models outperform their base versions. However, as kk increases to the tens or hundreds, base modls consistently catch u and surpass RL-trai models. Mor result n GM8K and AMC23 can b fond at Figure 10.

Analysis:

  • Small kk (e.g., k=1k=1): On all mathematical benchmarks (e.g., GSM8K, MATH500, Minerva, Olympiad, AIME24, AMC23), RLVR-trained models (red/orange lines) show higher pass@1 scores compared to their corresponding base models (green/blue lines). This indicates that RLVR makes the models more likely to sample a correct response on their first attempt, confirming its role in improving sampling efficiency.

  • Large kk (e.g., k=128k=128, k=1024k=1024): As the number of samples kk increases, the pass@k curves for base models rise more steeply and consistently catch up to and eventually surpass the RLVR-trained models. For instance, on Minerva (Qwen2.5-32B), the base model outperforms the RL-trained model by approximately 9% at k=128k=128. This suggests that while RLVR models are more efficient, base models have a broader coverage of solvable problems and can solve more problems overall if given enough attempts.

  • RLVR Narrows Coverage: This contrasting trend implies that current RLVR training does not expand the scope of problems an LLM can solve. Instead, it seems to make the model "forget" or deprioritize some reasoning paths, leading to a narrower reasoning boundary compared to its base model at high kk. The following figure (Figure 10 from the original paper) shows more results of SimpleRLZoo on GSM8K and AMC23.

    Figure 10: More results of SimpleRLZoo on GSM8K and AMC23. Figure 10: More results of SimpleRLZoo on GSM8K and AMC23.

    Analysis: This figure further supports the observation from Figure 2, showing consistent trends for various Qwen2.5 models and LLaMA-3.1-8B on GSM8K and AMC23. For all models, the RL-trained version initially outperforms the base model at small kk, but the base model eventually surpasses it at larger kk values.

The following figure (Figure 11 from the original paper) shows Oat-Zero-7B and DAPO-32B evaluated on AIME24 and compared against their respective base models.

Figure 11: Oat-Zero-7B and DAPO-32B are evaluated on AIME24 and compared against their respectiv base models.
Figure 11: Oat-Zero-7B and DAPO-32B are evaluated on AIME24 and compared against their respectiv base models.

Analysis: Even for strong RL models like Oat-Zero and DAPO, which demonstrate significant initial performance gains (e.g., nearly 30% higher than the base model at pass@1 for Oat-Zero), the base models eventually surpass them in pass@k as kk increases. This reinforces the conclusion that RLVR improves initial sampling efficiency but does not fundamentally expand reasoning capacity.

6.1.2. RLVR for Code Generation

The following figure (Figure 4 (left) from the original paper) shows pass@k curves of base and RLVR models for code generation.

Figure 4: Pass `@ k` curves of base and RLVR models. (Left) Code Generation. (Right) Visual Reasoning.
Figure 4: Pass @ k curves of base and RLVR models. (Left) Code Generation. (Right) Visual Reasoning.

Analysis: The trends for code generation (LiveCodeBench, HumanEval+, MBPP+) are highly consistent with those observed in mathematical reasoning. The RLVR models (Coder-R1-Zero-Qwen2.5-7B and Deepcoder-14B-Preview) initially show higher pass@k at small kk values, but their base counterparts eventually outperform them at larger kk. Since code problems are less susceptible to "guessing" (due to unit tests), this further strengthens the argument that RLVR's primary effect is on sampling efficiency.

The following figure (Figure 12 from the original paper) shows Coder-R1 on LiveCodeBench.

Figure 12: Coder-R1 on LiveCodeBench.
Figure 12: Coder-R1 on LiveCodeBench.

Analysis: This figure, including separate plots for 2023 and 2024 LiveCodeBench tasks, consistently shows the same pattern. The RL-trained Coder-R1-Zero (red line) starts higher but is eventually surpassed by the Qwen2.5-7B-Instruct-1M base model (blue line) as kk increases.

6.1.3. RLVR for Visual Reasoning

The following figure (Figure 4 (right) from the original paper) shows pass@k curves of base and RLVR models for visual reasoning.

Figure 4: Pass `@ k` curves of base and RLVR models. (Left) Code Generation. (Right) Visual Reasoning.
Figure 4: Pass @ k curves of base and RLVR models. (Left) Code Generation. (Right) Visual Reasoning.

Analysis: The results for visual reasoning (MathVista-TestMini and MathVision-TestMini) mirror the findings in math and coding. The RLVR-trained EasyR1 model shows an initial advantage at small kk, but the base Qwen2.5-VL-7B model demonstrates broader coverage at higher kk values. This suggests the observed phenomenon is generalizable across different modalities.

6.1.4. Accuracy Distribution Analysis

The following figure (Figure 5 from the original paper) shows the accuracy histogram of Qwen2.5-7B on Minerva.

Figure 5: Qwen2.5-7B Accuracy Histogram on Minerva.
Figure 5: Qwen2.5-7B Accuracy Histogram on Minerva.

Analysis: This histogram compares the distribution of problem accuracies for the base model (green) and the RL model (red).

  • RLVR increases the frequency of problems solved with high accuracy (near 1.0) and reduces the frequency of problems with low accuracies (e.g., 0.1, 0.2). This confirms that RLVR improves sampling efficiency for problems the base model could already solve.
  • Crucially, RLVR also leads to an increased frequency at accuracy 0, meaning more problems become completely unsolvable by the RL model compared to the base model. This directly supports the observation that RLVR narrows the model's overall coverage. The improvement in average scores (pass@1) comes from more efficient sampling on already solvable problems, not from solving new ones.

6.1.5. Perplexity Analysis

The following figure (Figure 6 from the original paper) shows the perplexity distribution of responses.

Figure 6: Perplexity distribution of responses. The conditioning problem \(x\) is omitted in the figure.
Figure 6: Perplexity distribution of responses. The conditioning problem xx is omitted in the figure.

Analysis: This figure shows the perplexity distribution of responses from different models, evaluated by the base model (PPL_Base).

  • PPL_Base(Y_Base | x) (left-most distribution, green) shows the perplexity of responses generated by the base model, evaluated by the base model itself.
  • PPL_Base(Y_RL | x) (middle distribution, red) shows the perplexity of responses generated by the RL model, evaluated by the base model. This distribution closely matches the lower portion of the PPL_Base(Y_Base | x) distribution. This means that the responses generated by the RL-trained models are highly likely to be generated by the base model; they are not "novel" or "out-of-distribution" for the base model.
  • PPL_Base(Y_GT | x) (right-most distribution, blue) shows the perplexity of responses generated by a powerful teacher model (OpenAI-o1), evaluated by the base model. These responses have much higher perplexity, indicating that they contain reasoning patterns that are outside the base model's typical output distribution. This analysis strongly suggests that RLVR primarily sharpens the distribution within the base model's prior, making it more likely to generate correct, but already "known," reasoning paths.

6.1.6. Distillation vs. RLVR

The following figure (Figure 7 from the original paper) shows pass@k of base, Instruct, RLVR, and distilled models.

Figure 7: pass `@ k` of base, Instruct, RLVR, and distilled models.
Figure 7: pass @ k of base, Instruct, RLVR, and distilled models.

Analysis: This figure provides a crucial comparison.

  • Base (Qwen2.5-Math-7B, green) and RL (Qwen2.5-Math-7B-Oat-Zero, red) show the familiar pattern: RL outperforms at small kk but is surpassed by Base at large kk.
  • Instruct (Qwen2.5-Math-7B-Instruct, light blue) performs better than Base at small kk but also falls below Base at large kk.
  • Distilled (DeepSeek-R1-Distill-Qwen-7B, dark blue) consistently and significantly outperforms the base model across all kk values. Its pass@k curve is always above that of the base model. This indicates that distillation can genuinely introduce new reasoning patterns from a stronger teacher model and expand the model's reasoning capabilities beyond the base model's boundary.

6.1.7. Effects of Model Size Scaling

The following figure (Figure 9 from the original paper) shows pass@k curves of Magistral-Medium.

Figure 9: pass `@ k` curves of Magistral-Medium.
Figure 9: pass @ k curves of Magistral-Medium.

Analysis: Even for larger, near-frontier models like Magistral-Medium (RLVR-enhanced) and its base Mistral-Medium-3, the same trend holds. RLVR provides significant gains at low kk (e.g., 7-8 more problems solved at k=1k=1 on AIME24/25), but this performance gap steadily narrows or even reverses at higher kk values. This suggests that the conclusions are not limited to smaller models but extend to highly capable LLMs, indicating a systemic limitation of current RLVR.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Task Start Model RL Framework RL Algorithm(s) Benchmark(s)
Mathematics LLaMA-3.1-8B Qwen2.5-7B/14B/32B-Base Qwen2.5-Math-7B SimpleRLZoo Oat-Zero DAPO GRPO GSM8K, MATH500 Minerva, Olympiad AIME24, AMC23
Code Generation Qwen2.5-7B-Instruct DeepSeek-R1-Distill-Qwen-14B Code-R1 DeepCoder GRPO LiveCodeBench HumanEval+
Visual Reasoning Qwen2.5-VL-7B EasyR1 GRPO MathVista MathVision
Deep Analysis Qwen2.5-7B-Base Qwen2.5-7B-Instruct DeepSeek-R1-Distill-Qwen-7B VeRL PPO, GRPO Reinforce++ RLOO, ReMax, DAPO Omni-Math-Rule MATH500

The following are the results from Table 2 of the original paper:

Base SimpleRLZoo AIME24 MATH500
63.3% 92.4%
× 13.3% 3.6%
X 0.0% 1.0%
X X 23.3% 3.0%

Analysis of Table 2 (Solvable Problem Coverage): This table quantifies the overlap in solvable problems between the base model and the RLVR-trained model.

  • Base ✓, SimpleRLZoo ✓: A large majority of problems (63.3% on AIME24, 92.4% on MATH500) are solved by both models. This reinforces that RLVR primarily optimizes problems that are already within the base model's capabilities.

  • Base ✓, SimpleRLZoo ×: A significant fraction of problems (13.3% on AIME24, 3.6% on MATH500) are solved by the base model but not by the RLVR model. This directly supports the finding that RLVR narrows the model's overall coverage.

  • Base X, SimpleRLZoo ✓: Crucially, this category shows the percentage of problems that only the RLVR model solves, but the base model cannot. On AIME24, this is 0.0%, and on MATH500, it is only 1.0%. This near-zero percentage is the strongest evidence that RLVR does not enable the model to solve fundamentally new problems. The paper further notes that even these rare cases are often solvable by the base model if given an astronomically large number of samples (e.g., 1024 times).

    The following are the results from Table 3 of the original paper:

    Model Omni-MATH-Train Omni-MATH-Test MATH500
    pass@1 pass@256 pass@1 pass@256 pass@1 pass@256
    Qwen2.5-7B 9.9 67.2 10.2 69.1 34.5 96.2
    GRPO 26.1 66.3 25.1 68.3 74.4 97.2
    PPO 27.2 65.8 26.8 69.2 75.2 97.2
    ReMax 24.4 65.5 23.8 67.5 73.5 96.6
    RLOO 28.6 66.4 28.1 69.2 75.0 97.4
    Reinforce++ 28.2 67.7 28.0 69.7 75.4 96.8
    DAPO 31.4 66.1 26.5 67.0 75.6 96.4

Analysis of Table 3 (Different RL Algorithms): This table provides detailed pass@1 and pass@256 values for various RL algorithms compared to the base Qwen2.5-7B model across multiple datasets.

  • pass@1 Improvement: All RL algorithms significantly improve pass@1 compared to the base model (e.g., from 9.9 to 26.1-31.4 on Omni-MATH-Train). This confirms their effectiveness in improving sampling efficiency.

  • pass@256 Stability/Slight Decrease: However, for pass@256 (representing the reasoning boundary), most RL algorithms show either similar or slightly lower performance compared to the base model. For example, on Omni-MATH-Train, the base model has pass@256 of 67.2, while RL algorithms range from 65.5 (ReMax) to 67.7 (Reinforce++). This indicates that even with different algorithmic approaches, the fundamental limitation of not expanding the reasoning boundary persists.

  • Algorithm Performance: While DAPO achieves the highest pass@1, its pass@256 is not consistently the best, aligning with the general observation that pass@k tends to decrease as RLVR training progresses. RLOO and Reinforce++ maintain a good balance between pass@1 and pass@256.

    The following are the results from Table 4 of the original paper:

    Model Omni-MATH-Train Omni-MATH-Test MATH500
    pass@1 pass@256 pass@1 pass@256 pass@1 pass@256
    Qwen2.5-7B 9.9 67.2 10.2 69.1 34.5 96.2
    GRPO-step150 26.1 66.3 25.1 68.3 74.4 97.2
    GRPO-step300 33.6 65.3 27.1 66.6 76.3 96.0
    GRPO-step450 42.5 64.3 28.3 63.9 77.2 95.4

Analysis of Table 4 (RL Training Steps): This table shows the effect of increasing RL training steps on pass@1 and pass@256 using the GRPO algorithm.

  • pass@1 Improvement: As training progresses (from step 150 to 450), pass@1 consistently improves (e.g., from 26.1 to 42.5 on Omni-MATH-Train). This indicates that the model becomes more efficient at generating correct solutions with more training.

  • pass@256 Decrease: However, pass@256 (the reasoning boundary) progressively decreases as RLVR training progresses (e.g., from 66.3 to 64.3 on Omni-MATH-Train, and from 68.3 to 63.9 on Omni-MATH-Test). This is a critical finding, indicating that overtraining with RLVR can actually narrow the model's overall problem-solving capacity, making it less capable of solving a diverse set of problems even with many attempts.

    The following are the results from Table 5 of the original paper:

    Models Problem Indices
    Qwen2.5-7B-Base 0, 1, 4, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 22, 23, 24, 25, 26, 27, 28, 29
    SimpleRL-Qwen2.5-7B 0, 1, 6, 7, 8, 9, 12, 14, 15, 16, 18, 22, 23, 24, 25, 26, 27, 28, 29

Analysis of Table 5 (Solvable Problem Indices in AIME24): This table lists the specific problem indices (out of 30) solved by the base model and the RLVR model on AIME24 (with k=1024k=1024).

  • The set of problems solved by SimpleRL-Qwen2.5-7B (RL) is largely a subset of the problems solved by Qwen2.5-7B-Base. For example, problems 4, 11, 17, 19 are solved by the base model but not by the RL model, while there are no problems solved by the RL model that are not also solved by the base model. This provides concrete evidence for the "subset relationship" and the narrowing of reasoning coverage by RLVR.

    The following are the results from Table 6 of the original paper:

    Model Solvable Problem Indices
    Qwen2.5-7B-Instruct-1M 400, 402, 403, 407, 409, 412, 413, 417, 418, 419, 422, 423, 427, 432, 433, 436, 438, 439, 440, 444, 445, 448, 449
    Coder-R1 400, 402, 403, 407, 412, 413, 417, 418, 419, 422, 423, 427, 430, 433, 438, 439, 440, 444, 445, 449

Analysis of Table 6 (Solvable Problem Indices in LiveCodeBench): Similar to Table 5, this table shows problem indices solved by the base and RL models for LiveCodeBench.

  • Again, Coder-R1 (RL) solves a subset of the problems solved by Qwen2.5-7B-Instruct-1M (Base). Problems 409, 432, 436, 448 are solved by the base but not by the RL model, while problem 430 is solved by RL but not by the base (a rare exception that the authors address by noting base models can solve them with enough attempts). This further confirms the narrowing of reasoning coverage in coding tasks.

6.3. Ablation Studies / Parameter Analysis

6.3.1. Effects of Different RL Algorithms

The following figure (Figure 8 (top) from the original paper) shows the ΔSE values for different RL algorithms.

Figure 8: (Top) Different RL algorithms. (Bottom) Different RL training steps. The detailed valuer for each point at pas `@ 1` and pass `@ 2 5 6` are provided in Table 3 and Table 4.
Figure 8: (Top) Different RL algorithms. (Bottom) Different RL training steps. The detailed valuer for each point at pas @ 1 and pass @ 2 5 6 are provided in Table 3 and Table 4.

Analysis:

  • The Sampling Efficiency Gap (ΔSE\Delta_{SE}), defined as pass@1RLpass@256Base\operatorname{pass@1}_{\mathrm{RL}} - \operatorname{pass@256}_{\mathrm{Base}}, quantifies how much pass@1 of an RL model falls short of the base model's potential maximum pass@256.
  • As shown in Figure 8 (top) and detailed in Table 3, ΔSEΔSE values across all six popular RL algorithms (PPO, GRPO, Reinforce++, RLOO, ReMax, DAPO) on Omni-MATH-Test show only minor variations (ranging from GRPO's 43.9 to RLOO's best 42.6).
  • Crucially, ΔSEΔSE remains consistently above 40 points, highlighting that current RL methods are still far from achieving optimal sampling efficiency and fully leveraging the base model's potential. This indicates that the core issue is not specific to a particular RL algorithm but rather a broader limitation of the current RLVR paradigm.

6.3.2. Effects of RL Training Steps

The following figure (Figure 8 (bottom) from the original paper) shows the effects of different RL training steps.

Figure 8: (Top) Different RL algorithms. (Bottom) Different RL training steps. The detailed valuer for each point at pas `@ 1` and pass `@ 2 5 6` are provided in Table 3 and Table 4.
Figure 8: (Top) Different RL algorithms. (Bottom) Different RL training steps. The detailed valuer for each point at pas @ 1 and pass @ 2 5 6 are provided in Table 3 and Table 4.

Analysis:

  • As detailed in Table 4, pass@1 on the training set consistently improves as RL training progresses (e.g., from 26.1 at step 150 to 42.5 at step 450).

  • However, pass@256 (the reasoning boundary) progressively decreases with more training steps (e.g., from 66.3 to 64.3 on Omni-MATH-Train, and 69.1 (base) to 63.9 (GRPO-step450) on Omni-MATH-Test). This suggests an asymptotic effect where continued RL training, while boosting efficiency, actively narrows the model's overall reasoning coverage, leading to a reduced reasoning boundary. The following figure (Figure 19 from the original paper) shows the curves of training reward, response length, and generation entropy during training, corresponding to experiments in Section 4.

    Figure 19: The curves of training reward, response length, and generation entropy during training, corresponding to experiments in Section 4. Figure 19: The curves of training reward, response length, and generation entropy during training, corresponding to experiments in Section 4.

    Analysis: This figure shows how various metrics evolve during RL training for different algorithms.

  • Actor Reward: The reward (actor_reward) generally increases over training steps for all algorithms, confirming that RL training is successfully optimizing the policy to produce higher-reward (correct) responses.

  • Response Length: response_length tends to decrease or stabilize, indicating that RL might be optimizing for shorter, more direct paths to correct answers.

  • Generation Entropy: generation_entropy consistently decreases for all algorithms as training progresses. This is a crucial observation, as lower entropy means the model's output distribution becomes sharper and less diverse. This reduction in diversity is a likely contributor to the observed narrowing of the reasoning boundary.

6.3.3. Effect of Number of Rollouts nn

The following figure (Figure 16 from the original paper) shows an ablation study on KL Loss and Rollout Number nn.

Figure 16: Ablation Study on KL Loss and Rollout Number \(n\) . For increasing \(n\) from 8 to 32, we keep the prompt batch size unchanged, which results in increased computation per training step. Due to resource constraints, we train for only 220 steps under this setting, leading to lower pass `@ 1` as the model has not yet converged. Nevertheless, the model with \(n = 3 2\) achieves a higher pass \(\\ @ 1 2 8\) , highlighting the positive effect of larger rollout numbers in improving pass `@ k` at higher values of \(k\) .
Figure 16: Ablation Study on KL Loss and Rollout Number nn . For increasing nn from 8 to 32, we keep the prompt batch size unchanged, which results in increased computation per training step. Due to resource constraints, we train for only 220 steps under this setting, leading to lower pass @ 1 as the model has not yet converged. Nevertheless, the model with n=32n = 3 2 achieves a higher pass  @128\ @ 1 2 8 , highlighting the positive effect of larger rollout numbers in improving pass @ k at higher values of kk .

Analysis:

  • Increasing the number of rollouts (nn) from 8 to 32 (meaning more samples are generated per prompt during training) leads to a slight improvement in pass@k (specifically pass@128).
  • However, even with more rollouts, the RL-trained model is still eventually outperformed by the base model at higher kk values. This suggests that while more exploration during training can help, it doesn't fundamentally overcome the limitation of RLVR being bounded by the base model's capacity.

6.3.4. Effect of KL Loss

The following figure (Figure 16 from the original paper) also shows an ablation study on KL Loss.

Figure 16: Ablation Study on KL Loss and Rollout Number \(n\) . For increasing \(n\) from 8 to 32, we keep the prompt batch size unchanged, which results in increased computation per training step. Due to resource constraints, we train for only 220 steps under this setting, leading to lower pass `@ 1` as the model has not yet converged. Nevertheless, the model with \(n = 3 2\) achieves a higher pass \(\\ @ 1 2 8\) , highlighting the positive effect of larger rollout numbers in improving pass `@ k` at higher values of \(k\) .
Figure 16: Ablation Study on KL Loss and Rollout Number nn . For increasing nn from 8 to 32, we keep the prompt batch size unchanged, which results in increased computation per training step. Due to resource constraints, we train for only 220 steps under this setting, leading to lower pass @ 1 as the model has not yet converged. Nevertheless, the model with n=32n = 3 2 achieves a higher pass  @128\ @ 1 2 8 , highlighting the positive effect of larger rollout numbers in improving pass @ k at higher values of kk .

Analysis:

  • Adding a KL divergence term (with coefficient 0.001) to regularize model deviation results in similar pass@1 scores compared to GRPO without KL.
  • However, the KL-regularized model exhibits a much lower pass@128. This indicates that constraining the model's deviation from the original policy (via KL loss) can further reduce its reasoning boundary, making it less capable of finding diverse solutions even with more attempts.

6.3.5. Effects of Entropy

The following figure (Figure 18 from the original paper) shows a comparison of Base and RLVR Models with Matched Output Entropy.

Figure 18: Comparison of Base and RLVR Models with Matched Output Entropy. We evaluate the base model (Qwen2.5-7B) on each dataset using temperature \(T = 0 . 6\) and report its output entropy \(E _ { \\mathrm { b a s e } }\) in the title of each figure. Tn RLVRRLZ approximately matches \(E _ { \\mathrm { b a s e } }\) . For example, on AMC23, we set \(T = 0 . 9\) to achieve \(E _ { \\mathrm { { R L } } } = 0 . 4 7\) . We also include RLVR results at \(T = 0 . 6\) as an additional baseline, which has lower entropy—e.g., 0.22 on AMC23 and 0.33 on MATH500.
该图像是图表,展示了不同模型在多个数据集上的 pass@k 评分和输出熵的比较。图中包含了基线模型和 RLVR 模型在不同样本数量 kk 下的表现,具体数据集包括 AIME24、AMC23、GSM8K、MATH500、Minerva 和 Olympiad。每个子图中均标注了基线模型的输出熵 EbaseE_{base} Figure 18: Comparison of Base and RLVR Models with Matched Output Entropy. We evaluate the base model (Qwen2.5-7B) on each dataset using temperature T=0.6T = 0 . 6 and report its output entropy EmathrmbaseE _ { \\mathrm { b a s e } } in the title of each figure. Tn RLVRRLZ approximately matches EmathrmbaseE _ { \\mathrm { b a s e } } . For example, on AMC23, we set T=0.9T = 0 . 9 to achieve EmathrmRL=0.47E _ { \\mathrm { { R L } } } = 0 . 4 7 . We also include RLVR results at T=0.6T = 0 . 6 as an additional baseline, which has lower entropy—e.g., 0.22 on AMC23 and 0.33 on MATH500.

Analysis:

  • As observed in Figure 19, RL training typically decreases the model's output entropy. This reduction in diversity is hypothesized to contribute to the narrowing of the reasoning boundary.
  • To test this, the authors increased the generation temperature of the RLVR-trained model to match the output entropy of the base model.
  • While the RLVR model performs slightly better (pass@k) at higher temperatures (matched entropy) compared to its own performance at T=0.6T=0.6 (lower entropy), it still underperforms the base model across all pass@k values.
  • This suggests that while reduced entropy contributes to the narrowing of the reasoning boundary, it is not the sole cause. Other factors, such as the inherent limitation of policy gradient in exploring fundamentally new reasoning patterns, are also at play.

6.4. CoT Case Analysis

The following figure (Figure 20 from the original paper) shows Qwen2.5-Base-7B Correct Response - Case 1.

Figure 23: Prompt for Oat-Zero training and evaluation.
Figure 20: Qwen2.5-Base-7B Correct Response - Case 1.

The following figure (Figure 21 from the original paper) shows Qwen2.5-Base-7B Correct Response - Case 2.

Analysis: These figures present examples of complex Chain-of-Thought (CoT) reasoning produced by the Qwen2.5-Base-7B model for challenging AIME24 questions.

  • The responses are notably long and exhibit a reflective behavior, where the model attempts multiple approaches, identifies errors, and corrects itself.
  • The base model successfully generates correct CoTs for these problems, even though they are considered "hardest questions" where traditional RLVR models struggle at high kk. This manual inspection corroborates the quantitative findings: the strong reasoning ability is already inherent in the base model's sampling distribution; RLVR doesn't necessarily introduce it but rather makes it more accessible or efficient under certain conditions.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study systematically investigated whether Reinforcement Learning with Verifiable Rewards (RLVR) genuinely enhances the reasoning capacity of Large Language Models (LLMs) beyond their base models. The findings, consistent across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, reveal that current RLVR methods primarily improve the sampling efficiency of LLMs, making them more likely to generate correct answers on initial attempts (pass@1). However, RLVR-trained models consistently exhibit a narrower reasoning coverage compared to their base models when evaluated with a large number of samples (pass@k at large kk). Detailed analyses using accuracy distributions and perplexity show that the reasoning paths generated by RLVR models are already present and bounded by the base model's sampling distribution. Unlike RLVR, distillation is demonstrated to genuinely expand reasoning capabilities by introducing new patterns from a stronger teacher model. The paper concludes that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs.

7.2. Limitations & Future Work

Limitations pointed out by the authors:

  • Proprietary Models: The analysis is constrained by the fact that many of the most capable models and training pipelines are proprietary, limiting the ability to conduct full evaluations on all state-of-the-art systems.
  • Rapid Evolution of RL for LLMs: The field is rapidly evolving, and emerging techniques might mitigate some of the identified limitations. The conclusions should be interpreted with this dynamic context in mind.

Future Research Directions suggested by the authors:

  • Efficient Exploration Strategies in High-Level Abstraction: Developing mechanisms like AlphaEvolve (Novikov et al., 2025) for self-evolution in a program-level abstraction space could enable the discovery of out-of-prior reasoning patterns.
  • Data Scaling via Curriculum: A more deliberate, large-scale data-RL iteration pipeline with a curriculum-based approach (training on easier subproblems first) could help improve sampling efficiency, acquire meta-skills, and enable RLVR to obtain meaningful rewards on challenging tasks.
  • Process Reward and Fine-Grained Credit Assignment: Incorporating intermediate signals to guide the reasoning trajectory, rather than just binary outcome rewards, could significantly improve exploration efficiency and direct the model toward more promising solution paths.
  • Agentic RL: Moving beyond single-turn responses to multi-turn agentic RL, with richer interactions, iterative refinement, and the ability to actively collect new information (e.g., using search tools), could unlock the potential for truly novel experiences and learning.

7.3. Personal Insights & Critique

This paper provides a crucial and timely reality check on the current state of RLVR for LLMs. The widespread belief that RLVR automatically leads to emergent, novel reasoning capabilities, akin to AlphaGo, is a powerful narrative, but this study rigorously challenges it. The distinction between sampling efficiency and reasoning capacity expansion is paramount and often conflated in the LLM community's discussion of "self-improvement." The consistent finding that base models, given enough attempts, often surpass or match RLVR models in terms of total solvable problems is a profound insight.

The perplexity analysis is particularly elegant, offering a quantitative explanation for why RLVR doesn't introduce novelty: it's merely making the model more confident and precise within its existing knowledge distribution, not pushing it beyond. The observed narrowing of the reasoning boundary with continued RL training (asymptotic effect and decreased entropy) is a concerning side effect that warrants more attention, highlighting a potential overfitting to reward signals that prioritizes narrow high-reward paths over broad exploration.

Applicability: The methods and conclusions are highly transferable. Researchers and practitioners employing RLVR for any LLM task (beyond just math/coding/visual) should consider using pass@k at large kk to genuinely assess whether their RL training is achieving true capability expansion or just efficiency gains. The findings also strongly advocate for exploring distillation as a viable, and perhaps more effective, alternative for injecting new reasoning patterns.

Potential Issues/Areas for Improvement:

  • Definition of "Novel Reasoning Patterns": While the paper provides strong evidence that RLVR doesn't expand beyond the base model's sampling distribution, the definition of "novel reasoning patterns" could be more explicitly formalized. What constitutes a "novel pattern" vs. a higher-probability existing one?

  • Exploration of Failure Cases: A deeper qualitative analysis of the types of problems RLVR models fail on (that base models solve at high kk) could yield more insights into the precise nature of the reasoning boundary narrowing. What specific conceptual gaps or missing strategies emerge after RLVR?

  • The "Astronomic kk" Counterargument: The paper states that base models might solve "Type 3" problems (only solved by RLVR) given "astronomically large kk." While this is theoretically true, practically, if kk has to be excessively large, the base model is effectively not solving the problem in a useful sense. The focus should remain on realistic kk values.

  • Teacher Model in Distillation: The success of distillation heavily relies on the "stronger teacher model." The paper hints at this, but a more explicit discussion on what makes a teacher "strong" (e.g., its own training data, architecture, or simply human supervision) would be beneficial.

  • Future RL Paradigms: The proposed future directions (exploration, curriculum, process rewards, agentic RL) are compelling. Future work could try to implement these to see if they genuinely overcome the limitations identified here.

    Overall, this paper is a significant contribution, providing a sober and evidence-based perspective that should guide the future development of RL for LLMs, moving beyond optimistic assumptions towards more targeted and effective paradigms.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.