Paper status: completed

Learning to Reason without External Rewards

Published:05/26/2025

Reinforcement Learning for Math Reasoning (14)Sequence Policy Optimization (40)RL Training for Large Language Models (67)Training-Free Acceleration Methods (22)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Intuitor leverages a model’s self-certainty as an intrinsic reward in reinforcement learning, enabling unsupervised training of LLMs for complex reasoning with strong cross-domain generalization and no reliance on external labels or rewards.

Abstract

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

Mind Map

In-depth Reading

English Analysis~14 min read · 18,371 chars

1. Bibliographic Information

1.1. Title

Learning to Reason without External Rewards

The title clearly states the paper's central theme: developing a method for Large Language Models (LLMs) to improve their reasoning abilities without relying on external feedback, such as human labels or pre-defined correct answers.

1.2. Authors

Xuandong Zhao* (UC Berkeley)
Zhewei Kang* (UC Berkeley)
Aosong Feng (Yale University)
Sergey Levine (UC Berkeley)
Dawn Song (UC Berkeley) (*Equal contribution)

The authors are affiliated with top-tier academic institutions known for their leading research in artificial intelligence, machine learning, and security. Sergey Levine and Dawn Song are prominent figures in the fields of reinforcement learning, deep learning, and AI security, lending significant credibility to the work.

1.3. Journal/Conference

The paper is available as a preprint on arXiv. The publication date suggests it is intended for a future top-tier AI/ML conference (e.g., NeurIPS, ICML, ICLR). arXiv is a standard platform for disseminating cutting-edge research quickly, allowing the community to engage with new ideas before the formal peer-review process is complete.

1.4. Publication Year

The paper was submitted to arXiv with a publication timestamp of May 26, 2025 (as listed in the metadata).

1.5. Abstract

The abstract summarizes that training LLMs for complex reasoning using Reinforcement Learning with Verifiable Rewards (RLVR) is effective but costly and domain-specific. To address this, the paper explores Reinforcement Learning from Internal Feedback (RLIF), a framework where LLMs learn from their own intrinsic signals. The authors propose Intuitor, an RLIF method that uses the model's own confidence, termed self-certainty, as the sole reward signal. Intuitor is implemented by replacing the external rewards in Group Relative Policy Optimization (GRPO) with these self-certainty scores, enabling fully unsupervised learning. Experiments show Intuitor matches GRPO's performance on in-domain math tasks and achieves superior generalization on out-of-domain coding tasks, all without needing correct solutions or test cases. The findings suggest that intrinsic signals can effectively drive learning, offering a scalable alternative for creating autonomous AI systems where verifiable rewards are not available.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2505.19590v2
PDF Link: http://arxiv.org/pdf/2505.19590v2
Publication Status: This is a preprint version available on arXiv and has not yet undergone formal peer review for a conference or journal.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper addresses is the scalability bottleneck in training large language models (LLMs) for complex reasoning. Current state-of-the-art methods rely heavily on external supervision:

Reinforcement Learning from Human Feedback (RLHF): Requires massive amounts of expensive, time-consuming, and potentially biased human preference data to train a reward model.
Reinforcement Learning with Verifiable Rewards (RLVR): Replaces the human-trained reward model with an automated "verifier" (e.g., a unit test for code, an answer checker for math). While more scalable than RLHF, RLVR is still limited. It requires domain-specific verifiers and "gold-standard" solutions, which are costly to create and maintain. This confines its application to narrow, well-defined domains like competitive programming or elementary math and limits the transfer of learned skills to other areas.

These limitations raise a fundamental question: Can LLMs learn to improve their reasoning skills on their own, without any external rewards or ground-truth labels? This is the entry point for the paper. The innovative idea is to leverage the model's own intrinsic signals as a form of self-supervision. The authors hypothesize that a model's confidence in its own output can serve as a powerful reward signal. A model that is more "certain" about its reasoning process is likely producing a higher-quality response.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Introducing the RLIF Framework: The paper formalizes the concept of Reinforcement Learning from Internal Feedback (RLIF), a new paradigm where LLMs are optimized using rewards generated from their own internal states, eliminating the need for external data or verifiers.
Proposing the Intuitor Method: The authors introduce Intuitor, a concrete and practical implementation of RLIF. Intuitor uses a specific metric called self-certainty—a measure of the model's confidence in its next-token predictions—as its sole intrinsic reward signal.
Demonstrating Unsupervised Learning Efficacy: The key finding is that Intuitor can achieve performance comparable to a fully supervised RLVR method (GRPO) on in-domain mathematical reasoning tasks. This demonstrates that learning from intrinsic signals alone is a viable strategy.
Superior Generalization and Emergent Abilities: Intuitor shows superior generalization to out-of-domain tasks. When trained only on math problems, the Intuitor-trained model shows significant improvements on code generation tasks, whereas the RLVR-trained model shows little to no improvement. Furthermore, the paper observes that Intuitor fosters emergent structured reasoning, where models spontaneously generate step-by-step reasoning before providing an answer, even when not explicitly instructed to do so. This process-oriented improvement, driven by the model's desire for self-conviction, contrasts with outcome-focused RLVR.

3.1. Foundational Concepts

To understand this paper, one must be familiar with the following concepts:

Large Language Models (LLMs): These are massive neural networks, typically based on the Transformer architecture, trained on vast amounts of text data. They function by predicting the next word (or token) in a sequence. Models like GPT, Llama, and Qwen are examples. Their ability to generate coherent and contextually relevant text allows them to perform a wide range of tasks, from translation to reasoning.
Reinforcement Learning (RL): A paradigm in machine learning where an agent (e.g., an LLM) learns to make decisions by interacting with an environment. The agent takes actions (e.g., generating a token), receives rewards (signals indicating whether the action was good or bad), and updates its internal policy (strategy for choosing actions) to maximize its cumulative reward over time.
Policy Gradient Methods: A class of RL algorithms used to optimize the agent's policy directly. Instead of learning the value of states or actions, these methods adjust the policy's parameters in the direction that increases the probability of taking actions that lead to higher rewards. REINFORCE and PPO are classic examples.
KL Divergence (Kullback-Leibler Divergence): A measure from information theory that quantifies how one probability distribution differs from a second, reference probability distribution. In the context of LLMs, it is often used as a regularization term during fine-tuning to prevent the updated model from deviating too much from its original, pre-trained version. A low KL divergence means the two distributions are similar.

3.2. Previous Works

The paper builds upon and differentiates itself from several key lines of research.

3.2.1. Reinforcement Learning from Human Feedback (RLHF)

RLHF is a technique to align LLMs with human values and preferences. The process typically involves:

Collecting Human Preference Data: Humans are shown multiple model-generated responses to a prompt and asked to rank them from best to worst.
Training a Reward Model: This preference data is used to train a separate neural network, the reward model ( $r_{\phi}$ ), which learns to predict a scalar "reward" score that reflects human preference for a given input-output pair.
Fine-tuning the LLM with RL: The LLM (the policy, $\pi_{\theta}$ ) is then fine-tuned using an RL algorithm like Proximal Policy Optimization (PPO). The reward model provides the reward signal. The LLM is encouraged to generate outputs that the reward model scores highly.

The optimization objective in RLHF is to maximize the expected reward while penalizing deviation from a reference policy ( $\pi_{\text{ref}}$ ) to maintain model stability and diversity: $ \max_{\pi_{\theta}} \mathbb{E}{o \sim \pi{\theta}(q)}\left[r_{\phi}(q, o)-\beta \operatorname{KL}\left[\pi_{\theta}(o \mid q) | \pi_{\text {ref }}(o \mid q)\right]\right] $

$q$ : The input prompt (query).
$o$ : The generated output.
$\pi_{\theta}$ : The LLM policy being trained.
$r_{\phi}(q, o)$ : The reward model's score for the output.
$\pi_{\text{ref}}$ : A reference policy (often the initial model before RL tuning).
$\beta$ : A hyperparameter controlling the strength of the KL divergence penalty.

Limitation: RLHF is expensive and slow due to its reliance on human annotators.

3.2.2. Reinforcement Learning with Verifiable Rewards (RLVR)

RLVR is an advancement that replaces the subjective, learned reward model of RLHF with an objective, automated verifier. This is suitable for tasks where correctness can be programmatically checked.

For Math Problems: The verifier checks if the model's final numerical answer matches the known correct answer.
For Code Generation: The verifier runs the generated code against a set of unit tests.

The objective is similar to RLHF, but the reward model $r_{\phi}$ is replaced by a verifiable reward function v(q, o): $ \max {\pi{\theta}} \mathbb{E}{o \sim \pi{\theta}(q)}[v(q, o)-\beta \operatorname{KL}\left[\pi_{\theta}(o \mid q) | \pi_{\text {ref }}(o \mid q)\right]] $
v(q, o): The verifiable reward function, which returns a high value (e.g., 1) if the output is correct and a low value (e.g., 0) otherwise.

Limitation: RLVR is restricted to domains with available ground-truth solutions and automated verifiers. It is also "outcome-oriented," rewarding only the final answer, which may not effectively teach the model the process of reasoning.

3.2.3. Group Relative Policy Optimization (GRPO)

GRPO is a policy gradient algorithm that has become popular for RLVR. Instead of evaluating a single output, GRPO works with a group of outputs generated for the same prompt. Its key idea is to use the relative performance within the group to estimate the advantage of each action. For a group of $G$ outputs, it normalizes the rewards, using the mean and standard deviation of rewards within the group to calculate an "advantage" for each output. This makes the learning signal more stable than using raw reward scores. Intuitor builds directly on this framework by substituting the external reward with its intrinsic self-certainty score.

3.2.4. Intrinsic Signals and Self-Play

The idea of using intrinsic signals is not entirely new. The paper situates itself within a growing body of work on autonomous model improvement:

SPIN and Self-Rewarding LMs: These methods use the model itself to generate feedback for its own training, creating a self-improvement loop.
Genius, TTRL, Absolute Zero: These are concurrent works that also explore RL on unlabeled data, but the paper notes they are often constrained to specific tasks like math.
Entropy Minimization (EMPO, EM-RL): These methods use the model's predictive entropy as a reward signal, encouraging it to be "less uncertain." The paper contrasts its self-certainty metric with entropy, arguing self-certainty is less prone to certain biases.

3.3. Technological Evolution

The field has evolved from purely supervised learning to more dynamic, feedback-driven approaches:

Pre-training & Supervised Fine-Tuning (SFT): Models learn general knowledge from web-scale text and are then fine-tuned on task-specific examples. This is static and limited by the available labeled data.
Reinforcement Learning from Human Feedback (RLHF): Introduced a dynamic loop, aligning models with human preferences, but at a high cost. This powered models like ChatGPT.
Reinforcement Learning with Verifiable Rewards (RLVR): Increased scalability by automating the reward signal for certain domains, leading to powerful reasoning models like DeepSeek-R1.
Reinforcement Learning from Internal Feedback (RLIF): The paradigm proposed in this paper, which aims for full autonomy by removing the need for any external verifier or data. It represents a shift from extrinsic to intrinsic motivation.

3.4. Differentiation Analysis

The core innovations of Intuitor compared to prior work are:

vs. RLHF/RLVR: Intuitor is fully unsupervised. It requires no human feedback, no reward model, and no ground-truth answers or verifiers. It only needs a collection of prompts (e.g., math questions without solutions). This makes it vastly more scalable and applicable to any domain.
vs. Process-based vs. Outcome-based Rewards: RLVR typically uses outcome-based rewards (correct/incorrect answer). Intuitor uses self-certainty, which is calculated over the entire generation process (token by token). This makes it a process-aware reward, encouraging the model to generate a coherent and confident reasoning path, not just a correct final answer. This is a key reason for its superior generalization.
vs. Other Intrinsic Methods: While other methods use intrinsic signals like entropy, Intuitor's use of self-certainty (a specific form of KL divergence from a uniform distribution) is claimed to be more robust against biases like a preference for longer or shorter sequences. It is presented as a lightweight, simple, and general-purpose signal.

4. Methodology

4.1. Principles

The core principle behind Intuitor is that a language model can improve its reasoning abilities by learning to become more "confident" in its own outputs. The intuition is that when an LLM generates a well-reasoned, logically consistent response, its internal probability distributions for each generated token will be sharper and more decisive. Conversely, when it is "confused" or generating flawed logic, its predictions will be more uncertain (i.e., the probability will be spread out over many possible next tokens).

By rewarding the model for generating responses that it is internally confident about, Intuitor creates a self-reinforcing loop. The model "practices" generating responses, and through reinforcement learning, it learns to favor generation trajectories that it finds more convincing. This process, in turn, is hypothesized to lead to objectively better and more structured reasoning.

4.2. Core Methodology In-depth

The methodology can be broken down into three main parts: the formalization of RLIF, the definition of the self-certainty reward signal, and its integration into the GRPO policy optimization algorithm.

4.2.1. Reinforcement Learning from Internal Feedback (RLIF)

The paper first defines the general RLIF paradigm, which reframes the standard RL objective for LLMs. Instead of an external reward from a human-trained model ( $r_\phi$ ) or a verifier (v(q,o)), RLIF uses an intrinsic signal u(q, o) derived from the model's own computation.

The optimization objective for RLIF is given by: $ \max_{\pi_{\theta}} \mathbb{E}{o \sim \pi{\theta}(q)}[u(q, o)-\beta \operatorname{KL}\left[\pi_{\theta}(o \mid q) | \pi_{\text {ref }}(o \mid q)\right]] $

u(q, o): An intrinsic reward function that scores the quality of an output $o$ for a query $q$ based on the model's internal state.
All other symbols ( $\pi_{\theta}$ , $\pi_{\text{ref}}$ , $\beta$ , etc.) retain their standard meanings from RLHF/RLVR.

The central challenge of RLIF is to find a suitable intrinsic signal u(q, o) that correlates with desired behaviors like correct reasoning.

4.2.2. Intuitor's Intrinsic Reward: Self-Certainty

Intuitor proposes using self-certainty as the intrinsic reward signal. Self-certainty is defined as the average KL divergence between a uniform distribution over the entire vocabulary and the model's predicted next-token probability distribution, for each token in the generated output.

The formula for self-certainty is: $ \text{Self-certainty}(o \mid q) := \frac{1}{|o|} \sum_{i=1}^{|o|} \mathrm{KL}\left(U | p_{\pi_{\theta}}\left(\cdot \mid q, o_{<i}\right)\right) $ Let's break this down:

$o$ : The sequence of generated tokens, $o = (o_1, o_2, \dots, o_{|o|})$ .
$|o|$ : The length of the generated output.
$o_{<i}$ : The sequence of tokens generated before step $i$ , i.e., $(o_1, \dots, o_{i-1})$ .
$p_{\pi_{\theta}}\left(\cdot \mid q, o_{<i}\right)$ : This is the model's predicted probability distribution over the entire vocabulary for the next token at step $i$ , given the query $q$ and the preceding tokens $o_{<i}$ .
$U$ : A uniform probability distribution over the vocabulary $\mathcal{V}$ . If the vocabulary has $|\mathcal{V}|$ tokens, then $U(j) = 1/|\mathcal{V}|$ for every token $j$ .
$\mathrm{KL}(U \| p_{\pi_{\theta}})$ : The KL divergence between the uniform distribution $U$ and the model's predictive distribution $p_{\pi_{\theta}}$ .

A higher self-certainty score indicates that the model's predictive distribution $p_{\pi_{\theta}}$ is far from uniform, meaning it is highly peaked or "spiky" on a few tokens. This reflects high confidence. A low score means the distribution is closer to uniform, indicating uncertainty.

The paper also provides an expanded form of the formula which clarifies the calculation: $ \text{Self-certainty}(o \mid q) = -\frac{1}{|o| \cdot|\mathcal{V}|} \sum_{i=1}^{|o|} \sum_{j=1}^{|\mathcal{V}|} \log \left(|\mathcal{V}| \cdot p_{\pi_{\theta}}\left(j \mid q, o_{ $i$

4.2.3. Implementation via Group Relative Policy Optimization (GRPO)

Intuitor is implemented by plugging the self-certainty reward into the GRPO algorithm. The overall process is illustrated in Figure 2 of the paper.

The Intuitor training pipeline for a single update step is as follows:

Sample Generation: For a given prompt $q$ , generate a group of $G$ different candidate outputs, $\{o_1, o_2, \ldots, o_G\}$ , using the current policy (or a slightly older one).
Intrinsic Reward Calculation: For each generated output $o_i$ in the group, calculate its self-certainty score. This score becomes the intrinsic reward, $u_i = \text{Self-certainty}(o_i \mid q)$ .
Advantage Estimation: The key innovation of Intuitor happens here. The external, verifiable rewards of GRPO are replaced by the intrinsic self-certainty scores. The advantage for each output $o_i$ is calculated by normalizing its reward relative to the other rewards in the group. This stabilizes the learning signal. The advantage $\hat{A}_{i,t}$ for each token in output $o_i$ is computed as: $ \hat{A}{i, t}=\frac{u{i}-\operatorname{mean}\left(\left{u_{1}, u_{2}, \cdots, u_{G}\right}\right)}{\operatorname{std}\left(\left{u_{1}, u_{2}, \cdots, u_{G}\right}\right)} $
- $u_i$ : The self-certainty score of the $i$ -th output.
- $\operatorname{mean}(\cdot)$ and $\operatorname{std}(\cdot)$ : The mean and standard deviation of the self-certainty scores across the entire group of $G$ outputs.
- Note: In this formulation, the advantage $\hat{A}_{i,t}$ is constant for all tokens $t$ within a given output $o_i$ . It is a sequence-level advantage.
Policy Update: The model's policy $\pi_{\theta}$ is then updated using the GRPO objective function, which aims to increase the likelihood of generating outputs with higher relative self-certainty. The full GRPO objective being optimized is: $ \begin{aligned} & \mathcal{J}{\mathrm{GRPO}}(\theta)=\mathbb{E}{q \sim P(Q),\left{o_{i}\right}{i=1}^{G} \sim \pi{\theta_{\text {obj }}}(O \mid q)} \ & \quad \frac{1}{G} \sum_{i=1}^{G} \frac{1}{\left|o_{i}\right|} \sum_{t=1}^{\left|o_{i}\right|} \left(\min \left[c_{i, t}(\theta) \hat{A}{i, t}, \operatorname{clip}\left(c{i, t}(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}{i, t}\right]-\beta \mathbb{D}{\mathrm{KL}}\left(\pi_{\theta} | \pi_{\mathrm{ref}}\right)\right) \end{aligned} $
- $c_{i, t}(\theta) = \frac{\pi_{\theta}\left(o_{i, t} \mid q, o_{i,<t}\right)}{\pi_{\theta_{\text {obj }}}\left(o_{i, t} \mid q, o_{i,<t}\right)}$ : This is the importance sampling ratio, comparing the probability of generating token $o_{i,t}$ under the current policy $\pi_{\theta}$ versus the policy that generated the samples, $\pi_{\theta_{\text{obj}}}$ .
- $\hat{A}_{i, t}$ : The advantage calculated in the previous step using self-certainty.
- $\operatorname{clip}(\cdot)$ : A clipping function (from PPO) that limits how much the policy can change in one update, preventing instability. $\epsilon$ is a small hyperparameter (e.g., 0.2).
- $\mathbb{D}_{\mathrm{KL}}(\pi_{\theta} \| \pi_{\mathrm{ref}})$ : The KL divergence penalty to keep the policy from straying too far from the reference model.
  
  The following diagram illustrates the Intuitor workflow:
  
  该图像是一张示意图，展示了Intuitor方法的流程。从输入的查询q开始，策略模型生成多个输出 $o_1, o_2, \, o_G$ ，由参考模型计算自信度得分，转换成奖励 $u_1, u_2, \, u_G$ 并归一化，最终得到优势函数 $A_1, A_2, \, A_G$ 用于训练模型。

In essence, Intuitor hijacks the powerful GRPO machinery but fuels it with a completely internal, self-generated reward signal, making the entire learning process unsupervised.

5. Experimental Setup

5.1. Datasets

The experiments use a combination of mathematical reasoning, code generation, and general instruction-following datasets for training and evaluation.

Training Datasets:
- MATH dataset: A challenging dataset of 12,500 math problems from high school competitions. The training split, containing 7,500 problems, was used for the main experiments. Crucially, Intuitor only used the problem statements (the questions), not the solutions.
- Codeforces dataset: A dataset of competitive programming problems used to train the Intuitor-Code variant to assess generalization to a different domain.
Evaluation Datasets:
- MATH500 & GSM8K: Standard benchmarks for mathematical reasoning. MATH500 is a test subset of the MATH dataset. GSM8K consists of grade-school math word problems. These are in-domain benchmarks for models trained on MATH.
- LiveCodeBench (LCB) & CRUXEval-O: Benchmarks for code generation and code reasoning. These are out-of-domain benchmarks to test generalization.
- MMLU-Pro: A more robust and challenging version of the popular MMLU benchmark, measuring multitask language understanding across various subjects.
- AlpacaEval 2.0: A benchmark for evaluating an LLM's ability to follow general instructions. It compares model outputs to a reference model's outputs using a powerful LLM-as-a-judge (GPT-4.1).
  
  To provide a concrete example, here is a problem from the LiveCodeBench dataset, as shown in the paper's appendix. This helps to understand the nature of the coding tasks.

Problem Example (LiveCodeBench):

Question: You are given a 0-indexed array of strings details. Each element provides information about a passenger... The first ten characters consist of the phone number... The next character denotes the gender... The following two characters are used to indicate the age... Return the number of passengers who are strictly more than 60 years old.

5.2. Evaluation Metrics

The paper uses standard metrics for each task domain.

Accuracy: Used for math and code benchmarks (GSM8K, MATH500, LCB, CRUXEval-O). This metric measures the percentage of problems the model solves correctly. For coding tasks, this is often a pass@1 score, where the model's first generated solution must pass all hidden unit tests.
- Conceptual Definition: Measures the fraction of correct predictions out of the total predictions.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
  - Number of Correct Predictions: The count of test samples for which the model's output was deemed correct by the verifier.
  - Total Number of Predictions: The total number of samples in the test set.
AlpacaEval Win Rate: Used to evaluate general instruction-following ability.
- Conceptual Definition: This metric is based on pairwise comparison. The model's response to a prompt is compared against the response from a baseline model (e.g., text-davinci-003). A powerful, automated judge (like GPT-4) determines which response is better. The win rate is the percentage of times the evaluated model's response was judged as better. The paper specifically uses a "length-controlled" version to mitigate biases where LLM judges prefer longer answers.
- Mathematical Formula: $ \text{Win Rate} = \frac{\text{Number of Wins}}{\text{Total Number of Comparisons}} $
- Symbol Explanation:
  - Number of Wins: The count of prompts for which the model's output was preferred over the baseline's output.
  - Total Number of Comparisons: The total number of prompts in the evaluation set.

5.3. Baselines

The primary models compared are:

Base Model: The original, pre-trained LLM (Qwen2.5-1.5B, Qwen2.5-3B, etc.) without any reinforcement learning fine-tuning. This shows the starting performance.
GRPO: The same base model fine-tuned with the GRPO algorithm using verifiable rewards. This is the main supervised baseline. For the MATH dataset, the reward is based on whether the final answer matches the gold solution. This represents the standard RLVR approach.
GRPO-PV: A GRPO variant using plurality voting as a reward signal, as a proxy for ground truth. For a given prompt, multiple answers are generated, and the most frequent answer is assumed to be correct. This is a weaker, pseudo-supervised baseline that doesn't require pre-annotated gold solutions but still relies on an external consensus mechanism.
Intuitor: The proposed method, which fine-tunes the base model with GRPO using only the intrinsic self-certainty reward. This is the fully unsupervised approach.

The comparison between Intuitor and GRPO is the most critical, as it directly pits a fully unsupervised method against a supervised one using the same RL algorithm.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results strongly support the paper's hypotheses. The key findings are presented in Table 1, which compares different methods across multiple benchmarks.

The following are the results from Table 1 of the original paper:

Model	Training Data	GSM8K	MATH500	LCB	CRUX	MMLU-Pro	AlpacaEval
Qwen2.5-1.5B Results
Base	-	0.002	0.090	0.000	0.000	0.297	2.10
+ GRPO	MATH	0.747	0.560	0.056	0.328	0.315	4.03
+ Intuitor	MATH	0.711	0.530	0.099	0.296	0.310	4.28
Qwen2.5-3B Results
Base	-	0.673	0.544	0.093	0.236	0.377	3.72
+ GRPO	MATH	0.826	0.636	0.085	0.341	0.403	6.91
+ GRPO-PV	MATH	0.820	0.636	0.086	0.299	0.398	6.17
+ Intuitor	MATH	0.792	0.612	0.153	0.416	0.379	7.10
+ Intuitor-Code	Codeforces	0.743	0.572	0.153	0.411	0.386	4.16

Analysis:

Comparable In-Domain Performance: On the mathematical reasoning benchmarks GSM8K and MATH500 (in-domain for models trained on MATH), Intuitor performs nearly as well as the supervised GRPO method. For the Qwen2.5-3B model, Intuitor scores 79.2% on GSM8K and 61.2% on MATH500, very close to GRPO's 82.6% and 63.6%. This is a remarkable result, as Intuitor achieves this without access to any correct answers.
Superior Out-of-Domain Generalization: The most striking result is on the out-of-domain coding tasks. When trained only on math problems, the Qwen2.5-3B model fine-tuned with Intuitor sees its LiveCodeBench (LCB) score jump from 9.3% to 15.3% (a 65% relative improvement). In contrast, the GRPO-tuned model's score decreases from 9.3% to 8.5%. A similar trend is observed on CRUXEval-O, where Intuitor achieves a 76% relative gain compared to 44% for GRPO. This suggests that by optimizing for an internal, process-aware signal (self-certainty), Intuitor learns a more generalizable "structured reasoning" ability, whereas GRPO, by focusing only on the final math answer, overfits to the specific domain.
Enhanced Instruction Following: On AlpacaEval, Intuitor consistently outperforms GRPO. For the 3B model, it achieves a win rate of 7.10, higher than GRPO's 6.91. For the 1.5B model, the difference is more pronounced (4.28 vs. 4.03). This shows that the self-reinforcement of confidence also improves the model's general ability to follow instructions.
Fostering Structured Reasoning: The paper observes that Intuitor-trained models learn to produce more structured, long-form reasoning. As shown in the line charts below, Intuitor leads to significantly longer responses during training, which correlates with the emergence of step-by-step reasoning chains.

该图像是两张折线图，展示了在不同参数规模的Qwen模型上，GRPO、Intuitor和GRPO-PV三种方法随训练步数的完成长度变化情况。

This emergent behavior is further analyzed in Figure 5 of the paper, which shows an Intuitor-trained model spontaneously producing free-form reasoning before providing the final answer in the required JSON format, a behavior not seen in the GRPO-tuned model.

该图像是论文中展示的示意图，比较了GRPO主流格式与Intuitor主流格式下的输入输出结构，展示了二者在问题描述中json数据结构的不同表达形式。

6.2. Ablation Studies / Parameter Analysis

6.2.1. Online vs. Offline Self-Certainty

A critical experiment in Section 5.4 investigates the stability of the self-certainty reward. The authors compare using an online annotator (where self-certainty is calculated by the constantly updating policy model) versus an offline annotator (where self-certainty is calculated by the fixed base model from

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Learning to Reason without External Rewards

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~14 min read · 18,371 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Reinforcement Learning from Human Feedback (RLHF)

3.2.2. Reinforcement Learning with Verifiable Rewards (RLVR)

3.2.3. Group Relative Policy Optimization (GRPO)

3.2.4. Intrinsic Signals and Self-Play

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Reinforcement Learning from Internal Feedback (RLIF)

4.2.2. Intuitor's Intrinsic Reward: Self-Certainty

4.2.3. Implementation via Group Relative Policy Optimization (GRPO)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Online vs. Offline Self-Certainty

Similar papers