Learning to Reason without External Rewards
TL;DR Summary
Intuitor leverages a model’s self-certainty as an intrinsic reward in reinforcement learning, enabling unsupervised training of LLMs for complex reasoning with strong cross-domain generalization and no reliance on external labels or rewards.
Abstract
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Learning to Reason without External Rewards
The title clearly states the paper's central theme: developing a method for Large Language Models (LLMs) to improve their reasoning abilities without relying on external feedback, such as human labels or pre-defined correct answers.
1.2. Authors
- Xuandong Zhao* (UC Berkeley)
- Zhewei Kang* (UC Berkeley)
- Aosong Feng (Yale University)
- Sergey Levine (UC Berkeley)
- Dawn Song (UC Berkeley) (*Equal contribution)
The authors are affiliated with top-tier academic institutions known for their leading research in artificial intelligence, machine learning, and security. Sergey Levine and Dawn Song are prominent figures in the fields of reinforcement learning, deep learning, and AI security, lending significant credibility to the work.
1.3. Journal/Conference
The paper is available as a preprint on arXiv. The publication date suggests it is intended for a future top-tier AI/ML conference (e.g., NeurIPS, ICML, ICLR). arXiv is a standard platform for disseminating cutting-edge research quickly, allowing the community to engage with new ideas before the formal peer-review process is complete.
1.4. Publication Year
The paper was submitted to arXiv with a publication timestamp of May 26, 2025 (as listed in the metadata).
1.5. Abstract
The abstract summarizes that training LLMs for complex reasoning using Reinforcement Learning with Verifiable Rewards (RLVR) is effective but costly and domain-specific. To address this, the paper explores Reinforcement Learning from Internal Feedback (RLIF), a framework where LLMs learn from their own intrinsic signals. The authors propose Intuitor, an RLIF method that uses the model's own confidence, termed self-certainty, as the sole reward signal. Intuitor is implemented by replacing the external rewards in Group Relative Policy Optimization (GRPO) with these self-certainty scores, enabling fully unsupervised learning. Experiments show Intuitor matches GRPO's performance on in-domain math tasks and achieves superior generalization on out-of-domain coding tasks, all without needing correct solutions or test cases. The findings suggest that intrinsic signals can effectively drive learning, offering a scalable alternative for creating autonomous AI systems where verifiable rewards are not available.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2505.19590v2
- PDF Link: http://arxiv.org/pdf/2505.19590v2
- Publication Status: This is a preprint version available on arXiv and has not yet undergone formal peer review for a conference or journal.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper addresses is the scalability bottleneck in training large language models (LLMs) for complex reasoning. Current state-of-the-art methods rely heavily on external supervision:
-
Reinforcement Learning from Human Feedback (RLHF): Requires massive amounts of expensive, time-consuming, and potentially biased human preference data to train a reward model.
-
Reinforcement Learning with Verifiable Rewards (RLVR): Replaces the human-trained reward model with an automated "verifier" (e.g., a unit test for code, an answer checker for math). While more scalable than RLHF, RLVR is still limited. It requires domain-specific verifiers and "gold-standard" solutions, which are costly to create and maintain. This confines its application to narrow, well-defined domains like competitive programming or elementary math and limits the transfer of learned skills to other areas.
These limitations raise a fundamental question: Can LLMs learn to improve their reasoning skills on their own, without any external rewards or ground-truth labels? This is the entry point for the paper. The innovative idea is to leverage the model's own intrinsic signals as a form of self-supervision. The authors hypothesize that a model's confidence in its own output can serve as a powerful reward signal. A model that is more "certain" about its reasoning process is likely producing a higher-quality response.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Introducing the RLIF Framework: The paper formalizes the concept of Reinforcement Learning from Internal Feedback (RLIF), a new paradigm where LLMs are optimized using rewards generated from their own internal states, eliminating the need for external data or verifiers.
-
Proposing the
IntuitorMethod: The authors introduceIntuitor, a concrete and practical implementation ofRLIF.Intuitoruses a specific metric calledself-certainty—a measure of the model's confidence in its next-token predictions—as its sole intrinsic reward signal. -
Demonstrating Unsupervised Learning Efficacy: The key finding is that
Intuitorcan achieve performance comparable to a fully supervisedRLVRmethod (GRPO) on in-domain mathematical reasoning tasks. This demonstrates that learning from intrinsic signals alone is a viable strategy. -
Superior Generalization and Emergent Abilities:
Intuitorshows superior generalization to out-of-domain tasks. When trained only on math problems, theIntuitor-trained model shows significant improvements on code generation tasks, whereas theRLVR-trained model shows little to no improvement. Furthermore, the paper observes thatIntuitorfosters emergent structured reasoning, where models spontaneously generate step-by-step reasoning before providing an answer, even when not explicitly instructed to do so. This process-oriented improvement, driven by the model's desire for self-conviction, contrasts with outcome-focusedRLVR.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, one must be familiar with the following concepts:
-
Large Language Models (LLMs): These are massive neural networks, typically based on the Transformer architecture, trained on vast amounts of text data. They function by predicting the next word (or token) in a sequence. Models like GPT, Llama, and Qwen are examples. Their ability to generate coherent and contextually relevant text allows them to perform a wide range of tasks, from translation to reasoning.
-
Reinforcement Learning (RL): A paradigm in machine learning where an agent (e.g., an LLM) learns to make decisions by interacting with an environment. The agent takes actions (e.g., generating a token), receives rewards (signals indicating whether the action was good or bad), and updates its internal policy (strategy for choosing actions) to maximize its cumulative reward over time.
-
Policy Gradient Methods: A class of RL algorithms used to optimize the agent's policy directly. Instead of learning the value of states or actions, these methods adjust the policy's parameters in the direction that increases the probability of taking actions that lead to higher rewards.
REINFORCEandPPOare classic examples. -
KL Divergence (Kullback-Leibler Divergence): A measure from information theory that quantifies how one probability distribution differs from a second, reference probability distribution. In the context of LLMs, it is often used as a regularization term during fine-tuning to prevent the updated model from deviating too much from its original, pre-trained version. A low KL divergence means the two distributions are similar.
3.2. Previous Works
The paper builds upon and differentiates itself from several key lines of research.
3.2.1. Reinforcement Learning from Human Feedback (RLHF)
RLHF is a technique to align LLMs with human values and preferences. The process typically involves:
-
Collecting Human Preference Data: Humans are shown multiple model-generated responses to a prompt and asked to rank them from best to worst.
-
Training a Reward Model: This preference data is used to train a separate neural network, the reward model (), which learns to predict a scalar "reward" score that reflects human preference for a given input-output pair.
-
Fine-tuning the LLM with RL: The LLM (the policy, ) is then fine-tuned using an RL algorithm like Proximal Policy Optimization (PPO). The reward model provides the reward signal. The LLM is encouraged to generate outputs that the reward model scores highly.
The optimization objective in RLHF is to maximize the expected reward while penalizing deviation from a reference policy () to maintain model stability and diversity: $ \max_{\pi_{\theta}} \mathbb{E}{o \sim \pi{\theta}(q)}\left[r_{\phi}(q, o)-\beta \operatorname{KL}\left[\pi_{\theta}(o \mid q) | \pi_{\text {ref }}(o \mid q)\right]\right] $
-
: The input prompt (query).
-
: The generated output.
-
: The LLM policy being trained.
-
: The reward model's score for the output.
-
: A reference policy (often the initial model before RL tuning).
-
: A hyperparameter controlling the strength of the KL divergence penalty.
Limitation: RLHF is expensive and slow due to its reliance on human annotators.
3.2.2. Reinforcement Learning with Verifiable Rewards (RLVR)
RLVR is an advancement that replaces the subjective, learned reward model of RLHF with an objective, automated verifier. This is suitable for tasks where correctness can be programmatically checked.
-
For Math Problems: The verifier checks if the model's final numerical answer matches the known correct answer.
-
For Code Generation: The verifier runs the generated code against a set of unit tests.
The objective is similar to RLHF, but the reward model is replaced by a verifiable reward function
v(q, o): $ \max {\pi{\theta}} \mathbb{E}{o \sim \pi{\theta}(q)}[v(q, o)-\beta \operatorname{KL}\left[\pi_{\theta}(o \mid q) | \pi_{\text {ref }}(o \mid q)\right]] $ -
v(q, o): The verifiable reward function, which returns a high value (e.g., 1) if the output is correct and a low value (e.g., 0) otherwise.Limitation: RLVR is restricted to domains with available ground-truth solutions and automated verifiers. It is also "outcome-oriented," rewarding only the final answer, which may not effectively teach the model the process of reasoning.
3.2.3. Group Relative Policy Optimization (GRPO)
GRPO is a policy gradient algorithm that has become popular for RLVR. Instead of evaluating a single output, GRPO works with a group of outputs generated for the same prompt. Its key idea is to use the relative performance within the group to estimate the advantage of each action.
For a group of outputs, it normalizes the rewards, using the mean and standard deviation of rewards within the group to calculate an "advantage" for each output. This makes the learning signal more stable than using raw reward scores. Intuitor builds directly on this framework by substituting the external reward with its intrinsic self-certainty score.
3.2.4. Intrinsic Signals and Self-Play
The idea of using intrinsic signals is not entirely new. The paper situates itself within a growing body of work on autonomous model improvement:
SPINandSelf-Rewarding LMs: These methods use the model itself to generate feedback for its own training, creating a self-improvement loop.Genius,TTRL,Absolute Zero: These are concurrent works that also explore RL on unlabeled data, but the paper notes they are often constrained to specific tasks like math.- Entropy Minimization (
EMPO,EM-RL): These methods use the model's predictive entropy as a reward signal, encouraging it to be "less uncertain." The paper contrasts itsself-certaintymetric with entropy, arguingself-certaintyis less prone to certain biases.
3.3. Technological Evolution
The field has evolved from purely supervised learning to more dynamic, feedback-driven approaches:
- Pre-training & Supervised Fine-Tuning (SFT): Models learn general knowledge from web-scale text and are then fine-tuned on task-specific examples. This is static and limited by the available labeled data.
- Reinforcement Learning from Human Feedback (RLHF): Introduced a dynamic loop, aligning models with human preferences, but at a high cost. This powered models like ChatGPT.
- Reinforcement Learning with Verifiable Rewards (RLVR): Increased scalability by automating the reward signal for certain domains, leading to powerful reasoning models like DeepSeek-R1.
- Reinforcement Learning from Internal Feedback (RLIF): The paradigm proposed in this paper, which aims for full autonomy by removing the need for any external verifier or data. It represents a shift from extrinsic to intrinsic motivation.
3.4. Differentiation Analysis
The core innovations of Intuitor compared to prior work are:
- vs. RLHF/RLVR:
Intuitoris fully unsupervised. It requires no human feedback, no reward model, and no ground-truth answers or verifiers. It only needs a collection of prompts (e.g., math questions without solutions). This makes it vastly more scalable and applicable to any domain. - vs. Process-based vs. Outcome-based Rewards: RLVR typically uses outcome-based rewards (correct/incorrect answer).
Intuitorusesself-certainty, which is calculated over the entire generation process (token by token). This makes it a process-aware reward, encouraging the model to generate a coherent and confident reasoning path, not just a correct final answer. This is a key reason for its superior generalization. - vs. Other Intrinsic Methods: While other methods use intrinsic signals like entropy,
Intuitor's use ofself-certainty(a specific form of KL divergence from a uniform distribution) is claimed to be more robust against biases like a preference for longer or shorter sequences. It is presented as a lightweight, simple, and general-purpose signal.
4. Methodology
4.1. Principles
The core principle behind Intuitor is that a language model can improve its reasoning abilities by learning to become more "confident" in its own outputs. The intuition is that when an LLM generates a well-reasoned, logically consistent response, its internal probability distributions for each generated token will be sharper and more decisive. Conversely, when it is "confused" or generating flawed logic, its predictions will be more uncertain (i.e., the probability will be spread out over many possible next tokens).
By rewarding the model for generating responses that it is internally confident about, Intuitor creates a self-reinforcing loop. The model "practices" generating responses, and through reinforcement learning, it learns to favor generation trajectories that it finds more convincing. This process, in turn, is hypothesized to lead to objectively better and more structured reasoning.
4.2. Core Methodology In-depth
The methodology can be broken down into three main parts: the formalization of RLIF, the definition of the self-certainty reward signal, and its integration into the GRPO policy optimization algorithm.
4.2.1. Reinforcement Learning from Internal Feedback (RLIF)
The paper first defines the general RLIF paradigm, which reframes the standard RL objective for LLMs. Instead of an external reward from a human-trained model () or a verifier (v(q,o)), RLIF uses an intrinsic signal u(q, o) derived from the model's own computation.
The optimization objective for RLIF is given by:
$
\max_{\pi_{\theta}} \mathbb{E}{o \sim \pi{\theta}(q)}[u(q, o)-\beta \operatorname{KL}\left[\pi_{\theta}(o \mid q) | \pi_{\text {ref }}(o \mid q)\right]]
$
-
u(q, o): An intrinsic reward function that scores the quality of an output for a query based on the model's internal state. -
All other symbols (, , , etc.) retain their standard meanings from RLHF/RLVR.
The central challenge of
RLIFis to find a suitable intrinsic signalu(q, o)that correlates with desired behaviors like correct reasoning.
4.2.2. Intuitor's Intrinsic Reward: Self-Certainty
Intuitor proposes using self-certainty as the intrinsic reward signal. Self-certainty is defined as the average KL divergence between a uniform distribution over the entire vocabulary and the model's predicted next-token probability distribution, for each token in the generated output.
The formula for self-certainty is:
$
\text{Self-certainty}(o \mid q) := \frac{1}{|o|} \sum_{i=1}^{|o|} \mathrm{KL}\left(U | p_{\pi_{\theta}}\left(\cdot \mid q, o_{<i}\right)\right)
$
Let's break this down:
-
: The sequence of generated tokens, .
-
: The length of the generated output.
-
: The sequence of tokens generated before step , i.e., .
-
: This is the model's predicted probability distribution over the entire vocabulary for the next token at step , given the query and the preceding tokens .
-
: A uniform probability distribution over the vocabulary . If the vocabulary has tokens, then for every token .
-
: The KL divergence between the uniform distribution and the model's predictive distribution .
A higher
self-certaintyscore indicates that the model's predictive distribution is far from uniform, meaning it is highly peaked or "spiky" on a few tokens. This reflects high confidence. A low score means the distribution is closer to uniform, indicating uncertainty.
The paper also provides an expanded form of the formula which clarifies the calculation: $ \text{Self-certainty}(o \mid q) = -\frac{1}{|o| \cdot|\mathcal{V}|} \sum_{i=1}^{|o|} \sum_{j=1}^{|\mathcal{V}|} \log \left(|\mathcal{V}| \cdot p_{\pi_{\theta}}\left(j \mid q, o_{ and all possible tokens in the vocabulary . This formula highlights that the metric considers the entire probability distribution at each step, not just the probability of the token that was actually chosen.
4.2.3. Implementation via Group Relative Policy Optimization (GRPO)
Intuitor is implemented by plugging the self-certainty reward into the GRPO algorithm. The overall process is illustrated in Figure 2 of the paper.
The Intuitor training pipeline for a single update step is as follows:
-
Sample Generation: For a given prompt , generate a group of different candidate outputs, , using the current policy (or a slightly older one).
-
Intrinsic Reward Calculation: For each generated output in the group, calculate its
self-certaintyscore. This score becomes the intrinsic reward, . -
Advantage Estimation: The key innovation of
Intuitorhappens here. The external, verifiable rewards ofGRPOare replaced by the intrinsicself-certaintyscores. The advantage for each output is calculated by normalizing its reward relative to the other rewards in the group. This stabilizes the learning signal. The advantage for each token in output is computed as: $ \hat{A}{i, t}=\frac{u{i}-\operatorname{mean}\left(\left{u_{1}, u_{2}, \cdots, u_{G}\right}\right)}{\operatorname{std}\left(\left{u_{1}, u_{2}, \cdots, u_{G}\right}\right)} $- : The
self-certaintyscore of the -th output. - and : The mean and standard deviation of the
self-certaintyscores across the entire group of outputs. - Note: In this formulation, the advantage is constant for all tokens within a given output . It is a sequence-level advantage.
- : The
-
Policy Update: The model's policy is then updated using the
GRPOobjective function, which aims to increase the likelihood of generating outputs with higher relativeself-certainty. The full GRPO objective being optimized is: $ \begin{aligned} & \mathcal{J}{\mathrm{GRPO}}(\theta)=\mathbb{E}{q \sim P(Q),\left{o_{i}\right}{i=1}^{G} \sim \pi{\theta_{\text {obj }}}(O \mid q)} \ & \quad \frac{1}{G} \sum_{i=1}^{G} \frac{1}{\left|o_{i}\right|} \sum_{t=1}^{\left|o_{i}\right|} \left(\min \left[c_{i, t}(\theta) \hat{A}{i, t}, \operatorname{clip}\left(c{i, t}(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}{i, t}\right]-\beta \mathbb{D}{\mathrm{KL}}\left(\pi_{\theta} | \pi_{\mathrm{ref}}\right)\right) \end{aligned} $-
: This is the importance sampling ratio, comparing the probability of generating token under the current policy versus the policy that generated the samples, .
-
: The advantage calculated in the previous step using
self-certainty. -
: A clipping function (from PPO) that limits how much the policy can change in one update, preventing instability. is a small hyperparameter (e.g., 0.2).
-
: The KL divergence penalty to keep the policy from straying too far from the reference model.
The following diagram illustrates the
Intuitorworkflow:
该图像是一张示意图,展示了Intuitor方法的流程。从输入的查询q开始,策略模型生成多个输出,由参考模型计算自信度得分,转换成奖励并归一化,最终得到优势函数用于训练模型。
-
In essence, Intuitor hijacks the powerful GRPO machinery but fuels it with a completely internal, self-generated reward signal, making the entire learning process unsupervised.
5. Experimental Setup
5.1. Datasets
The experiments use a combination of mathematical reasoning, code generation, and general instruction-following datasets for training and evaluation.
-
Training Datasets:
MATHdataset: A challenging dataset of 12,500 math problems from high school competitions. The training split, containing 7,500 problems, was used for the main experiments. Crucially,Intuitoronly used the problem statements (the questions), not the solutions.Codeforcesdataset: A dataset of competitive programming problems used to train theIntuitor-Codevariant to assess generalization to a different domain.
-
Evaluation Datasets:
-
MATH500&GSM8K: Standard benchmarks for mathematical reasoning.MATH500is a test subset of theMATHdataset.GSM8Kconsists of grade-school math word problems. These are in-domain benchmarks for models trained onMATH. -
LiveCodeBench(LCB) &CRUXEval-O: Benchmarks for code generation and code reasoning. These are out-of-domain benchmarks to test generalization. -
MMLU-Pro: A more robust and challenging version of the popular MMLU benchmark, measuring multitask language understanding across various subjects. -
AlpacaEval 2.0: A benchmark for evaluating an LLM's ability to follow general instructions. It compares model outputs to a reference model's outputs using a powerful LLM-as-a-judge (GPT-4.1).To provide a concrete example, here is a problem from the
LiveCodeBenchdataset, as shown in the paper's appendix. This helps to understand the nature of the coding tasks.
-
Problem Example (LiveCodeBench):
Question: You are given a 0-indexed array of strings
details. Each element provides information about a passenger... The first ten characters consist of the phone number... The next character denotes the gender... The following two characters are used to indicate the age... Return the number of passengers who are strictly more than 60 years old.
5.2. Evaluation Metrics
The paper uses standard metrics for each task domain.
-
Accuracy: Used for math and code benchmarks (
GSM8K,MATH500,LCB,CRUXEval-O). This metric measures the percentage of problems the model solves correctly. For coding tasks, this is often apass@1score, where the model's first generated solution must pass all hidden unit tests.- Conceptual Definition: Measures the fraction of correct predictions out of the total predictions.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
Number of Correct Predictions: The count of test samples for which the model's output was deemed correct by the verifier.Total Number of Predictions: The total number of samples in the test set.
-
AlpacaEval Win Rate: Used to evaluate general instruction-following ability.
- Conceptual Definition: This metric is based on pairwise comparison. The model's response to a prompt is compared against the response from a baseline model (e.g.,
text-davinci-003). A powerful, automated judge (like GPT-4) determines which response is better. The win rate is the percentage of times the evaluated model's response was judged as better. The paper specifically uses a "length-controlled" version to mitigate biases where LLM judges prefer longer answers. - Mathematical Formula: $ \text{Win Rate} = \frac{\text{Number of Wins}}{\text{Total Number of Comparisons}} $
- Symbol Explanation:
Number of Wins: The count of prompts for which the model's output was preferred over the baseline's output.Total Number of Comparisons: The total number of prompts in the evaluation set.
- Conceptual Definition: This metric is based on pairwise comparison. The model's response to a prompt is compared against the response from a baseline model (e.g.,
5.3. Baselines
The primary models compared are:
-
BaseModel: The original, pre-trained LLM (Qwen2.5-1.5B,Qwen2.5-3B, etc.) without any reinforcement learning fine-tuning. This shows the starting performance. -
GRPO: The same base model fine-tuned with theGRPOalgorithm using verifiable rewards. This is the main supervised baseline. For theMATHdataset, the reward is based on whether the final answer matches the gold solution. This represents the standardRLVRapproach. -
GRPO-PV: AGRPOvariant using plurality voting as a reward signal, as a proxy for ground truth. For a given prompt, multiple answers are generated, and the most frequent answer is assumed to be correct. This is a weaker, pseudo-supervised baseline that doesn't require pre-annotated gold solutions but still relies on an external consensus mechanism. -
Intuitor: The proposed method, which fine-tunes the base model withGRPOusing only the intrinsicself-certaintyreward. This is the fully unsupervised approach.The comparison between
IntuitorandGRPOis the most critical, as it directly pits a fully unsupervised method against a supervised one using the same RL algorithm.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results strongly support the paper's hypotheses. The key findings are presented in Table 1, which compares different methods across multiple benchmarks.
The following are the results from Table 1 of the original paper:
| Model | Training Data | GSM8K | MATH500 | LCB | CRUX | MMLU-Pro | AlpacaEval |
|---|---|---|---|---|---|---|---|
| Qwen2.5-1.5B Results | |||||||
| Base | - | 0.002 | 0.090 | 0.000 | 0.000 | 0.297 | 2.10 |
| + GRPO | MATH | 0.747 | 0.560 | 0.056 | 0.328 | 0.315 | 4.03 |
| + Intuitor | MATH | 0.711 | 0.530 | 0.099 | 0.296 | 0.310 | 4.28 |
| Qwen2.5-3B Results | |||||||
| Base | - | 0.673 | 0.544 | 0.093 | 0.236 | 0.377 | 3.72 |
| + GRPO | MATH | 0.826 | 0.636 | 0.085 | 0.341 | 0.403 | 6.91 |
| + GRPO-PV | MATH | 0.820 | 0.636 | 0.086 | 0.299 | 0.398 | 6.17 |
| + Intuitor | MATH | 0.792 | 0.612 | 0.153 | 0.416 | 0.379 | 7.10 |
| + Intuitor-Code | Codeforces | 0.743 | 0.572 | 0.153 | 0.411 | 0.386 | 4.16 |
Analysis:
-
Comparable In-Domain Performance: On the mathematical reasoning benchmarks
GSM8KandMATH500(in-domain for models trained onMATH),Intuitorperforms nearly as well as the supervisedGRPOmethod. For theQwen2.5-3Bmodel,Intuitorscores 79.2% onGSM8Kand 61.2% onMATH500, very close toGRPO's 82.6% and 63.6%. This is a remarkable result, asIntuitorachieves this without access to any correct answers. -
Superior Out-of-Domain Generalization: The most striking result is on the out-of-domain coding tasks. When trained only on math problems, the
Qwen2.5-3Bmodel fine-tuned withIntuitorsees itsLiveCodeBench(LCB) score jump from 9.3% to 15.3% (a 65% relative improvement). In contrast, theGRPO-tuned model's score decreases from 9.3% to 8.5%. A similar trend is observed onCRUXEval-O, whereIntuitorachieves a 76% relative gain compared to 44% forGRPO. This suggests that by optimizing for an internal, process-aware signal (self-certainty),Intuitorlearns a more generalizable "structured reasoning" ability, whereasGRPO, by focusing only on the final math answer, overfits to the specific domain. -
Enhanced Instruction Following: On
AlpacaEval,Intuitorconsistently outperformsGRPO. For the 3B model, it achieves a win rate of 7.10, higher thanGRPO's 6.91. For the 1.5B model, the difference is more pronounced (4.28 vs. 4.03). This shows that the self-reinforcement of confidence also improves the model's general ability to follow instructions. -
Fostering Structured Reasoning: The paper observes that
Intuitor-trained models learn to produce more structured, long-form reasoning. As shown in the line charts below,Intuitorleads to significantly longer responses during training, which correlates with the emergence of step-by-step reasoning chains.
该图像是两张折线图,展示了在不同参数规模的Qwen模型上,GRPO、Intuitor和GRPO-PV三种方法随训练步数的完成长度变化情况。
This emergent behavior is further analyzed in Figure 5 of the paper, which shows an Intuitor-trained model spontaneously producing free-form reasoning before providing the final answer in the required JSON format, a behavior not seen in the GRPO-tuned model.
该图像是论文中展示的示意图,比较了GRPO主流格式与Intuitor主流格式下的输入输出结构,展示了二者在问题描述中json数据结构的不同表达形式。
6.2. Ablation Studies / Parameter Analysis
6.2.1. Online vs. Offline Self-Certainty
A critical experiment in Section 5.4 investigates the stability of the self-certainty reward. The authors compare using an online annotator (where self-certainty is calculated by the constantly updating policy model) versus an offline annotator (where self-certainty is calculated by the fixed base model from
Similar papers
Recommended via semantic vector search.