Process Reinforcement through Implicit Rewards
TL;DR Summary
The paper introduces PRIME, which enhances reinforcement learning for large language models by using implicit rewards for online process reward model updates. PRIME significantly improves performance by efficiently solving issues related to label collection costs and reward hacki
Abstract
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "Process Reinforcement Through Implicit Rewards," introducing a novel framework called PRIME.
1.2. Authors
The paper lists numerous authors from various institutions, including:
-
Shanghai AI Lab
-
Peking University
-
Shanghai Jiaotong University
-
CUHK
The corresponding authors are Ganqu Cui (
cuiganqu@pjlab.org.cn) and Ning Ding.
1.3. Journal/Conference
The paper is a preprint, published on arXiv. The abstract mentions a publication date of 2025-02-03T15:43:48.000Z. As a preprint, it has not yet undergone formal peer review or been accepted by a specific conference or journal, although arXiv is a highly respected platform for disseminating cutting-edge research in fields like AI.
1.4. Publication Year
2025
1.5. Abstract
This paper addresses the challenge of scaling reinforcement learning (RL) for large language models (LLMs) using dense process rewards. While dense rewards offer advantages over sparse outcome-level rewards—such as improved training efficiency and credit assignment in complex multi-step reasoning tasks—their adoption has been limited due to the high cost of collecting high-quality process labels for online process reward model (PRM) updates, making PRMs vulnerable to reward hacking.
To overcome these issues, the authors propose PRIME (Process Reinforcement through IMplicit rEwards). PRIME enables online PRM updates using only policy rollouts and outcome labels by leveraging implicit process rewards. This framework is compatible with various advantage functions and eliminates the need for a dedicated reward model training phase, significantly reducing development overhead.
The authors demonstrate PRIME's effectiveness on competitive math and coding tasks. Starting from Qwen2.5-Math-7B-Base, PRIME achieves an average improvement of 15.1% across several key reasoning benchmarks compared to the supervised fine-tuning (SFT) model. Notably, their resulting model, Eurus-2-7B-PRIME, outperforms Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks using only 10% of its training data.
1.6. Original Source Link
https://arxiv.org/abs/2502.01456v2 PDF Link: https://arxiv.org/pdf/2502.01456v2.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the difficulty in effectively applying dense process rewards to the reinforcement learning (RL) of large language models (LLMs) for complex reasoning tasks.
This problem is important because:
- Limitations of Sparse Outcome Rewards: Current state-of-the-art LLMs primarily rely on sparse outcome-level rewards, which only provide feedback at the end of an entire generated sequence. This approach suffers from several issues:
- Training Inefficiency: Learning is slow as feedback is infrequent.
- Credit Assignment Problem: It's hard to determine which intermediate steps contributed positively or negatively to the final outcome, especially in multi-step reasoning.
- Encouraging Spurious Solutions: Models might learn incorrect reasoning processes that coincidentally lead to correct answers.
- Potential of Dense Process Rewards: Dense process rewards, which offer feedback at each intermediate step (e.g., token-level), have shown promise in improving inference-time performance for LLMs on reasoning tasks. In principle, they should address the issues of sparse rewards by providing fine-grained feedback, leading to better training efficiency and credit assignment in RL.
- Challenges in Incorporating Dense Rewards for RL: Despite their potential, successful applications of dense rewards in RL for LLMs are limited due to three main challenges:
-
Difficulty in Defining Process Rewards (C1): It's hard to collect step-level labels for complex reasoning, and annotating rewards for every token is prohibitively expensive and ambiguous.
-
Scalability of PRM Online Updates (C2): To prevent
reward hacking(over-optimization of a static reward model),process reward models (PRMs)need to be updated online. However, this typically requires extensive, nuanced step-level annotations on new policy rollouts, which is infeasible at scale. -
Extra Cost of Explicit Reward Modeling (C3): Training
PRMsusually involves a costly, dedicated data collection and training phase to ensure generalization, adding significant overhead.The paper's entry point and innovative idea revolve around making dense process rewards scalable and practical for online RL. It leverages the concept of
implicit process reward modelingto derive token-level rewards using only outcome labels, thereby circumventing the annotation bottleneck and enabling onlinePRMupdates.
-
2.2. Main Contributions / Findings
The paper's primary contributions are:
- Proposing PRIME (Process Reinforcement through IMplicit rEwards): A novel and scalable framework that enhances LLM reasoning capabilities through efficient online reinforcement learning with dense token-level rewards.
- Enabling Online PRM Updates with Outcome Labels:
PRIMEutilizesimplicit process reward modelingto train dense reward models using only outcome-level labels. This fundamentally addressesC1andC2by enabling online updates of thePRMwith policy rollouts and outcome supervision, mitigatingreward hackingwithout requiring expensive step-level annotations. - Eliminating Dedicated Reward Model Training:
PRIMEsimplifies the development process by initializing theImplicit PRMdirectly from thesupervised fine-tuning (SFT)model or even the base model, removing the need for a separate, costly reward model training phase (C3). - General Framework for Reward Fusion:
PRIMEprovides a general method to combine token-level dense rewards and sparse outcome rewards by calculating their returns separately and summing them. This design makes it compatible with variousRL algorithms(e.g.,REINFORCE,RLOO,GRPO,PPO). - Demonstrated Effectiveness and Efficiency:
-
PRIMEachieved a 15.1% average improvement over theSFTmodel on several key reasoning benchmarks (competitive math and coding). -
The resulting
Eurus-2-7B-PRIMEmodel surpassedQwen2.5-Math-7B-Instructon seven reasoning benchmarks with only 10% of its training data, showcasing significant sample efficiency. -
Compared to
RLOOwith outcome rewards only,PRIMEdemonstrated a 2.5x sample efficiency gain and a 6.9% performance improvement on challenging math problems. -
PRIMEconsistently boosted the performance and efficiency of otherRL algorithms(REINFORCE, GRPO, PPO). -
The analysis showed that online
PRMupdates are crucial for success, preventing over-optimization andreward hacking.These findings solve the specific problems of reward sparsity, training inefficiency, and the credit assignment problem in LLM RL by providing a scalable and effective method for incorporating dense, fine-grained rewards.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand PRIME, a reader needs a grasp of core concepts in large language models and reinforcement learning.
- Large Language Models (LLMs): These are neural networks, typically based on the Transformer architecture, trained on vast amounts of text data to generate human-like text, translate languages, write different kinds of creative content, and answer your questions in an informative way. They operate in an
autoregressivemanner, meaning they predict the next token (word or sub-word unit) in a sequence based on the preceding tokens. - Reinforcement Learning (RL): A paradigm where an
agentlearns to make optimal decisions by interacting with anenvironment. The agent performsactionsinstates, receivesrewardsorpenalties, and aims to maximize its cumulative reward over time.- Policy (): The agent's strategy, mapping states to actions. In LLMs, the
policy() is typically the language model itself, parameterized by , which outputs probabilities for the next token given the previous sequence. - Environment: For LLMs, the environment is the task (e.g., math problem, coding task) and the feedback mechanism (e.g., a verifier checking the final answer).
- State: The current context or sequence of tokens generated so far, plus the input prompt.
- Action: Generating the next token .
- Reward (): A scalar feedback signal indicating the desirability of an action or sequence of actions.
- Return (G): The total cumulative discounted reward from a certain time step to the end of an episode, often calculated as
G_t = \sum_{s=t}^T \gamma^{s-t} r(y_s), where is adiscount factorthat weighs immediate rewards more heavily than future ones.
- Policy (): The agent's strategy, mapping states to actions. In LLMs, the
- Policy Gradient: A family of RL algorithms that directly optimize the policy by estimating the gradient of the expected return with respect to the policy parameters. The general
policy gradient theoremstates that: $ \nabla _ { \theta } J ( \theta ) = \mathbb { E } _ { \mathbf { x } \sim \mathcal { D } , \mathbf { y } \sim \pi _ { \theta } } \left[ \displaystyle \sum _ { t = 0 } ^ { T } \nabla _ { \theta } \log \pi _ { \theta } ( y _ { t } | \mathbf { y } _ { < t } ) A _ { t } \right] $ Here, is the objective function (expected cumulative reward), is the data distribution (prompts), is the probability of choosing token given previous tokens by the policy , and is theadvantage function. - Advantage Function (): Quantifies how much better a specific action in a state is compared to the average expected outcome from that state. It helps reduce variance in policy gradient estimation. A common form is the
Monte-Carlo (MC) advantage estimate: $ A _ { t } = \sum _ { s = t } ^ { T } \gamma ^ { s - t } r ( y _ { s } ) - b $ Here,\sum_{s=t}^T \gamma^{s-t} r(y_s)is the actualreturnreceived from step , and is abaseline(e.g., a value estimate) subtracted to reduce variance without changing the expected gradient. - Value Models (): Neural networks trained to predict the expected future return from a given state, . They are used as baselines to reduce the variance of
Monte-Carloadvantage estimates.- Generalized Advantage Estimation (GAE): A method that combines
Monte-Carloestimates (low bias, high variance) withTemporal Difference (TD)estimates (high bias, low variance) to achieve a good bias-variance trade-off. TheTD error\delta_t = r(y_t) + \gamma V(\mathbf{y}_{<t+1}) - V(\mathbf{y}_{<t})measures the discrepancy between the actual reward plus the predicted value of the next state, and the predicted value of the current state.
- Generalized Advantage Estimation (GAE): A method that combines
- Proximal Policy Optimization (PPO): A widely used
actor-criticRL algorithm. The "actor" is the policy network () that generates actions, and the "critic" is a value network () that estimates state values. PPO updates the policy by taking multiple small steps, clipping the objective function to prevent large policy updates that could destabilize training, and usesGAEfor advantage estimation. - Reward Sparsity: A common issue in RL where rewards are only provided infrequently (e.g., at the end of a long sequence). This makes learning difficult, as the agent receives little feedback during intermediate steps, exacerbating the
credit assignment problem. - Reward Hacking / Overoptimization: Occurs when an RL agent finds loopholes in the reward function, maximizing the reward signal without actually achieving the intended goal. This is especially problematic with static reward models, as the policy can drift into regions where the reward model provides high scores for unintended behavior, leading to a
distribution shiftbetween the policy's generated data and the reward model's training data. - Supervised Fine-tuning (SFT): A common pre-training step for LLMs in RL, where a base model is fine-tuned on a dataset of high-quality instruction-following examples (input-output pairs). This teaches the model basic reasoning abilities and desired output formats before RL.
3.2. Previous Works
The paper frames its contribution by contrasting with existing approaches to handling rewards in LLM RL:
- Sparse Outcome Rewards: Most industry-leading LLMs rely on
outcome reward models (ORMs)that provide a single scalar reward at the end of a generated sequence (e.g.,Rafailov et al., 2023; Shao et al., 2024; DeepSeek-AI et al., 2025). While simpler, this approach suffers from thereward sparsityissues, training inefficiency, andcredit assignment problemdescribed above. Some attempts to mitigate sparsity with value models inPPOhave shown limited effectiveness due to training challenges (Shao et al., 2024; Ahmadian et al., 2024). - Dense Process Rewards (Traditional): The concept of
dense process rewards(feedback at intermediate steps) is not new and has proven effective ininference-time scalingfor LLMs on reasoning tasks (Uesato et al., 2022; Lightman et al., 2023; Wang et al., 2023; Yuan et al., 2024b). However, their application in training LLMs with RL faces significant challenges:- Human Annotation Pipelines:
Lightman et al. (2023)utilized complex human annotation pipelines to gather step-level labels for trainingPRMs. This is expensive and not scalable for online RL updates. - Estimation-based Methods: Other methods rely on estimating process rewards, requiring a large number of rollouts (e.g., 10x more for each step) compared to response-level trajectories (
Wang et al., 2023; Kazemnejad et al., 2024), making them computationally intensive and less scalable for onlinePRMupdates.
- Human Annotation Pipelines:
- Implicit Rewards: This line of work, particularly in
LLM alignment, has shown that reward functions can be implicitly learned.Rafailov et al. (2024)demonstrated that optimizing theDirect Preference Optimization (DPO)objective implicitly learns a Q-function.Zhou et al. (2024)utilized implicit rewards withinPPOand highlighted the effectiveness ofdense implicit rewards.- Implicit Process Reward Modeling (
Yuan et al., 2024b): This work is a direct precursor toPRIME. It proposes a method to train anORM(with outcome labels) that can be repurposed as aPRMat inference time. The core idea is that the reward can be represented as a log-ratio of probabilities between a reward model and a reference model: $ r(\mathbf{y}) = \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_{ref}(\mathbf{y})} $ where is the probability of the sequence under the reward model, and is under a reference model. This formulation allows deriving token-level rewards from a model trained on sequence-level preferences.PRIMEbuilds directly on this formulation for itsImplicit PRM.
3.3. Technological Evolution
The field of LLM reinforcement learning has evolved significantly:
- Early Alignment (RLHF): Initial applications of
RLforLLMsfocused onhuman alignment, primarily usingReinforcement Learning from Human Feedback (RLHF)(Christiano et al., 2017; Ouyang ett al., 2022; Cui et al., 2024). This typically involved training anORMfrom human preferences over complete responses, and then fine-tuning theLLMwithPPOusing these sparse rewards. - Imitation Learning for Reasoning: For
LLM reasoning, many open-source efforts initially relied onimitation learning(Yuan et al., 2024a; Yue et al., 2024; Wei et al., 2024; Liu et al., 2024), where models learn by mimicking expert-demonstrated reasoning steps. - Large-scale RL for Reasoning: More recently, the paradigm has shifted towards large-scale
RLfor reasoningLLMs, with works likeOpenAI o1(Jaech et al., 2024) andDeepSeek-R1(DeepSeek-AI et al., 2025) demonstrating the immense potential ofRLwithoutcome rewards. These works highlight the scaling effects but also acknowledge the limitations ofoutcome-onlyfeedback. - Emergence of Dense Rewards in Training:
PRIMErepresents a crucial step in this evolution by directly tackling the challenge of incorporatingdense process rewardsinto the training ofLLMsfor reasoning, moving beyond inference-time applications. It builds on theimplicit rewardparadigm to make this scalable.
3.4. Differentiation Analysis
Compared to main methods in related work, PRIME introduces several core differences and innovations:
-
Online PRM Updates without Step-level Labels: The most significant differentiation is
PRIME's ability to updateprocess reward models (PRMs)online duringRL trainingusing only outcome-level labels. This directly contrasts with traditionalPRMsthat require expensive and difficult-to-collect step-level annotations (e.g.,Lightman et al., 2023) or computationally intensive estimation methods (Wang et al., 2023; Kazemnejad et al., 2024). This solves the scalability issue (C2) and the difficulty of defining process rewards (C1). -
Elimination of Dedicated RM Training Phase: Unlike existing
RLHFpipelines that require a separate, costly phase to train areward model (RM)orPRM(C3),PRIMEinitializes itsImplicit PRMdirectly from theSFT modelor even the base model. This substantially reduces development overhead and time. -
Token-level Fine-grained Rewards from Outcome Labels:
PRIMEleveragesimplicit process reward modeling(Yuan et al., 2024b) to derive dense, token-level rewards from models trained solely on outcome labels. This provides finer granularity than typical step-levelPRMswithout additional annotation cost or ambiguity. -
General Compatibility with RL Algorithms:
PRIMEintegrates dense and sparse rewards in a flexible manner by computing their returns separately before summing, making it a general plug-in for variousMonte Carlo (MC)advantage estimators andRL algorithms(REINFORCE,RLOO,GRPO,PPO), unlike methods tied to specificactor-criticarchitectures. -
Mitigation of Reward Hacking: By enabling online updates of the
PRMwith on-policy rollouts,PRIMEinherently mitigatesreward hackingandover-optimizationissues that plague staticreward modelsdue todistribution shift.In essence,
PRIMEbridges the gap between the theoretical benefits of dense rewards and the practical challenges of their implementation in large-scaleLLM RL, offering a scalable, efficient, and general solution.
4. Methodology
4.1. Principles
The core idea behind PRIME is to leverage implicit process reward modeling to generate dense, token-level rewards that can be updated online using only outcome-level supervision. This overcomes the major hurdles of conventional dense PRMs, such as the prohibitive cost of collecting step-level labels and the vulnerability to reward hacking from static reward models. By treating the Implicit PRM as a causal language model, PRIME can derive token-level rewards from a model trained on sequence-level outcomes, making it scalable for online RL and compatible with various advantage functions. The framework integrates these dense process rewards with traditional sparse outcome rewards to provide comprehensive feedback for policy optimization.
4.2. Core Methodology In-depth (Layer by Layer)
The PRIME framework is designed as a scalable online RL method with dense rewards. It integrates the concept of implicit process rewards with a flexible advantage estimation and policy update mechanism. The overall workflow is illustrated in Figure 1 and detailed in Algorithm 1.
The process flows through several stages, iteratively refining the policy model and the Implicit PRM:
4.2.1. Initialization
The first step involves initializing the key components:
-
The
policy model() and its old version () are initialized from a pre-trained language model, typically asupervised fine-tuning (SFT)model. -
The
Implicit PRM() and thereference model() are also initialized from the sameSFT modelor even a base model. This is a keyPRIMEinnovation, eliminating a dedicatedPRMtraining phase.The following figure (Figure 1 from the original paper) shows the illustration of PRIME:
该图像是示意图,展示了PRIME的工作流程。图中包含多个组件,包括政策模型、隐式过程奖励模型(Implicit PRM)和结果验证器。流程从输入提示开始,生成响应并通过结果验证器进行评估。根据过程奖励和输出的准确性,更新政策模型和隐式PRM。图示清晰地展示了各个步骤之间的关系和反馈机制。
Figure 1: Illustration of PRIME. PRIME follows that (1) initialize policy model and the Implicit PRM both with the reference model; (2) sample multiple responses for each prompt and filter with output accuracy; (3) obtain implicit process rewards by the Implicit PRM and update it using cross-entropy (CE) loss; (4) compute advantage and policy loss then update the policy model.
4.2.2. Policy Rollouts
For each RL iteration (Step 2 of Algorithm 1):
-
Sample Prompts (Algorithm 1, Step 3): A batch of prompts is sampled from the dataset .
-
Generate Responses (Algorithm 1, Step 4): For each prompt , the current
policy modelgenerates multiple responses: . These are complete trajectories (sequences of tokens). -
Compute Outcome Rewards (Algorithm 1, Step 5): An
outcome verifier(a rule-based function or a reward model that provides a score for the entire generated response) computes the outcome reward for each of the responses. As discussed in Section 5.2, for math, this is usually 1 for an exact match to the ground truth and 0 otherwise. For coding, it's the proportion of passing test cases. -
Apply Accuracy Filter (Algorithm 1, Step 6): An
online prompt filteringtechnique is applied. This filters the generated trajectories to retain only prompts where at least one response falls within a certain accuracy range (e.g., median-level difficulty). This helps balance the data distribution forImplicit PRMtraining and stabilizesRLtraining by focusing on useful examples (as shown in Figure 2). The filtered set of (prompt, response, outcome reward) triplets is denoted as .The following figure (Figure 2 from the original paper) shows the effect of online prompt filtering:
该图像是一个折线图,展示了不同步长下的结果训练奖励变化情况。蓝色线条表示应用过滤器的结果,而橙色线条表示未应用过滤器的结果。随着步骤的增加,应用过滤器的奖励值整体保持在更高水平,表明过滤器的效果显著。
Figure : Effect of online prompt filtering.
4.2.3. Implicit Process Reward Modeling and Update
This is where PRIME addresses the scalability challenges of dense rewards.
-
Obtain Implicit Process Rewards (Algorithm 1, Step 7): For each , the
Implicit PRM() and thereference model() are used to calculatetoken-level dense rewardsfor each token in the sequence . The calculation is based on the log-ratio of token probabilities: $ r _ { \phi } ( y _ { t } ) : = \beta \log \frac { \pi _ { \phi } ( y _ { t } | \mathbf { y } _ { < t } ) } { \pi _ { \mathrm { r e f } } ( y _ { t } | \mathbf { y } _ { < t } ) } $- : The implicit process reward for generating token at time step .
- : A hyperparameter that controls the magnitude of the implicit rewards.
- : The probability of generating token given the preceding sequence by the
Implicit PRM. - : The probability of generating token given the preceding sequence by the
reference model. - : The natural logarithm.
This formulation allows the
Implicit PRMto be trained with outcome labels, yet provide fine-grained token-level feedback by interpreting the ratio of likelihoods from thePRMandreference modelas rewards.
-
Update Implicit PRM (Algorithm 1, Step 8): The
Implicit PRMis updated online using the collected rollouts from and their outcome rewards. The update uses across-entropy (CE)loss: $ \mathcal { L } _ { \mathrm { C E } } ( \phi ) = - \mathbb { E } _ { ( \mathbf { x } , \mathbf { y } , r _ { o } ( \mathbf { y } ) ) \sim \mathcal { T } } \left[ r _ { o } \left( \mathbf { y } \right) \cdot \log \sigma \left( r _ { \phi } \left( \mathbf { y } \right) \right) + \left( 1 - r _ { o } \left( \mathbf { y } \right) \right) \cdot \log \left( 1 - \sigma \left( r _ { \phi } \left( \mathbf { y } \right) \right) \right) \right] $- : The cross-entropy loss for updating the
Implicit PRMparameters . - : Expected value over the filtered samples .
- : A triplet of prompt, generated response, and its outcome reward from the filtered set .
- : The binary outcome reward for the entire response (typically 1 for correct, 0 for incorrect).
- : The sigmoid function, which squashes the
Implicit PRM's predicted total reward (which is\sum_{t=1}^{|\mathbf{y}|} r_\phi(y_t)) into a probability-like score between 0 and 1. The total implicit reward for a sequence is often computed as the sum of its token-level rewards,r_\phi(\mathbf{y}) = \sum_{t=1}^{|\mathbf{y}|} r_\phi(y_t). - This
CE losseffectively trains theImplicit PRMto predict the outcome reward of a full trajectory, even though it provides token-level rewards at inference. This online update is crucial for mitigatingreward hackingas thePRMadapts to the evolving policy distribution (as shown in Figure 5).
- : The cross-entropy loss for updating the
4.2.4. Advantage Estimation and Policy Update
Once dense process rewards and outcome rewards are available, PRIME calculates advantages to update the policy.
-
Monte Carlo Advantage with Leave-One-Out Baseline: The paper uses
Monte-Carlo (MC)estimators for advantage calculation, finding them stable and effective. Specifically, it employs aleave-one-out (LOO)baseline, which helps reduce variance by subtracting the average reward of other samples for the same prompt. TheLOObaseline for outcome rewards for the -th response among samples is: $ A ^ { i } = r _ { o } ( { \bf y } ^ { i } ) - \frac { 1 } { K - 1 } \sum _ { j \neq i } r _ { o } ( { \bf y } ^ { j } ) $- : The advantage for the -th response.
- : The outcome reward for the -th response.
- : The total number of responses sampled for a given prompt.
- : The sum of outcome rewards for all other responses () except the -th one. This formula calculates the advantage of the -th response as its own reward minus the average reward of all other responses for the same prompt.
-
Combined Advantage Function (Algorithm 1, Step 9):
PRIMEcombinesimplicit process rewardsandoutcome rewardsby calculating their returns separately and then summing them. This is done to avoid numerical instability that might arise from directly mixing their values. The final advantage for token in the -th response is: $ A _ { t } ^ { i } = \sum _ { s = t } ^ { \left| \mathbf { y } ^ { i } \right| } \gamma ^ { s - t } \cdot \left[ r _ { \phi } ( y _ { s } ^ { i } ) - \frac { 1 } { K - 1 } \sum _ { j \neq i } r _ { \phi } \left( \mathbf { y } ^ { j } \right) \right] + r _ { o } \left( \mathbf { y } ^ { i } \right) - \frac { 1 } { K - 1 } \sum _ { j \neq i } r _ { o } \left( \mathbf { y } ^ { j } \right) $- : The advantage for generating token at step in the -th response.
- : This part represents the discounted return from implicit process rewards.
- : The discount factor, weighing future rewards less.
- : The total length of the -th response.
- : The implicit process reward for token at step .
- : The
leave-one-outbaseline for the implicit sequence-level rewardr_\phi(\mathbf{y}^j) = \sum_{t=1}^{|\mathbf{y}^j|} r_\phi(y_t^j), or, as implied by the overall structure, it could be an average ofimplicit process rewardsat step across other trajectories. The paper states "Use the averaged implicit process rewards to calculate the leave-one-out baseline," which suggests here refers to a sequence-level average or sum. Given the summation , the baseline term is likely meant to be a sequence-level average to normalize the process reward component of the return. A more precise interpretation for token-level baseline would be , but the formula as written shows a sum over applied to , implying a sequence-level baseline for the entire process reward component. The text "Normalize the process reward at step by subtracting the baseline" seems to conflict slightly with the formula's notation of , but the overall intent is aleave-one-outbaseline.
- : This part is the
leave-one-outadvantage for the sparseoutcome reward.
-
Update Policy with PPO Loss (Algorithm 1, Step 10): The
policy modelis updated using thePPO clip surrogate loss, which provides stable updates by preventing the new policy from deviating too far from the old one: $ L _ { \mathrm { C L I P } } ( \theta ) = \mathbb { E } _ { t } \left[ \operatorname* { m i n } \left( \frac { \pi _ { \theta } \bigl ( y _ { t } | \mathbf { y } _ { < t } \bigr ) } { \pi _ { \theta _ { \mathrm { o d d } } } \bigl ( y _ { t } | \mathbf { y } _ { < t } \bigr ) } A _ { t } , \mathrm { c l i p } \Bigl ( \frac { \pi _ { \theta } \bigl ( y _ { t } | \mathbf { y } _ { < t } \bigr ) } { \pi _ { \theta _ { \mathrm { o d d } } } \bigl ( y _ { t } | \mathbf { y } _ { < t } \bigr ) } , 1 - \epsilon , 1 + \epsilon \Bigr ) A _ { t } \right) \right] $- : The PPO clip surrogate loss for updating policy parameters .
- : Expected value over timesteps .
- : Probability of token under the current policy.
- : Probability of token under the previous policy (before update).
- : The advantage function calculated in the previous step (for token ).
- : A clipping function that constrains to be within .
- : A small clipping parameter (e.g., 0.2) that limits how much the policy can change in one update step, ensuring stability. The loss function encourages actions with positive advantage to be more likely, while discouraging those with negative advantage, clipped to prevent excessive changes.
4.2.5. Update Old Parameters
Finally, the old policy parameters are updated to the current policy parameters (Algorithm 1, Step 11) for the next iteration, and the loop continues until total iteration .
4.2.6. Other Techniques
- Initializing PRM with SFT/base model: As mentioned, the
Implicit PRMis directly initialized from theSFT modelor even the base model, bypassing a dedicatedPRMtraining stage. This is shown to be effective and even outperformPRMstrained on extra data (see Section 5.1 and Figure 4). - Online Prompt Filtering: This technique, described in Section 4.2.2, filters prompts to keep those with accuracy within a certain range. This focuses training on "median-level difficulty" problems, stabilizing
RLand balancingImplicit PRMtraining.
4.2.7. PRIME's Solution to the Challenges (from Section 2.2)
- C1. Process rewards are hard to define:
PRIMEaddresses this by deriving token-level implicit process rewards from anImplicit PRMtrained only with outcome labels, removing the need for ambiguous and costly step-level annotations. - C2. PRM online updates are not scalable:
PRIMEenables online updating of theImplicit PRMusing on-policy rollouts and outcome labels (which are already collected for policy updates). This is scalable because it doesn't require new, expensive human annotations for each update, preventingreward hackingdue to distribution shift. - C3. Explicit reward modeling brings extra cost:
PRIMEeliminates the dedicatedreward modeling stageby directly initializing theImplicit PRMfrom theSFTor base model.
5. Experimental Setup
5.1. Datasets
The experiments primarily focus on competition-level mathematics and programming tasks.
5.1.1. Supervised Fine-tuning (SFT) Dataset
-
Purpose: To provide a strong starter model for RL by teaching specific reasoning patterns.
-
Content: Reasoning instructions collected from several open-source datasets, completed by
LLaMA-3.1-70B-Instructusing anaction-centric chain-of-thoughtreasoning framework. The framework involves 7 actions (ASSESS,ADVANCE,VERIFY,SIMPLIFY,SYNTHESIZE,PIVOT,OUTPUT) that the model chooses at each step of its multi-step reasoning. -
Sources (Table 10):
- Math:
MathInstruct-MATH(Yue et al., 2023),OpenMathIns-2-Aug_Math(Toshniwal et al., 2024),Numina(Li et al., 2024),Reasoning-001(SkunkworksAI, 2024). - Coding:
Code-Feedback(Zheng et al., 2024),Magicoder(Wei et al., 2024),Magicoder-OSS(Wei et al., 2024). - Biomedicine:
UltraMedical_mc(Zhang et al., 2024).
- Math:
-
Scale: 230K
SFTdata pairs, with an average response length of 1390 tokens. -
Reason for Selection: Authors explicitly did not include many datasets with ground-truth answers in
SFT, reserving them forRLto diversify exploration and because ground-truth is considered more essential inRL.The following are the results from Table 10 of the original paper:
Task Dataset Size Avg. Response Length Source Math MathInstruct-MATH (Yue et al., 2023) 12715 964.01 https://huggingface.co/datasets/TIGER-Lab/MathInstruct OpenMathIns-2-Aug_Math (Toshniwal et al., 2024) 15086 1202.25 https://huggingface.co/datasets/nvidia/OpenMathInstruct-2 Numina (Li et al., 2024) 55845 1331.61 https://huggingface.co/datasets/AI-MO/NuminaMath-CoT Reasoning-001 (SkunkworksAI, 2024) 29831 1316.49 https://huggingface.co/datasets/SkunkworksAI/reasoning-0.01 Coding Code-Feedback (Zheng et al., 2024) 27663 1805.16 https://huggingface.co/datasets/m-a-p/Code-Feedback Magicoder (Wei et al., 2024) 24480 1828.72 https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K Magicoder-OSS (Wei et al., 2024) 28980 1850.05 https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K Biomedicine UltraMedical_mc (Zhang et al., 2024) 35163 891.06 https://huggingface.co/datasets/TsinghuaC3I/UltraMedical Total / Avg. 229763 1390.75
5.1.2. Reinforcement Learning (RL) Datasets
- Purpose: To train
LLMsusingRLwith outcome verifiers. - Content: High-quality mathematics and coding problems with outcome verifiers (LaTeX answers for math, test cases for coding).
- Sources:
- Math:
NuminaMathCoT(Li et al., 2024), containing about 860K math problems (Chinese high school to International Mathematical Olympiad). After cleaning and filtering, 457k math problems were retained. - Coding:
APPS(Hendrycks et al., 2021a),CodeContests(Li et al., 2022),TACO(Li et al., 2023), andCodeforces2. After cleaning and filtering, 27k coding problems were retained.
- Math:
- Data Preprocessing: A systematic rule-based approach was used to filter and classify math problems. Problems with figures/diagrams or proofs were excluded. Remaining problems were classified into question-answering, multiple-choice, or fill-in-the-blank, with a focus on multiple-choice. These were then converted to a direct question-answer format using a two-phase reformatting process (rule-based filtering,
LLM-based filtering,LLM-based formatting). Finally,LLM-based comprehensive validation(usingQwQ-32B-PreviewandQwen2.5-Math-72B-Instructwithself-consistency) was performed to ensure solvability and correctness.
5.1.3. EurusPRM Training Dataset (for ablation)
-
Purpose: Used for an ablation study to compare
PRMinitialization strategies. This dataset is specifically for training a dedicatedPRMto compare againstPRIME's approach of initializingPRMfromSFTmodel. -
Sources (Table 11):
UltraInteract,Numina-SynMath,Numina-Olympiads. -
Generators:
Llama-3.1-8B-Inst,Llama-3.1-8B-Base,Qwen2.5-72B-Inst,Qwen2.5-Math-7B-Base. -
Scale: 500K data generated by
Llama3.1andQwen2.5series, with 8 responses per instruction, all at the response-level.The following are the results from Table 11 of the original paper:
Dataset Generator Model Num. Inst Resp/Inst Step-level/Response-level UltraInteract Llama-3.1-8B-Inst 20177 8 Response-level Llama-3.1-8B-Base 13570 8 Response-level Qwen2.5-72B-Inst 4758 8 Response-level Qwen2.5-Math-7B-Base 25713 8 Response-level Numina-SynMath Llama-3.1-8B-Inst 4783 8 Response-level Qwen2.5-Math-7B-Base 5806 8 Response-level Numina-Olympiads Llama-3.1-8B-Inst 2909 8 Response-level Qwen2.5-Math-7B-Base 4739 8 Response-level
5.2. Evaluation Metrics
The primary evaluation metrics are focused on accuracy on reasoning benchmarks.
-
Rule-based Outcome Verifier (): This serves as the ground truth reward for both training and evaluation, consistent with recent research on unhackable rewards.
- For Math ():
- Conceptual Definition: For mathematical problems, the outcome reward is a binary value indicating whether the generated answer exactly matches the ground truth. It focuses on the correctness of the final numerical or symbolic result.
- Mathematical Formula: $ r _ { o } ^ { \mathrm { m a t h } } ( \mathbf { y } ) = \left{ \begin{array} { l l } { 1 , } & { \mathrm { m a t c h e d } } \ { 0 , } & { \mathrm { o t h e r w i s e } } \end{array} \right. $
- Symbol Explanation:
- : The outcome reward for a generated mathematical response .
1: Indicates the generated answer is an exact match to the ground truth.0: Indicates the generated answer does not match the ground truth.matched: A boolean condition where the extracted answer from is identical to the canonical ground truth answer.
- For Coding ():
- Conceptual Definition: For coding problems, the outcome reward measures the proportion of test cases that the generated code successfully passes. It assesses the functional correctness of the code.
- Mathematical Formula: $ r _ { o } ^ { \mathrm { c o d e } } ( \mathbf { y } ) = \frac { \sum # \mathrm { p a s s e s } } { \sum # \mathrm { t e s t } \mathrm { c a s e s } } $
- Symbol Explanation:
- : The outcome reward for a generated code response .
- : The total number of test cases that the generated code successfully passes.
- : The total number of available test cases for the problem.
- For Math ():
-
Pass@1 (Accuracy):
- Conceptual Definition:
Pass@1(or accuracy) is a metric commonly used in code generation and reasoning tasks. It measures the percentage of problems for which at least one generated solution (when sampling multiple) is correct. In this paper,Pass@1likely refers to the single best generation for each problem, as is typical for competitive benchmarks. - Mathematical Formula: $ \text{Pass@1} = \frac{\text{Number of problems with at least one correct solution}}{\text{Total number of problems}} \times 100% $
- Symbol Explanation:
Number of problems with at least one correct solution: The count of problems where the model produced at least one output that passed theoutcome verifier.Total number of problems: The total number of problems in the evaluation set.
- Conceptual Definition:
-
Average Improvement: This refers to the percentage increase in
Pass@1(or similar accuracy metric) over a baseline model across multiple benchmarks.
5.3. Baselines
The paper compares PRIME against several strong baselines:
- Eurus-2-7B-SFT: The
supervised fine-tuning (SFT)version of their model, serving as the starting point forRLand a direct comparison to measureRL's impact. - RLOO w/OV Only:
RLusingleave-one-out (LOO)baseline with only theoutcome verifier (OV)rewards (i.e., sparse rewards, no dense process rewards). This is the direct sparse rewardRLbaseline forPRIME. - State-of-the-art LLMs:
GPT-4o: A powerful proprietaryLLMfrom OpenAI.Llama-3.1-70B-Instruct: A leading open-sourceLLMfrom Meta.Qwen2.5-Math-7B-Instruct: A specialized mathematical reasoningLLMfrom Qwen, used as a strong domain-specific competitor.
- Other RL Algorithms (for ablation):
REINFORCE(Williams, 1992): A basicMonte-Carlopolicy gradient algorithm.GRPO(Shao et al., 2024):Grouped Reward Policy Optimization, which uses group average of rewards as a baseline.PPO(Schulman et al., 2017): A popularactor-criticalgorithm that usesGeneralized Advantage Estimation (GAE)and a clipped objective.
- Specialized RL Baselines (for comparison in Appendix):
-
VinePPO(Kazemnejad et al., 2024): AnRLmethod forLLM reasoningthat uses average return across trajectories for value estimation. -
DeepScaleR(Luo et al., 2025): A three-stageRLtraining pipeline that iteratively increases context length for performance gain.These baselines are chosen to demonstrate
PRIME's superiority over: (1) itsSFTprecursor, (2)RLwith sparse rewards only, (3) otherRL algorithms(to show generality), and (4) leading commercial and open-source models, especially those specialized in mathematical reasoning.
-
5.4. Hyperparameters
The experiments were conducted on GPUs using veRL (Sheng et al., 2024) framework.
- Policy Model Learning Rate: Constant
- Optimizer:
AdamWfor policy model. - Learning Rate Schedule:
Cosine annealingwith awarmup ratioof 0.1 forSFTphase. - PRM Learning Rate:
- Batch Size: 256 for both policy and
PRMs. - Micro Batchsize: 8.
- Rollout Stage: Collects 256 prompts and samples 4 responses for each prompt.
- Beta () for PRM:
0.05. This parameter controls the magnitude of the implicit rewards. - KL Coefficient: Set to 0 in all experiments. This means no
KL divergencepenalty is applied between the updated policy and the reference policy, which is common inPPOand similar algorithms to prevent excessive policy shifts. - SFT Training Details: Full parameter fine-tuning with a learning rate of ,
AdamWoptimizer,cosine annealing learning rate schedule(warmup ratio 0.1), batch size 96, random seed 42. Trained on 230K datasets for 3 epochs. - Implicit PRM/Reference Model Initialization: Default is
Implicit PRMinitialized withSFT modelandSFT modelretained for reference logprobs.
5.5. Evaluation Benchmarks
The models were evaluated on 7 reasoning benchmarks, focusing on competition-level mathematics and programming tasks:
- AIME 2024 (
Li et al., 2024): American Invitational Mathematics Examination, a challenging competition math contest. - AMC (
Li et al., 2024): American Mathematics Competitions, another series of math contests. - MATH-500 (
Hendrycks et al., 2021b): A dataset of 500 challenging math problems designed to test advanced mathematical reasoning. - Minerva Math (
Lewkowycz et al., 2022): A dataset of math problems from various sources, used to evaluate mathematical problem-solving. - OlympiadBench (
He et al., 2024): A benchmark for olympiad-level bilingual multimodal scientific problems. - LeetCode (
Guo et al., 2024): A platform providing competitive programming problems. - LiveCodeBench (v2) (
Jain et al., 2024): A holistic and contamination-free evaluation benchmark for large language models for code.
6. Results & Analysis
6.1. Core Results Analysis
The main results demonstrate PRIME's substantial improvements in reasoning benchmarks, particularly in competitive math and coding.
The following are the results from Table 1 of the original paper:
| Method | Step | AIME 2024 | AMC | MATH-500 | MinervaMath | OlympiadBench | LeetCode | LiveCodeBench | Avg. |
| GPT-40 | - | 9.3 | 45.8 | 76.4 | 36.8 | 43.3 | 58.9 | 48.8 | 45.6 |
| Llama-3.1-70B-Inst. | - | 20.0 | 37.3 | 65.0 | 37.1 | 30.5 | 35.0 | 34.4 | 37.0 |
| Qwen2.5-Math-7B-Inst. | - | 13.3 | 506 | 79.8 | 34.6 | 440.7 | 1.7 | 11.3 | 34.6 |
| Eurus-2-7B-SFT | 0 | 3.3 | 30.1 | 66.2 | 32.7 | 29.8 | 21.7 | 17.8 | 28.8 |
| RLOO w/OV Only | 240 | 20.0 | 47.0 | 73.2 | 36.4 | 35.4 | 28.3 | 26.7 | 36.9 |
| Eurus-2-7B-PRIME | 80 | 20.0 | 41.0 | 68.2 | 38.2 | 37.0 | 26.7 | 26.6 | 36.8 |
| 160 | 13.3 | 42.2 | 72.0 | 37.1 | 38.7 | 26.7 | 25.6 | 36.5 | |
| 240 | 20.0 | 50.6 | 78.2 | 39.3 | 40.3 | 31.1 | 27.5 | 41.0 | |
| 320 | 16.7 | 51.8 | 77.8 | 39.7 | 41.5 | 36.1 | 28.5 | 41.7 | |
| 592 | 26.7 | 57.8 | 79.2 | 38.6 | 42.1 | 33.3 | 28.6 | 43.9 |
-
Significant Improvement over SFT Baseline: The
Eurus-2-7B-PRIMEmodel, starting fromEurus-2-7B-SFT(Avg. 28.8%), achieves an average score of 43.9% after 592 steps, representing a substantial 15.1% average improvement. This demonstrates the clear effectiveness of thePRIMEframework in enhancing reasoning capabilities throughRL. -
Strong Performance in Math Competitions:
PRIMEshows remarkable gains onAMC(from 30.1% forSFTto 57.8% forPRIMEat 592 steps) andAIME 2024(from 3.3% to 26.7%), with over 20% improvement in these areas. This highlights its capability for complex mathematical reasoning. -
Outperforming Qwen2.5-Math-7B-Instruct with Less Data: The final
Eurus-2-7B-PRIMEmodel surpassesQwen2.5-Math-7B-Instructon five mathematical benchmarks (AIME 2024, AMC, MATH-500, MinervaMath, OlympiadBench, see Appendix B.3 for full list including coding). Notably, as shown in Table 3, this achievement is accomplished with only 10% of the training data used byQwen2.5-Math-7B-Instruct(230K SFT + 150K RL queries vs. 2.5M SFT + 618K RM + 66K RL queries). This underscoresPRIME's exceptional sample efficiency.The following figure (Figure 12 from the original paper) shows the overall math performance:
该图像是一个条形图,展示了不同模型在多个推理基准上的准确率。图中包含 Eurus-2-7B-PRIME、Eurus-2-7B-SFT、Qwen-2.5-Math-7B-Instruct、Llama-3.1-70B-Instruct 和 GPT-4o-2024-08-06 的表现,尤其在 MATH-500 基准中,Eurus-2-7B-PRIME 达到了 79.2%。
B.3 Results of Different RL Algorithms Figure 12: Overall math performance. Eurus-2-7B-PRIME excels at competition-level mathematics benchmarks, outperforming advanced math models and larger models. Notably, PRIME brings substantial performance gain over Eurus-2-7B-SFT.
The following are the results from Table 3 of the original paper:
| Model | Eurus-2-7B-PRIME | Qwen2.5-Math-7B-Instruct |
| Base Model | Qwen2.5-Math-7B | Qwen2.5-Math-7B |
| SFT Data | 230K (open-source) | 2.5M (open-source & in-house) |
| RM Data | 0 | 618K (in-house) |
| RM | Eurus-2-7B-SFT | Qwen2.5-Math-RM (72B) |
| RL Data | 150K queries × 4 samples | 66K queries × 32 samples |
6.2. Dense Rewards vs. Sparse Rewards
The paper directly compares PRIME (with dense rewards) against RLOO w/OV Only (with sparse outcome rewards).
The following figure (Figure 3 from the original paper) shows the effect of dense reward:
该图像是图表,展示了PRIME与RLOO在结果训练奖励和测试准确度方面的比较。图中(a)部分显示,PRIME的结果训练奖励(蓝线)在步骤200时比RLOO高出6.9%,且表现出2.5倍的样本效率。图(b)部分则展示了在不同梯度步骤下的测试准确度,PRIME的准确度整体高于RLOO。
Figure 3: The effect of dense reward. We compare PRIME and RLOO with outcome verifier (OV). PRIME leads to sample efficiency (wall clock as axis can be found in Figure 17) and performance improvement. PRIME also substantially outperforms RLOO on downstream tasks.
The following figure (Figure 17 from the original paper) shows the effect of dense reward, with wall clock as the X-axis:
该图像是图表,显示了PRIME与仅使用结果验证器(OV)的RLOO之间的训练奖励比较。图中展示了训练奖励的10次迭代移动平均,随着时间的推移,PRIME在样本效率上表现优于RLOO。
Figure 17: The effect of dense reward. We compared PRIME and RLOO with outcome verifier (OV). The figure depicts training reward curves across wall clock, revealing better sample efficiency of PRIME.
- Performance: Figure 3 shows that
PRIMEachieves a 6.9% higher final training reward and lower variance compared toRLOO w/OV Onlyafter 240 steps. On downstream tasks (Table 1),PRIMEat 240 steps (Avg. 41.0%) significantly outperformsRLOO w/OV Onlyat 240 steps (Avg. 36.9%), demonstrating consistent improvements. - Training Efficiency:
-
Sample Efficiency: Figure 3 indicates that
PRIMEreaches similar training reward levels in significantly fewer steps, leading to a 2.5x sample efficiency gain compared toRLOO w/OV Only. Figure 17 (plotting against wall clock time) further supports this. -
Time Cost: While
PRIMEhas a slightly higher per-step time cost due toPRMupdates, its superior sample efficiency still translates to overall faster training.The following are the results from Table 2 of the original paper:
Time(s) Rollout Policy update PRM update Others Sum PRIME 281.7 156.6 150.9 91.1 680.3 RLOO 282.4 157.9 0 90.4 530.7
-
PRIME requires 24% more time per step (680.3s vs. 530.7s) primarily due to the PRM update phase. However, given its 2.5x sample efficiency, PRIME is still about 2x more efficient in total training time. This shows that the benefits of dense rewards outweigh the minor overhead.
6.3. Ablation Studies / Parameter Analysis
6.3.1. Design Choices for the Implicit PRM
-
SFT Model Initializes a Good PRM: The paper investigates
Implicit PRMinitialization strategies. Surprisingly, directly initializing thePRMfrom theSFT model(Eurus-2-7B-SFT) performs significantly better than initializing from a specially trainedEurusPRM(which was trained on an additional 500K samples from Llama3.1 and Qwen2.5 series). The following figure (Figure 4 from the original paper) shows the comparison of different PRMs:
该图像是图表,展示了不同过程中获得的奖励与测试准确性的比较。图(a)中,PRIME使用在线SFT PRM的结果最佳;图(b)显示在不同梯度步骤下,各模型的测试准确度变化情况。数据表明,使用额外的离线EurusPRM训练会影响性能。Figure 4: Comparison of different PRMs. Online PRM initialized from SFT model achieved the best results. However, using PRMs trained on extra rollouts hurts the performance. This suggests that aligning the
PRMandpolicy modelby initializing them from the same base model (theSFT modelin this case) helps alleviatedistribution shiftissues, and that thePRM's effectiveness largely comes from being trained on online rollouts from the policy model itself, rather than from extensive pre-training. -
Online PRM Update is Essential: The paper demonstrates that online
PRMupdates are crucial for preventingoveroptimizationandreward hacking. The following figure (Figure 5 from the original paper) shows the impact of PRM online update:
该图像是图表,展示了PRM的在线更新效果。线上PRM在训练过程中表现出更高的准确率,而离线PRM则逐渐被过度优化。图中显示了使用在线SFT PRM及在线EurusPRM的准确率变化趋势。Figure 5: Impact of PRM online update. Offline PRM is gradually been overoptimized while online PRMs achieve higher accuracy during training. Figure 5 clearly shows that an
offline PRM(trained once and kept static) initially has high accuracy but gradually drops duringRLtraining due todistribution shiftas the policy evolves. In contrast,online PRMs(trained on policy rollouts) show increasing accuracy, effectively adapting to the current policy distribution. This validation confirms thatPRIME's onlinePRMupdate mechanism is vital for its success.
6.3.2. Scaling PRIME with More Compute
PRIME's scalability is tested by extending training steps and increasing rollout numbers.
The following figure (Figure 6 from the original paper) shows RL training with more training steps (Left) and larger rollout numbers (Right):
该图像是图表,展示了PRIME训练过程的测试性能。左侧图为PRIME训练至800步的结果,右侧图为每个提示使用16次回滚的训练效果。蓝线表示PRIME模型的准确率,橙线表示仅使用输出标签的RLOO模型。整体趋势显示PRIME在训练中表现优于RLOO模型。
Figure 6: RL training with more training steps (Left) and larger rollout numbers (Right).
- Extended Training (800 steps):
PRIMEconsistently shows stable growth and outperforms theRLOObaseline by 3.7% over an extended training period (Figure 6, Left). - Larger Rollout Numbers (16 samples/prompt): Increasing the number of sampled responses per prompt from 4 to 16 yields a non-trivial improvement of approximately 4.4% for
PRIMEoverRLOO(Figure 6, Right). These results confirmPRIME's scalability and robust performance with increased computational resources.
6.3.3. PRIME with Other RL Algorithms
PRIME is shown to be a general method, compatible with various RL algorithms.
The following figure (Figure 7 from the original paper) shows that PRIME also generally benefits REINFORCE, GRPO, and PPO:
该图像是一个图表,展示了不同强化学习方法(REINFORCE、GRPO、PPO)在训练过程中的结果奖励变化,比较了使用PRIME的效果。随着步骤的增加,添加了PRIME的强化学习方法整体表现出更高的奖励,表明其对训练过程的积极影响。
Figure 7: PRIME also generally benefits REINFORCE, GRPO, and PPO.
The following are the results from Table 4 of the original paper:
| Method | Step | AIME 2024 | AMC | MATH-500 | MinervaMath | OlympiadBench | LeetCode | LiveCodeBench | Avg. |
| RLOO | 240 | 20.0 | 47.0 | 73.2 | 36.4 | 35.4 | 28.3 | 26.7 | 36.9 |
| RLOO w/PRIME | 240 | 20.0 | 506 | 78.2 | 39.3 | 40.3 | 31.1 | 27.5 | 41.0 |
| REINFORCE | 240 | 6.7 | 47.0 | 72.6 | 36.0 | 37.2 | 27.2 | 25.0 | 36.0 |
| REINFORCE w/PRIME | 240 | 6.7 | 50.0 | 76.4 | 36.8 | 39.1 | 27.8 | 27.5 | 37.8 |
| GRPO | 240 | 10.0 | 44.6 | 73.2 | 37.5 | 36.6 | 25.0 | 25.8 | 36.1 |
| GRPO w/PRIME | 240 | 16.7 | 47.0 | 75.0 | 34.9 | 38.2 | 28.9 | 23.9 | 37.8 |
| PPO | 240 | 10.0 | 41.0 | 73.6 | 36.0 | 36.3 | 28.3 | 25.7 | 35.8 |
| PRIME as Value Model | 240 | 16.7 | 44.6 | 72.6 | 34.6 | 35.7 | 27.8 | 24.6 | 36.6 |
| PPO w/ PRIME | 240 | 13.3 | 50.6 | 77.4 | 37.1 | 40.6 | 30.0 | 26.7 | 39.4 |
Figure 7 and Table 4 show that PRIME consistently boosts the performance of REINFORCE, GRPO, and PPO in terms of both efficiency (implied by higher rewards at similar steps) and final performance. For example, RLOO w/PRIME achieves an average of 41.0% compared to RLOO's 36.9%. This highlights PRIME as a generic, plug-in component for almost any RL algorithm for LLMs. Notably, the PPO variant of PRIME (39.4%) performs better than PPO alone (35.8%), but RLOO w/PRIME (41.0%) still achieves the best performance.
6.3.4. Value or Reward: How to Use the Implicit PRM?
This section explores whether the Implicit PRM is better utilized as a reward model (providing ) or a value model (predicting ).
The following figure (Figure 8 from the original paper) shows the comparison of value models and process reward models:
该图像是一个图表,展示了不同方法在训练过程中的结果奖励变化,横轴为步骤数量,纵轴为结果训练奖励。图中包含了 REINFORCE 方法及其与线性头价值模型、隐式过程奖励模型作为价值和奖励的对比。不同方法的曲线展示了奖励随着步骤增加的趋势。
Figure 8: Comparison of value models and process reward models. Figure 8 compares four variants:
-
REINFORCE(A_t = r_o(\mathbf{y})): Basic sparse reward. -
PPO(A_t = r_o(\mathbf{y}) - V(\mathbf{y}_{<t})): Uses a linear-head value model for baseline. -
Implicit PRM as Value Model(A_t = r_o(\mathbf{y}) - v_\phi(\mathbf{y}_{<t})): Uses values fromImplicit PRMas baseline. -
REINFORCE w/PRIME(A_t = r_o(\mathbf{y}) + \sum_{s=t}^T r_\phi(y_s)): Uses process rewards fromImplicit PRMto calculate return.The results clearly indicate that using
Implicit PRMto calculateprocess rewardsand incorporate them into the return (variant 4) significantly outperforms all other baselines, including those usingvalue models(variants 2 and 3). This suggests thatPRMsare more effective thanvalue modelsforRLinLLMs, and that explicitly leveraging the dense reward signal is key.
6.3.5. "Zero" Experiments (RL from Base Model)
The paper also explores "Zero" RL, where training starts directly from a base model (e.g., Qwen2.5-Math-7B-Base) without an SFT phase.
The following figure (Figure 13 from the original paper) shows "Zero" RL from Qwen2.5-Math-7B:
该图像是两个图表,展示了PRIME与PRIME-Zero在训练过程中的效果。第一个图表显示了在不同步数下的结果训练奖励,PRIME的表现低于PRIME-Zero。第二个图表展示了不同梯度步数下的数学测试准确率,PRIME同样表现不及Qwen2.5-Math-7B-Instruct。
Figure 13: "Zero" RL from Qwen2.5-Math-7B. RL from the base model converges way faster than the SFT model, surpassing the instruct version within 32 steps.
The following figure (Figure 14 from the original paper) shows "Zero" RL from Qwen2.5-32B-Base:
该图像是图表,展示了PRIME-Zero在步骤数与训练奖励及数学测试准确率之间的关系。在(a)部分,展示了PRIME-Zero的结果随步骤的变化情况。在(b)部分,显示了不同梯度步骤下的数学测试准确率,在80步时达到52分以上。
Figure 14: "Zero" RL from Qwen2.5-32B-Base. RL from a 32B base model shows more promising gain, surpassing the instruct version within 16 steps.
- Efficiency:
RLfrom a base model (PRIME-Zero) converges much faster than from anSFT model. ForQwen2.5-Math-7B-Base, it surpasses the instruct version within 32 steps (Figure 13). - Larger Models Benefit More: The 32B base model shows even more promising gains, surpassing the instruct version within 16 steps (Figure 14). This aligns with findings in
DeepSeek-AI et al. (2025). - Saturation Issue: Despite impressive initial gains,
PRIME-Zeromodels quickly saturate at an early stage (around 50 steps), which hinders further improvement. This is potentially due to a decrease in response diversity and is highlighted as future work.
6.3.6. Effect of Reward Model Size
The paper investigates the impact of reward model (RM) capacity by fixing the policy model (Qwen2.5-7B-Base) and varying the RM size.
The following are the results from Table 5 of the original paper:
| Reward Model | AIME 24 | AIME 25 | AMC | MATH | Minerva | OlympiadBench | Average |
| Qwen2.5-3B | 10.7 | 4.8 | 44.0 | 73.2 | 26.1 | 33.0 | 32.0 |
| Qwen2.5-7B | 13.2 | 6.4 | 42.9 | 73.4 | 26.5 | 33.1 | 32.6 |
| Qwen2.5-14B | 10.8 | 4.8 | 44.1 | 73.2 | 25.4 | 32.7 | 31.8 |
The results (Table 5) show that reward model size has a limited influence, with the 7B reward model achieving the best average performance (32.6%). Larger (14B) or smaller (3B) RMs do not yield clear advantages, suggesting that a PRM of similar size to the policy model is sufficient.
6.3.7. Comparison with VinePPO
PRIME is compared against VinePPO (Kazemnejad et al., 2024) on the MATH dataset using RhoMath 1.1B as the base model.
The following figure (Figure 15 from the original paper) shows validation accuracy curves of PRIME and VinePPO:
该图像是图表,展示了在 MATH500 数据集上 PRIME 和 VinePPO 的验证准确率曲线。随着时间的推移,PRIME 的验证准确率显著优于 VinePPO,初始阶段的表现差异明显。
Figure 15: Validation accuracy curves of PRIME and VinePPO.
The following are the results from Table 6 of the original paper:
| Steps | 16 | 32 | 48 | 64 | 80 | 96 |
| VinePPO Val Acc (%) | 15.7 | 16.3 | 17.2 | 17.6 | 17.7 | 18.4 |
| VinePPO Clock Time (Hours) | 2.23 | 4.57 | 7.23 | 9.86 | 11.96 | 13.94 |
| PRIME Val Acc (%) | 16.4 | 16.8 | 17.5 | 18.1 | 18.7 | 18.8 |
| PRIME Clock Time (Hours) | 0.22 | 0.41 | 0.60 | 0.80 | 1.01 | 1.22 |
- Efficiency:
PRIMEis significantly faster, completing 96 steps in 1.22 hours compared toVinePPO's 13.94 hours, making it 11x more efficient. - Performance:
PRIMEconsistently achieves higher validation accuracy thanVinePPOat every step, demonstrating superior performance (Table 6, Figure 15).
6.3.8. Comparison with DeepScaleR
PRIME is also compared against DeepScaleR (Luo et al., 2025), a three-stage RL pipeline, under a similar setting.
The following figure (Figure 16 from the original paper) shows training reward curves of PRIME and DeepScaleR:
该图像是图表,展示了PRIME和DeepScaleR的训练奖励曲线。横轴表示训练步骤,纵轴表示训练奖励,PRIME的训练奖励曲线(蓝色)显示出逐步上升的趋势,而DeepScaleR(橙色)则相对平稳。
Figure 16: Training reward curves of PRIME and DeepScaleR.
The following are the results from Table 7 of the original paper:
| Model | Step | GPU Hour | AIME 2024 | MATH-500 | AMC | MinervaMath | OlympiadBench | Avg. |
| DeepScaleR-1.5B-Preview | 1750 | 3800 | 43.1 | 87.8 | 73.6 | 30.2 | 50.0 | 57.0 |
| DeepScaleR-1.5B-Stage1 | 1040 | ~ 600 | 33.9 | - | - | - | - | - |
| DeepSeek-R1-Distill-Qwen-1.5B | - | - | 28.8 | 82.8 | 62.9 | 26.5 | 43.3 | 48.9 |
| PRIME-DeepScaleR-1.5B-Stage1 | 330 | 446.7 | 32.1 | 85.1 | 68.1 | 30.1 | 44.6 | 52.0 |
- Performance:
PRIMEachieves comparable training accuracy toDeepScaleR's Stage 1 within 330 steps, which is only 1/3 ofDeepScaleR's 1040 steps for Stage 1 (Figure 16). On test sets (Table 7),PRIME-DeepScaleR-1.5B-Stage1improves the base model (DeepSeek-R1-Distill-Qwen-1.5B) by 3.1 points (52.0% vs. 48.9% average), validating its effectiveness even on highly capable base models. - Efficiency:
PRIMEconsumes 446.7A800 GPU hoursfor this experiment, compared toDeepScaleR's Stage 1 which roughly required ~600A100 GPU hours. This indicatesPRIMEis about 25% faster and potentially more given hardware differences. The overhead forPRIMEis also noted to be smaller for long reasoning models.
6.4. Reference Model Choice
The paper explores two variants for the reference model () in the Implicit PRM's reward calculation:
-
SFT ref: Retains the initial
SFT modelas . -
Policy ref: Uses the running policy's old log-probabilities as . The following figure (Figure 10 from the original paper) shows the comparison of different reference policy implementations:
该图像是图表,展示了不同参考策略实现的结果。蓝色线条表示使用运行策略的旧对数概率作为参考(policy ref),而橙色线条表示使用初始 SFT 模型作为参考(SFT ref)。在训练步骤上,两个参考的回报表现相似。
Figure 10: Different reference model for PRM. We compare two reference model selection strategies for PRIME. Using the policy model as reference and using the initial SFT model as reference. Their rewards are similar.
Figure 10 shows that both strategies yield similar training rewards. The choice is flexible, with policy ref naturally serving as the reference for the Q-value expectation, while SFT ref is necessary if KL divergence calculations against the initial policy are also desired.
6.5. Single-Forward vs. Double-Forward
The paper investigates if updating the PRM before the policy model in the same rollout stage (double-forward) affects performance compared to using the old PRM (single-forward).
The following figure (Figure 11 from the original paper) shows single and double forward:
该图像是图表,展示了PRM分类准确度和训练奖励的变化情况。左侧图(a)显示了在训练样本上不同策略下的准确度,其中双向前推法在在线更新后准确度较高;右侧图(b)展示了训练过程中不同策略下的奖励变化,双向前推法的奖励总体上高于单向前推法。
Figure 11: Single and double forward. While double forward methods obtain higher accuracy after online update, the two variants achieve similar rewards during training.
Figure 11 shows that while double-forward can increase PRM accuracy after online updates, the training rewards between single-forward and double-forward methods remain similar. This implies that the additional computational cost of double-forward might not be justified for the marginal gain in policy performance.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces PRIME (Process Reinforcement through IMplicit rEwards), a novel and scalable framework for enhancing the reasoning capabilities of large language models through online reinforcement learning with dense token-level rewards. PRIME successfully addresses the long-standing challenges of incorporating dense rewards by leveraging implicit process reward modeling, which allows process reward models (PRMs) to be updated online using only outcome labels, thus circumventing expensive step-level annotations and mitigating reward hacking. The framework also eliminates the need for a dedicated reward model training phase, initializing PRMs directly from SFT or base models, and is broadly compatible with various RL algorithms. Experiments on competitive math and coding benchmarks demonstrate that PRIME significantly boosts sample efficiency (2.5x gain) and policy performance (15.1% average improvement over SFT), even enabling a 7B model to surpass larger, specialized instruct models with substantially less training data.
7.2. Limitations & Future Work
The authors explicitly mention one limitation:
-
Resource Constraints: Experiments were conducted only on models up to 32B. Other ablation experiments ran fewer steps compared to the main experiments, though comparisons were made fairly under the same step numbers.
Implicitly, from the "Zero" experiments, a potential area for future work is identified:
-
Saturation in "Zero" RL: While starting
RLdirectly from a base model (PRIME-Zero) shows impressive initial gains and efficiency, it quickly saturates at an early stage. This could be attributed to a decrease in response diversity, and addressing this saturation is left as future work.
7.3. Personal Insights & Critique
The PRIME paper offers a highly insightful and practical solution to a critical bottleneck in LLM RL: the scalable integration of dense process rewards. Its core innovation of using implicit process rewards with online PRM updates from outcome labels is a clever way to bridge the gap between fine-grained feedback and annotation feasibility.
Inspirations and Transferability:
- Efficiency for Complex Tasks: The demonstrated efficiency gains (2.5x sample efficiency, 10% data usage) make
PRIMEhighly appealing for trainingLLMson complex multi-step reasoning tasks where reward sparsity is a major issue. This could be particularly impactful in domains like scientific discovery, advanced code generation, or medical diagnosis, where detailed reasoning steps are crucial but hard to manually label. - Reduced Development Overhead: Eliminating the dedicated
reward modeltraining phase is a significant practical advantage. For research labs and developers, this means faster iteration cycles and lower computational costs, democratizing access toRLmethods forLLMs. - Generalizability: The plug-in nature of
PRIMEwith variousRL algorithmssuggests its methods could be broadly adopted across differentRLHForRLpipelines, enhancing existing approaches rather than requiring a complete overhaul. - "Zero" RL Potential: The preliminary "Zero"
RLexperiments, despite their saturation issue, open up an exciting avenue. If the diversity problem can be solved, training directly from base models could drastically simplify theLLMdevelopment lifecycle, potentially replacing the costlySFTstage entirely for certain applications.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Dependence on Outcome Verifiers: While
PRIMEavoids process labels, it still relies on high-quality, unhackableoutcome verifiers. For many real-world tasks, defining such a verifiable outcome (e.g., for creative writing or open-ended dialogue) can itself be challenging. The effectiveness ofPRIMEwould be limited by the quality and hackability of the finaloutcome reward. -
Interpretability of Implicit Rewards: While mathematically defined, the
implicit process rewardsare a log-ratio of probabilities. Their direct interpretability for human understanding or debugging purposes might be less intuitive than explicitly designed human feedback-based process rewards. -
Hyperparameter Sensitivity of : The parameter plays a crucial role in scaling the implicit rewards. Its optimal value might be task-dependent, and the paper doesn't deeply explore its sensitivity or provide a principled way to choose it, beyond setting it to 0.05.
-
Credit Assignment Ambiguity within Tokens: While
PRIMEoffers token-level rewards, a single token might still represent a part of a larger, ambiguous "step." The credit assignment problem might merely be shifted to a finer granularity without being fully resolved, especially if theImplicit PRMitself is not perfectly aligned with human notions of good reasoning steps. -
Scaling to Larger Models (Beyond 32B): While
PRIMEshows promise, the explicit limitation on models up to 32B leaves open questions about its behavior and efficiency for models in the 70B+ or even trillion-parameter range, where memory and computational constraints are even more severe.Overall,
PRIMEpresents a strong case for the practical viability of dense rewards inLLM RL, offering a robust and efficient framework that addresses key challenges. Its impact could be significant in pushing the boundaries ofLLMreasoning capabilities.
Similar papers
Recommended via semantic vector search.