Paper status: completed

Process Reinforcement through Implicit Rewards

Published:02/03/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces PRIME, which enhances reinforcement learning for large language models by using implicit rewards for online process reward model updates. PRIME significantly improves performance by efficiently solving issues related to label collection costs and reward hacki

Abstract

Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Process Reinforcement Through Implicit Rewards," introducing a novel framework called PRIME.

1.2. Authors

The paper lists numerous authors from various institutions, including:

  • Shanghai AI Lab

  • Peking University

  • Shanghai Jiaotong University

  • CUHK

    The corresponding authors are Ganqu Cui (cuiganqu@pjlab.org.cn) and Ning Ding.

1.3. Journal/Conference

The paper is a preprint, published on arXiv. The abstract mentions a publication date of 2025-02-03T15:43:48.000Z. As a preprint, it has not yet undergone formal peer review or been accepted by a specific conference or journal, although arXiv is a highly respected platform for disseminating cutting-edge research in fields like AI.

1.4. Publication Year

2025

1.5. Abstract

This paper addresses the challenge of scaling reinforcement learning (RL) for large language models (LLMs) using dense process rewards. While dense rewards offer advantages over sparse outcome-level rewards—such as improved training efficiency and credit assignment in complex multi-step reasoning tasks—their adoption has been limited due to the high cost of collecting high-quality process labels for online process reward model (PRM) updates, making PRMs vulnerable to reward hacking.

To overcome these issues, the authors propose PRIME (Process Reinforcement through IMplicit rEwards). PRIME enables online PRM updates using only policy rollouts and outcome labels by leveraging implicit process rewards. This framework is compatible with various advantage functions and eliminates the need for a dedicated reward model training phase, significantly reducing development overhead.

The authors demonstrate PRIME's effectiveness on competitive math and coding tasks. Starting from Qwen2.5-Math-7B-Base, PRIME achieves an average improvement of 15.1% across several key reasoning benchmarks compared to the supervised fine-tuning (SFT) model. Notably, their resulting model, Eurus-2-7B-PRIME, outperforms Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks using only 10% of its training data.

https://arxiv.org/abs/2502.01456v2 PDF Link: https://arxiv.org/pdf/2502.01456v2.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the difficulty in effectively applying dense process rewards to the reinforcement learning (RL) of large language models (LLMs) for complex reasoning tasks.

This problem is important because:

  • Limitations of Sparse Outcome Rewards: Current state-of-the-art LLMs primarily rely on sparse outcome-level rewards, which only provide feedback at the end of an entire generated sequence. This approach suffers from several issues:
    • Training Inefficiency: Learning is slow as feedback is infrequent.
    • Credit Assignment Problem: It's hard to determine which intermediate steps contributed positively or negatively to the final outcome, especially in multi-step reasoning.
    • Encouraging Spurious Solutions: Models might learn incorrect reasoning processes that coincidentally lead to correct answers.
  • Potential of Dense Process Rewards: Dense process rewards, which offer feedback at each intermediate step (e.g., token-level), have shown promise in improving inference-time performance for LLMs on reasoning tasks. In principle, they should address the issues of sparse rewards by providing fine-grained feedback, leading to better training efficiency and credit assignment in RL.
  • Challenges in Incorporating Dense Rewards for RL: Despite their potential, successful applications of dense rewards in RL for LLMs are limited due to three main challenges:
    1. Difficulty in Defining Process Rewards (C1): It's hard to collect step-level labels for complex reasoning, and annotating rewards for every token is prohibitively expensive and ambiguous.

    2. Scalability of PRM Online Updates (C2): To prevent reward hacking (over-optimization of a static reward model), process reward models (PRMs) need to be updated online. However, this typically requires extensive, nuanced step-level annotations on new policy rollouts, which is infeasible at scale.

    3. Extra Cost of Explicit Reward Modeling (C3): Training PRMs usually involves a costly, dedicated data collection and training phase to ensure generalization, adding significant overhead.

      The paper's entry point and innovative idea revolve around making dense process rewards scalable and practical for online RL. It leverages the concept of implicit process reward modeling to derive token-level rewards using only outcome labels, thereby circumventing the annotation bottleneck and enabling online PRM updates.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  • Proposing PRIME (Process Reinforcement through IMplicit rEwards): A novel and scalable framework that enhances LLM reasoning capabilities through efficient online reinforcement learning with dense token-level rewards.
  • Enabling Online PRM Updates with Outcome Labels: PRIME utilizes implicit process reward modeling to train dense reward models using only outcome-level labels. This fundamentally addresses C1 and C2 by enabling online updates of the PRM with policy rollouts and outcome supervision, mitigating reward hacking without requiring expensive step-level annotations.
  • Eliminating Dedicated Reward Model Training: PRIME simplifies the development process by initializing the Implicit PRM directly from the supervised fine-tuning (SFT) model or even the base model, removing the need for a separate, costly reward model training phase (C3).
  • General Framework for Reward Fusion: PRIME provides a general method to combine token-level dense rewards and sparse outcome rewards by calculating their returns separately and summing them. This design makes it compatible with various RL algorithms (e.g., REINFORCE, RLOO, GRPO, PPO).
  • Demonstrated Effectiveness and Efficiency:
    • PRIME achieved a 15.1% average improvement over the SFT model on several key reasoning benchmarks (competitive math and coding).

    • The resulting Eurus-2-7B-PRIME model surpassed Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with only 10% of its training data, showcasing significant sample efficiency.

    • Compared to RLOO with outcome rewards only, PRIME demonstrated a 2.5x sample efficiency gain and a 6.9% performance improvement on challenging math problems.

    • PRIME consistently boosted the performance and efficiency of other RL algorithms (REINFORCE, GRPO, PPO).

    • The analysis showed that online PRM updates are crucial for success, preventing over-optimization and reward hacking.

      These findings solve the specific problems of reward sparsity, training inefficiency, and the credit assignment problem in LLM RL by providing a scalable and effective method for incorporating dense, fine-grained rewards.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand PRIME, a reader needs a grasp of core concepts in large language models and reinforcement learning.

  • Large Language Models (LLMs): These are neural networks, typically based on the Transformer architecture, trained on vast amounts of text data to generate human-like text, translate languages, write different kinds of creative content, and answer your questions in an informative way. They operate in an autoregressive manner, meaning they predict the next token (word or sub-word unit) in a sequence based on the preceding tokens.
  • Reinforcement Learning (RL): A paradigm where an agent learns to make optimal decisions by interacting with an environment. The agent performs actions in states, receives rewards or penalties, and aims to maximize its cumulative reward over time.
    • Policy (π\pi): The agent's strategy, mapping states to actions. In LLMs, the policy (πθ\pi_\theta) is typically the language model itself, parameterized by θ\theta, which outputs probabilities for the next token given the previous sequence.
    • Environment: For LLMs, the environment is the task (e.g., math problem, coding task) and the feedback mechanism (e.g., a verifier checking the final answer).
    • State: The current context or sequence of tokens generated so far, plus the input prompt.
    • Action: Generating the next token yty_t.
    • Reward (rr): A scalar feedback signal indicating the desirability of an action or sequence of actions.
    • Return (G): The total cumulative discounted reward from a certain time step tt to the end of an episode, often calculated as G_t = \sum_{s=t}^T \gamma^{s-t} r(y_s), where γ[0,1]\gamma \in [0, 1] is a discount factor that weighs immediate rewards more heavily than future ones.
  • Policy Gradient: A family of RL algorithms that directly optimize the policy by estimating the gradient of the expected return with respect to the policy parameters. The general policy gradient theorem states that: $ \nabla _ { \theta } J ( \theta ) = \mathbb { E } _ { \mathbf { x } \sim \mathcal { D } , \mathbf { y } \sim \pi _ { \theta } } \left[ \displaystyle \sum _ { t = 0 } ^ { T } \nabla _ { \theta } \log \pi _ { \theta } ( y _ { t } | \mathbf { y } _ { < t } ) A _ { t } \right] $ Here, J(θ)J(\theta) is the objective function (expected cumulative reward), D\mathcal{D} is the data distribution (prompts), πθ(yty<t)\pi_\theta(y_t | \mathbf{y}_{<t}) is the probability of choosing token yty_t given previous tokens y<t\mathbf{y}_{<t} by the policy πθ\pi_\theta, and AtA_t is the advantage function.
  • Advantage Function (AtA_t): Quantifies how much better a specific action yty_t in a state y<t\mathbf{y}_{<t} is compared to the average expected outcome from that state. It helps reduce variance in policy gradient estimation. A common form is the Monte-Carlo (MC) advantage estimate: $ A _ { t } = \sum _ { s = t } ^ { T } \gamma ^ { s - t } r ( y _ { s } ) - b $ Here, \sum_{s=t}^T \gamma^{s-t} r(y_s) is the actual return received from step tt, and bb is a baseline (e.g., a value estimate) subtracted to reduce variance without changing the expected gradient.
  • Value Models (VV): Neural networks trained to predict the expected future return from a given state, V(y<t)V(\mathbf{y}_{<t}). They are used as baselines to reduce the variance of Monte-Carlo advantage estimates.
    • Generalized Advantage Estimation (GAE): A method that combines Monte-Carlo estimates (low bias, high variance) with Temporal Difference (TD) estimates (high bias, low variance) to achieve a good bias-variance trade-off. The TD error \delta_t = r(y_t) + \gamma V(\mathbf{y}_{<t+1}) - V(\mathbf{y}_{<t}) measures the discrepancy between the actual reward plus the predicted value of the next state, and the predicted value of the current state.
  • Proximal Policy Optimization (PPO): A widely used actor-critic RL algorithm. The "actor" is the policy network (πθ\pi_\theta) that generates actions, and the "critic" is a value network (VV) that estimates state values. PPO updates the policy by taking multiple small steps, clipping the objective function to prevent large policy updates that could destabilize training, and uses GAE for advantage estimation.
  • Reward Sparsity: A common issue in RL where rewards are only provided infrequently (e.g., at the end of a long sequence). This makes learning difficult, as the agent receives little feedback during intermediate steps, exacerbating the credit assignment problem.
  • Reward Hacking / Overoptimization: Occurs when an RL agent finds loopholes in the reward function, maximizing the reward signal without actually achieving the intended goal. This is especially problematic with static reward models, as the policy can drift into regions where the reward model provides high scores for unintended behavior, leading to a distribution shift between the policy's generated data and the reward model's training data.
  • Supervised Fine-tuning (SFT): A common pre-training step for LLMs in RL, where a base model is fine-tuned on a dataset of high-quality instruction-following examples (input-output pairs). This teaches the model basic reasoning abilities and desired output formats before RL.

3.2. Previous Works

The paper frames its contribution by contrasting with existing approaches to handling rewards in LLM RL:

  • Sparse Outcome Rewards: Most industry-leading LLMs rely on outcome reward models (ORMs) that provide a single scalar reward at the end of a generated sequence (e.g., Rafailov et al., 2023; Shao et al., 2024; DeepSeek-AI et al., 2025). While simpler, this approach suffers from the reward sparsity issues, training inefficiency, and credit assignment problem described above. Some attempts to mitigate sparsity with value models in PPO have shown limited effectiveness due to training challenges (Shao et al., 2024; Ahmadian et al., 2024).
  • Dense Process Rewards (Traditional): The concept of dense process rewards (feedback at intermediate steps) is not new and has proven effective in inference-time scaling for LLMs on reasoning tasks (Uesato et al., 2022; Lightman et al., 2023; Wang et al., 2023; Yuan et al., 2024b). However, their application in training LLMs with RL faces significant challenges:
    • Human Annotation Pipelines: Lightman et al. (2023) utilized complex human annotation pipelines to gather step-level labels for training PRMs. This is expensive and not scalable for online RL updates.
    • Estimation-based Methods: Other methods rely on estimating process rewards, requiring a large number of rollouts (e.g., 10x more for each step) compared to response-level trajectories (Wang et al., 2023; Kazemnejad et al., 2024), making them computationally intensive and less scalable for online PRM updates.
  • Implicit Rewards: This line of work, particularly in LLM alignment, has shown that reward functions can be implicitly learned.
    • Rafailov et al. (2024) demonstrated that optimizing the Direct Preference Optimization (DPO) objective implicitly learns a Q-function.
    • Zhou et al. (2024) utilized implicit rewards within PPO and highlighted the effectiveness of dense implicit rewards.
    • Implicit Process Reward Modeling (Yuan et al., 2024b): This work is a direct precursor to PRIME. It proposes a method to train an ORM (with outcome labels) that can be repurposed as a PRM at inference time. The core idea is that the reward can be represented as a log-ratio of probabilities between a reward model and a reference model: $ r(\mathbf{y}) = \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_{ref}(\mathbf{y})} $ where πϕ(y)\pi_\phi(\mathbf{y}) is the probability of the sequence y\mathbf{y} under the reward model, and πref(y)\pi_{ref}(\mathbf{y}) is under a reference model. This formulation allows deriving token-level rewards from a model trained on sequence-level preferences. PRIME builds directly on this formulation for its Implicit PRM.

3.3. Technological Evolution

The field of LLM reinforcement learning has evolved significantly:

  1. Early Alignment (RLHF): Initial applications of RL for LLMs focused on human alignment, primarily using Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017; Ouyang ett al., 2022; Cui et al., 2024). This typically involved training an ORM from human preferences over complete responses, and then fine-tuning the LLM with PPO using these sparse rewards.
  2. Imitation Learning for Reasoning: For LLM reasoning, many open-source efforts initially relied on imitation learning (Yuan et al., 2024a; Yue et al., 2024; Wei et al., 2024; Liu et al., 2024), where models learn by mimicking expert-demonstrated reasoning steps.
  3. Large-scale RL for Reasoning: More recently, the paradigm has shifted towards large-scale RL for reasoning LLMs, with works like OpenAI o1 (Jaech et al., 2024) and DeepSeek-R1 (DeepSeek-AI et al., 2025) demonstrating the immense potential of RL with outcome rewards. These works highlight the scaling effects but also acknowledge the limitations of outcome-only feedback.
  4. Emergence of Dense Rewards in Training: PRIME represents a crucial step in this evolution by directly tackling the challenge of incorporating dense process rewards into the training of LLMs for reasoning, moving beyond inference-time applications. It builds on the implicit reward paradigm to make this scalable.

3.4. Differentiation Analysis

Compared to main methods in related work, PRIME introduces several core differences and innovations:

  • Online PRM Updates without Step-level Labels: The most significant differentiation is PRIME's ability to update process reward models (PRMs) online during RL training using only outcome-level labels. This directly contrasts with traditional PRMs that require expensive and difficult-to-collect step-level annotations (e.g., Lightman et al., 2023) or computationally intensive estimation methods (Wang et al., 2023; Kazemnejad et al., 2024). This solves the scalability issue (C2) and the difficulty of defining process rewards (C1).

  • Elimination of Dedicated RM Training Phase: Unlike existing RLHF pipelines that require a separate, costly phase to train a reward model (RM) or PRM (C3), PRIME initializes its Implicit PRM directly from the SFT model or even the base model. This substantially reduces development overhead and time.

  • Token-level Fine-grained Rewards from Outcome Labels: PRIME leverages implicit process reward modeling (Yuan et al., 2024b) to derive dense, token-level rewards from models trained solely on outcome labels. This provides finer granularity than typical step-level PRMs without additional annotation cost or ambiguity.

  • General Compatibility with RL Algorithms: PRIME integrates dense and sparse rewards in a flexible manner by computing their returns separately before summing, making it a general plug-in for various Monte Carlo (MC) advantage estimators and RL algorithms (REINFORCE, RLOO, GRPO, PPO), unlike methods tied to specific actor-critic architectures.

  • Mitigation of Reward Hacking: By enabling online updates of the PRM with on-policy rollouts, PRIME inherently mitigates reward hacking and over-optimization issues that plague static reward models due to distribution shift.

    In essence, PRIME bridges the gap between the theoretical benefits of dense rewards and the practical challenges of their implementation in large-scale LLM RL, offering a scalable, efficient, and general solution.

4. Methodology

4.1. Principles

The core idea behind PRIME is to leverage implicit process reward modeling to generate dense, token-level rewards that can be updated online using only outcome-level supervision. This overcomes the major hurdles of conventional dense PRMs, such as the prohibitive cost of collecting step-level labels and the vulnerability to reward hacking from static reward models. By treating the Implicit PRM as a causal language model, PRIME can derive token-level rewards from a model trained on sequence-level outcomes, making it scalable for online RL and compatible with various advantage functions. The framework integrates these dense process rewards with traditional sparse outcome rewards to provide comprehensive feedback for policy optimization.

4.2. Core Methodology In-depth (Layer by Layer)

The PRIME framework is designed as a scalable online RL method with dense rewards. It integrates the concept of implicit process rewards with a flexible advantage estimation and policy update mechanism. The overall workflow is illustrated in Figure 1 and detailed in Algorithm 1.

The process flows through several stages, iteratively refining the policy model and the Implicit PRM:

4.2.1. Initialization

The first step involves initializing the key components:

  • The policy model (πθ\pi_\theta) and its old version (πθold\pi_{\theta_{old}}) are initialized from a pre-trained language model, typically a supervised fine-tuning (SFT) model.

  • The Implicit PRM (πϕ\pi_\phi) and the reference model (πref\pi_{ref}) are also initialized from the same SFT model or even a base model. This is a key PRIME innovation, eliminating a dedicated PRM training phase.

    The following figure (Figure 1 from the original paper) shows the illustration of PRIME:

    Figure 1: Illustration of PRIME. PRIME follows that (1) initialize policy model and the Implicit PRM both with the reference model; (2) sample multiple responses for each prompt and filter with output accuracy; (3) obtain implicit process rewards by the Implicit PRM and update it using cross-entropy (CE) loss; (4) compute advantage and policy loss then update the policy model. 该图像是示意图,展示了PRIME的工作流程。图中包含多个组件,包括政策模型、隐式过程奖励模型(Implicit PRM)和结果验证器。流程从输入提示开始,生成响应并通过结果验证器进行评估。根据过程奖励和输出的准确性,更新政策模型和隐式PRM。图示清晰地展示了各个步骤之间的关系和反馈机制。

Figure 1: Illustration of PRIME. PRIME follows that (1) initialize policy model and the Implicit PRM both with the reference model; (2) sample multiple responses for each prompt and filter with output accuracy; (3) obtain implicit process rewards by the Implicit PRM and update it using cross-entropy (CE) loss; (4) compute advantage and policy loss then update the policy model.

4.2.2. Policy Rollouts

For each RL iteration (Step 2 of Algorithm 1):

  1. Sample Prompts (Algorithm 1, Step 3): A batch of prompts B\mathbf{B} is sampled from the dataset D\mathcal{D}.

  2. Generate Responses (Algorithm 1, Step 4): For each prompt xB\mathbf{x} \in \mathbf{B}, the current policy model πθ\pi_\theta generates KK multiple responses: {y1,,yK}\{\mathbf{y}^1, \dots, \mathbf{y}^K\}. These are complete trajectories (sequences of tokens).

  3. Compute Outcome Rewards (Algorithm 1, Step 5): An outcome verifier ror_o (a rule-based function or a reward model that provides a score for the entire generated response) computes the outcome reward ro(y1:K)r_o(\mathbf{y}^{1:K}) for each of the KK responses. As discussed in Section 5.2, for math, this is usually 1 for an exact match to the ground truth and 0 otherwise. For coding, it's the proportion of passing test cases.

  4. Apply Accuracy Filter (Algorithm 1, Step 6): An online prompt filtering technique is applied. This filters the generated trajectories to retain only prompts where at least one response falls within a certain accuracy range (e.g., median-level difficulty). This helps balance the data distribution for Implicit PRM training and stabilizes RL training by focusing on useful examples (as shown in Figure 2). The filtered set of (prompt, response, outcome reward) triplets is denoted as T\mathcal{T}.

    The following figure (Figure 2 from the original paper) shows the effect of online prompt filtering:

    Figure : Effect of online prompt filtering. 该图像是一个折线图,展示了不同步长下的结果训练奖励变化情况。蓝色线条表示应用过滤器的结果,而橙色线条表示未应用过滤器的结果。随着步骤的增加,应用过滤器的奖励值整体保持在更高水平,表明过滤器的效果显著。

Figure : Effect of online prompt filtering.

4.2.3. Implicit Process Reward Modeling and Update

This is where PRIME addresses the scalability challenges of dense rewards.

  1. Obtain Implicit Process Rewards (Algorithm 1, Step 7): For each (x,y)T(\mathbf{x}, \mathbf{y}) \in \mathcal{T}, the Implicit PRM (πϕ\pi_\phi) and the reference model (πref\pi_{ref}) are used to calculate token-level dense rewards rϕ(yt)r_\phi(y_t) for each token yty_t in the sequence y\mathbf{y}. The calculation is based on the log-ratio of token probabilities: $ r _ { \phi } ( y _ { t } ) : = \beta \log \frac { \pi _ { \phi } ( y _ { t } | \mathbf { y } _ { < t } ) } { \pi _ { \mathrm { r e f } } ( y _ { t } | \mathbf { y } _ { < t } ) } $

    • rϕ(yt)r_\phi(y_t): The implicit process reward for generating token yty_t at time step tt.
    • β\beta: A hyperparameter that controls the magnitude of the implicit rewards.
    • πϕ(yty<t)\pi_\phi(y_t | \mathbf{y}_{<t}): The probability of generating token yty_t given the preceding sequence y<t\mathbf{y}_{<t} by the Implicit PRM πϕ\pi_\phi.
    • πref(yty<t)\pi_{ref}(y_t | \mathbf{y}_{<t}): The probability of generating token yty_t given the preceding sequence y<t\mathbf{y}_{<t} by the reference model πref\pi_{ref}.
    • log\log: The natural logarithm. This formulation allows the Implicit PRM to be trained with outcome labels, yet provide fine-grained token-level feedback by interpreting the ratio of likelihoods from the PRM and reference model as rewards.
  2. Update Implicit PRM (Algorithm 1, Step 8): The Implicit PRM πϕ\pi_\phi is updated online using the collected rollouts from T\mathcal{T} and their outcome rewards. The update uses a cross-entropy (CE) loss: $ \mathcal { L } _ { \mathrm { C E } } ( \phi ) = - \mathbb { E } _ { ( \mathbf { x } , \mathbf { y } , r _ { o } ( \mathbf { y } ) ) \sim \mathcal { T } } \left[ r _ { o } \left( \mathbf { y } \right) \cdot \log \sigma \left( r _ { \phi } \left( \mathbf { y } \right) \right) + \left( 1 - r _ { o } \left( \mathbf { y } \right) \right) \cdot \log \left( 1 - \sigma \left( r _ { \phi } \left( \mathbf { y } \right) \right) \right) \right] $

    • LCE(ϕ)\mathcal{L}_{CE}(\phi): The cross-entropy loss for updating the Implicit PRM parameters ϕ\phi.
    • E\mathbb{E}: Expected value over the filtered samples T\mathcal{T}.
    • (x,y,ro(y))(\mathbf{x}, \mathbf{y}, r_o(\mathbf{y})): A triplet of prompt, generated response, and its outcome reward from the filtered set T\mathcal{T}.
    • ro(y)r_o(\mathbf{y}): The binary outcome reward for the entire response y\mathbf{y} (typically 1 for correct, 0 for incorrect).
    • σ()\sigma(\cdot): The sigmoid function, which squashes the Implicit PRM's predicted total reward rϕ(y)r_\phi(\mathbf{y}) (which is \sum_{t=1}^{|\mathbf{y}|} r_\phi(y_t)) into a probability-like score between 0 and 1. The total implicit reward for a sequence y\mathbf{y} is often computed as the sum of its token-level rewards, r_\phi(\mathbf{y}) = \sum_{t=1}^{|\mathbf{y}|} r_\phi(y_t).
    • This CE loss effectively trains the Implicit PRM to predict the outcome reward of a full trajectory, even though it provides token-level rewards at inference. This online update is crucial for mitigating reward hacking as the PRM adapts to the evolving policy distribution (as shown in Figure 5).

4.2.4. Advantage Estimation and Policy Update

Once dense process rewards and outcome rewards are available, PRIME calculates advantages to update the policy.

  1. Monte Carlo Advantage with Leave-One-Out Baseline: The paper uses Monte-Carlo (MC) estimators for advantage calculation, finding them stable and effective. Specifically, it employs a leave-one-out (LOO) baseline, which helps reduce variance by subtracting the average reward of other samples for the same prompt. The LOO baseline for outcome rewards ro(yi)r_o(\mathbf{y}^i) for the ii-th response among KK samples is: $ A ^ { i } = r _ { o } ( { \bf y } ^ { i } ) - \frac { 1 } { K - 1 } \sum _ { j \neq i } r _ { o } ( { \bf y } ^ { j } ) $

    • AiA^i: The advantage for the ii-th response.
    • ro(yi)r_o(\mathbf{y}^i): The outcome reward for the ii-th response.
    • KK: The total number of responses sampled for a given prompt.
    • jiro(yj)\sum_{j \neq i} r_o(\mathbf{y}^j): The sum of outcome rewards for all other responses (jj) except the ii-th one. This formula calculates the advantage of the ii-th response as its own reward minus the average reward of all other responses for the same prompt.
  2. Combined Advantage Function (Algorithm 1, Step 9): PRIME combines implicit process rewards and outcome rewards by calculating their returns separately and then summing them. This is done to avoid numerical instability that might arise from directly mixing their values. The final advantage AtiA_t^i for token ytiy_t^i in the ii-th response is: $ A _ { t } ^ { i } = \sum _ { s = t } ^ { \left| \mathbf { y } ^ { i } \right| } \gamma ^ { s - t } \cdot \left[ r _ { \phi } ( y _ { s } ^ { i } ) - \frac { 1 } { K - 1 } \sum _ { j \neq i } r _ { \phi } \left( \mathbf { y } ^ { j } \right) \right] + r _ { o } \left( \mathbf { y } ^ { i } \right) - \frac { 1 } { K - 1 } \sum _ { j \neq i } r _ { o } \left( \mathbf { y } ^ { j } \right) $

    • AtiA_t^i: The advantage for generating token ytiy_t^i at step tt in the ii-th response.
    • s=tyiγst[]\sum_{s=t}^{|\mathbf{y}^i|} \gamma^{s-t} \cdot [\dots]: This part represents the discounted return from implicit process rewards.
      • γ\gamma: The discount factor, weighing future rewards less.
      • yi|\mathbf{y}^i|: The total length of the ii-th response.
      • rϕ(ysi)r_\phi(y_s^i): The implicit process reward for token ysiy_s^i at step ss.
      • 1K1jirϕ(yj)\frac{1}{K-1} \sum_{j \neq i} r_\phi(\mathbf{y}^j): The leave-one-out baseline for the implicit sequence-level reward r_\phi(\mathbf{y}^j) = \sum_{t=1}^{|\mathbf{y}^j|} r_\phi(y_t^j), or, as implied by the overall structure, it could be an average of implicit process rewards at step ss across other trajectories. The paper states "Use the averaged implicit process rewards to calculate the leave-one-out baseline," which suggests rϕ(yj)r_\phi(\mathbf{y}^j) here refers to a sequence-level average or sum. Given the summation s=tyi\sum_{s=t}^{|\mathbf{y}^i|} \dots, the baseline term 1K1jirϕ(yj)\frac{1}{K-1} \sum_{j \neq i} r_\phi(\mathbf{y}^j) is likely meant to be a sequence-level average to normalize the process reward component of the return. A more precise interpretation for token-level baseline would be 1K1jirϕ(ysj)\frac{1}{K-1} \sum_{j \neq i} r_\phi(y_s^j), but the formula as written shows a sum over jij \neq i applied to rϕ(yj)r_\phi(\mathbf{y}^j), implying a sequence-level baseline for the entire process reward component. The text "Normalize the process reward at step tt by subtracting the baseline" seems to conflict slightly with the formula's notation of jirϕ(yj)\sum_{j \neq i} r_\phi(\mathbf{y}^j), but the overall intent is a leave-one-out baseline.
    • ro(yi)1K1jiro(yj)r_o(\mathbf{y}^i) - \frac{1}{K-1} \sum_{j \neq i} r_o(\mathbf{y}^j): This part is the leave-one-out advantage for the sparse outcome reward.
  3. Update Policy with PPO Loss (Algorithm 1, Step 10): The policy model πθ\pi_\theta is updated using the PPO clip surrogate loss, which provides stable updates by preventing the new policy from deviating too far from the old one: $ L _ { \mathrm { C L I P } } ( \theta ) = \mathbb { E } _ { t } \left[ \operatorname* { m i n } \left( \frac { \pi _ { \theta } \bigl ( y _ { t } | \mathbf { y } _ { < t } \bigr ) } { \pi _ { \theta _ { \mathrm { o d d } } } \bigl ( y _ { t } | \mathbf { y } _ { < t } \bigr ) } A _ { t } , \mathrm { c l i p } \Bigl ( \frac { \pi _ { \theta } \bigl ( y _ { t } | \mathbf { y } _ { < t } \bigr ) } { \pi _ { \theta _ { \mathrm { o d d } } } \bigl ( y _ { t } | \mathbf { y } _ { < t } \bigr ) } , 1 - \epsilon , 1 + \epsilon \Bigr ) A _ { t } \right) \right] $

    • LCLIP(θ)L_{CLIP}(\theta): The PPO clip surrogate loss for updating policy parameters θ\theta.
    • Et[]\mathbb{E}_t[\dots]: Expected value over timesteps tt.
    • πθ(yty<t)\pi_\theta(y_t | \mathbf{y}_{<t}): Probability of token yty_t under the current policy.
    • πθold(yty<t)\pi_{\theta_{old}}(y_t | \mathbf{y}_{<t}): Probability of token yty_t under the previous policy (before update).
    • AtA_t: The advantage function calculated in the previous step (for token yty_t).
    • clip(x,xlow,xhigh)\mathrm{clip}(x, x_{low}, x_{high}): A clipping function that constrains xx to be within [xlow,xhigh][x_{low}, x_{high}].
    • ϵ\epsilon: A small clipping parameter (e.g., 0.2) that limits how much the policy can change in one update step, ensuring stability. The loss function encourages actions with positive advantage to be more likely, while discouraging those with negative advantage, clipped to prevent excessive changes.

4.2.5. Update Old Parameters

Finally, the old policy parameters θold\theta_{old} are updated to the current policy parameters θ\theta (Algorithm 1, Step 11) for the next iteration, and the loop continues until total iteration NN.

4.2.6. Other Techniques

  • Initializing PRM with SFT/base model: As mentioned, the Implicit PRM is directly initialized from the SFT model or even the base model, bypassing a dedicated PRM training stage. This is shown to be effective and even outperform PRMs trained on extra data (see Section 5.1 and Figure 4).
  • Online Prompt Filtering: This technique, described in Section 4.2.2, filters prompts to keep those with accuracy within a certain range. This focuses training on "median-level difficulty" problems, stabilizing RL and balancing Implicit PRM training.

4.2.7. PRIME's Solution to the Challenges (from Section 2.2)

  • C1. Process rewards are hard to define: PRIME addresses this by deriving token-level implicit process rewards from an Implicit PRM trained only with outcome labels, removing the need for ambiguous and costly step-level annotations.
  • C2. PRM online updates are not scalable: PRIME enables online updating of the Implicit PRM using on-policy rollouts and outcome labels (which are already collected for policy updates). This is scalable because it doesn't require new, expensive human annotations for each update, preventing reward hacking due to distribution shift.
  • C3. Explicit reward modeling brings extra cost: PRIME eliminates the dedicated reward modeling stage by directly initializing the Implicit PRM from the SFT or base model.

5. Experimental Setup

5.1. Datasets

The experiments primarily focus on competition-level mathematics and programming tasks.

5.1.1. Supervised Fine-tuning (SFT) Dataset

  • Purpose: To provide a strong starter model for RL by teaching specific reasoning patterns.

  • Content: Reasoning instructions collected from several open-source datasets, completed by LLaMA-3.1-70B-Instruct using an action-centric chain-of-thought reasoning framework. The framework involves 7 actions (ASSESS, ADVANCE, VERIFY, SIMPLIFY, SYNTHESIZE, PIVOT, OUTPUT) that the model chooses at each step of its multi-step reasoning.

  • Sources (Table 10):

    • Math: MathInstruct-MATH (Yue et al., 2023), OpenMathIns-2-Aug_Math (Toshniwal et al., 2024), Numina (Li et al., 2024), Reasoning-001 (SkunkworksAI, 2024).
    • Coding: Code-Feedback (Zheng et al., 2024), Magicoder (Wei et al., 2024), Magicoder-OSS (Wei et al., 2024).
    • Biomedicine: UltraMedical_mc (Zhang et al., 2024).
  • Scale: 230K SFT data pairs, with an average response length of 1390 tokens.

  • Reason for Selection: Authors explicitly did not include many datasets with ground-truth answers in SFT, reserving them for RL to diversify exploration and because ground-truth is considered more essential in RL.

    The following are the results from Table 10 of the original paper:

    Task Dataset Size Avg. Response Length Source
    Math MathInstruct-MATH (Yue et al., 2023) 12715 964.01 https://huggingface.co/datasets/TIGER-Lab/MathInstruct
    OpenMathIns-2-Aug_Math (Toshniwal et al., 2024) 15086 1202.25 https://huggingface.co/datasets/nvidia/OpenMathInstruct-2
    Numina (Li et al., 2024) 55845 1331.61 https://huggingface.co/datasets/AI-MO/NuminaMath-CoT
    Reasoning-001 (SkunkworksAI, 2024) 29831 1316.49 https://huggingface.co/datasets/SkunkworksAI/reasoning-0.01
    Coding Code-Feedback (Zheng et al., 2024) 27663 1805.16 https://huggingface.co/datasets/m-a-p/Code-Feedback
    Magicoder (Wei et al., 2024) 24480 1828.72 https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K
    Magicoder-OSS (Wei et al., 2024) 28980 1850.05 https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K
    Biomedicine UltraMedical_mc (Zhang et al., 2024) 35163 891.06 https://huggingface.co/datasets/TsinghuaC3I/UltraMedical
    Total / Avg. 229763 1390.75

5.1.2. Reinforcement Learning (RL) Datasets

  • Purpose: To train LLMs using RL with outcome verifiers.
  • Content: High-quality mathematics and coding problems with outcome verifiers (LaTeX answers for math, test cases for coding).
  • Sources:
    • Math: NuminaMathCoT (Li et al., 2024), containing about 860K math problems (Chinese high school to International Mathematical Olympiad). After cleaning and filtering, 457k math problems were retained.
    • Coding: APPS (Hendrycks et al., 2021a), CodeContests (Li et al., 2022), TACO (Li et al., 2023), and Codeforces2. After cleaning and filtering, 27k coding problems were retained.
  • Data Preprocessing: A systematic rule-based approach was used to filter and classify math problems. Problems with figures/diagrams or proofs were excluded. Remaining problems were classified into question-answering, multiple-choice, or fill-in-the-blank, with a focus on multiple-choice. These were then converted to a direct question-answer format using a two-phase reformatting process (rule-based filtering, LLM-based filtering, LLM-based formatting). Finally, LLM-based comprehensive validation (using QwQ-32B-Preview and Qwen2.5-Math-72B-Instruct with self-consistency) was performed to ensure solvability and correctness.

5.1.3. EurusPRM Training Dataset (for ablation)

  • Purpose: Used for an ablation study to compare PRM initialization strategies. This dataset is specifically for training a dedicated PRM to compare against PRIME's approach of initializing PRM from SFT model.

  • Sources (Table 11): UltraInteract, Numina-SynMath, Numina-Olympiads.

  • Generators: Llama-3.1-8B-Inst, Llama-3.1-8B-Base, Qwen2.5-72B-Inst, Qwen2.5-Math-7B-Base.

  • Scale: 500K data generated by Llama3.1 and Qwen2.5 series, with 8 responses per instruction, all at the response-level.

    The following are the results from Table 11 of the original paper:

    Dataset Generator Model Num. Inst Resp/Inst Step-level/Response-level
    UltraInteract Llama-3.1-8B-Inst 20177 8 Response-level
    Llama-3.1-8B-Base 13570 8 Response-level
    Qwen2.5-72B-Inst 4758 8 Response-level
    Qwen2.5-Math-7B-Base 25713 8 Response-level
    Numina-SynMath Llama-3.1-8B-Inst 4783 8 Response-level
    Qwen2.5-Math-7B-Base 5806 8 Response-level
    Numina-Olympiads Llama-3.1-8B-Inst 2909 8 Response-level
    Qwen2.5-Math-7B-Base 4739 8 Response-level

5.2. Evaluation Metrics

The primary evaluation metrics are focused on accuracy on reasoning benchmarks.

  • Rule-based Outcome Verifier (ror_o): This serves as the ground truth reward for both training and evaluation, consistent with recent research on unhackable rewards.

    • For Math (romath(y)r_o^{math}(\mathbf{y})):
      • Conceptual Definition: For mathematical problems, the outcome reward is a binary value indicating whether the generated answer exactly matches the ground truth. It focuses on the correctness of the final numerical or symbolic result.
      • Mathematical Formula: $ r _ { o } ^ { \mathrm { m a t h } } ( \mathbf { y } ) = \left{ \begin{array} { l l } { 1 , } & { \mathrm { m a t c h e d } } \ { 0 , } & { \mathrm { o t h e r w i s e } } \end{array} \right. $
      • Symbol Explanation:
        • romath(y)r_o^{math}(\mathbf{y}): The outcome reward for a generated mathematical response y\mathbf{y}.
        • 1: Indicates the generated answer is an exact match to the ground truth.
        • 0: Indicates the generated answer does not match the ground truth.
        • matched: A boolean condition where the extracted answer from y\mathbf{y} is identical to the canonical ground truth answer.
    • For Coding (rocode(y)r_o^{code}(\mathbf{y})):
      • Conceptual Definition: For coding problems, the outcome reward measures the proportion of test cases that the generated code successfully passes. It assesses the functional correctness of the code.
      • Mathematical Formula: $ r _ { o } ^ { \mathrm { c o d e } } ( \mathbf { y } ) = \frac { \sum # \mathrm { p a s s e s } } { \sum # \mathrm { t e s t } \mathrm { c a s e s } } $
      • Symbol Explanation:
        • rocode(y)r_o^{code}(\mathbf{y}): The outcome reward for a generated code response y\mathbf{y}.
        • #passes\sum \# \mathrm{passes}: The total number of test cases that the generated code successfully passes.
        • #testcases\sum \# \mathrm{test} \mathrm{cases}: The total number of available test cases for the problem.
  • Pass@1 (Accuracy):

    • Conceptual Definition: Pass@1 (or accuracy) is a metric commonly used in code generation and reasoning tasks. It measures the percentage of problems for which at least one generated solution (when sampling multiple) is correct. In this paper, Pass@1 likely refers to the single best generation for each problem, as is typical for competitive benchmarks.
    • Mathematical Formula: $ \text{Pass@1} = \frac{\text{Number of problems with at least one correct solution}}{\text{Total number of problems}} \times 100% $
    • Symbol Explanation:
      • Number of problems with at least one correct solution: The count of problems where the model produced at least one output that passed the outcome verifier.
      • Total number of problems: The total number of problems in the evaluation set.
  • Average Improvement: This refers to the percentage increase in Pass@1 (or similar accuracy metric) over a baseline model across multiple benchmarks.

5.3. Baselines

The paper compares PRIME against several strong baselines:

  • Eurus-2-7B-SFT: The supervised fine-tuning (SFT) version of their model, serving as the starting point for RL and a direct comparison to measure RL's impact.
  • RLOO w/OV Only: RL using leave-one-out (LOO) baseline with only the outcome verifier (OV) rewards (i.e., sparse rewards, no dense process rewards). This is the direct sparse reward RL baseline for PRIME.
  • State-of-the-art LLMs:
    • GPT-4o: A powerful proprietary LLM from OpenAI.
    • Llama-3.1-70B-Instruct: A leading open-source LLM from Meta.
    • Qwen2.5-Math-7B-Instruct: A specialized mathematical reasoning LLM from Qwen, used as a strong domain-specific competitor.
  • Other RL Algorithms (for ablation):
    • REINFORCE (Williams, 1992): A basic Monte-Carlo policy gradient algorithm.
    • GRPO (Shao et al., 2024): Grouped Reward Policy Optimization, which uses group average of rewards as a baseline.
    • PPO (Schulman et al., 2017): A popular actor-critic algorithm that uses Generalized Advantage Estimation (GAE) and a clipped objective.
  • Specialized RL Baselines (for comparison in Appendix):
    • VinePPO (Kazemnejad et al., 2024): An RL method for LLM reasoning that uses average return across trajectories for value estimation.

    • DeepScaleR (Luo et al., 2025): A three-stage RL training pipeline that iteratively increases context length for performance gain.

      These baselines are chosen to demonstrate PRIME's superiority over: (1) its SFT precursor, (2) RL with sparse rewards only, (3) other RL algorithms (to show generality), and (4) leading commercial and open-source models, especially those specialized in mathematical reasoning.

5.4. Hyperparameters

The experiments were conducted on 8×A8008 \times \mathrm{A800} GPUs using veRL (Sheng et al., 2024) framework.

  • Policy Model Learning Rate: Constant 5×1075 \times 10^{-7}
  • Optimizer: AdamW for policy model.
  • Learning Rate Schedule: Cosine annealing with a warmup ratio of 0.1 for SFT phase.
  • PRM Learning Rate: 1×1061 \times 10^{-6}
  • Batch Size: 256 for both policy and PRMs.
  • Micro Batchsize: 8.
  • Rollout Stage: Collects 256 prompts and samples 4 responses for each prompt.
  • Beta (β\beta) for PRM: 0.05. This parameter controls the magnitude of the implicit rewards.
  • KL Coefficient: Set to 0 in all experiments. This means no KL divergence penalty is applied between the updated policy and the reference policy, which is common in PPO and similar algorithms to prevent excessive policy shifts.
  • SFT Training Details: Full parameter fine-tuning with a learning rate of 1×1051 \times 10^{-5}, AdamW optimizer, cosine annealing learning rate schedule (warmup ratio 0.1), batch size 96, random seed 42. Trained on 230K datasets for 3 epochs.
  • Implicit PRM/Reference Model Initialization: Default is Implicit PRM initialized with SFT model and SFT model retained for reference logprobs.

5.5. Evaluation Benchmarks

The models were evaluated on 7 reasoning benchmarks, focusing on competition-level mathematics and programming tasks:

  • AIME 2024 (Li et al., 2024): American Invitational Mathematics Examination, a challenging competition math contest.
  • AMC (Li et al., 2024): American Mathematics Competitions, another series of math contests.
  • MATH-500 (Hendrycks et al., 2021b): A dataset of 500 challenging math problems designed to test advanced mathematical reasoning.
  • Minerva Math (Lewkowycz et al., 2022): A dataset of math problems from various sources, used to evaluate mathematical problem-solving.
  • OlympiadBench (He et al., 2024): A benchmark for olympiad-level bilingual multimodal scientific problems.
  • LeetCode (Guo et al., 2024): A platform providing competitive programming problems.
  • LiveCodeBench (v2) (Jain et al., 2024): A holistic and contamination-free evaluation benchmark for large language models for code.

6. Results & Analysis

6.1. Core Results Analysis

The main results demonstrate PRIME's substantial improvements in reasoning benchmarks, particularly in competitive math and coding.

The following are the results from Table 1 of the original paper:

Method Step AIME 2024 AMC MATH-500 MinervaMath OlympiadBench LeetCode LiveCodeBench Avg.
GPT-40 - 9.3 45.8 76.4 36.8 43.3 58.9 48.8 45.6
Llama-3.1-70B-Inst. - 20.0 37.3 65.0 37.1 30.5 35.0 34.4 37.0
Qwen2.5-Math-7B-Inst. - 13.3 506 79.8 34.6 440.7 1.7 11.3 34.6
Eurus-2-7B-SFT 0 3.3 30.1 66.2 32.7 29.8 21.7 17.8 28.8
RLOO w/OV Only 240 20.0 47.0 73.2 36.4 35.4 28.3 26.7 36.9
Eurus-2-7B-PRIME 80 20.0 41.0 68.2 38.2 37.0 26.7 26.6 36.8
160 13.3 42.2 72.0 37.1 38.7 26.7 25.6 36.5
240 20.0 50.6 78.2 39.3 40.3 31.1 27.5 41.0
320 16.7 51.8 77.8 39.7 41.5 36.1 28.5 41.7
592 26.7 57.8 79.2 38.6 42.1 33.3 28.6 43.9
  • Significant Improvement over SFT Baseline: The Eurus-2-7B-PRIME model, starting from Eurus-2-7B-SFT (Avg. 28.8%), achieves an average score of 43.9% after 592 steps, representing a substantial 15.1% average improvement. This demonstrates the clear effectiveness of the PRIME framework in enhancing reasoning capabilities through RL.

  • Strong Performance in Math Competitions: PRIME shows remarkable gains on AMC (from 30.1% for SFT to 57.8% for PRIME at 592 steps) and AIME 2024 (from 3.3% to 26.7%), with over 20% improvement in these areas. This highlights its capability for complex mathematical reasoning.

  • Outperforming Qwen2.5-Math-7B-Instruct with Less Data: The final Eurus-2-7B-PRIME model surpasses Qwen2.5-Math-7B-Instruct on five mathematical benchmarks (AIME 2024, AMC, MATH-500, MinervaMath, OlympiadBench, see Appendix B.3 for full list including coding). Notably, as shown in Table 3, this achievement is accomplished with only 10% of the training data used by Qwen2.5-Math-7B-Instruct (230K SFT + 150K RL queries vs. 2.5M SFT + 618K RM + 66K RL queries). This underscores PRIME's exceptional sample efficiency.

    The following figure (Figure 12 from the original paper) shows the overall math performance:

    该图像是一个条形图,展示了不同模型在多个推理基准上的准确率。图中包含 Eurus-2-7B-PRIME、Eurus-2-7B-SFT、Qwen-2.5-Math-7B-Instruct、Llama-3.1-70B-Instruct 和 GPT-4o-2024-08-06 的表现,尤其在 MATH-500 基准中,Eurus-2-7B-PRIME 达到了 79.2%。 该图像是一个条形图,展示了不同模型在多个推理基准上的准确率。图中包含 Eurus-2-7B-PRIME、Eurus-2-7B-SFT、Qwen-2.5-Math-7B-Instruct、Llama-3.1-70B-Instruct 和 GPT-4o-2024-08-06 的表现,尤其在 MATH-500 基准中,Eurus-2-7B-PRIME 达到了 79.2%。

B.3 Results of Different RL Algorithms Figure 12: Overall math performance. Eurus-2-7B-PRIME excels at competition-level mathematics benchmarks, outperforming advanced math models and larger models. Notably, PRIME brings substantial performance gain (+16.7%)( + 1 6 . 7 \% ) over Eurus-2-7B-SFT.

The following are the results from Table 3 of the original paper:

Model Eurus-2-7B-PRIME Qwen2.5-Math-7B-Instruct
Base Model Qwen2.5-Math-7B Qwen2.5-Math-7B
SFT Data 230K (open-source) 2.5M (open-source & in-house)
RM Data 0 618K (in-house)
RM Eurus-2-7B-SFT Qwen2.5-Math-RM (72B)
RL Data 150K queries × 4 samples 66K queries × 32 samples

6.2. Dense Rewards vs. Sparse Rewards

The paper directly compares PRIME (with dense rewards) against RLOO w/OV Only (with sparse outcome rewards).

The following figure (Figure 3 from the original paper) shows the effect of dense reward:

Figure 3: The effect of dense reward. We compare PRIME and RLOO with outcome verifier (OV). PRIME leads to \(2 . 5 \\times\) sample efficiency (wall clock as \(\\mathrm { X }\) axis can be found in Figure 17) and \(6 . 9 \\%\) performance improvement. PRIME also substantially outperforms RLOO on downstream tasks. 该图像是图表,展示了PRIME与RLOO在结果训练奖励和测试准确度方面的比较。图中(a)部分显示,PRIME的结果训练奖励(蓝线)在步骤200时比RLOO高出6.9%,且表现出2.5倍的样本效率。图(b)部分则展示了在不同梯度步骤下的测试准确度,PRIME的准确度整体高于RLOO。

Figure 3: The effect of dense reward. We compare PRIME and RLOO with outcome verifier (OV). PRIME leads to 2.5×2 . 5 \times sample efficiency (wall clock as X\mathrm { X } axis can be found in Figure 17) and 6.9%6 . 9 \% performance improvement. PRIME also substantially outperforms RLOO on downstream tasks.

The following figure (Figure 17 from the original paper) shows the effect of dense reward, with wall clock as the X-axis:

Figure 17: The effect of dense reward. We compared PRIME and RLOO with outcome verifier (OV). The figure depicts training reward curves across wall clock, revealing better sample efficiency of PRIME. 该图像是图表,显示了PRIME与仅使用结果验证器(OV)的RLOO之间的训练奖励比较。图中展示了训练奖励的10次迭代移动平均,随着时间的推移,PRIME在样本效率上表现优于RLOO。

Figure 17: The effect of dense reward. We compared PRIME and RLOO with outcome verifier (OV). The figure depicts training reward curves across wall clock, revealing better sample efficiency of PRIME.

  • Performance: Figure 3 shows that PRIME achieves a 6.9% higher final training reward and lower variance compared to RLOO w/OV Only after 240 steps. On downstream tasks (Table 1), PRIME at 240 steps (Avg. 41.0%) significantly outperforms RLOO w/OV Only at 240 steps (Avg. 36.9%), demonstrating consistent improvements.
  • Training Efficiency:
    • Sample Efficiency: Figure 3 indicates that PRIME reaches similar training reward levels in significantly fewer steps, leading to a 2.5x sample efficiency gain compared to RLOO w/OV Only. Figure 17 (plotting against wall clock time) further supports this.

    • Time Cost: While PRIME has a slightly higher per-step time cost due to PRM updates, its superior sample efficiency still translates to overall faster training.

      The following are the results from Table 2 of the original paper:

      Time(s) Rollout Policy update PRM update Others Sum
      PRIME 281.7 156.6 150.9 91.1 680.3
      RLOO 282.4 157.9 0 90.4 530.7

PRIME requires 24% more time per step (680.3s vs. 530.7s) primarily due to the PRM update phase. However, given its 2.5x sample efficiency, PRIME is still about 2x more efficient in total training time. This shows that the benefits of dense rewards outweigh the minor overhead.

6.3. Ablation Studies / Parameter Analysis

6.3.1. Design Choices for the Implicit PRM

  • SFT Model Initializes a Good PRM: The paper investigates Implicit PRM initialization strategies. Surprisingly, directly initializing the PRM from the SFT model (Eurus-2-7B-SFT) performs significantly better than initializing from a specially trained EurusPRM (which was trained on an additional 500K samples from Llama3.1 and Qwen2.5 series). The following figure (Figure 4 from the original paper) shows the comparison of different PRMs:

    Figure 4: Comparison of different PRMs. Online PRM initialized from SFT model achieved the best results. However, using PRMs trained on extra rollouts hurts the performance. 该图像是图表,展示了不同过程中获得的奖励与测试准确性的比较。图(a)中,PRIME使用在线SFT PRM的结果最佳;图(b)显示在不同梯度步骤下,各模型的测试准确度变化情况。数据表明,使用额外的离线EurusPRM训练会影响性能。

    Figure 4: Comparison of different PRMs. Online PRM initialized from SFT model achieved the best results. However, using PRMs trained on extra rollouts hurts the performance. This suggests that aligning the PRM and policy model by initializing them from the same base model (the SFT model in this case) helps alleviate distribution shift issues, and that the PRM's effectiveness largely comes from being trained on online rollouts from the policy model itself, rather than from extensive pre-training.

  • Online PRM Update is Essential: The paper demonstrates that online PRM updates are crucial for preventing overoptimization and reward hacking. The following figure (Figure 5 from the original paper) shows the impact of PRM online update:

    Figure 5: Impact of PRM online update. Offline PRM is gradually been overoptimized while online PRMs achieve higher accuracy during training. 该图像是图表,展示了PRM的在线更新效果。线上PRM在训练过程中表现出更高的准确率,而离线PRM则逐渐被过度优化。图中显示了使用在线SFT PRM及在线EurusPRM的准确率变化趋势。

    Figure 5: Impact of PRM online update. Offline PRM is gradually been overoptimized while online PRMs achieve higher accuracy during training. Figure 5 clearly shows that an offline PRM (trained once and kept static) initially has high accuracy but gradually drops during RL training due to distribution shift as the policy evolves. In contrast, online PRMs (trained on policy rollouts) show increasing accuracy, effectively adapting to the current policy distribution. This validation confirms that PRIME's online PRM update mechanism is vital for its success.

6.3.2. Scaling PRIME with More Compute

PRIME's scalability is tested by extending training steps and increasing rollout numbers. The following figure (Figure 6 from the original paper) shows RL training with more training steps (Left) and larger rollout numbers (Right):

Figure 6: RL training with more training steps (Left) and larger rollout numbers (Right). 该图像是图表,展示了PRIME训练过程的测试性能。左侧图为PRIME训练至800步的结果,右侧图为每个提示使用16次回滚的训练效果。蓝线表示PRIME模型的准确率,橙线表示仅使用输出标签的RLOO模型。整体趋势显示PRIME在训练中表现优于RLOO模型。

Figure 6: RL training with more training steps (Left) and larger rollout numbers (Right).

  • Extended Training (800 steps): PRIME consistently shows stable growth and outperforms the RLOO baseline by 3.7% over an extended training period (Figure 6, Left).
  • Larger Rollout Numbers (16 samples/prompt): Increasing the number of sampled responses per prompt from 4 to 16 yields a non-trivial improvement of approximately 4.4% for PRIME over RLOO (Figure 6, Right). These results confirm PRIME's scalability and robust performance with increased computational resources.

6.3.3. PRIME with Other RL Algorithms

PRIME is shown to be a general method, compatible with various RL algorithms. The following figure (Figure 7 from the original paper) shows that PRIME also generally benefits REINFORCE, GRPO, and PPO:

Figure 7: PRIME also generally benefits REINFORCE, GRPO, and PPO. 该图像是一个图表,展示了不同强化学习方法(REINFORCE、GRPO、PPO)在训练过程中的结果奖励变化,比较了使用PRIME的效果。随着步骤的增加,添加了PRIME的强化学习方法整体表现出更高的奖励,表明其对训练过程的积极影响。

Figure 7: PRIME also generally benefits REINFORCE, GRPO, and PPO.

The following are the results from Table 4 of the original paper:

Method Step AIME 2024 AMC MATH-500 MinervaMath OlympiadBench LeetCode LiveCodeBench Avg.
RLOO 240 20.0 47.0 73.2 36.4 35.4 28.3 26.7 36.9
RLOO w/PRIME 240 20.0 506 78.2 39.3 40.3 31.1 27.5 41.0
REINFORCE 240 6.7 47.0 72.6 36.0 37.2 27.2 25.0 36.0
REINFORCE w/PRIME 240 6.7 50.0 76.4 36.8 39.1 27.8 27.5 37.8
GRPO 240 10.0 44.6 73.2 37.5 36.6 25.0 25.8 36.1
GRPO w/PRIME 240 16.7 47.0 75.0 34.9 38.2 28.9 23.9 37.8
PPO 240 10.0 41.0 73.6 36.0 36.3 28.3 25.7 35.8
PRIME as Value Model 240 16.7 44.6 72.6 34.6 35.7 27.8 24.6 36.6
PPO w/ PRIME 240 13.3 50.6 77.4 37.1 40.6 30.0 26.7 39.4

Figure 7 and Table 4 show that PRIME consistently boosts the performance of REINFORCE, GRPO, and PPO in terms of both efficiency (implied by higher rewards at similar steps) and final performance. For example, RLOO w/PRIME achieves an average of 41.0% compared to RLOO's 36.9%. This highlights PRIME as a generic, plug-in component for almost any RL algorithm for LLMs. Notably, the PPO variant of PRIME (39.4%) performs better than PPO alone (35.8%), but RLOO w/PRIME (41.0%) still achieves the best performance.

6.3.4. Value or Reward: How to Use the Implicit PRM?

This section explores whether the Implicit PRM is better utilized as a reward model (providing rϕ(yt)r_\phi(y_t)) or a value model (predicting V(y<t)V(\mathbf{y}_{<t})). The following figure (Figure 8 from the original paper) shows the comparison of value models and process reward models:

Figure 8: Comparison of value models and process reward models. 该图像是一个图表,展示了不同方法在训练过程中的结果奖励变化,横轴为步骤数量,纵轴为结果训练奖励。图中包含了 REINFORCE 方法及其与线性头价值模型、隐式过程奖励模型作为价值和奖励的对比。不同方法的曲线展示了奖励随着步骤增加的趋势。

Figure 8: Comparison of value models and process reward models. Figure 8 compares four variants:

  1. REINFORCE (A_t = r_o(\mathbf{y})): Basic sparse reward.

  2. PPO (A_t = r_o(\mathbf{y}) - V(\mathbf{y}_{<t})): Uses a linear-head value model for baseline.

  3. Implicit PRM as Value Model (A_t = r_o(\mathbf{y}) - v_\phi(\mathbf{y}_{<t})): Uses values from Implicit PRM as baseline.

  4. REINFORCE w/PRIME (A_t = r_o(\mathbf{y}) + \sum_{s=t}^T r_\phi(y_s)): Uses process rewards from Implicit PRM to calculate return.

    The results clearly indicate that using Implicit PRM to calculate process rewards and incorporate them into the return (variant 4) significantly outperforms all other baselines, including those using value models (variants 2 and 3). This suggests that PRMs are more effective than value models for RL in LLMs, and that explicitly leveraging the dense reward signal is key.

6.3.5. "Zero" Experiments (RL from Base Model)

The paper also explores "Zero" RL, where training starts directly from a base model (e.g., Qwen2.5-Math-7B-Base) without an SFT phase. The following figure (Figure 13 from the original paper) shows "Zero" RL from Qwen2.5-Math-7B:

该图像是两个图表,展示了PRIME与PRIME-Zero在训练过程中的效果。第一个图表显示了在不同步数下的结果训练奖励,PRIME的表现低于PRIME-Zero。第二个图表展示了不同梯度步数下的数学测试准确率,PRIME同样表现不及Qwen2.5-Math-7B-Instruct。 该图像是两个图表,展示了PRIME与PRIME-Zero在训练过程中的效果。第一个图表显示了在不同步数下的结果训练奖励,PRIME的表现低于PRIME-Zero。第二个图表展示了不同梯度步数下的数学测试准确率,PRIME同样表现不及Qwen2.5-Math-7B-Instruct。

Figure 13: "Zero" RL from Qwen2.5-Math-7B. RL from the base model converges way faster than the SFT model, surpassing the instruct version within 32 steps.

The following figure (Figure 14 from the original paper) shows "Zero" RL from Qwen2.5-32B-Base:

Figure 14: "Zero" RL from Qwen2.5-32B-Base. RL from a 32B base model shows more promising gain, surpassing the instruct version within 16 steps. 该图像是图表,展示了PRIME-Zero在步骤数与训练奖励及数学测试准确率之间的关系。在(a)部分,展示了PRIME-Zero的结果随步骤的变化情况。在(b)部分,显示了不同梯度步骤下的数学测试准确率,在80步时达到52分以上。

Figure 14: "Zero" RL from Qwen2.5-32B-Base. RL from a 32B base model shows more promising gain, surpassing the instruct version within 16 steps.

  • Efficiency: RL from a base model (PRIME-Zero) converges much faster than from an SFT model. For Qwen2.5-Math-7B-Base, it surpasses the instruct version within 32 steps (Figure 13).
  • Larger Models Benefit More: The 32B base model shows even more promising gains, surpassing the instruct version within 16 steps (Figure 14). This aligns with findings in DeepSeek-AI et al. (2025).
  • Saturation Issue: Despite impressive initial gains, PRIME-Zero models quickly saturate at an early stage (around 50 steps), which hinders further improvement. This is potentially due to a decrease in response diversity and is highlighted as future work.

6.3.6. Effect of Reward Model Size

The paper investigates the impact of reward model (RM) capacity by fixing the policy model (Qwen2.5-7B-Base) and varying the RM size. The following are the results from Table 5 of the original paper:

Reward Model AIME 24 AIME 25 AMC MATH Minerva OlympiadBench Average
Qwen2.5-3B 10.7 4.8 44.0 73.2 26.1 33.0 32.0
Qwen2.5-7B 13.2 6.4 42.9 73.4 26.5 33.1 32.6
Qwen2.5-14B 10.8 4.8 44.1 73.2 25.4 32.7 31.8

The results (Table 5) show that reward model size has a limited influence, with the 7B reward model achieving the best average performance (32.6%). Larger (14B) or smaller (3B) RMs do not yield clear advantages, suggesting that a PRM of similar size to the policy model is sufficient.

6.3.7. Comparison with VinePPO

PRIME is compared against VinePPO (Kazemnejad et al., 2024) on the MATH dataset using RhoMath 1.1B as the base model. The following figure (Figure 15 from the original paper) shows validation accuracy curves of PRIME and VinePPO:

Figure 15: Validation accuracy curves of PRIME and VinePPO. 该图像是图表,展示了在 MATH500 数据集上 PRIME 和 VinePPO 的验证准确率曲线。随着时间的推移,PRIME 的验证准确率显著优于 VinePPO,初始阶段的表现差异明显。

Figure 15: Validation accuracy curves of PRIME and VinePPO.

The following are the results from Table 6 of the original paper:

Steps 16 32 48 64 80 96
VinePPO Val Acc (%) 15.7 16.3 17.2 17.6 17.7 18.4
VinePPO Clock Time (Hours) 2.23 4.57 7.23 9.86 11.96 13.94
PRIME Val Acc (%) 16.4 16.8 17.5 18.1 18.7 18.8
PRIME Clock Time (Hours) 0.22 0.41 0.60 0.80 1.01 1.22
  • Efficiency: PRIME is significantly faster, completing 96 steps in 1.22 hours compared to VinePPO's 13.94 hours, making it 11x more efficient.
  • Performance: PRIME consistently achieves higher validation accuracy than VinePPO at every step, demonstrating superior performance (Table 6, Figure 15).

6.3.8. Comparison with DeepScaleR

PRIME is also compared against DeepScaleR (Luo et al., 2025), a three-stage RL pipeline, under a similar setting. The following figure (Figure 16 from the original paper) shows training reward curves of PRIME and DeepScaleR:

Figure 16: Training reward curves of PRIME and DeepScaleR. 该图像是图表,展示了PRIME和DeepScaleR的训练奖励曲线。横轴表示训练步骤,纵轴表示训练奖励,PRIME的训练奖励曲线(蓝色)显示出逐步上升的趋势,而DeepScaleR(橙色)则相对平稳。

Figure 16: Training reward curves of PRIME and DeepScaleR.

The following are the results from Table 7 of the original paper:

Model Step GPU Hour AIME 2024 MATH-500 AMC MinervaMath OlympiadBench Avg.
DeepScaleR-1.5B-Preview 1750 3800 43.1 87.8 73.6 30.2 50.0 57.0
DeepScaleR-1.5B-Stage1 1040 ~ 600 33.9 - - - - -
DeepSeek-R1-Distill-Qwen-1.5B - - 28.8 82.8 62.9 26.5 43.3 48.9
PRIME-DeepScaleR-1.5B-Stage1 330 446.7 32.1 85.1 68.1 30.1 44.6 52.0
  • Performance: PRIME achieves comparable training accuracy to DeepScaleR's Stage 1 within 330 steps, which is only 1/3 of DeepScaleR's 1040 steps for Stage 1 (Figure 16). On test sets (Table 7), PRIME-DeepScaleR-1.5B-Stage1 improves the base model (DeepSeek-R1-Distill-Qwen-1.5B) by 3.1 points (52.0% vs. 48.9% average), validating its effectiveness even on highly capable base models.
  • Efficiency: PRIME consumes 446.7 A800 GPU hours for this experiment, compared to DeepScaleR's Stage 1 which roughly required ~600 A100 GPU hours. This indicates PRIME is about 25% faster and potentially more given hardware differences. The overhead for PRIME is also noted to be smaller for long reasoning models.

6.4. Reference Model Choice

The paper explores two variants for the reference model (πref\pi_{ref}) in the Implicit PRM's reward calculation:

  1. SFT ref: Retains the initial SFT model as πref\pi_{ref}.

  2. Policy ref: Uses the running policy's old log-probabilities as πref\pi_{ref}. The following figure (Figure 10 from the original paper) shows the comparison of different reference policy implementations:

    Figure 9: Comparison of different reference policy implementations. One uses the running policy's old logprobs as reference (policy ref) while the other uses the initial SFT model as the reference model (SFT ref). 该图像是图表,展示了不同参考策略实现的结果。蓝色线条表示使用运行策略的旧对数概率作为参考(policy ref),而橙色线条表示使用初始 SFT 模型作为参考(SFT ref)。在训练步骤上,两个参考的回报表现相似。

Figure 10: Different reference model for PRM. We compare two reference model selection strategies for PRIME. Using the policy model as reference and using the initial SFT model as reference. Their rewards are similar. Figure 10 shows that both strategies yield similar training rewards. The choice is flexible, with policy ref naturally serving as the reference for the Q-value expectation, while SFT ref is necessary if KL divergence calculations against the initial policy are also desired.

6.5. Single-Forward vs. Double-Forward

The paper investigates if updating the PRM before the policy model in the same rollout stage (double-forward) affects performance compared to using the old PRM (single-forward). The following figure (Figure 11 from the original paper) shows single and double forward:

Figure 11: Single and double forward. While double forward methods obtain higher accuracy after online update, the two variants achieve similar rewards during training. 该图像是图表,展示了PRM分类准确度和训练奖励的变化情况。左侧图(a)显示了在训练样本上不同策略下的准确度,其中双向前推法在在线更新后准确度较高;右侧图(b)展示了训练过程中不同策略下的奖励变化,双向前推法的奖励总体上高于单向前推法。

Figure 11: Single and double forward. While double forward methods obtain higher accuracy after online update, the two variants achieve similar rewards during training. Figure 11 shows that while double-forward can increase PRM accuracy after online updates, the training rewards between single-forward and double-forward methods remain similar. This implies that the additional computational cost of double-forward might not be justified for the marginal gain in policy performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces PRIME (Process Reinforcement through IMplicit rEwards), a novel and scalable framework for enhancing the reasoning capabilities of large language models through online reinforcement learning with dense token-level rewards. PRIME successfully addresses the long-standing challenges of incorporating dense rewards by leveraging implicit process reward modeling, which allows process reward models (PRMs) to be updated online using only outcome labels, thus circumventing expensive step-level annotations and mitigating reward hacking. The framework also eliminates the need for a dedicated reward model training phase, initializing PRMs directly from SFT or base models, and is broadly compatible with various RL algorithms. Experiments on competitive math and coding benchmarks demonstrate that PRIME significantly boosts sample efficiency (2.5x gain) and policy performance (15.1% average improvement over SFT), even enabling a 7B model to surpass larger, specialized instruct models with substantially less training data.

7.2. Limitations & Future Work

The authors explicitly mention one limitation:

  • Resource Constraints: Experiments were conducted only on models up to 32B. Other ablation experiments ran fewer steps compared to the main experiments, though comparisons were made fairly under the same step numbers.

    Implicitly, from the "Zero" experiments, a potential area for future work is identified:

  • Saturation in "Zero" RL: While starting RL directly from a base model (PRIME-Zero) shows impressive initial gains and efficiency, it quickly saturates at an early stage. This could be attributed to a decrease in response diversity, and addressing this saturation is left as future work.

7.3. Personal Insights & Critique

The PRIME paper offers a highly insightful and practical solution to a critical bottleneck in LLM RL: the scalable integration of dense process rewards. Its core innovation of using implicit process rewards with online PRM updates from outcome labels is a clever way to bridge the gap between fine-grained feedback and annotation feasibility.

Inspirations and Transferability:

  • Efficiency for Complex Tasks: The demonstrated efficiency gains (2.5x sample efficiency, 10% data usage) make PRIME highly appealing for training LLMs on complex multi-step reasoning tasks where reward sparsity is a major issue. This could be particularly impactful in domains like scientific discovery, advanced code generation, or medical diagnosis, where detailed reasoning steps are crucial but hard to manually label.
  • Reduced Development Overhead: Eliminating the dedicated reward model training phase is a significant practical advantage. For research labs and developers, this means faster iteration cycles and lower computational costs, democratizing access to RL methods for LLMs.
  • Generalizability: The plug-in nature of PRIME with various RL algorithms suggests its methods could be broadly adopted across different RLHF or RL pipelines, enhancing existing approaches rather than requiring a complete overhaul.
  • "Zero" RL Potential: The preliminary "Zero" RL experiments, despite their saturation issue, open up an exciting avenue. If the diversity problem can be solved, training directly from base models could drastically simplify the LLM development lifecycle, potentially replacing the costly SFT stage entirely for certain applications.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

  • Dependence on Outcome Verifiers: While PRIME avoids process labels, it still relies on high-quality, unhackable outcome verifiers. For many real-world tasks, defining such a verifiable outcome (e.g., for creative writing or open-ended dialogue) can itself be challenging. The effectiveness of PRIME would be limited by the quality and hackability of the final outcome reward.

  • Interpretability of Implicit Rewards: While mathematically defined, the implicit process rewards are a log-ratio of probabilities. Their direct interpretability for human understanding or debugging purposes might be less intuitive than explicitly designed human feedback-based process rewards.

  • Hyperparameter Sensitivity of β\beta: The β\beta parameter plays a crucial role in scaling the implicit rewards. Its optimal value might be task-dependent, and the paper doesn't deeply explore its sensitivity or provide a principled way to choose it, beyond setting it to 0.05.

  • Credit Assignment Ambiguity within Tokens: While PRIME offers token-level rewards, a single token might still represent a part of a larger, ambiguous "step." The credit assignment problem might merely be shifted to a finer granularity without being fully resolved, especially if the Implicit PRM itself is not perfectly aligned with human notions of good reasoning steps.

  • Scaling to Larger Models (Beyond 32B): While PRIME shows promise, the explicit limitation on models up to 32B leaves open questions about its behavior and efficiency for models in the 70B+ or even trillion-parameter range, where memory and computational constraints are even more severe.

    Overall, PRIME presents a strong case for the practical viability of dense rewards in LLM RL, offering a robust and efficient framework that addresses key challenges. Its impact could be significant in pushing the boundaries of LLM reasoning capabilities.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.