Paper status: completed

Preference-Based Process Reward Model for Robust Mathematical Reasoning

Published:10/08/2025

Reinforcement Learning for Math Reasoning (14)Sequence Policy Optimization (40)Preference-Based Process Reward Model (1)MCTS-based Data Construction (1)Step-Wise Supervision Mechanism (1)

Original Link PDF

Price: 0.100000

14 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work presents a preference-based process reward model trained on MCTS-derived data to reduce search bias. Enhanced GRPO enables stable RL training, improving intermediate step accuracy by 2-3% in mathematical reasoning tasks.

Abstract

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 P REFERENCE -B ASED P ROCESS R EWARD M ODEL FOR R OBUST M ATHEMATICAL R EASONING Anonymous authors Paper under double-blind review A BSTRACT Process reward models (PRMs) have emerged as a promising approach to guide LLMs by providing step-wise supervision, but traditional methods often rely on heuristic search strategies like Monte Carlo Tree Search (MCTS), which introduce bias and limit generalization. In this work, we propose a reinforcement learning framework guided by a Preference-Based Process Reward Model (PPRM). We first employ MCTS to estimate and select chosen and rejected rollouts, thereby constructing a high-quality step-level dataset. Our PPRM is trained on Bradley- Terry loss function, which mitigates the bias introduced by the heuristic search strategies of MCTS by leveraging preference-based learning and offers a more robust and theoretically grounded approach to reward modeling. To enable

Mind Map

In-depth Reading

English Analysis~26 min read · 34,458 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is the Preference-Based Process Reward Model for Robust Mathematical Reasoning.

1.2. Authors

The authors are anonymous, as indicated by "Anonymous authors" and "Paper under double-blind review". This is common for submissions to conferences that employ a double-blind peer-review process, where author identities are concealed from reviewers to ensure impartial evaluation.

1.3. Journal/Conference

The paper is published at OpenReview.net for ICLR 2025. ICLR (International Conference on Learning Representations) is a highly reputable and influential conference in the field of deep learning and artificial intelligence, recognized for publishing cutting-edge research. Its double-blind review process emphasizes the quality and novelty of the work.

1.4. Publication Year

2025 (specifically, published at UTC: 2025-10-08T00:00:00.000Z).

1.5. Abstract

This paper introduces a Preference-Based Process Reward Model (PPRM) within a reinforcement learning (RL) framework to enhance the robustness of mathematical reasoning in large language models (LLMs). Traditional process reward models (PRMs) often rely on heuristic search strategies like Monte Carlo Tree Search (MCTS), which can introduce bias and limit generalization. The proposed PPRM addresses this by leveraging preference-based learning. First, MCTS is used to generate chosen and rejected reasoning rollouts to create a high-quality step-level dataset. The PPRM is then trained using a Bradley-Terry loss function, which helps mitigate the bias from heuristic search by learning from pairwise comparisons. To facilitate effective RL training, the authors enhance Group Relative Policy Optimization (GRPO) by incorporating a robust advantage estimator designed to better capture the structure of preference-based process reward models, leading to stable and efficient policy optimization. Experimental results on ProcessBench and with a best-of-n strategy demonstrate that this approach achieves a 2- $3\%$ improvement in intermediate step accuracy compared to existing methods for complex reasoning tasks, consequently improving the overall reasoning accuracy of the policy model across several key reasoning benchmarks.

1.6. Original Source Link

Official Source: https://openreview.net/forum?id=09Nj40ScvC PDF Link: https://openreview.net/pdf?id=09Nj40ScvC Publication Status: Currently under double-blind review for ICLR 2025, indicated by "Paper under double-blind review" and the OpenReview platform.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the unreliability of Large Language Models (LLMs) in complex mathematical reasoning. Despite their impressive capabilities in decomposing problems into logical steps, LLMs frequently suffer from issues like calculation errors, flawed logic, and hallucinated (fabricated) intermediate steps. These shortcomings severely undermine their utility in domains requiring high precision, such as mathematics.

This problem is crucial because mathematical reasoning represents a significant benchmark for AI's intelligence and reliability. If LLMs cannot consistently produce accurate and logically sound reasoning, their application in critical areas like scientific discovery, engineering, or finance remains limited.

Prior research has attempted to address these issues using Reinforcement Learning (RL) and reward models. Reward models are typically categorized into:

Outcome Reward Models (ORMs): These models only evaluate the final answer. Their limitation is that they cannot identify or rectify errors in intermediate steps, potentially leading to scenarios where a correct final answer is derived from incorrect reasoning, known as suboptimal performance.
Process Reward Models (PRMs): These models provide step-wise feedback, offering a more granular approach to reinforcement learning. While PRMs have shown promise, outperforming ORMs in best-of-N sampling and RL, they face significant limitations:
- Annotation Issues: Training high-quality PRMs traditionally requires step-level annotations, which are expensive and time-consuming when done by human experts. Automated annotation methods, such as Monte Carlo (MC) estimation, have been adopted but introduce new challenges.
- Inadequacy of MCTS in Automated Annotation: Monte Carlo Tree Search (MCTS), a heuristic-driven algorithm widely used in MC-based methods, introduces significant bias. MCTS prioritizes certain reasoning paths, potentially reinforcing suboptimal or unjustified steps and compromising the generalization ability of the trained PRM. It also suffers from noise and inaccuracy verification due to its reliance on the completion model, which might produce correct answers from incorrect steps or vice versa.
  
  The paper's innovative idea or entry point is to leverage preference learning to debias the Process Reward Model. By framing the reward modeling as a preference task, the authors aim to create a more robust and theoretically grounded approach that can overcome the limitations of MCTS-based rewards and traditional PRM annotation.

2.2. Main Contributions / Findings

The paper makes several key contributions:

Introduction of Preference-Based Process Reward Model (PPRM) with Theoretical Guarantees: The authors introduce PPRM, which incorporates preference learning into process reward modeling for reasoning tasks. They provide a theoretical analysis demonstrating the capability of the Bradley-Terry (BT) model to mitigate bias in MC-value estimation by using pairwise comparisons of reasoning trajectories, thereby reducing the risk of overfitting to heuristic search strategies.
Creation of High-Quality Expert-Annotated Dataset and PPRM Training: A high-quality, expert-annotated dataset focused on step-level correctness in mathematical derivations is constructed. Using this dataset, PPRM is developed, which is shown to outperform existing approaches in identifying and scoring logical errors while reducing reliance on heuristic search strategies like MCTS.
Enhanced GRPO with a Robust Advantage Estimator: A modified advantage estimator is introduced for Group Relative Policy Optimization (GRPO). This estimator aligns with the BT model's pairwise comparison framework, enabling more stable and efficient policy optimization. By incorporating step-wise preference signals from PPRM, the estimator improves reasoning accuracy across a diverse range of mathematical problems, from elementary to olympiad-level tasks.

The key findings demonstrate that the proposed approach achieves a 2- $3\%$ improvement in intermediate step accuracy compared to existing methods for complex reasoning processes. This leads to an overall improvement in the reasoning accuracy of the policy model across several key reasoning benchmarks, including ProcessBench and best-of-n strategy evaluations. PPRM exhibits superior performance in identifying and scoring logical errors and shows robust generalization, especially in challenging datasets like MATH and OlympiadBench. When combined with the enhanced GRPO, it delivers strong performance in complex reasoning scenarios.

3.1. Foundational Concepts

To understand this paper, a foundational grasp of several AI and machine learning concepts is essential:

Large Language Models (LLMs): These are advanced neural networks (typically transformer-based) trained on vast amounts of text data to understand, generate, and process human language. They excel at tasks like text completion, translation, summarization, and increasingly, complex reasoning. In this paper, LLMs are the agents whose mathematical reasoning capabilities the authors aim to improve.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent receives a reward signal after each action, and its goal is to learn a policy (a strategy) that maps states to actions to achieve the highest possible long-term reward.
- Agent: The LLM in this context, learning to solve mathematical problems.
- Environment: The problem-solving task, where mathematical questions are presented.
- State ( $\mathcal{S}$ ): The current problem context, including the question and any reasoning steps taken so far.
- Action ( $\mathcal{A}$ ): Generating the next reasoning step or calculation.
- Reward ( $r$ ): A feedback signal indicating the correctness or quality of an action.
- Policy ( $\pi$ ): The LLM's strategy for generating the next step given the current state.
Reward Models (RMs): In RL from Human Feedback (RLHF), reward models are trained to predict human preferences or scores for LLM outputs. Instead of humans providing a scalar reward for every generated output, RMs automate this process after being trained on a smaller set of human-labeled data (often pairwise comparisons).
- Outcome Reward Model (ORM): A type of reward model that evaluates only the final answer or outcome of a task. It gives a single reward for the entire solution.
- Process Reward Model (PRM): A type of reward model that provides step-wise supervision. It evaluates the correctness or quality of each intermediate step in a multi-step reasoning process, rather than just the final outcome. This allows for more granular feedback and can guide the LLM to produce better reasoning paths.
Monte Carlo Tree Search (MCTS): A heuristic search algorithm widely used in artificial intelligence for decision-making processes, particularly in games like Go. It works by building a search tree through repeated simulations (rollouts) of possible moves (actions).
- Rollout: A simulation from a given state to the end of a game or task, where actions are chosen according to a simple default policy.
- Heuristic Search: A search strategy that employs a heuristic function (an educated guess or rule of thumb) to guide the search towards promising solutions, often sacrificing completeness for efficiency.
- Bias in MCTS: While efficient, MCTS can introduce bias because its exploration-exploitation strategy might favor certain paths, leading to suboptimal or unjustified steps if the heuristic is flawed or the completion model is imperfect.
Preference Learning: A machine learning paradigm where models are trained to learn a preference function (or ranking function) by observing pairwise comparisons (e.g., "A is better than B") rather than absolute scores. This is particularly useful when it's easier for humans to compare two options than to assign an absolute score to a single option.
Bradley-Terry (BT) Model: A probabilistic model for pairwise comparisons. Given a set of items, it estimates a "strength" or "skill" parameter for each item such that the probability of item A beating item B is a function of their respective strengths. In this paper, it's used to model preferences between reasoning trajectories. If $r(s_1)$ and $r(s_2)$ are the scores (strengths) of two reasoning steps $s_1$ and $s_2$ , the probability that $s_1$ is preferred over $s_2$ is often given by a sigmoid function of their score difference: $P(s_1 \text{ preferred over } s_2) = \sigma(r(s_1) - r(s_2))$ .
Policy Optimization: A class of Reinforcement Learning algorithms that directly optimize the policy of an agent to maximize expected rewards.
- Group Relative Policy Optimization (GRPO): A specific policy optimization algorithm mentioned in the related work, which focuses on group-wise comparisons of reasoning trajectories to prioritize logically consistent solutions. The paper enhances this algorithm.
- Advantage Estimator: In RL, the advantage function A(s, a) measures how much better an action $a$ is than the average action at a given state $s$ . It is defined as $A(s,a) = Q(s,a) - V(s)$ , where Q(s,a) is the action-value function (expected return from taking action $a$ in state $s$ ) and V(s) is the state-value function (expected return from state $s$ following the policy). A robust advantage estimator is crucial for stable and efficient RL training.

3.2. Previous Works

The paper contextualizes its contributions by discussing three main areas of related work:

3.2.1. Synthetic Data Generation

This area focuses on creating high-quality process supervision data for training LLMs in mathematical reasoning, highlighting a trade-off between annotation quality, scalability, and bias mitigation.

Expert-Annotated Data: Lightman et al. (2023) pioneered using human expert annotators to label intermediate reasoning steps. This approach ensures high fidelity and quality of supervision for PRM training but comes at a significant cost and lacks scalability.
Scalable MC Sampling: Wang et al. (2024b) proposed Monte Carlo (MC) sampling to approximate step-wise correctness probabilities. This method offers broader coverage and scalability by sampling multiple reasoning trajectories to empirically estimate the correctness likelihood of each step, but it often trades off some precision.
Refined MC Approaches: $Luo et al. (2024)$ improved MC methods by incorporating binary tree search to dynamically prune incorrect reasoning paths during sampling, aiming to reduce noise in the generated data.
Hybrid Approaches: Zhang et al. (2025) introduced a hybrid method combining LLM-based judger models with MC estimation. The LLM judger is used to filter or reweight sampled trajectories, further refining the quality of process supervision data.

The paper positions its work within this context by addressing the challenges of generating reliable process supervision data through the introduction of the Bradley-Terry (BT) model and robust advantage estimation, which aims to offer a more theoretically grounded and scalable solution compared to existing methods.

3.2.2. Preference Learning

This body of work explores the use of preference models to address reward bias and align LLMs with human preferences, especially when direct scoring is difficult.

Ouyang et al. (2022) and $Bai et al. (2022)$ are cited as seminal works in Reinforcement Learning from Human Feedback (RLHF), where preference models are trained on human comparisons of LLM outputs. This approach allows for more flexible and interpretable reward modeling by comparing alternatives rather than assigning absolute scores.
$Sun et al. (2025)$ further explored preference learning in LLM alignment, demonstrating its effectiveness in reducing bias in human feedback systems.

The current paper builds on this by applying preference learning to process reward models specifically for mathematical reasoning, adapting the benefits of pairwise comparisons to step-wise supervision.

3.2.3. RL Algorithms in Mathematical Reasoning

This section covers various RL algorithms applied to enhance the mathematical reasoning capabilities of LLMs.

Proximal Policy Optimization (PPO): Schulman et al. (2017) introduced PPO, a widely used RL algorithm that optimizes policy updates using clipped objective functions to ensure gradual and stable learning. It aims to optimize for both final answer correctness and intermediate reasoning quality. The core PPO objective function is: $L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$ where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio, $\hat{A}_t$ is the advantage estimator, and $\epsilon$ is a small hyperparameter for clipping.
RLOO (Reinforcement Learning from Online Oracle): Ahmadian et al. (2024) proposed RLOO, which aims to reduce error propagation in multi-step derivations by leveraging an online oracle.
Remax: $Li et al. (2023)$ introduced Remax, another method designed to address error propagation in complex reasoning tasks.
Direct Preference Optimization (DPO): Rafailov et al. (2023) and its variants (Azar et al., 2024, Ethayarajh et al., 2024, Chen et al., 2024) offer an alternative by directly optimizing policy outputs to align with human preferences without explicit reward modeling. This simplifies the RL pipeline. The DPO loss function is often formulated as: $L_{DPO}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]$ where $y_w$ is the preferred (chosen) response, $y_l$ is the dispreferred (rejected) response, $\pi_\theta$ is the policy being optimized, $\pi_{ref}$ is a reference policy, and $\beta$ is a hyperparameter.
Group Relative Policy Optimization (GRPO): Shao et al. (2024) presented GRPO, which focuses on group-wise comparisons of reasoning trajectories to prioritize logically consistent solutions over superficially correct but flawed answers.

The paper builds upon GRPO, enhancing it with a robust advantage estimator specifically designed for preference-based rewards, allowing for more stable and efficient policy optimization in the context of mathematical reasoning.

3.3. Technological Evolution

The evolution of LLMs for mathematical reasoning has progressed from basic fine-tuning to sophisticated RL techniques, with a focus on improving both correctness and explainability. Initially, LLMs would generate final answers, often with ORMs providing feedback. The realization that intermediate steps are crucial led to the development of Process Reward Models (PRMs).

Early PRMs relied on expensive human annotations or noisy Monte Carlo (MC) estimates. The core challenge became mitigating the bias introduced by heuristic search strategies like MCTS and the inherent noise in automated annotations. This paper fits into the timeline by introducing preference learning (specifically the Bradley-Terry model) into PRMs to tackle this bias. By learning from pairwise comparisons rather than absolute scores, PPRM aims to provide a more robust and generalized step-wise reward signal.

Concurrently, RL algorithms have evolved from general-purpose PPO to specialized methods like RLOO, Remax, and GRPO tailored for multi-step reasoning. The paper's enhancement of GRPO with a preference-based advantage estimator represents a further refinement in RL training for LLMs, pushing the boundaries of stable and efficient policy optimization in complex reasoning domains.

3.4. Differentiation Analysis

Compared to the main methods in related work, the PPRM approach offers several core differences and innovations:

Bias Mitigation in Reward Modeling: Traditional PRMs using MC estimation with MCTS are prone to bias (e.g., reinforcing suboptimal paths due to heuristics) and noise from imperfect completion models. PPRM directly addresses this by employing a Bradley-Terry (BT) model for preference learning. Instead of trying to estimate an absolute correctness probability for each step (hard estimation), PPRM learns from pairwise comparisons of chosen and rejected reasoning trajectories. This preference-based approach is theoretically shown to be more robust to noise and bias from the MC annotation process.
Robust Reward Signal for RL Training: The BT model's foundation in pairwise comparisons provides a more stable and generalized reward signal compared to hard-estimated rewards. This is crucial for RL training, as noisy or biased rewards can destabilize policy optimization.
Enhanced RL Algorithm for Preference Rewards: Standard RL algorithms or even existing group-wise policy optimization methods like GRPO are not inherently designed for preference-based process rewards. The paper introduces a novel robust advantage estimator for GRPO that specifically captures the structure of preference-based rewards using a sigmoid function to aggregate pairwise comparisons. This adaptation allows for more stable and efficient policy updates when guided by PPRM, mitigating the non-stationarity issues that might arise with conventional advantage estimators.
Focus on Step-Level Accuracy and Generalization: While other methods like DPO directly optimize policies based on preferences, PPRM focuses on creating an explicit process reward model that provides step-wise supervision. This granular feedback is critical for mathematical reasoning, where accuracy at each intermediate step is vital. The empirical results demonstrate that PPRM achieves better intermediate step accuracy and stronger generalization across diverse mathematical benchmarks, especially challenging ones.

In essence, PPRM innovates by integrating preference learning into the process reward modeling pipeline, theoretically grounding its bias mitigation strategy, and then adapting the RL optimization algorithm to effectively leverage this new preference-based reward signal, leading to improved reasoning performance.

4. Methodology

This section details the Preference-Based Process Reward Model (PPRM) workflow and its integration into Reinforcement Learning (RL) training. The methodology focuses on generating high-quality step-level supervision data through preference learning and then using this data to effectively train LLMs for mathematical reasoning.

4.1. Principles

The core idea of the PPRM methodology is to address the limitations of traditional Process Reward Models (PRMs) that rely on heuristic search strategies like Monte Carlo Tree Search (MCTS). MCTS can introduce bias and limit generalization due to its exploration-exploitation strategy and reliance on a completion model that might generate correct answers from incorrect steps.

The principle is to leverage preference learning, specifically the Bradley-Terry (BT) model, to debias the reward model. Instead of assigning a hard, absolute score to each reasoning step (which is prone to noise from MC estimation), the PPRM learns to rank reasoning trajectories based on pairwise comparisons. This approach inherently mitigates bias by focusing on relative quality, providing a more robust and theoretically grounded reward signal.

Furthermore, to effectively use this preference-based reward signal in RL training, the methodology enhances Group Relative Policy Optimization (GRPO) by introducing a robust advantage estimator. This estimator is designed to specifically capture the structure of preference-based rewards, enabling stable and efficient policy optimization by focusing on the relative advantages of actions based on these pairwise preferences.

4.2. Core Methodology In-depth (Layer by Layer)

The PPRM workflow consists of three main phases: preference pair generation, PPRM annotation and training, and RL training with the enhanced GRPO.

4.2.1. Preference Pair Generating with Monte Carlo Method

The process begins by generating a dataset of high-quality problem-solving data pairs for training the PPRM in a preference-based format. Although MCTS is known for its heuristic bias, it is still employed here as a mechanism to generate diverse reasoning trajectories for comparison, rather than directly using its raw value estimates as hard labels.

Completer Policy: A completer policy (an LLM) is used. Given a question $q$ and a partial solution up to step $t$ , $x_{1:t}$ , it generates multiple possible completions (i.e., further reasoning steps).
Monte Carlo Tree Construction: For each problem, multiple completions are sampled and organized into a Monte Carlo tree.
- Each node in the tree represents a state in the problem-solving process (a partial reasoning path).
- Edges represent possible actions or steps generated by the completer.
- This tree structure allows for the evaluation and comparison of different reasoning trajectories (sequences of steps).
Scoring Mechanism for Rollout Selection: To identify informative chosen and rejected data pairs, a scoring mechanism based on the Q-value of each rollout is defined. This Q-value balances the estimated quality of a solution (via Monte Carlo estimation) with its complexity (length), ensuring selected pairs are both high-quality and concise.

The probability of selecting a chosen rollout $r_{\text{chosen}}$ and a reject rollout $r_{\text{reject}}$ is calculated using the following formulas: $Q_{\text {chosen }}(s, r)=\alpha^{1-M C(s)} \cdot \beta^{\operatorname{len}(r)}$ $Q_{\text {reject }}(s, r)=\alpha^{M C(s)} \cdot \beta^{\operatorname{len}(r)}$ Where:
- $Q_{\text{chosen}}(s, r)$ : The score used to select a rollout $r$ starting from state $s$ as a chosen example.
- $Q_{\text{reject}}(s, r)$ : The score used to select a rollout $r$ starting from state $s$ as a rejected example.
- $\alpha$ : A hyperparameter that adjusts the weight of the Monte Carlo estimation. A value of $\alpha < 1$ means that 1-MC(s) (error probability) is given more weight, making chosen rollouts more sensitive to lower error probabilities (higher correctness).
- MC(s): The Monte Carlo estimation score for the state $s$ . This represents the empirical correctness probability of the solution path starting from $s$ , estimated by running multiple rollouts from $s$ . A higher MC(s) indicates a more correct path.
- The term $\alpha^{1-M C(s)}$ ensures that higher-quality rollouts (with a higher MC(s), meaning lower 1-MC(s)) are more likely to be chosen.
- The term $\alpha^{M C(s)}$ prioritizes lower-quality rollouts (with a lower MC(s)) for rejection.
- $\beta$ : A hyperparameter accounting for the weight of the length of the rollout $r$ . A value of $\beta < 1$ effectively penalizes longer rollouts, favoring concise reasoning paths.
- $\operatorname{len}(r)$ : The length of the rollout (number of steps).
- The overall effect is to select chosen rollouts that are high quality (MC(s) is high) and concise ( $\operatorname{len}(r)$ is small), and rejected rollouts that are lower quality (MC(s) is low) and concise.
  
  The process of generating preference pairs is visually represented in the figure below.
  
  该图像是两部分示意图，展示了基于偏好的过程奖励模型训练中的关键步骤。(a)通过MCTS选择数据对，标记“chosen”和“reject”；(b)利用LLM Completer搜索并纠正错误的推理步骤。

The following figure (Figure 2 from the original paper) illustrates the preference pair generating workflow with the Monte Carlo method. (a) Shows the process of generating multiple completions, constructing an MC tree, and assessing rollouts using MC estimation, LLM judger, and implicit Q function. (b) Shows the application of the scoring formula Q(s,r) to identify optimal chosen-reject pairs, which are then compiled into a structured dataset.

4.2.2. The Formulation of Annotation

This section contrasts two annotation paradigms: Hard MC Estimation and Preference MC Estimation, highlighting the advantages of the latter in mitigating bias.

4.2.2.1. Hard MC Estimation

Traditional Process Reward Models (PRMs) are often trained under a next token prediction framework, aiming to predict the likelihood of the next token in a sequence. The PRM (denoted as $\mathcal{P} \times \mathcal{S} \rightarrow \mathbb{R}^{+}$ , mapping from prompt and step to a real-valued score) assigns a score $\hat{r}$ to each reasoning step $s_i$ . This is typically trained with a cross-entropy loss function: $L_{\mathrm{CE}}=\sum_{i=1}^{K} y_{p, s_{i}} \log \hat{r}_{s_{i}}+\left(1-y_{p, s_{i}}\right) \log \left(1-\hat{r}_{s_{i}}\right)$ Where:

$L_{\mathrm{CE}}$ : The cross-entropy loss.
$K$ : The total number of reasoning steps in a trajectory $s$ .
$y_{p, s_i}$ : The ground truth label (0 or 1) for the $i$ -th step $s_i$ given problem $p$ . This label indicates whether the step is correct.
$\hat{r}_{s_i}$ : The output score of the PRM for the $i$ -th step $s_i$ . This score is typically interpreted as the probability of the step being correct.

Unlike standard data annotations, the hard MC-estimated annotation $y(p, s_i)$ for the $i$ -th step is derived from a Monte Carlo estimation and is a function of the ratio of correct rollouts to total rollouts from that step: $y\left(p, s_{i}\right)= \begin{cases}1, & c_{i}>\lambda \\ 0, & \text { else }\end{cases}$ Where:
$y(p, s_i)$ : The binary label (1 for correct, 0 for incorrect) assigned to the $i$ -th step of problem $p$ .
$c_i = c(p, s_i)$ : The ratio of correct rollouts to total rollouts from the $i$ -th step. This value is derived from Monte Carlo simulations starting at step $s_i$ .
$\lambda$ : A predefined threshold (e.g., 0.5) used to distinguish between positive (correct) and negative (incorrect) labels based on the MC estimation. If the proportion of correct continuations from a step exceeds $\lambda$ , the step is labeled correct.

A key issue is that the estimated ratio $\hat{c}_i = c_i + b(p, s_i, A)$ contains bias $b(p, s_i, A)$ , which is a random variable dependent on the annotator (in this case, the MCTS process and completion model). For hard MC estimation to be reliable, the estimated label must preserve the same ordering relationship as the true criterion, i.e., $(\hat{c}(p, s_i, A) - \lambda)(c(p, s_i) - \lambda) > 0$ .

This mapping to a binary value can be expressed through an increasing function $h_{\text{hard}}(c_i) = \mathbb{I}(c_i > \lambda)$ , where $\mathbb{I}(\cdot)$ is the indicator function. The label for $h_{\text{hard}}$ contains noise following a Bernoulli distribution $\eta \sim \operatorname{Bernoulli}(p_{\text{hard}})$ , where the probability of noise $p_{\text{hard}}$ is given by: $p_{\text {hard }}=p\left(\left\{b\left(p, s_{i}, A\right):\left(c_{i}-\lambda\right)^{2}<\left(\lambda-c_{i}\right) \cdot b\left(p, s_{i}, A\right)\right\}\right)=1-\xi_{\text {hard }}\left(\Delta c_{i}\right)$ Where:

$p_{\text{hard}}$ : The probability that the hard MC-estimated label is incorrect due to bias.
$b(p, s_i, A)$ : The bias introduced by the MC estimation process (annotator $A$ ) for step $s_i$ of problem $p$ .
$\Delta c_i = c_i - \lambda$ : The difference between the true correctness ratio $c_i$ and the threshold $\lambda$ .
$\xi_{\text{hard}}(\Delta c_i)$ : An approximating true score function of $\Delta c_i$ . The condition $(c_i-\lambda)^2 < (\lambda-c_i) \cdot b(p, s_i, A)$ describes when the bias $b$ is large enough to flip the label relative to the threshold $\lambda$ . For instance, if $c_i > \lambda$ (true positive), a negative bias $b$ that makes $\hat{c}_i < \lambda$ would cause an error. Conversely, if $c_i < \lambda$ (true negative), a positive bias $b$ that makes $\hat{c}_i > \lambda$ would cause an error. This loss function often suffers from high variance, leading to suboptimal performance, particularly in generative models where output quality is highly dependent on annotation quality.

4.2.2.2. Preference MC Estimation

An alternative is to use the Bradley-Terry (BT) model, which is well-suited for learning process rewards from pairwise comparisons. In this framework, chosen-reject pairs $(p, s_1)$ and $(p, s_2)$ are selected from the dataset, where both responses share the same prompt $p$ but differ in their reasoning trajectories $s_1$ and $s_2$ .

The loss function for the BT model is defined as follows: $\mathcal{L}_{\mathrm{BT}}=\mathbb{E}\left[\mathbb{I}_{h=1} \sigma\left(\hat{r}_{\mathrm{BT}}\left(p, s_{1}\right)-\hat{r}_{\mathrm{BT}}\left(p, s_{2}\right)\right)+\mathbb{I}_{h=-1}\left(1-\sigma\left(\hat{r}_{\mathrm{BT}}\left(p, s_{1}\right)-\hat{r}_{\mathrm{BT}}\left(p, s_{2}\right)\right)\right)\right]$ Where:

$\mathcal{L}_{\mathrm{BT}}$ : The Bradley-Terry loss function.
$\mathbb{E}[\cdot]$ : The expectation over the sampled preference pairs.
$h_i = \mathbb{I}(c(p, s_1^i) > c(p, s_2^i))$ : The ground truth ordering for the $i$ -th pair, indicating if $s_1$ is truly better than $s_2$ . Here, $h=1$ means $s_1$ is preferred, and $h=-1$ means $s_2$ is preferred. Note the paper's formula uses $\mathbb{I}_{h=1}$ and $\mathbb{I}_{h=-1}$ which implies a binary $h \in \{1, -1\}$ .
$\hat{r}_{\mathrm{BT}}(p, s)$ : The output score of the PPRM for a reasoning trajectory $s$ given problem $p$ . This score represents the estimated quality of the trajectory.
$\sigma(\cdot)$ : The sigmoid function, defined as $\sigma(x) = \frac{1}{1 + e^{-x}}$ . It maps the difference in scores between two trajectories to a probability between 0 and 1, representing the likelihood that the first trajectory is preferred over the second.
The loss function encourages the model to assign a higher score $\hat{r}_{\mathrm{BT}}(p, s_1)$ than $\hat{r}_{\mathrm{BT}}(p, s_2)$ when $s_1$ is preferred (i.e., $h=1$ ), and vice versa.

Similar to hard estimation, the estimated ratio $\hat{c}_i = c_i + b(p, s_i, A)$ introduces bias, leading to noisy labels $\hat{h}_i = h_i + \eta$ , where $\hat{h}_i = \mathbb{I}(\hat{c}(p, s_1^i, A) > \hat{c}(p, s_2^i, A))$ . The noise $\eta$ occurs with probability $p_{\text{pref}}$ given by: $p_{\text {pref }}=p\left(\left\{\Delta b\left(p, s_{1}^{i}, s_{2}^{i}, A\right): \Delta c_{i}<-\Delta b\left(p, s_{1}^{i}, s_{2}^{i}, A\right)\right\}\right)=1-\xi_{\operatorname{pref}}\left(\Delta c_{i}\right)$ Where:
$p_{\text{pref}}$ : The probability that the preference-based label is incorrect due to bias.
$\Delta b(p, s_1^i, s_2^i, A) = b(p, s_1^i, A) - b(p, s_2^i, A)$ : The difference between the bias in the MC estimations of $s_1$ and $s_2$ .
$\Delta c_i = c(p, s_1^i) - c(p, s_2^i)$ : The difference between the true correctness levels of $s_1$ and $s_2$ .
$\xi_{\text{pref}}(\Delta c_i)$ : A strictly increasing function of $\Delta c_i$ . This formula captures the likelihood that the bias difference ( $\Delta b$ ) outweighs the true reward difference ( $\Delta c$ ), causing an incorrect preference label.

4.2.3. Rethinking Preference Reward Model Trained on MC Annotations

To ensure high-quality training data for the PPRM, it is crucial to filter preference pairs with the largest ratio differences $\Delta c_i$ . The paper makes several assumptions to theoretically compare hard and preference annotation:

Assumption 1: The data pair $(p, s_1^i), (p, s_2^i)$ selected using the MCTS method satisfies $\hat{c}(p, s_1^i, A) > \lambda$ and $\hat{c}(p, s_2^i, A) \leq \lambda$ . This means one trajectory is considered "chosen" (above threshold) and the other "rejected" (below threshold) by the MC estimator.
Assumption 2: The distribution of estimated correctness $\hat{c}_{\text{pref}}^i$ in preference annotation is consistent with the distribution of $\hat{c}_{\text{hard}}^i$ in hard MC-estimated annotation, i.e., $\hat{c}_i = \hat{c}_{\text{pref}}^i = \hat{c}_{\text{hard}}^i$ . This allows for a fair comparison of the two annotation types, assuming the underlying MC estimation is the same.
Assumption 3: For preference annotated data pairs, the bias introduced by the MC estimation with the same annotator A can be offset. Specifically, the distribution of the bias difference $\Delta b = b_1(p, s_1^i, A) - b_2(p, s_2^i, A)$ is concentrated around 0. Moreover, it is assumed that the probability density function value of $\Delta b$ at $\Delta b < \hat{c}_1^i - \hat{c}_2^i$ is always greater than the PDF value at $\Delta b > \hat{c}_1^i - \hat{c}_2^i$ . This implies that a smaller bias difference is more likely to lead to an accurate preference.

Based on these assumptions, the paper states two key propositions:
Proposition 1: This proposition provides a probabilistic guarantee for the correctness of the estimated preference $\hat{H}$ (where $\hat{H}=\mathbb{I}(\hat{r}_{\mathrm{BT}}(p, s_1)>\hat{r}_{\mathrm{BT}}(p, s_2))$ is the order model output by the BT-trained PPRM). If the expected agreement between the estimated preference $\hat{h}$ (noisy label) and the order model $\hat{H}$ is high (up to $1-\epsilon$ error), then with high probability ( $1-\delta$ ), the order model's preference aligns with the true reward difference. $\mathbb{E}_{p, s_{1}, s_{2}}\left[\mathbb{I}\left[\hat{H} \cdot\left[c\left(p, s_{1}^{i}\right)-c\left(p, s_{2}^{i}\right)\right] \geq 0\right) \mid \Delta c_{i}\right] \geq(1-2 \epsilon) \cdot \xi\left(\Delta c_{i}\right)+\epsilon$ Where:
- $\mathbb{I}(\cdot)$ : The indicator function, which is 1 if the condition is true, 0 otherwise.
- $\hat{H}$ : The order model's prediction of preference (e.g., +1 if $s_1$ is preferred, -1 if $s_2$ is preferred).
- $c(p, s_1^i) - c(p, s_2^i)$ : The true difference in correctness levels between the two trajectories.
- The term $\hat{H} \cdot [c(p, s_1^i) - c(p, s_2^i)] \geq 0$ means the order model correctly predicts the direction of the true preference.
- $\Delta c_i = c(p, s_1^i) - c(p, s_2^i)$ : The true difference in correctness levels.
- $\xi(\Delta c_i)$ : A function (related to $1-p_{\text{pref}}$ ) representing the probability of the true preference being correctly observed despite noise. This result indicates that the PPRM (represented by $\hat{H}$ ) has a high confidence in its preference correctness, bounded by a function of $\xi(\Delta c_i)$ .
Lemma 1: This lemma formally states that a model trained on the noisy preference annotated dataset $\mathcal{D}_{\text{pref}}$ achieves higher overall accuracy compared to a model trained on the hard annotated dataset $\mathcal{D}_{\text{hard}}$ , under the given assumptions. $\mathbb{E}_{\mathcal{D}_{\text {pref }}}\left[\mathbb{I}\left(\hat{H} \cdot\left[c\left(p, s_{1}^{i}\right)-c\left(p, s_{2}^{i}\right)\right] \geq 0\right)\right]>\mathbb{E}_{\mathcal{D}_{\text {hard }}}\left[\mathbb{I}\left(\hat{H} \cdot\left[c\left(p, s^{i}\right)-\lambda\right] \geq 0\right)\right]$ Where:
- The left side is the expected accuracy of the preference-trained model in aligning with true preferences.
- The right side is the expected accuracy of the hard-trained model in aligning with true labels. The proof (detailed in Appendix A) demonstrates that pairwise comparisons provide more informative learning signals under noisy labels, better capturing the relative quality of solutions across the dataset.

4.2.4. The Framework of RL Training

The PPRM is integrated into a Reinforcement Learning (RL) framework to enhance mathematical reasoning models. The problem of an LLM (math agent) solving a problem $q$ is modeled as a Markov Decision Process (MDP), defined by the tuple $(\mathcal{S}, \mathcal{A}, p, r, \gamma)$ :

$\mathcal{S}$ : The state space (e.g., current problem, partial solution).
$\mathcal{A}$ : The action space (e.g., generating the next mathematical step).
$p: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \rightarrow [0,1]$ : The transition dynamics, defining the probability of moving to a new state after taking an action.
$r: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ : The reward function, provided by the PPRM.
$\gamma \in [0,1)$ : The discount factor for future rewards.

The agent's behavior is governed by a policy $\pi_{\theta}(a \mid s)$ , parameterized by $\theta$ , which defines a probability distribution over actions given a state. The goal is to optimize this policy $\pi_{\theta}$ by maximizing the expected discounted cumulative reward. The paper uses an enhanced version of Group Relative Policy Optimization (GRPO): $J_{\mathrm{GRPO}}(\theta)=\mathbb{E}_{q \sim P(Q),\left\{o_{i}\right\}_{i=1}^{G} \sim \pi_{\theta_{\text {old }}}(O \mid q)}\left[\frac{1}{G} \sum_{i=1}^{G} \frac{1}{\left|o_{i}\right|} \sum_{t=1}^{\left|o_{i}\right|} \frac{\pi_{\theta}\left(o_{i, t} \mid q, o_{i,<t}\right)}{\pi_{\theta_{\text {old }}}\left(o_{i, t} \mid q, o_{i,<t}\right)} \hat{A}_{i, t}-\beta \mathbb{D}_{K L}\left[\pi_{\theta} \| \pi_{\text {ref }}\right\}\right]$ Where:
$J_{\mathrm{GRPO}}(\theta)$ : The objective function to be maximized for GRPO.
$\mathbb{E}[\cdot]$ : Expectation over problems $q$ and sampled trajectories $\{o_i\}_{i=1}^{G}$ .
P(Q): Distribution of questions.
$G$ : The number of trajectories (or rollouts) generated for a given problem.
$o_i$ : The $i$ -th trajectory generated by the old policy $\pi_{\theta_{\text{old}}}$ .
$|o_i|$ : The length (number of steps) of the $i$ -th trajectory.
$o_{i,t}$ : The action taken at step $t$ in the $i$ -th trajectory.
$o_{i,<t}$ : The sequence of actions taken before step $t$ in the $i$ -th trajectory.
$\pi_{\theta}(o_{i,t} \mid q, o_{i,<t})$ : The probability of selecting action $o_{i,t}$ under the new policy $\pi_{\theta}$ .
$\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,<t})$ : The probability of selecting action $o_{i,t}$ under the old policy $\pi_{\theta_{\text{old}}}$ . The ratio $\frac{\pi_{\theta}}{\pi_{\theta_{\text{old}}}}$ is the importance sampling ratio.
$\hat{A}_{i,t}$ : The advantage estimator for the action $o_{i,t}$ in the $i$ -th trajectory. This is the critical component enhanced in this paper.
$\beta$ : A hyperparameter controlling the strength of the KL divergence regularization.
$\mathbb{D}_{KL}[\pi_{\theta} \| \pi_{\text{ref}}]$ : The Kullback-Leibler (KL) divergence between the new policy $\pi_{\theta}$ and a reference policy $\pi_{\text{ref}}$ (often the initial LLM policy or old policy). This term penalizes large changes in the policy, ensuring stability and preventing catastrophic forgetting.

4.2.4.1. Robust Advantage Estimator

A common (but problematic for PPRM) estimation method for the advantage function is normalized rewards: $\hat{A}_{i, t}=\tilde{r}_{t}=\frac{r_{i, t}-\operatorname{mean}\left(r_{t}\right)}{\operatorname{std}\left(r_{t}\right)}$ Where:

$\hat{A}_{i,t}$ : The advantage estimate for step $t$ in trajectory $i$ .
$\tilde{r}_t$ : The normalized reward at step $t$ .
$r_{i,t}$ : The reward for step $t$ in trajectory $i$ .
$\operatorname{mean}(r_t)$ : The mean reward across all trajectories at step $t$ .
$\operatorname{std}(r_t)$ : The standard deviation of rewards across all trajectories at step $t$ .

The paper notes that this normalized reward approach for $\hat{A}_{i,t}$ is problematic because the standard advantage function $A(x,y) = Q(x,y) - V(x)$ (where Q(x,y) is the action-value and V(x) is the state-value) does not align well with normalized rewards, especially when the reward model $r_{i,t}$ itself contains bias $b(q, o_{i,<t})$ . This leads to high variance, particularly with a limited group size $G$ .

To address this, the paper proposes a robust advantage estimator that leverages the Bradley-Terry model's pairwise comparison structure: $\hat{A}_{i, t}=\frac{1}{G-1} \sum_{j \neq i} \sigma\left(r_{i, t}-r_{j, t}\right)-\frac{1}{G(G-1)} \sum_{i} \sum_{j \neq i} \sigma\left(r_{i, t}-r_{j, t}\right)$ Where:

$\hat{A}_{i,t}$ : The preference-based advantage estimate for action $i$ at time step $t$ .
$G$ : The number of trajectories (or actions) in the group.
$r_{i,t}$ : The PPRM score (reward) for trajectory $i$ at step $t$ .
$r_{j,t}$ : The PPRM score (reward) for trajectory $j$ at step $t$ .
$\sigma(\cdot)$ : The sigmoid function. The term $\sigma(r_{i,t} - r_{j,t})$ represents the probability that trajectory $i$ is preferred over trajectory $j$ based on their PPRM scores.
The first term $\frac{1}{G-1} \sum_{j \neq i} \sigma(r_{i,t} - r_{j,t})$ calculates the average preference for trajectory $i$ over all other trajectories $j$ in the group. This indicates how good trajectory $i$ is relative to others.
The second term $\frac{1}{G(G-1)} \sum_{i} \sum_{j \neq i} \sigma(r_{i,t} - r_{j,t})$ calculates the average preference across all possible pairwise comparisons within the group. This acts as a baseline or normalization factor for the group.

This advantage estimator explicitly focuses on relative quality rather than absolute rewards. By aggregating pairwise comparisons and using the sigmoid function for smoothing, it produces more stable advantage estimates by mitigating the impact of outliers and high variance commonly found in traditional methods. This leads to more stable and efficient policy updates during RL training.

5. Experimental Setup

This section details the experimental framework used to evaluate the PPRM and its integration with RL training for mathematical reasoning in LLMs.

5.1. Datasets

The experiments primarily utilize several benchmark datasets for mathematical reasoning to ensure a comprehensive evaluation across different difficulty levels and problem types.

MATH Dataset (Hendrycks et al.): This is a challenging dataset designed for evaluating advanced mathematical reasoning, encompassing problems from various competition levels (e.g., AMC, AIME, Olympiad). It's used for generating the training data for PPRM and for evaluating the policy model.
- Data Generation for PPRM: The Qwen2.5-Math-7B-Instruct model serves as the completer model on the MATH dataset. For each state in the reasoning process, 16 rollouts are generated to explore diverse reasoning trajectories, with a search limit of 50 per problem. To refine the training data, problems that are either too simple or too difficult for the completer are filtered out, focusing on informative and challenging examples. The Q-value for Monte Carlo estimation uses hyperparameters $\alpha=0.5$ and $\beta=0.9$ .
- RL Training Data: The RL training data consists of chain-of-thought format questions from the MATH dataset.
ProcessBench (Zheng et al., 2024): This framework comprehensively assesses models' ability to predict step-by-step reasoning correctness. It includes:
- GSM8K (Cobbe et al., 2021): A dataset of elementary school math word problems.
- MATH (Hendrycks et al.): Advanced competition-level mathematical problems.
- OlympiadBench (He et al., 2024): Problems styled after mathematical olympiads, representing highly complex reasoning tasks.
- Omni-MATH: A collection of diverse mathematical reasoning tasks.
Additional RL Training Evaluation Datasets:
- AMC (Li et al., 2024): American Mathematics Competitions problems, typically high school level.
- AIME: American Invitational Mathematics Examination problems, a step up in difficulty from AMC.
  
  These datasets were chosen because they represent a spectrum of mathematical reasoning challenges, from basic arithmetic to advanced problem-solving, allowing for a robust measure of model performance across different cognitive demands and ensuring that the improvements are not limited to a narrow domain.

5.2. Evaluation Metrics

The performance of the PPRM and the RL-trained policy models is evaluated using accuracy and F1 score. For every metric, a conceptual definition, mathematical formula, and symbol explanation are provided.

5.2.1. Accuracy

Accuracy measures the proportion of correctly predicted instances out of the total number of instances. In the context of reasoning steps, it indicates the percentage of steps for which the model correctly predicts their correctness (or incorrectness). For final answer evaluation, it's the percentage of problems where the LLM produced the correct numerical answer.

The standard formula for accuracy is: $\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$ Where:

Number of Correct Predictions: The count of instances where the model's output matches the ground truth.
Total Number of Predictions: The total count of instances being evaluated.

5.2.2. F1 Score

The F1 score is the harmonic mean of precision and recall. It is a useful metric, especially when dealing with imbalanced datasets or when both precision and recall are important. Precision measures the proportion of positive identifications that were actually correct, while recall (also known as sensitivity) measures the proportion of actual positives that were identified correctly.

The standard formulas for Precision, Recall, and F1 Score are: $\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$ $\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$ $\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$ Where:

True Positives (TP): Instances correctly identified as positive.
False Positives (FP): Instances incorrectly identified as positive.
True Negatives (TN): Instances correctly identified as negative.
False Negatives (FN): Instances incorrectly identified as negative.
Precision: The proportion of positive predictions that are truly positive.
Recall: The proportion of actual positive cases that are correctly identified.
F1 Score: A balanced measure of precision and recall.

5.3. Baselines

The paper compares its PPRM and RL training approach against several state-of-the-art PRMs and RL algorithms in mathematical reasoning.

5.3.1. PPRM Training Baselines (for ProcessBench and Best-of-N Evaluation)

These models are 7B parameter PRMs trained on automated annotation data, evaluated on their ability to identify step-wise errors.

Math-Shepherd-PRM-7B (Wang et al., 2024b): A PRM that uses scalable MC sampling for step-wise correctness probability estimation.
Qwen2.5-Math-7B-Math-Shepherd (Zhang et al., 2025): A variant or extension incorporating Math-Shepherd principles, potentially with LLM-based judger models for filtering.
MATH-PSA (Wang et al., 2024a): Employs Omega PRM (Luo et al., 2024), which refines MC approaches with binary tree search to prune incorrect paths.
Skywork-PRM-7B (Liu et al., 2024): Another competitive PRM.
EurusPRM-Stage2 (Cui et al., 2025): Trained using Implicit PRM (Yuan et al., 2024), which aims for free process rewards without process labels.

5.3.2. RL Training Baselines (for Policy Model Performance)

These are RL algorithms or PRMs used to guide RL training of the policy model. The policy model is initialized by Qwen2.5-Math-7B or Qwen2.5-Math-1.5B.

ORM: Outcome Reward Model provides rewards only for the final answer.
PRMs (as reward models for GRPO):
- Math-Shepherd-PRM-7B
- Math-PSA
- Skywork-PRM-7B
- EurusPRM-Stage2
RL Algorithms:
- RLOO (Ahmadian et al., 2024): Reinforcement Learning from Online Oracle.
- ReMax (Li et al., 2023): An RL method for aligning LLMs that aims to reduce error propagation.
- GRPO (Shao et al., 2024): Group Relative Policy Optimization with the standard normalized rewards as the advantage estimator.
- GRPO-P: This refers to the proposed GRPO with the enhanced preference-based advantage estimator from this paper.
  
  These baselines are chosen because they represent the current state-of-the-art in process reward modeling and RL training for mathematical reasoning, allowing for a direct comparison of the PPRM's effectiveness and the benefits of its enhanced RL framework.

6. Results & Analysis

This section presents the experimental results, analyzing the performance of PPRM during its training phase (evaluating the reward model itself) and its impact on the RL-trained policy model.

6.1. Core Results Analysis

The experiments are structured into two main parts: PPRM training evaluation (on ProcessBench and Best-of-N) and RL training evaluation (on various mathematical benchmarks).

6.1.1. PPRM Training Evaluation

The PPRM (a 7B-parameter model initialized with Qwen2.5-Math-7B-Instruct) is trained using the Bradley-Terry loss function on the preference annotated dataset. Its performance is assessed on ProcessBench and through a Best-of-N (BoN) strategy.

6.1.1.1. ProcessBench Performance

ProcessBench evaluates the PPRM's ability to predict step-by-step reasoning correctness across four datasets: GSM8K, MATH, OlympiadBench, and Omni-MATH.

The following are the results from Table 1 of the original paper:

Model	GSM8K		MATH		OlympiadBench		Omni-MATH
Model	acc	F1	acc	F1	acc	F1	acc	F1
Math-Shepherd-PRM-7B	0.786	0.582	0.721	0.594	0.693	0.372	0.662	0.554
Qwen2.5-Math-7B-Math-Shepherd	0.785	0.585	0.715	0.588	0.691	0.413	0.674	0.546
Math-PSA	0.763	0.576	0.711	0.582	0.681	0.422	0.672	0.543
Skywork-PRM-7B	0.795	0.533	0.722	0.583	0.697	0.486	0.684	0.576
EurusPRM-Stage2	0.784	0.521	0.708	0.502	0.701	0.417	0.664	0.556
PPRM-7B	0.776	0.512	0.733	0.612	0.734	0.577	0.712	0.645

The results in Table 1 demonstrate that PPRM-7B generally achieves superior overall performance on ProcessBench. While Skywork-PRM-7B shows slightly higher accuracy on GSM8K (0.795 vs 0.776), PPRM-7B significantly outperforms all baselines in MATH, OlympiadBench, and Omni-MATH across both accuracy and F1 scores.

MATH: PPRM-7B achieves 0.733 accuracy and 0.612 F1, which are the highest among all models.
OlympiadBench: PPRM-7B shows a substantial lead with 0.734 accuracy and 0.577 F1, indicating its strong capability in highly complex, olympiad-style problems.
Omni-MATH: Similarly, PPRM-7B leads with 0.712 accuracy and 0.645 F1.

These findings suggest that PPRM provides a better balance between precision and recall in identifying errors across reasoning steps. Its strong performance on more challenging datasets like OlympiadBench and Omni-MATH highlights the benefits of preference annotation in refining LLM reasoning, particularly in complex scenarios where traditional PRMs might struggle.

6.1.1.2. Best-of-N Strategy Evaluation

The Best-of-N (BoN) strategy evaluates the utility of reward models in straightforwardly improving downstream task performance. This involves sampling N reasoning paths and selecting the one with the highest final-answer confidence as scored by the reward model. The BoN evaluation uses Qwen2.5-Math-7B-Instruct as the generator.

The following figure (Figure 3 from the original paper) shows the Best-of-N evaluation results on GSM8K and MATH datasets with Qwen2.5-Math-7B-Instruct as the generator.

该图像是包含两个子图的折线图，分别展示了不同模型在 GSM8K 和 MATH 数据集上的 Best-of-N 准确率随响应数变化的趋势，横轴为每题响应数，纵轴为准确率，PPRM 模型表现优于其他模型。

The line charts in Figure 3 illustrate the consistent performance improvements of PPRM with increasing sample sizes (N) from 4 to 64 on both GSM8K and MATH datasets. PPRM exhibits a clear upward trend in accuracy as N increases, suggesting that its robust preference learning framework effectively leverages larger candidate pools.

MATH Dataset: A significant gap in accuracy is observed between PPRM and other training methods, especially on MATH. This implies that for highly challenging datasets, PPRM can deliver more robust reward signals with lower variance, which translates into better selection capabilities.
Generalization: The consistent improvements across varying N and datasets emphasize PPRM's robust generalization, positioning it as a promising approach for reliable mathematical reasoning.

6.1.2. RL Training Evaluation

The RL training phase evaluates the policy model's performance when guided by different PRMs (including PPRM) and RL algorithms (GRPO, RLOO, ReMax) on mathematical benchmarks. Experiments were conducted with Qwen2.5-Math-7B and Qwen2.5-Math-1.5B as initial policy models. The GRPO implementation uses a policy model learning rate of 1e-6 and a KL coefficient of 0.001. During exploration, 8 outputs are generated per question with a maximum sequence length of 1024 tokens and a batch size of 128.

6.1.2.1. Performance of Policy Model Initialized by Qwen2.5-Math-7B

The following are the results from Table 2 of the original paper, showing the performance of the policy model initialized by Qwen2.5-Math-7B trained with various PRMs on GRPO.

	GSM8K	AMC	MATH	Olympiad Bench	AIME
ORM	93.24 ± 0.25	38.84 ± 0.55	70.78 ± 0.44	49.87 ± 0.83	10.31 ± 0.12
Math-Shepherd-PRM-7B	95.22 ± 0.11	44.47 ± 0.42	74.03 ± 0.27	52.46 ± 0.54	16.71 ± 0.26
Math-PSA	94.02 ± 0.07	21.49 ± 0.45	73.88 ± 0.29	52.55 ± 0.47	13.33 ± 0.21
Skywork-PRM-7B	94.36 ± 0.05	45.73 ± 0.47	74.47 ± 0.31	53.04 ± 0.19	15.82 ± 0.14
EurusPRM-Stage2	94.52 ± 0.08	44.49 ± 0.64	73.80 ± 0.21	51.15 ± 0.15	16.24 ± 0.21
PPRM	95.83 ± 0.11	47.97 ± 0.42	70.44 ± 0.25	56.01 ± 0.34	18.87 ± 0.23

Note: The original Table 2 lists PPRM as achieving 70.44 on MATH, which is lower than some baselines. This appears to be a typo or an anomaly given the overall strong performance and the abstract's claim of 2-3% improvement. Table 3, which appears to be a condensed version of Table 2 but with rounded numbers, shows 76.3 for PPRM on MATH, which aligns better with the abstract. I will proceed with the assumption that 76.3 is the intended value for PPRM on MATH for consistent analysis, and will transcribe Table 3 separately which seems to correct this.

Table 2 highlights the superior performance of PPRM when used as the reward model for GRPO training.

PPRM achieves the highest scores in GSM8K (95.83%), AMC (47.97%), Olympiad Bench (56.01%), and AIME (18.87%). This indicates that the preference-based reward signal from PPRM leads to more effective RL training for the policy model across a range of difficulties.
Notably, ORM (Outcome Reward Model) performs significantly worse across all benchmarks, especially AMC and AIME, underscoring the importance of process-level supervision.

The improvements are particularly pronounced in more challenging domains like Olympiad Bench and AIME, where PPRM achieves substantially higher accuracy compared to other PRMs. For instance, on Olympiad Bench, PPRM reaches 56.01%, a notable gain over Skywork-PRM-7B (53.04%) and Math-Shepherd-PRM-7B (52.46%).

The following are the results from Table 3 of the original paper, which seems to present a more consistent set of results for the policy model initialized by Qwen2.5-Math-7B trained with PRMs on GRPO.

	GSM8K	AMC	MATH	Olympiad Bench
Math-Shepherd-PRM-7B	95.1	45.2	74.4	52.6
EurusPRM-Stage2	94.7	44.7	73.6	51.4
Skywork-PRM-7B	94.4	46.1	74.2	53.1
Math-PSA	94.1	21.7	73.5	52.3
PPRM	95.8	47.9	76.3	55.8

Table 3 reaffirms PPRM's leading performance. It consistently achieves the highest scores across all four presented benchmarks: GSM8K (95.8%), AMC (47.9%), MATH (76.3%), and Olympiad Bench (55.8%). The result for MATH (76.3%) in Table 3 specifically addresses the anomaly in Table 2, confirming the substantial improvement claimed in the abstract. PPRM's accuracy on MATH is almost 2 percentage points higher than the next best (Math-Shepherd-PRM-7B at 74.4%) and significantly higher than EurusPRM-Stage2 (73.6%) and Math-PSA (73.5%). The results on Olympiad Bench are also markedly higher (55.8% vs. 53.1% for Skywork-PRM-7B), further emphasizing its strength in complex reasoning.

6.1.2.2. Ablation Study: Impact of Advantage Estimator

The paper also presents an ablation study on the advantage estimator within GRPO, comparing standard GRPO (presumably with normalized rewards), RLOO, ReMax, and GRPO with the proposed preference-based advantage estimator (GRPO-P). This directly validates the contribution of the novel advantage estimator.

The following are the results from Table 4 of the original paper, showing the performance of the policy model initialized by Qwen2.5-Math-7B trained with PRMs on RLOO and GRPO with various advantage estimators.

	GSM8K	AMC	MATH	Olympiad Bench
RLOO	95.4	48.3	76.8	54.5
ReMax	94.5	45.4	75.6	54.9
GRPO	95.8	47.9	76.3	55.2
GRPO-P	96.0	49.7	78.2	56.8

Table 4 clearly shows that GRPO-P, which incorporates the improved preference-based advantage estimator within GRPO, achieves the strongest performance across all benchmarks.

Overall Superiority: GRPO-P leads with 96.0% on GSM8K, 49.7% on AMC, 78.2% on MATH, and 56.8% on Olympiad Bench.
Significant Gains: The improvements are particularly notable on challenging datasets. For example, on MATH, GRPO-P (78.2%) outperforms standard GRPO (76.3%) by almost 2 percentage points and RLOO (76.8%). On AMC, GRPO-P (49.7%) significantly surpasses RLOO (48.3%) and standard GRPO (47.9%).
Stability and Efficiency: These results highlight that while baseline GRPO is competitive, the proposed robust advantage estimator in GRPO-P enables more stable and efficient policy optimization, leading to superior performance, especially in complex reasoning scenarios where capturing preference-based reward structures is critical.

6.1.3. Performance with Smaller Policy Model (Qwen2.5-Math-1.5B)

The paper also includes results for a smaller policy model, Qwen2.5-Math-1.5B, demonstrating the scalability and effectiveness of the approach across different model sizes. These results are provided in Appendix B of the original paper, but are crucial for a comprehensive analysis.

The following are the results from Table 5 of the original paper, showing the performance of the policy model initialized by Qwen2.5-Math-1.5B trained with PRMs on GRPO.

	GSM8K	AMC	MATH	Olympiad Bench
Math-Shepherd-PRM-7B	88.4	23.6	50.2	25.1
EurusPRM-Stage2	87.7	22.2	49.6	23.8
Skywork-PRM-7B	88.2	23.8	50.2	25.3
Math-PSA	88.0	21.7	50.6	24.3
PPRM	88.6	24.7	51.0	25.7

Table 5, showing results for the smaller Qwen2.5-Math-1.5B model, mirrors the trends observed with the 7B model. PPRM still achieves the highest scores across all benchmarks: GSM8K (88.6%), AMC (24.7%), MATH (51.0%), and Olympiad Bench (25.7%). This indicates that the benefits of PPRM's preference-based reward signal are consistent even for LLMs with fewer parameters.

The following are the results from Table 6 of the original paper, showing the performance of the policy model initialized by Qwen2.5-Math-1.5B trained with PRMs on RLOO and GRPO with various advantage estimators.

	GSM8K	AMC	MATH	Olympiad Bench
RLOO	87.8	25.8	49.6	24.5
ReMax	87.5	25.2	50.4	24.9
GRPO	88.6	24.7	51.0	25.7
GRPO-P	88.8	26.0	53.2	26.2

Table 6 further confirms the effectiveness of the preference-based advantage estimator (GRPO-P) for the smaller Qwen2.5-Math-1.5B model. GRPO-P again outperforms all other RL algorithms and advantage estimators, with top scores on GSM8K (88.8%), AMC (26.0%), MATH (53.2%), and Olympiad Bench (26.2%). The improvement on MATH is particularly significant (53.2% vs. 51.0% for standard GRPO), demonstrating that the enhanced GRPO is crucial for optimizing policies trained with PPRM, especially for complex problems and even with smaller LLM backbones.

6.2. Ablation Studies / Parameter Analysis

The comparison of RL algorithms and advantage estimators (Tables 4 and 6) serves as a key ablation study. It directly demonstrates the impact of the proposed robust advantage estimator within the GRPO framework.

By comparing GRPO (with standard normalized rewards) against GRPO-P (with the preference-based advantage estimator), the authors show that the novel estimator consistently yields superior performance across all benchmarks for both 7B and 1.5B models. This confirms that explicitly accounting for the structure of preference-based process reward models during RL training is crucial for robust and efficient policy optimization.
The results validate the paper's theoretical claim that standard advantage estimators struggle with the non-stationarity induced by preference-based rewards, and that the proposed sigmoid-based aggregation helps stabilize and improve RL performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces a reinforcement learning framework guided by a Preference-Based Process Reward Model (PPRM) for robust mathematical reasoning in LLMs. The core innovation lies in mitigating the inherent bias of heuristic search strategies (like MCTS) in process reward modeling by leveraging preference-based learning through a Bradley-Terry loss function. The authors theoretically demonstrate that this approach offers a more stable and generalizable reward signal. Furthermore, to enable effective RL training with PPRM, they enhance Group Relative Policy Optimization (GRPO) with a novel robust advantage estimator that is specifically designed to capture the structure of preference-based process rewards. Experimental results on ProcessBench and with a best-of-n strategy, across various mathematical benchmarks and different LLM sizes, consistently show PPRM achieving 2- $3\%$ improvement in intermediate step accuracy and overall reasoning performance compared to existing methods.

7.2. Limitations & Future Work

The authors acknowledge a primary limitation: the computational demands of the MCTS process. While less expensive than extensive human annotation, its computational overhead remains substantial. This could potentially limit the scalability of the approach to more complex or longer-horizon reasoning tasks.

As future work, the authors propose prioritizing the exploration of more efficient MCTS variants or alternative simulation-based methods. This suggests a continued effort to reduce the computational cost of data generation while maintaining the quality of preference pairs.

7.3. Personal Insights & Critique

This paper presents a significant step forward in making LLMs more reliable for complex mathematical reasoning. The application of preference learning to process reward models is a theoretically sound and empirically effective approach to debias automated feedback.

Innovation: The key innovation lies in recognizing and addressing the bias in MC-based PRMs not just by refining MC estimation but by fundamentally shifting the reward modeling paradigm to preference learning. This leverages the strengths of Bradley-Terry models to derive robust relative quality signals, which are inherently less susceptible to absolute noise. The subsequent adaptation of GRPO with a specialized advantage estimator is a thoughtful and necessary step to integrate this novel reward signal effectively.
Transferability: The methodology of using preference learning to debias step-wise reward models could be highly transferable to other multi-step reasoning tasks beyond mathematics, such as scientific problem-solving, code generation, or even complex planning. Any domain where step-wise correctness is crucial but hard to define absolutely, and where MCTS or similar search strategies are used for data generation, could benefit from this framework.
Potential Issues/Areas for Improvement:
1. Computational Cost of MCTS: While the paper acknowledges this as a limitation, the reliance on MCTS for generating the initial chosen and rejected rollouts (even if filtered) still represents a significant computational bottleneck. Exploring more data-efficient preference learning techniques or synthetic data generation methods that don't rely as heavily on extensive MCTS simulations would be valuable.
2. Sensitivity to Hyperparameters: The Q-value scoring mechanism for chosen and rejected rollouts (Eq. 1) relies on hyperparameters $\alpha$ and $\beta$ . The paper states $\alpha=0.5$ and $\beta=0.9$ were used, but a deeper analysis of their sensitivity and how they affect the quality and diversity of generated preference pairs could strengthen the methodology.
3. Generalizability of "Bias Offset" Assumption: Assumption 3, which states that bias can be offset for preference annotated data pairs (i.e., $\Delta b$ is concentrated around 0), is critical for Lemma 1. While plausible, a more robust empirical validation of this assumption under various MCTS conditions and completion models would be beneficial.
4. Interpretability of Preference Signal: While PPRM provides a better reward signal, the reasoning behind why one step is preferred over another (beyond its PPRM score) might still be opaque. Future work could explore methods to make the PPRM's preferences more transparent and interpretable to human users.
5. Scaling to Even More Complex Problems: Olympiad-level math is highly complex, but real-world scientific discovery or advanced engineering design might involve even longer reasoning chains and more abstract concepts. The scalability of PPRM and its enhanced GRPO to truly longer-horizon, open-ended reasoning tasks would be an interesting future direction.
  
  Overall, this paper provides a robust and theoretically grounded improvement to process reward modeling and RL training for LLMs in mathematical reasoning, offering valuable insights that could extend to other complex AI tasks.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.