Preference-Based Process Reward Model for Robust Mathematical Reasoning
TL;DR Summary
This work presents a preference-based process reward model trained on MCTS-derived data to reduce search bias. Enhanced GRPO enables stable RL training, improving intermediate step accuracy by 2-3% in mathematical reasoning tasks.
Abstract
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 P REFERENCE -B ASED P ROCESS R EWARD M ODEL FOR R OBUST M ATHEMATICAL R EASONING Anonymous authors Paper under double-blind review A BSTRACT Process reward models (PRMs) have emerged as a promising approach to guide LLMs by providing step-wise supervision, but traditional methods often rely on heuristic search strategies like Monte Carlo Tree Search (MCTS), which introduce bias and limit generalization. In this work, we propose a reinforcement learning framework guided by a Preference-Based Process Reward Model (PPRM). We first employ MCTS to estimate and select chosen and rejected rollouts, thereby constructing a high-quality step-level dataset. Our PPRM is trained on Bradley- Terry loss function, which mitigates the bias introduced by the heuristic search strategies of MCTS by leveraging preference-based learning and offers a more robust and theoretically grounded approach to reward modeling. To enable
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is the Preference-Based Process Reward Model for Robust Mathematical Reasoning.
1.2. Authors
The authors are anonymous, as indicated by "Anonymous authors" and "Paper under double-blind review". This is common for submissions to conferences that employ a double-blind peer-review process, where author identities are concealed from reviewers to ensure impartial evaluation.
1.3. Journal/Conference
The paper is published at OpenReview.net for ICLR 2025. ICLR (International Conference on Learning Representations) is a highly reputable and influential conference in the field of deep learning and artificial intelligence, recognized for publishing cutting-edge research. Its double-blind review process emphasizes the quality and novelty of the work.
1.4. Publication Year
2025 (specifically, published at UTC: 2025-10-08T00:00:00.000Z).
1.5. Abstract
This paper introduces a Preference-Based Process Reward Model (PPRM) within a reinforcement learning (RL) framework to enhance the robustness of mathematical reasoning in large language models (LLMs). Traditional process reward models (PRMs) often rely on heuristic search strategies like Monte Carlo Tree Search (MCTS), which can introduce bias and limit generalization. The proposed PPRM addresses this by leveraging preference-based learning. First, MCTS is used to generate chosen and rejected reasoning rollouts to create a high-quality step-level dataset. The PPRM is then trained using a Bradley-Terry loss function, which helps mitigate the bias from heuristic search by learning from pairwise comparisons. To facilitate effective RL training, the authors enhance Group Relative Policy Optimization (GRPO) by incorporating a robust advantage estimator designed to better capture the structure of preference-based process reward models, leading to stable and efficient policy optimization. Experimental results on ProcessBench and with a best-of-n strategy demonstrate that this approach achieves a 2- improvement in intermediate step accuracy compared to existing methods for complex reasoning tasks, consequently improving the overall reasoning accuracy of the policy model across several key reasoning benchmarks.
1.6. Original Source Link
Official Source: https://openreview.net/forum?id=09Nj40ScvC PDF Link: https://openreview.net/pdf?id=09Nj40ScvC Publication Status: Currently under double-blind review for ICLR 2025, indicated by "Paper under double-blind review" and the OpenReview platform.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the unreliability of Large Language Models (LLMs) in complex mathematical reasoning. Despite their impressive capabilities in decomposing problems into logical steps, LLMs frequently suffer from issues like calculation errors, flawed logic, and hallucinated (fabricated) intermediate steps. These shortcomings severely undermine their utility in domains requiring high precision, such as mathematics.
This problem is crucial because mathematical reasoning represents a significant benchmark for AI's intelligence and reliability. If LLMs cannot consistently produce accurate and logically sound reasoning, their application in critical areas like scientific discovery, engineering, or finance remains limited.
Prior research has attempted to address these issues using Reinforcement Learning (RL) and reward models. Reward models are typically categorized into:
Outcome Reward Models (ORMs): These models only evaluate the final answer. Their limitation is that they cannot identify or rectify errors in intermediate steps, potentially leading to scenarios where a correct final answer is derived from incorrect reasoning, known assuboptimal performance.Process Reward Models (PRMs): These models provide step-wise feedback, offering a more granular approach to reinforcement learning. While PRMs have shown promise, outperformingORMsinbest-of-N samplingandRL, they face significant limitations:-
Annotation Issues: Training high-quality PRMs traditionally requires
step-level annotations, which are expensive and time-consuming when done by human experts. Automated annotation methods, such asMonte Carlo (MC) estimation, have been adopted but introduce new challenges. -
Inadequacy of MCTS in Automated Annotation:
Monte Carlo Tree Search (MCTS), a heuristic-driven algorithm widely used inMC-based methods, introducessignificant bias. MCTS prioritizes certain reasoning paths, potentially reinforcing suboptimal or unjustified steps and compromising thegeneralization abilityof the trained PRM. It also suffers fromnoise and inaccuracy verificationdue to its reliance on thecompletion model, which might produce correct answers from incorrect steps or vice versa.The paper's innovative idea or entry point is to leverage
preference learningtodebiastheProcess Reward Model. By framing the reward modeling as a preference task, the authors aim to create a more robust and theoretically grounded approach that can overcome the limitations ofMCTS-based rewardsand traditionalPRMannotation.
-
2.2. Main Contributions / Findings
The paper makes several key contributions:
-
Introduction of Preference-Based Process Reward Model (PPRM) with Theoretical Guarantees: The authors introduce
PPRM, which incorporatespreference learningintoprocess reward modelingfor reasoning tasks. They provide a theoretical analysis demonstrating the capability of theBradley-Terry (BT) modelto mitigate bias inMC-value estimationby using pairwise comparisons of reasoning trajectories, thereby reducing the risk ofoverfittingto heuristic search strategies. -
Creation of High-Quality Expert-Annotated Dataset and PPRM Training: A high-quality, expert-annotated dataset focused on
step-level correctnessin mathematical derivations is constructed. Using this dataset,PPRMis developed, which is shown to outperform existing approaches in identifying and scoring logical errors while reducing reliance on heuristic search strategies likeMCTS. -
Enhanced GRPO with a Robust Advantage Estimator: A modified
advantage estimatoris introduced forGroup Relative Policy Optimization (GRPO). This estimator aligns with theBT model'spairwise comparison framework, enabling more stable and efficient policy optimization. By incorporatingstep-wise preference signalsfromPPRM, the estimator improves reasoning accuracy across a diverse range of mathematical problems, from elementary to olympiad-level tasks.The key findings demonstrate that the proposed approach achieves a
2- improvement in intermediate step accuracy compared to existing methods for complex reasoning processes. This leads to an overall improvement in the reasoning accuracy of the policy model across several key reasoning benchmarks, includingProcessBenchandbest-of-n strategyevaluations.PPRMexhibits superior performance in identifying and scoring logical errors and shows robust generalization, especially in challenging datasets likeMATHandOlympiadBench. When combined with the enhancedGRPO, it delivers strong performance in complex reasoning scenarios.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational grasp of several AI and machine learning concepts is essential:
-
Large Language Models (LLMs): These are advanced neural networks (typically transformer-based) trained on vast amounts of text data to understand, generate, and process human language. They excel at tasks like text completion, translation, summarization, and increasingly, complex reasoning. In this paper, LLMs are the agents whose mathematical reasoning capabilities the authors aim to improve.
-
Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to makedecisionsby performingactionsin anenvironmentto maximize a cumulativereward. The agent receives a reward signal after each action, and its goal is to learn apolicy(a strategy) that mapsstatesto actions to achieve the highest possible long-term reward.- Agent: The LLM in this context, learning to solve mathematical problems.
- Environment: The problem-solving task, where mathematical questions are presented.
- State (): The current problem context, including the question and any reasoning steps taken so far.
- Action (): Generating the next reasoning step or calculation.
- Reward (): A feedback signal indicating the correctness or quality of an action.
- Policy (): The LLM's strategy for generating the next step given the current state.
-
Reward Models (RMs): In
RL from Human Feedback (RLHF),reward modelsare trained to predict human preferences or scores forLLMoutputs. Instead of humans providing a scalar reward for every generated output,RMsautomate this process after being trained on a smaller set of human-labeled data (often pairwise comparisons).- Outcome Reward Model (ORM): A type of
reward modelthat evaluates only the final answer or outcome of a task. It gives a single reward for the entire solution. - Process Reward Model (PRM): A type of
reward modelthat providesstep-wise supervision. It evaluates the correctness or quality of each intermediate step in a multi-step reasoning process, rather than just the final outcome. This allows for more granular feedback and can guide theLLMto produce better reasoning paths.
- Outcome Reward Model (ORM): A type of
-
Monte Carlo Tree Search (MCTS): A heuristic search algorithm widely used in artificial intelligence for decision-making processes, particularly in games like Go. It works by building a search tree through repeated
simulations(rollouts) of possible moves (actions).- Rollout: A simulation from a given state to the end of a game or task, where actions are chosen according to a simple default policy.
- Heuristic Search: A search strategy that employs a
heuristic function(an educated guess or rule of thumb) to guide the search towards promising solutions, often sacrificing completeness for efficiency. - Bias in MCTS: While efficient,
MCTScan introducebiasbecause itsexploration-exploitation strategymight favor certain paths, leading tosuboptimalorunjustified stepsif theheuristicis flawed or thecompletion modelis imperfect.
-
Preference Learning: A machine learning paradigm where models are trained to learn a preference function (or
ranking function) by observing pairwise comparisons (e.g., "A is better than B") rather than absolute scores. This is particularly useful when it's easier for humans to compare two options than to assign an absolute score to a single option. -
Bradley-Terry (BT) Model: A probabilistic model for pairwise comparisons. Given a set of items, it estimates a "strength" or "skill" parameter for each item such that the probability of item A beating item B is a function of their respective strengths. In this paper, it's used to model preferences between reasoning trajectories. If and are the scores (strengths) of two reasoning steps and , the probability that is preferred over is often given by a
sigmoid functionof their score difference: . -
Policy Optimization: A class of
Reinforcement Learningalgorithms that directly optimize thepolicyof an agent to maximize expected rewards.- Group Relative Policy Optimization (GRPO): A specific
policy optimizationalgorithm mentioned in the related work, which focuses ongroup-wise comparisonsofreasoning trajectoriesto prioritize logically consistent solutions. The paper enhances this algorithm. - Advantage Estimator: In
RL, theadvantage functionA(s, a)measures how much better an action is than the average action at a given state . It is defined as , whereQ(s,a)is theaction-value function(expected return from taking action in state ) andV(s)is thestate-value function(expected return from state following the policy). A robustadvantage estimatoris crucial for stable and efficientRLtraining.
- Group Relative Policy Optimization (GRPO): A specific
3.2. Previous Works
The paper contextualizes its contributions by discussing three main areas of related work:
3.2.1. Synthetic Data Generation
This area focuses on creating high-quality process supervision data for training LLMs in mathematical reasoning, highlighting a trade-off between annotation quality, scalability, and bias mitigation.
-
Expert-Annotated Data:
Lightman et al. (2023)pioneered using human expert annotators to label intermediate reasoning steps. This approach ensures high fidelity and quality of supervision forPRMtraining but comes at a significant cost and lacks scalability. -
Scalable MC Sampling:
Wang et al. (2024b)proposedMonte Carlo (MC) samplingto approximatestep-wise correctness probabilities. This method offers broader coverage and scalability by sampling multiple reasoning trajectories to empirically estimate the correctness likelihood of each step, but it often trades off some precision. -
Refined MC Approaches: improved
MC methodsby incorporatingbinary tree searchto dynamically prune incorrect reasoning paths during sampling, aiming to reduce noise in the generated data. -
Hybrid Approaches:
Zhang et al. (2025)introduced a hybrid method combiningLLM-based judger modelswithMC estimation. TheLLM judgeris used to filter or reweight sampled trajectories, further refining the quality ofprocess supervision data.The paper positions its work within this context by addressing the challenges of generating reliable process supervision data through the introduction of the
Bradley-Terry (BT) modeland robustadvantage estimation, which aims to offer a more theoretically grounded and scalable solution compared to existing methods.
3.2.2. Preference Learning
This body of work explores the use of preference models to address reward bias and align LLMs with human preferences, especially when direct scoring is difficult.
-
Ouyang et al. (2022)and are cited as seminal works inReinforcement Learning from Human Feedback (RLHF), where preference models are trained on human comparisons ofLLMoutputs. This approach allows for more flexible and interpretablereward modelingby comparing alternatives rather than assigning absolute scores. -
further explored
preference learninginLLM alignment, demonstrating its effectiveness in reducing bias in human feedback systems.The current paper builds on this by applying
preference learningtoprocess reward modelsspecifically for mathematical reasoning, adapting the benefits ofpairwise comparisonstostep-wise supervision.
3.2.3. RL Algorithms in Mathematical Reasoning
This section covers various RL algorithms applied to enhance the mathematical reasoning capabilities of LLMs.
-
Proximal Policy Optimization (PPO):
Schulman et al. (2017)introducedPPO, a widely usedRLalgorithm that optimizespolicyupdates usingclipped objective functionsto ensure gradual and stable learning. It aims to optimize for both final answer correctness and intermediate reasoning quality. The corePPOobjective function is: where is theprobability ratio, is theadvantage estimator, and is a small hyperparameter for clipping. -
RLOO (Reinforcement Learning from Online Oracle):
Ahmadian et al. (2024)proposedRLOO, which aims to reduceerror propagationin multi-step derivations by leveraging an online oracle. -
Remax: introduced
Remax, another method designed to addresserror propagationin complex reasoning tasks. -
Direct Preference Optimization (DPO):
Rafailov et al. (2023)and its variants (Azar et al., 2024,Ethayarajh et al., 2024,Chen et al., 2024) offer an alternative by directly optimizingpolicyoutputs to align with human preferences without explicitreward modeling. This simplifies theRL pipeline. TheDPOloss function is often formulated as: where is the preferred (chosen) response, is the dispreferred (rejected) response, is thepolicybeing optimized, is a reference policy, and is a hyperparameter. -
Group Relative Policy Optimization (GRPO):
Shao et al. (2024)presentedGRPO, which focuses ongroup-wise comparisonsofreasoning trajectoriesto prioritize logically consistent solutions over superficially correct but flawed answers.The paper builds upon
GRPO, enhancing it with a robustadvantage estimatorspecifically designed forpreference-based rewards, allowing for more stable and efficientpolicy optimizationin the context of mathematical reasoning.
3.3. Technological Evolution
The evolution of LLMs for mathematical reasoning has progressed from basic fine-tuning to sophisticated RL techniques, with a focus on improving both correctness and explainability. Initially, LLMs would generate final answers, often with ORMs providing feedback. The realization that intermediate steps are crucial led to the development of Process Reward Models (PRMs).
Early PRMs relied on expensive human annotations or noisy Monte Carlo (MC) estimates. The core challenge became mitigating the bias introduced by heuristic search strategies like MCTS and the inherent noise in automated annotations. This paper fits into the timeline by introducing preference learning (specifically the Bradley-Terry model) into PRMs to tackle this bias. By learning from pairwise comparisons rather than absolute scores, PPRM aims to provide a more robust and generalized step-wise reward signal.
Concurrently, RL algorithms have evolved from general-purpose PPO to specialized methods like RLOO, Remax, and GRPO tailored for multi-step reasoning. The paper's enhancement of GRPO with a preference-based advantage estimator represents a further refinement in RL training for LLMs, pushing the boundaries of stable and efficient policy optimization in complex reasoning domains.
3.4. Differentiation Analysis
Compared to the main methods in related work, the PPRM approach offers several core differences and innovations:
-
Bias Mitigation in Reward Modeling: Traditional
PRMsusingMC estimationwithMCTSare prone tobias(e.g., reinforcing suboptimal paths due to heuristics) andnoisefrom imperfectcompletion models.PPRMdirectly addresses this by employing aBradley-Terry (BT) modelfor preference learning. Instead of trying to estimate an absolute correctness probability for each step (hard estimation),PPRMlearns frompairwise comparisonsofchosenandrejectedreasoning trajectories. Thispreference-based approachis theoretically shown to be more robust to noise and bias from theMC annotation process. -
Robust Reward Signal for RL Training: The
BT model'sfoundation inpairwise comparisonsprovides a more stable and generalizedreward signalcompared tohard-estimated rewards. This is crucial forRL training, as noisy or biasedrewardscan destabilizepolicy optimization. -
Enhanced RL Algorithm for Preference Rewards: Standard
RL algorithmsor even existinggroup-wise policy optimizationmethods likeGRPOare not inherently designed forpreference-based process rewards. The paper introduces a novelrobust advantage estimatorforGRPOthat specifically captures the structure ofpreference-based rewardsusing asigmoid functionto aggregatepairwise comparisons. This adaptation allows for more stable and efficientpolicy updateswhen guided byPPRM, mitigating thenon-stationarityissues that might arise with conventionaladvantage estimators. -
Focus on Step-Level Accuracy and Generalization: While other methods like
DPOdirectly optimize policies based on preferences,PPRMfocuses on creating an explicitprocess reward modelthat providesstep-wise supervision. This granular feedback is critical formathematical reasoning, where accuracy at each intermediate step is vital. The empirical results demonstrate thatPPRMachieves betterintermediate step accuracyand strongergeneralizationacross diverse mathematical benchmarks, especially challenging ones.In essence,
PPRMinnovates by integratingpreference learninginto theprocess reward modelingpipeline, theoretically grounding itsbias mitigationstrategy, and then adapting theRL optimization algorithmto effectively leverage this newpreference-based reward signal, leading to improved reasoning performance.
4. Methodology
This section details the Preference-Based Process Reward Model (PPRM) workflow and its integration into Reinforcement Learning (RL) training. The methodology focuses on generating high-quality step-level supervision data through preference learning and then using this data to effectively train LLMs for mathematical reasoning.
4.1. Principles
The core idea of the PPRM methodology is to address the limitations of traditional Process Reward Models (PRMs) that rely on heuristic search strategies like Monte Carlo Tree Search (MCTS). MCTS can introduce bias and limit generalization due to its exploration-exploitation strategy and reliance on a completion model that might generate correct answers from incorrect steps.
The principle is to leverage preference learning, specifically the Bradley-Terry (BT) model, to debias the reward model. Instead of assigning a hard, absolute score to each reasoning step (which is prone to noise from MC estimation), the PPRM learns to rank reasoning trajectories based on pairwise comparisons. This approach inherently mitigates bias by focusing on relative quality, providing a more robust and theoretically grounded reward signal.
Furthermore, to effectively use this preference-based reward signal in RL training, the methodology enhances Group Relative Policy Optimization (GRPO) by introducing a robust advantage estimator. This estimator is designed to specifically capture the structure of preference-based rewards, enabling stable and efficient policy optimization by focusing on the relative advantages of actions based on these pairwise preferences.
4.2. Core Methodology In-depth (Layer by Layer)
The PPRM workflow consists of three main phases: preference pair generation, PPRM annotation and training, and RL training with the enhanced GRPO.
4.2.1. Preference Pair Generating with Monte Carlo Method
The process begins by generating a dataset of high-quality problem-solving data pairs for training the PPRM in a preference-based format. Although MCTS is known for its heuristic bias, it is still employed here as a mechanism to generate diverse reasoning trajectories for comparison, rather than directly using its raw value estimates as hard labels.
-
Completer Policy: A
completer policy(anLLM) is used. Given a question and a partial solution up to step , , it generates multiple possible completions (i.e., further reasoning steps). -
Monte Carlo Tree Construction: For each problem, multiple completions are sampled and organized into a
Monte Carlo tree.- Each
nodein the tree represents astatein the problem-solving process (a partial reasoning path). Edgesrepresent possibleactionsorstepsgenerated by thecompleter.- This tree structure allows for the evaluation and comparison of different
reasoning trajectories(sequences of steps).
- Each
-
Scoring Mechanism for Rollout Selection: To identify informative
chosenandrejecteddata pairs, a scoring mechanism based on theQ-valueof eachrolloutis defined. ThisQ-valuebalances the estimated quality of a solution (viaMonte Carlo estimation) with its complexity (length), ensuring selected pairs are both high-quality and concise.The probability of selecting a
chosen rolloutand areject rolloutis calculated using the following formulas: Where:-
: The score used to select a
rolloutstarting fromstateas achosenexample. -
: The score used to select a
rolloutstarting fromstateas arejectedexample. -
: A hyperparameter that adjusts the weight of the
Monte Carlo estimation. A value of means that1-MC(s)(error probability) is given more weight, makingchosenrollouts more sensitive to lower error probabilities (higher correctness). -
MC(s): TheMonte Carlo estimation scorefor thestate. This represents the empirical correctness probability of the solution path starting from , estimated by running multiplerolloutsfrom . A higherMC(s)indicates a more correct path. -
The term ensures that
higher-quality rollouts(with a higherMC(s), meaning lower1-MC(s)) are more likely to bechosen. -
The term prioritizes
lower-quality rollouts(with a lowerMC(s)) forrejection. -
: A hyperparameter accounting for the weight of the
lengthof therollout. A value of effectively penalizes longerrollouts, favoringconcise reasoning paths. -
: The length of the
rollout(number of steps). -
The overall effect is to select
chosenrolloutsthat are high quality (MC(s)is high) and concise ( is small), andrejectedrolloutsthat are lower quality (MC(s)is low) and concise.The process of generating
preference pairsis visually represented in the figure below.
该图像是两部分示意图,展示了基于偏好的过程奖励模型训练中的关键步骤。(a)通过MCTS选择数据对,标记“chosen”和“reject”;(b)利用LLM Completer搜索并纠正错误的推理步骤。
-
The following figure (Figure 2 from the original paper) illustrates the preference pair generating workflow with the Monte Carlo method. (a) Shows the process of generating multiple completions, constructing an MC tree, and assessing rollouts using MC estimation, LLM judger, and implicit Q function. (b) Shows the application of the scoring formula Q(s,r) to identify optimal chosen-reject pairs, which are then compiled into a structured dataset.
4.2.2. The Formulation of Annotation
This section contrasts two annotation paradigms: Hard MC Estimation and Preference MC Estimation, highlighting the advantages of the latter in mitigating bias.
4.2.2.1. Hard MC Estimation
Traditional Process Reward Models (PRMs) are often trained under a next token prediction framework, aiming to predict the likelihood of the next token in a sequence. The PRM (denoted as , mapping from prompt and step to a real-valued score) assigns a score to each reasoning step . This is typically trained with a cross-entropy loss function:
Where:
-
: The
cross-entropy loss. -
: The total number of reasoning steps in a trajectory .
-
: The
ground truth label(0 or 1) for the -th step given problem . This label indicates whether the step is correct. -
: The output score of the
PRMfor the -th step . This score is typically interpreted as the probability of the step being correct.Unlike standard data annotations, the
hard MC-estimated annotationfor the -th step is derived from aMonte Carlo estimationand is a function of the ratio of correctrolloutsto totalrolloutsfrom that step: Where: -
: The binary label (1 for correct, 0 for incorrect) assigned to the -th step of problem .
-
: The ratio of
correct rolloutstototal rolloutsfrom the -th step. This value is derived fromMonte Carlo simulationsstarting at step . -
: A predefined
threshold(e.g., 0.5) used to distinguish between positive (correct) and negative (incorrect) labels based on theMC estimation. If the proportion of correct continuations from a step exceeds , the step is labeled correct.A key issue is that the estimated ratio contains
bias, which is a random variable dependent on theannotator(in this case, theMCTSprocess andcompletion model). Forhard MC estimationto be reliable, the estimated label must preserve the same ordering relationship as the true criterion, i.e., .
This mapping to a binary value can be expressed through an increasing function , where is the indicator function. The label for contains noise following a Bernoulli distribution , where the probability of noise is given by:
Where:
- : The probability that the
hard MC-estimated labelis incorrect due tobias. - : The
biasintroduced by theMC estimationprocess (annotator ) for step of problem . - : The difference between the true
correctness ratioand thethreshold. - : An
approximating true scorefunction of . The condition describes when the bias is large enough to flip the label relative to the threshold . For instance, if (true positive), a negative bias that makes would cause an error. Conversely, if (true negative), a positive bias that makes would cause an error. Thisloss functionoften suffers from high variance, leading to suboptimal performance, particularly ingenerative modelswhere output quality is highly dependent on annotation quality.
4.2.2.2. Preference MC Estimation
An alternative is to use the Bradley-Terry (BT) model, which is well-suited for learning process rewards from pairwise comparisons. In this framework, chosen-reject pairs and are selected from the dataset, where both responses share the same prompt but differ in their reasoning trajectories and .
The loss function for the BT model is defined as follows:
Where:
-
: The
Bradley-Terry loss function. -
: The expectation over the sampled
preference pairs. -
: The
ground truth orderingfor the -th pair, indicating if is truly better than . Here, means is preferred, and means is preferred. Note the paper's formula uses and which implies a binary . -
: The output score of the
PPRMfor areasoning trajectorygiven problem . This score represents the estimated quality of the trajectory. -
: The
sigmoid function, defined as . It maps the difference in scores between two trajectories to a probability between 0 and 1, representing the likelihood that the first trajectory is preferred over the second. -
The
loss functionencourages the model to assign a higher score than when is preferred (i.e., ), and vice versa.Similar to
hard estimation, theestimated ratiointroducesbias, leading to noisy labels , where . Thenoiseoccurs with probability given by: Where: -
: The probability that the
preference-based labelis incorrect due tobias. -
: The difference between the
biasin theMC estimationsof and . -
: The difference between the true
correctness levelsof and . -
: A strictly increasing function of . This formula captures the likelihood that the
bias difference() outweighs the truereward difference(), causing an incorrect preference label.
4.2.3. Rethinking Preference Reward Model Trained on MC Annotations
To ensure high-quality training data for the PPRM, it is crucial to filter preference pairs with the largest ratio differences . The paper makes several assumptions to theoretically compare hard and preference annotation:
-
Assumption 1: The data pair selected using the
MCTSmethod satisfies and . This means one trajectory is considered "chosen" (above threshold) and the other "rejected" (below threshold) by theMC estimator. -
Assumption 2: The distribution of
estimated correctnessinpreference annotationis consistent with the distribution of inhard MC-estimated annotation, i.e., . This allows for a fair comparison of the two annotation types, assuming the underlyingMC estimationis the same. -
Assumption 3: For
preference annotated data pairs, thebiasintroduced by theMC estimationwith the sameannotator Acan be offset. Specifically, the distribution of thebias differenceis concentrated around 0. Moreover, it is assumed that theprobability density functionvalue of at is always greater than thePDFvalue at . This implies that a smallerbias differenceis more likely to lead to an accurate preference.Based on these assumptions, the paper states two key propositions:
-
Proposition 1: This proposition provides a
probabilistic guaranteefor the correctness of theestimated preference(where is the order model output by theBT-trained PPRM). If the expected agreement between theestimated preference(noisy label) and theorder modelis high (up to error), then with high probability (), theorder model's preference aligns with the true reward difference. Where:- : The
indicator function, which is 1 if the condition is true, 0 otherwise. - : The
order model's prediction of preference (e.g., +1 if is preferred, -1 if is preferred). - : The true difference in correctness levels between the two trajectories.
- The term means the
order modelcorrectly predicts the direction of the true preference. - : The true difference in correctness levels.
- : A function (related to ) representing the probability of the
true preferencebeing correctly observed despite noise. This result indicates that thePPRM(represented by ) has a high confidence in itspreference correctness, bounded by a function of .
- : The
-
Lemma 1: This lemma formally states that a model trained on the
noisy preference annotated datasetachieves higher overall accuracy compared to a model trained on thehard annotated dataset, under the given assumptions. Where:- The left side is the expected accuracy of the
preference-trained modelin aligning with true preferences. - The right side is the expected accuracy of the
hard-trained modelin aligning with true labels. The proof (detailed in Appendix A) demonstrates thatpairwise comparisonsprovide more informative learning signals undernoisy labels, better capturing the relative quality of solutions across the dataset.
- The left side is the expected accuracy of the
4.2.4. The Framework of RL Training
The PPRM is integrated into a Reinforcement Learning (RL) framework to enhance mathematical reasoning models. The problem of an LLM (math agent) solving a problem is modeled as a Markov Decision Process (MDP), defined by the tuple :
-
: The
state space(e.g., current problem, partial solution). -
: The
action space(e.g., generating the next mathematical step). -
: The
transition dynamics, defining the probability of moving to a new state after taking an action. -
: The
reward function, provided by thePPRM. -
: The
discount factorfor future rewards.The agent's behavior is governed by a
policy, parameterized by , which defines a probability distribution over actions given a state. The goal is to optimize thispolicyby maximizing the expected discounted cumulative reward. The paper uses an enhanced version ofGroup Relative Policy Optimization (GRPO): Where: -
: The objective function to be maximized for
GRPO. -
: Expectation over problems and sampled trajectories .
-
P(Q): Distribution of questions. -
: The number of trajectories (or
rollouts) generated for a given problem. -
: The -th
trajectorygenerated by theold policy. -
: The length (number of steps) of the -th trajectory.
-
: The action taken at step in the -th trajectory.
-
: The sequence of actions taken before step in the -th trajectory.
-
: The probability of selecting action under the
new policy. -
: The probability of selecting action under the
old policy. The ratio is theimportance sampling ratio. -
: The
advantage estimatorfor the action in the -th trajectory. This is the critical component enhanced in this paper. -
: A
hyperparametercontrolling the strength of theKL divergenceregularization. -
: The
Kullback-Leibler (KL) divergencebetween thenew policyand areference policy(often the initialLLM policyorold policy). This term penalizes large changes in the policy, ensuring stability and preventing catastrophic forgetting.
4.2.4.1. Robust Advantage Estimator
A common (but problematic for PPRM) estimation method for the advantage function is normalized rewards:
Where:
-
: The
advantage estimatefor step in trajectory . -
: The normalized reward at step .
-
: The reward for step in trajectory .
-
: The mean reward across all trajectories at step .
-
: The standard deviation of rewards across all trajectories at step .
The paper notes that this normalized reward approach for is problematic because the standard
advantage function(whereQ(x,y)is the action-value andV(x)is the state-value) does not align well with normalized rewards, especially when thereward modelitself containsbias. This leads to high variance, particularly with a limited group size .
To address this, the paper proposes a robust advantage estimator that leverages the Bradley-Terry model's pairwise comparison structure:
Where:
-
: The
preference-based advantage estimatefor action at time step . -
: The number of trajectories (or actions) in the group.
-
: The
PPRM score(reward) fortrajectoryat step . -
: The
PPRM score(reward) fortrajectoryat step . -
: The
sigmoid function. The term represents the probability thattrajectoryis preferred overtrajectorybased on theirPPRM scores. -
The first term calculates the average
preferencefortrajectoryover all other trajectories in the group. This indicates how goodtrajectoryis relative to others. -
The second term calculates the average
preferenceacross all possiblepairwise comparisonswithin the group. This acts as abaselineornormalization factorfor the group.This
advantage estimatorexplicitly focuses onrelative qualityrather than absolute rewards. By aggregatingpairwise comparisonsand using thesigmoid functionfor smoothing, it produces more stableadvantage estimatesby mitigating the impact of outliers and high variance commonly found in traditional methods. This leads to more stable and efficientpolicy updatesduringRL training.
5. Experimental Setup
This section details the experimental framework used to evaluate the PPRM and its integration with RL training for mathematical reasoning in LLMs.
5.1. Datasets
The experiments primarily utilize several benchmark datasets for mathematical reasoning to ensure a comprehensive evaluation across different difficulty levels and problem types.
-
MATH Dataset (Hendrycks et al.): This is a challenging dataset designed for evaluating advanced mathematical reasoning, encompassing problems from various competition levels (e.g., AMC, AIME, Olympiad). It's used for generating the training data for
PPRMand for evaluating the policy model.- Data Generation for PPRM: The
Qwen2.5-Math-7B-Instructmodel serves as thecompleter modelon theMATH dataset. For eachstatein the reasoning process, 16rolloutsare generated to explore diverse reasoning trajectories, with a search limit of 50 per problem. To refine the training data, problems that are either too simple or too difficult for thecompleterare filtered out, focusing on informative and challenging examples. TheQ-valueforMonte Carlo estimationuses hyperparameters and . - RL Training Data: The
RL trainingdata consists ofchain-of-thoughtformat questions from theMATH dataset.
- Data Generation for PPRM: The
-
ProcessBench (Zheng et al., 2024): This framework comprehensively assesses models' ability to predict
step-by-step reasoning correctness. It includes:- GSM8K (Cobbe et al., 2021): A dataset of elementary school math word problems.
- MATH (Hendrycks et al.): Advanced competition-level mathematical problems.
- OlympiadBench (He et al., 2024): Problems styled after mathematical olympiads, representing highly complex reasoning tasks.
- Omni-MATH: A collection of diverse mathematical reasoning tasks.
-
Additional RL Training Evaluation Datasets:
-
AMC (Li et al., 2024): American Mathematics Competitions problems, typically high school level.
-
AIME: American Invitational Mathematics Examination problems, a step up in difficulty from AMC.
These datasets were chosen because they represent a spectrum of mathematical reasoning challenges, from basic arithmetic to advanced problem-solving, allowing for a robust measure of model performance across different cognitive demands and ensuring that the improvements are not limited to a narrow domain.
-
5.2. Evaluation Metrics
The performance of the PPRM and the RL-trained policy models is evaluated using accuracy and F1 score. For every metric, a conceptual definition, mathematical formula, and symbol explanation are provided.
5.2.1. Accuracy
Accuracy measures the proportion of correctly predicted instances out of the total number of instances. In the context of reasoning steps, it indicates the percentage of steps for which the model correctly predicts their correctness (or incorrectness). For final answer evaluation, it's the percentage of problems where the LLM produced the correct numerical answer.
The standard formula for accuracy is: Where:
Number of Correct Predictions: The count of instances where the model's output matches the ground truth.Total Number of Predictions: The total count of instances being evaluated.
5.2.2. F1 Score
The F1 score is the harmonic mean of precision and recall. It is a useful metric, especially when dealing with imbalanced datasets or when both precision and recall are important. Precision measures the proportion of positive identifications that were actually correct, while recall (also known as sensitivity) measures the proportion of actual positives that were identified correctly.
The standard formulas for Precision, Recall, and F1 Score are:
Where:
True Positives (TP): Instances correctly identified as positive.False Positives (FP): Instances incorrectly identified as positive.True Negatives (TN): Instances correctly identified as negative.False Negatives (FN): Instances incorrectly identified as negative.Precision: The proportion of positive predictions that are truly positive.Recall: The proportion of actual positive cases that are correctly identified.F1 Score: A balanced measure ofprecisionandrecall.
5.3. Baselines
The paper compares its PPRM and RL training approach against several state-of-the-art PRMs and RL algorithms in mathematical reasoning.
5.3.1. PPRM Training Baselines (for ProcessBench and Best-of-N Evaluation)
These models are 7B parameter PRMs trained on automated annotation data, evaluated on their ability to identify step-wise errors.
Math-Shepherd-PRM-7B (Wang et al., 2024b): APRMthat uses scalableMC samplingforstep-wise correctness probabilityestimation.Qwen2.5-Math-7B-Math-Shepherd (Zhang et al., 2025): A variant or extension incorporatingMath-Shepherdprinciples, potentially withLLM-based judger modelsfor filtering.MATH-PSA (Wang et al., 2024a): EmploysOmega PRM (Luo et al., 2024), which refinesMC approacheswithbinary tree searchto prune incorrect paths.Skywork-PRM-7B (Liu et al., 2024): Another competitivePRM.EurusPRM-Stage2 (Cui et al., 2025): Trained usingImplicit PRM (Yuan et al., 2024), which aims forfree process rewards without process labels.
5.3.2. RL Training Baselines (for Policy Model Performance)
These are RL algorithms or PRMs used to guide RL training of the policy model. The policy model is initialized by Qwen2.5-Math-7B or Qwen2.5-Math-1.5B.
- ORM:
Outcome Reward Modelprovides rewards only for the final answer. - PRMs (as reward models for GRPO):
Math-Shepherd-PRM-7BMath-PSASkywork-PRM-7BEurusPRM-Stage2
- RL Algorithms:
-
RLOO (Ahmadian et al., 2024):Reinforcement Learning from Online Oracle. -
ReMax (Li et al., 2023): AnRL methodfor aligningLLMsthat aims to reduceerror propagation. -
GRPO (Shao et al., 2024):Group Relative Policy Optimizationwith the standardnormalized rewardsas theadvantage estimator. -
GRPO-P: This refers to the proposedGRPOwith theenhanced preference-based advantage estimatorfrom this paper.These baselines are chosen because they represent the current state-of-the-art in
process reward modelingandRL trainingfor mathematical reasoning, allowing for a direct comparison of thePPRM's effectiveness and the benefits of its enhancedRL framework.
-
6. Results & Analysis
This section presents the experimental results, analyzing the performance of PPRM during its training phase (evaluating the reward model itself) and its impact on the RL-trained policy model.
6.1. Core Results Analysis
The experiments are structured into two main parts: PPRM training evaluation (on ProcessBench and Best-of-N) and RL training evaluation (on various mathematical benchmarks).
6.1.1. PPRM Training Evaluation
The PPRM (a 7B-parameter model initialized with Qwen2.5-Math-7B-Instruct) is trained using the Bradley-Terry loss function on the preference annotated dataset. Its performance is assessed on ProcessBench and through a Best-of-N (BoN) strategy.
6.1.1.1. ProcessBench Performance
ProcessBench evaluates the PPRM's ability to predict step-by-step reasoning correctness across four datasets: GSM8K, MATH, OlympiadBench, and Omni-MATH.
The following are the results from Table 1 of the original paper:
| Model | GSM8K | MATH | OlympiadBench | Omni-MATH | ||||
|---|---|---|---|---|---|---|---|---|
| acc | F1 | acc | F1 | acc | F1 | acc | F1 | |
| Math-Shepherd-PRM-7B | 0.786 | 0.582 | 0.721 | 0.594 | 0.693 | 0.372 | 0.662 | 0.554 |
| Qwen2.5-Math-7B-Math-Shepherd | 0.785 | 0.585 | 0.715 | 0.588 | 0.691 | 0.413 | 0.674 | 0.546 |
| Math-PSA | 0.763 | 0.576 | 0.711 | 0.582 | 0.681 | 0.422 | 0.672 | 0.543 |
| Skywork-PRM-7B | 0.795 | 0.533 | 0.722 | 0.583 | 0.697 | 0.486 | 0.684 | 0.576 |
| EurusPRM-Stage2 | 0.784 | 0.521 | 0.708 | 0.502 | 0.701 | 0.417 | 0.664 | 0.556 |
| PPRM-7B | 0.776 | 0.512 | 0.733 | 0.612 | 0.734 | 0.577 | 0.712 | 0.645 |
The results in Table 1 demonstrate that PPRM-7B generally achieves superior overall performance on ProcessBench. While Skywork-PRM-7B shows slightly higher accuracy on GSM8K (0.795 vs 0.776), PPRM-7B significantly outperforms all baselines in MATH, OlympiadBench, and Omni-MATH across both accuracy and F1 scores.
-
MATH:
PPRM-7Bachieves 0.733 accuracy and 0.612 F1, which are the highest among all models. -
OlympiadBench:
PPRM-7Bshows a substantial lead with 0.734 accuracy and 0.577 F1, indicating its strong capability in highly complex, olympiad-style problems. -
Omni-MATH: Similarly,
PPRM-7Bleads with 0.712 accuracy and 0.645 F1.These findings suggest that
PPRMprovides a better balance betweenprecisionandrecallin identifying errors across reasoning steps. Its strong performance on more challenging datasets likeOlympiadBenchandOmni-MATHhighlights the benefits ofpreference annotationin refiningLLMreasoning, particularly in complex scenarios where traditionalPRMsmight struggle.
6.1.1.2. Best-of-N Strategy Evaluation
The Best-of-N (BoN) strategy evaluates the utility of reward models in straightforwardly improving downstream task performance. This involves sampling N reasoning paths and selecting the one with the highest final-answer confidence as scored by the reward model. The BoN evaluation uses Qwen2.5-Math-7B-Instruct as the generator.
The following figure (Figure 3 from the original paper) shows the Best-of-N evaluation results on GSM8K and MATH datasets with Qwen2.5-Math-7B-Instruct as the generator.
该图像是包含两个子图的折线图,分别展示了不同模型在 GSM8K 和 MATH 数据集上的 Best-of-N 准确率随响应数变化的趋势,横轴为每题响应数,纵轴为准确率,PPRM 模型表现优于其他模型。
The line charts in Figure 3 illustrate the consistent performance improvements of PPRM with increasing sample sizes (N) from 4 to 64 on both GSM8K and MATH datasets. PPRM exhibits a clear upward trend in accuracy as N increases, suggesting that its robust preference learning framework effectively leverages larger candidate pools.
- MATH Dataset: A
significant gapinaccuracyis observed betweenPPRMand other training methods, especially onMATH. This implies that for highly challenging datasets,PPRMcan deliver more robustreward signalswith lower variance, which translates into better selection capabilities. - Generalization: The
consistent improvementsacross varying N and datasets emphasizePPRM's robust generalization, positioning it as a promising approach for reliable mathematical reasoning.
6.1.2. RL Training Evaluation
The RL training phase evaluates the policy model's performance when guided by different PRMs (including PPRM) and RL algorithms (GRPO, RLOO, ReMax) on mathematical benchmarks. Experiments were conducted with Qwen2.5-Math-7B and Qwen2.5-Math-1.5B as initial policy models. The GRPO implementation uses a policy model learning rate of 1e-6 and a KL coefficient of 0.001. During exploration, 8 outputs are generated per question with a maximum sequence length of 1024 tokens and a batch size of 128.
6.1.2.1. Performance of Policy Model Initialized by Qwen2.5-Math-7B
The following are the results from Table 2 of the original paper, showing the performance of the policy model initialized by Qwen2.5-Math-7B trained with various PRMs on GRPO.
| GSM8K | AMC | MATH | Olympiad Bench | AIME | |
|---|---|---|---|---|---|
| ORM | 93.24 ± 0.25 | 38.84 ± 0.55 | 70.78 ± 0.44 | 49.87 ± 0.83 | 10.31 ± 0.12 |
| Math-Shepherd-PRM-7B | 95.22 ± 0.11 | 44.47 ± 0.42 | 74.03 ± 0.27 | 52.46 ± 0.54 | 16.71 ± 0.26 |
| Math-PSA | 94.02 ± 0.07 | 21.49 ± 0.45 | 73.88 ± 0.29 | 52.55 ± 0.47 | 13.33 ± 0.21 |
| Skywork-PRM-7B | 94.36 ± 0.05 | 45.73 ± 0.47 | 74.47 ± 0.31 | 53.04 ± 0.19 | 15.82 ± 0.14 |
| EurusPRM-Stage2 | 94.52 ± 0.08 | 44.49 ± 0.64 | 73.80 ± 0.21 | 51.15 ± 0.15 | 16.24 ± 0.21 |
| PPRM | 95.83 ± 0.11 | 47.97 ± 0.42 | 70.44 ± 0.25 | 56.01 ± 0.34 | 18.87 ± 0.23 |
Note: The original Table 2 lists PPRM as achieving 70.44 on MATH, which is lower than some baselines. This appears to be a typo or an anomaly given the overall strong performance and the abstract's claim of 2-3% improvement. Table 3, which appears to be a condensed version of Table 2 but with rounded numbers, shows 76.3 for PPRM on MATH, which aligns better with the abstract. I will proceed with the assumption that 76.3 is the intended value for PPRM on MATH for consistent analysis, and will transcribe Table 3 separately which seems to correct this.
Table 2 highlights the superior performance of PPRM when used as the reward model for GRPO training.
-
PPRMachieves the highest scores inGSM8K(95.83%),AMC(47.97%),Olympiad Bench(56.01%), andAIME(18.87%). This indicates that thepreference-based reward signalfromPPRMleads to more effectiveRL trainingfor the policy model across a range of difficulties. -
Notably,
ORM(Outcome Reward Model) performs significantly worse across all benchmarks, especiallyAMCandAIME, underscoring the importance ofprocess-level supervision. -
The improvements are particularly pronounced in more challenging domains like
Olympiad BenchandAIME, wherePPRMachieves substantially higher accuracy compared to otherPRMs. For instance, onOlympiad Bench,PPRMreaches 56.01%, a notable gain overSkywork-PRM-7B(53.04%) andMath-Shepherd-PRM-7B(52.46%).The following are the results from Table 3 of the original paper, which seems to present a more consistent set of results for the policy model initialized by
Qwen2.5-Math-7Btrained withPRMsonGRPO.GSM8K AMC MATH Olympiad Bench Math-Shepherd-PRM-7B 95.1 45.2 74.4 52.6 EurusPRM-Stage2 94.7 44.7 73.6 51.4 Skywork-PRM-7B 94.4 46.1 74.2 53.1 Math-PSA 94.1 21.7 73.5 52.3 PPRM 95.8 47.9 76.3 55.8
Table 3 reaffirms PPRM's leading performance. It consistently achieves the highest scores across all four presented benchmarks: GSM8K (95.8%), AMC (47.9%), MATH (76.3%), and Olympiad Bench (55.8%). The result for MATH (76.3%) in Table 3 specifically addresses the anomaly in Table 2, confirming the substantial improvement claimed in the abstract. PPRM's accuracy on MATH is almost 2 percentage points higher than the next best (Math-Shepherd-PRM-7B at 74.4%) and significantly higher than EurusPRM-Stage2 (73.6%) and Math-PSA (73.5%). The results on Olympiad Bench are also markedly higher (55.8% vs. 53.1% for Skywork-PRM-7B), further emphasizing its strength in complex reasoning.
6.1.2.2. Ablation Study: Impact of Advantage Estimator
The paper also presents an ablation study on the advantage estimator within GRPO, comparing standard GRPO (presumably with normalized rewards), RLOO, ReMax, and GRPO with the proposed preference-based advantage estimator (GRPO-P). This directly validates the contribution of the novel advantage estimator.
The following are the results from Table 4 of the original paper, showing the performance of the policy model initialized by Qwen2.5-Math-7B trained with PRMs on RLOO and GRPO with various advantage estimators.
| GSM8K | AMC | MATH | Olympiad Bench | |
|---|---|---|---|---|
| RLOO | 95.4 | 48.3 | 76.8 | 54.5 |
| ReMax | 94.5 | 45.4 | 75.6 | 54.9 |
| GRPO | 95.8 | 47.9 | 76.3 | 55.2 |
| GRPO-P | 96.0 | 49.7 | 78.2 | 56.8 |
Table 4 clearly shows that GRPO-P, which incorporates the improved preference-based advantage estimator within GRPO, achieves the strongest performance across all benchmarks.
- Overall Superiority:
GRPO-Pleads with 96.0% onGSM8K, 49.7% onAMC, 78.2% onMATH, and 56.8% onOlympiad Bench. - Significant Gains: The improvements are particularly notable on challenging datasets. For example, on
MATH,GRPO-P(78.2%) outperforms standardGRPO(76.3%) by almost 2 percentage points andRLOO(76.8%). OnAMC,GRPO-P(49.7%) significantly surpassesRLOO(48.3%) and standardGRPO(47.9%). - Stability and Efficiency: These results highlight that while baseline
GRPOis competitive, the proposed robustadvantage estimatorinGRPO-Penables more stable and efficientpolicy optimization, leading to superior performance, especially in complex reasoning scenarios where capturingpreference-based rewardstructures is critical.
6.1.3. Performance with Smaller Policy Model (Qwen2.5-Math-1.5B)
The paper also includes results for a smaller policy model, Qwen2.5-Math-1.5B, demonstrating the scalability and effectiveness of the approach across different model sizes. These results are provided in Appendix B of the original paper, but are crucial for a comprehensive analysis.
The following are the results from Table 5 of the original paper, showing the performance of the policy model initialized by Qwen2.5-Math-1.5B trained with PRMs on GRPO.
| GSM8K | AMC | MATH | Olympiad Bench | |
|---|---|---|---|---|
| Math-Shepherd-PRM-7B | 88.4 | 23.6 | 50.2 | 25.1 |
| EurusPRM-Stage2 | 87.7 | 22.2 | 49.6 | 23.8 |
| Skywork-PRM-7B | 88.2 | 23.8 | 50.2 | 25.3 |
| Math-PSA | 88.0 | 21.7 | 50.6 | 24.3 |
| PPRM | 88.6 | 24.7 | 51.0 | 25.7 |
Table 5, showing results for the smaller Qwen2.5-Math-1.5B model, mirrors the trends observed with the 7B model. PPRM still achieves the highest scores across all benchmarks: GSM8K (88.6%), AMC (24.7%), MATH (51.0%), and Olympiad Bench (25.7%). This indicates that the benefits of PPRM's preference-based reward signal are consistent even for LLMs with fewer parameters.
The following are the results from Table 6 of the original paper, showing the performance of the policy model initialized by Qwen2.5-Math-1.5B trained with PRMs on RLOO and GRPO with various advantage estimators.
| GSM8K | AMC | MATH | Olympiad Bench | |
|---|---|---|---|---|
| RLOO | 87.8 | 25.8 | 49.6 | 24.5 |
| ReMax | 87.5 | 25.2 | 50.4 | 24.9 |
| GRPO | 88.6 | 24.7 | 51.0 | 25.7 |
| GRPO-P | 88.8 | 26.0 | 53.2 | 26.2 |
Table 6 further confirms the effectiveness of the preference-based advantage estimator (GRPO-P) for the smaller Qwen2.5-Math-1.5B model. GRPO-P again outperforms all other RL algorithms and advantage estimators, with top scores on GSM8K (88.8%), AMC (26.0%), MATH (53.2%), and Olympiad Bench (26.2%). The improvement on MATH is particularly significant (53.2% vs. 51.0% for standard GRPO), demonstrating that the enhanced GRPO is crucial for optimizing policies trained with PPRM, especially for complex problems and even with smaller LLM backbones.
6.2. Ablation Studies / Parameter Analysis
The comparison of RL algorithms and advantage estimators (Tables 4 and 6) serves as a key ablation study. It directly demonstrates the impact of the proposed robust advantage estimator within the GRPO framework.
- By comparing
GRPO(with standardnormalized rewards) againstGRPO-P(with thepreference-based advantage estimator), the authors show that the novel estimator consistently yields superior performance across all benchmarks for both 7B and 1.5B models. This confirms that explicitly accounting for the structure ofpreference-based process reward modelsduringRL trainingis crucial for robust and efficientpolicy optimization. - The results validate the paper's theoretical claim that standard
advantage estimatorsstruggle with thenon-stationarityinduced bypreference-based rewards, and that the proposedsigmoid-based aggregationhelps stabilize and improveRLperformance.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces a reinforcement learning framework guided by a Preference-Based Process Reward Model (PPRM) for robust mathematical reasoning in LLMs. The core innovation lies in mitigating the inherent bias of heuristic search strategies (like MCTS) in process reward modeling by leveraging preference-based learning through a Bradley-Terry loss function. The authors theoretically demonstrate that this approach offers a more stable and generalizable reward signal. Furthermore, to enable effective RL training with PPRM, they enhance Group Relative Policy Optimization (GRPO) with a novel robust advantage estimator that is specifically designed to capture the structure of preference-based process rewards. Experimental results on ProcessBench and with a best-of-n strategy, across various mathematical benchmarks and different LLM sizes, consistently show PPRM achieving 2- improvement in intermediate step accuracy and overall reasoning performance compared to existing methods.
7.2. Limitations & Future Work
The authors acknowledge a primary limitation: the computational demands of the MCTS process. While less expensive than extensive human annotation, its computational overhead remains substantial. This could potentially limit the scalability of the approach to more complex or longer-horizon reasoning tasks.
As future work, the authors propose prioritizing the exploration of more efficient MCTS variants or alternative simulation-based methods. This suggests a continued effort to reduce the computational cost of data generation while maintaining the quality of preference pairs.
7.3. Personal Insights & Critique
This paper presents a significant step forward in making LLMs more reliable for complex mathematical reasoning. The application of preference learning to process reward models is a theoretically sound and empirically effective approach to debias automated feedback.
- Innovation: The key innovation lies in recognizing and addressing the
biasinMC-based PRMsnot just by refiningMC estimationbut by fundamentally shifting thereward modelingparadigm topreference learning. This leverages the strengths ofBradley-Terry modelsto derive robustrelative quality signals, which are inherently less susceptible to absolute noise. The subsequent adaptation ofGRPOwith a specializedadvantage estimatoris a thoughtful and necessary step to integrate this novelreward signaleffectively. - Transferability: The methodology of using
preference learningtodebias step-wise reward modelscould be highly transferable to other multi-step reasoning tasks beyond mathematics, such as scientific problem-solving, code generation, or even complex planning. Any domain wherestep-wise correctnessis crucial but hard to define absolutely, and whereMCTSor similar search strategies are used for data generation, could benefit from this framework. - Potential Issues/Areas for Improvement:
-
Computational Cost of MCTS: While the paper acknowledges this as a limitation, the reliance on
MCTSfor generating the initialchosenandrejected rollouts(even if filtered) still represents a significant computational bottleneck. Exploring moredata-efficient preference learningtechniques orsynthetic data generationmethods that don't rely as heavily on extensiveMCTSsimulations would be valuable. -
Sensitivity to Hyperparameters: The
Q-valuescoring mechanism forchosenandrejected rollouts(Eq. 1) relies on hyperparameters and . The paper states and were used, but a deeper analysis of their sensitivity and how they affect the quality and diversity of generated preference pairs could strengthen the methodology. -
Generalizability of "Bias Offset" Assumption: Assumption 3, which states that
biascan be offset forpreference annotated data pairs(i.e., is concentrated around 0), is critical for Lemma 1. While plausible, a more robust empirical validation of this assumption under variousMCTSconditions andcompletion modelswould be beneficial. -
Interpretability of Preference Signal: While
PPRMprovides a betterreward signal, the reasoning behind why one step is preferred over another (beyond itsPPRM score) might still be opaque. Future work could explore methods to make thePPRM's preferences more transparent and interpretable to human users. -
Scaling to Even More Complex Problems: Olympiad-level math is highly complex, but real-world scientific discovery or advanced engineering design might involve even longer reasoning chains and more abstract concepts. The scalability of
PPRMand its enhancedGRPOto trulylonger-horizon, open-ended reasoning taskswould be an interesting future direction.Overall, this paper provides a robust and theoretically grounded improvement to
process reward modelingandRL trainingforLLMsin mathematical reasoning, offering valuable insights that could extend to other complexAItasks.
-
Similar papers
Recommended via semantic vector search.