Paper status: completed

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Published:10/09/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The HERO framework integrates verifiable rewards with reward models to address the limitations of sparse feedback in large language model reasoning tasks. Using stratified normalization and variance-aware weighting, HERO significantly improves performance on mathematical reasonin

Abstract

Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

The title concisely summarizes the paper's core thesis: in scenarios where feedback signals are infrequent and binary (sparse), supplementing them with more continuous and graded feedback (dense) leads to better learning outcomes. It points to a hybrid approach in the context of reinforcement learning.

1.2. Authors

Leitian Tao, Sharon Li, Jason E Weston, Ping Yu.

The authors are affiliated with FAIR at Meta and the University of Wisconsin-Madison. Jason Weston is a prominent researcher at Meta AI Research (FAIR), known for his extensive work in natural language processing, dialogue systems, and reinforcement learning. The affiliations with a top-tier industrial research lab and a major research university indicate a strong background in both theoretical and applied machine learning.

1.3. Journal/Conference

The paper is available on arXiv, a preprint server for academic papers. The publication date of October 2025 suggests this is a placeholder for a future conference submission. Given the topic and quality, it would be a suitable submission for top-tier AI/ML conferences like NeurIPS, ICML, or ICLR. arXiv is a standard platform for disseminating research quickly, but papers on it are not yet peer-reviewed.

1.4. Publication Year

The paper is listed with a publication date of October 8, 2025, which is in the future. This is likely a placeholder date assigned by arXiv or the authors. The version analyzed is v3v3, uploaded at 2025-10-08T17:09:41.000Z.

1.5. Abstract

The abstract introduces the problem of post-training Large Language Models (LLMs) for reasoning tasks. It notes the increasing reliance on verifiable rewards—deterministic 0-1 signals (correct/incorrect)—which are reliable but brittle and sparse. This "all-or-nothing" feedback limits learning, as it fails to credit partially correct answers. The paper proposes that reward models (RMs), which provide continuous (dense) feedback, can be a complementary signal. The core contribution is HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals and RM scores. HERO uses two key techniques: stratified normalization to rescale RM scores within verifier-defined correctness groups (preserving correctness while adding nuance) and variance-aware weighting to focus on challenging prompts. The abstract concludes that HERO outperforms both RM-only and verifier-only methods on mathematical reasoning benchmarks, demonstrating that a hybrid approach combines the stability of verifiers with the nuance of RMs.

2. Executive Summary

2.1. Background & Motivation

The central problem this paper addresses is the suboptimal training of large language models (LLMs) for complex reasoning tasks like mathematics. The dominant method, Reinforcement Learning from Verifiable Rewards (RLVR), uses deterministic checkers (verifiers) to provide a binary 0 (incorrect) or 1 (correct) reward.

Why this is a problem:

  1. Brittleness: Verifiers are often too strict. They may incorrectly penalize a solution that is semantically correct but formatted differently (a "false negative"). This is common in math problems with multiple equivalent answer formats.

  2. Sparsity: The binary reward is sparse. For a given problem, if all generated solutions are incorrect, they all receive a reward of 0. The model gets no gradient signal to differentiate a nearly-correct solution from a completely nonsensical one. This stalls learning, especially on difficult problems where initial attempts are likely to be wrong.

    On the other hand, Reward Models (RMs), trained on human preferences, provide a continuous or "dense" score. This captures nuances like partial correctness, clarity, and proximity to the correct answer. However, RMs can be noisy, misaligned with objective correctness, and prone to "reward hacking," where the LLM learns to generate text that gets a high score from the RM without actually being correct.

This creates a clear gap: one signal is reliable but sparse (verifier), while the other is dense but unreliable (RM). The paper's entry point is the insight that these two signals are complementary. The core challenge is how to combine them effectively to get the best of both worlds: the reliability of the verifier and the nuanced feedback of the RM.

2.2. Main Contributions / Findings

The paper's main contribution is HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework designed to structurally integrate sparse verifier rewards and dense reward model scores.

HERO's key components are:

  1. Stratified Normalization: This is the core mechanism for combining the two reward types. Instead of simply adding or averaging the scores, HERO first uses the verifier's binary label (0 or 1) to partition generated answers into two groups: "incorrect" and "correct." Then, within each group, it normalizes the RM scores to a small, predefined range. This ensures that a verifiably correct answer always receives a higher reward than any incorrect answer, preserving the verifier's ground truth. The RM's role is demoted to a "tie-breaker," providing a fine-grained quality signal within the verifier-defined categories.
  2. Variance-Aware Weighting: This technique dynamically adjusts the learning signal's strength based on prompt difficulty. For a given prompt, if the RM scores for different generated answers have high variance, it implies the prompt is challenging and the RM is providing useful, discriminative feedback. HERO up-weights the reward for such prompts. Conversely, if all answers get similar RM scores (low variance), the prompt is likely too easy or too hard, providing little learning signal, and is down-weighted. This focuses training effort on the most informative samples.

Key Findings:

  • Consistent Outperformance: Across various mathematical reasoning benchmarks (both easy-to-verify and hard-to-verify), HERO consistently outperforms baselines that use only a verifier or only a reward model.
  • Robustness Across Models: The benefits of HERO generalize across different LLM backbones, including a strong model (Qwen3-4B-Base) and a weaker one (OctoThinker-8B-Hybrid-Base), demonstrating its wide applicability.
  • Solving Sparsity and Instability: HERO effectively addresses the sparsity problem of verifiers by providing dense gradients even when all solutions are incorrect. It also prevents the instability of RM-only training by anchoring the reward to the verifier's objective correctness. The gains are particularly significant on "hard-to-verify" tasks where verifiers are most brittle.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs) for Reasoning

LLMs are deep learning models, typically based on the Transformer architecture, trained on vast amounts of text data. While they excel at generating fluent text, complex reasoning (e.g., solving multi-step math problems) remains a frontier. Post-training, often called alignment, is used to refine a base LLM's capabilities for specific tasks like following instructions or performing logical reasoning. This paper focuses on a post-training method using reinforcement learning.

3.1.2. Reinforcement Learning (RL)

RL is a machine learning paradigm where an agent (here, the LLM) learns to make decisions by interacting with an environment. The agent takes actions (generating tokens of text) to reach a certain state, and receives a reward signal in return. The agent's goal is to learn a policy—a strategy for choosing actions—that maximizes the cumulative reward over time. In the context of LLMs, the policy is the model's probability distribution over the next token to generate.

3.1.3. Reward Models (RMs)

A Reward Model is a separate model trained to predict the quality of an LLM's output. It takes a prompt and a generated response as input and outputs a scalar score.

  • Training: RMs are typically trained on a dataset of human preferences. For a given prompt, humans are shown two or more responses and asked to rank them. The RM is trained to assign a higher score to the preferred response.
  • Loss Function: The training often uses a pairwise loss based on the Bradley-Terry model, which models the probability that one item is preferred over another. The loss function used in the paper is: $ \mathcal { L } _ { R } = - \mathbb { E } _ { ( x , y _ { c } , y _ { r } ) \in \mathcal { D } } [ \log \sigma ( r ( x , y _ { c } ) - r ( x , y _ { r } ) ) ] $
    • xx: The input prompt.
    • ycy_c: The chosen (preferred) response.
    • yry_r: The rejected (less preferred) response.
    • r(x, y): The scalar score from the reward model for response yy to prompt xx.
    • σ\sigma: The sigmoid function, which squashes the difference in scores to a probability between 0 and 1. The goal of minimizing this loss is to make the score difference r(x,yc)r(x,yr)r(x, y_c) - r(x, y_r) as large and positive as possible, thus aligning the RM with human preferences. This provides a dense reward signal because it can assign any value on a continuum, capturing fine-grained quality differences.

3.1.4. Rule-Based Verifiers

A verifier is a deterministic program or function that checks if an LLM's output meets an objective criterion. For math problems, a verifier might:

  1. Parse the final numerical answer from the LLM's text.
  2. Normalize it (e.g., remove commas, convert to a standard format).
  3. Compare it to the ground-truth answer. The output is a binary signal, often called a sparse reward, because it offers only two states (correct/incorrect) with no intermediate values. The paper defines this verifier function ψ\psi as: $ \psi \left( x , y _ { i } , y _ { \mathrm { r e f } } \right) = \left{ { \begin{array} { l l } { 1 , } & { { \mathrm { i f ~ } } y _ { i } { \mathrm { ~ i s ~ e q u i v a l e n t ~ t o ~ } } y _ { \mathrm { r e f ~ } } { \mathrm { ~ g i v e n ~ } } x , } \ { 0 , } & { { \mathrm { o t h e r w i s e . } } } \end{array} } \right. $
  • xx: The input prompt.
  • yiy_i: The model's generated response.
  • yrefy_{ref}: The ground-truth reference solution.

3.2. Previous Works

3.2.1. Reinforcement Learning from Verifiable Rewards (RLVR)

RLVR is a specific application of RL where the reward signal comes from a verifier instead of a human-preference-trained RM. This approach has been successful in domains with objective correctness, like code generation (where unit tests act as verifiers) and mathematical reasoning. It provides a stable and reliable learning signal, but as the paper argues, it is often too sparse and brittle.

3.2.2. Group Relative Policy Optimization (GRPO)

GRPO is an advanced RL algorithm that this paper builds upon. Standard RL algorithms like PPO (Proximal Policy Optimization) typically learn from a single response (trajectory) per prompt. GRPO improves upon this by generating multiple candidate responses for the same prompt and comparing them as a group.

  • How it Works: Instead of calculating an absolute reward for each response, GRPO focuses on the relative advantage of each response within the group. It normalizes the rewards and uses them to identify which responses are better than the group average. This stabilizes learning and makes it more sample-efficient.
  • Relevance to the Paper: The authors use GRPO as their underlying RL algorithm. However, they note a key weakness: when a verifier gives all responses for a prompt the same binary reward (e.g., all are incorrect and get a 0), the relative advantages become zero for everyone. In this case, GRPO provides no useful policy gradient, and learning stalls. This is a primary motivation for HERO, which introduces a dense signal to provide gradients even in these situations.

3.3. Technological Evolution

  1. Supervised Fine-Tuning (SFT): Early LLM alignment involved fine-tuning the model on a high-quality dataset of prompt-response pairs. This is effective but limited by the quality and diversity of the dataset.
  2. Reinforcement Learning from Human Feedback (RLHF): To go beyond static datasets, RLHF was introduced. It uses a reward model trained on human preferences to guide the LLM's learning via RL. This allows the model to explore and learn from its own outputs.
  3. Reinforcement Learning from Verifiable Rewards (RLVR): For domains with objective truth (math, code), RLVR replaced the subjective RM with a deterministic verifier. This brought more stability and reliability but introduced the problems of sparsity and brittleness.
  4. Hybrid Approaches (This Paper): The current paper represents the next logical step. It recognizes the complementary strengths of RMs and verifiers and proposes a structured way to combine them. Instead of choosing one or the other, HERO uses the verifier as a "ground truth anchor" and the RM as a "fine-grained refiner."

3.4. Differentiation Analysis

  • vs. Verifier-Only (RLVR): RLVR provides a sparse, all-or-nothing signal. HERO is superior because its dense component from the RM allows it to distinguish between different "bad" answers or different "good" answers, providing a learning gradient where RLVR would provide none.
  • vs. RM-Only (RLHF): RM-only training can be unstable and prone to reward hacking. HERO is superior because its stratified normalization strictly subordinates the RM score to the verifier's judgment. A response deemed incorrect by the verifier can never get a higher reward than a correct one, no matter how high its RM score is. This anchors the training to objective correctness.
  • vs. Naive Combination: The paper shows that simply adding or averaging the verifier and RM scores is ineffective. This naive approach leads to a noisy and misaligned reward signal. HERO's stratified normalization is a principled, structured combination that avoids this instability.
  • vs. GRPO: HERO is an enhancement of GRPO, not a replacement. It solves a key failure case of GRPO (when all rewards in a group are identical) by injecting a dense, intra-group reward differential. The variance-aware weighting is another novel addition not present in the original GRPO.

4. Methodology

4.1. Principles

The core design principle of HERO is to use the reliable but sparse signal from a rule-based verifier as the primary guide for correctness, while leveraging the dense but potentially noisy signal from a reward model as a supplementary refiner to provide more granular feedback. The methodology is built on two key innovations designed to integrate these signals in a stable and effective manner: stratified normalization and variance-aware advantage reweighting.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Motivation: The Limits of Verifiers and RMs

The authors first motivate their hybrid approach by analyzing the trade-offs of existing methods on a challenging math benchmark (HardVerify_Math). The following table, transcribed from Table 1 of the paper, summarizes their findings.

The following are the results from Table 1 of the original paper:

Type Verifier Recall ↑ Precision ↑ FPR ↓ Acc. ↑
Rule-based math_reward (verl) 10.1 97.5 0.3 53.6
math_verify (verl) 68.4 100.0 0.0 83.7
math_verify (library) 38.6 96.1 1.6 67.6
Generative Model-based TIGER-Lab/general-verifier 49.5 89.3 6.3 70.9
RM-based AceMath-7B-RM w threshold 1 91.7 67.7 46.4 73.2
AceMath-7B-RM w threshold 3 84.2 72.7 33.5 75.6
AceMath-7B-RM w threshold 5 73.8 76.6 23.9 74.9
AceMath-7B-RM w threshold 7 62.4 78.5 18.1 71.9

Analysis:

  • Rule-based verifiers like math_verify (verl) achieve perfect or near-perfect precision (they are rarely wrong when they label an answer as correct) but suffer from low recall (they miss many genuinely correct answers that are formatted differently). This confirms their reliability but also their brittleness.

  • RM-based verifiers (using a score threshold) show the opposite trend. They can achieve very high recall (identifying most correct answers) but at the cost of much lower precision and a high false-positive rate (FPR), meaning they often label incorrect answers as correct. This confirms their ability to provide broad coverage but also their unreliability.

    This empirical analysis highlights the fundamental tension and motivates a hybrid solution that can harness the precision of verifiers and the recall/coverage of RMs.

4.2.2. HERO: Hybrid Ensemble Reward Optimization

HERO addresses this tension through a two-stage reward shaping process.

Step 1: Stratified Normalization This is the core of HERO's hybrid reward calculation. Instead of naively mixing scores, it first uses the verifier's binary reward, rrule{0,1}r_{rule} \in \{0, 1\}, to stratify the generated responses for a given prompt into two groups: incorrect (rrule=0r_{rule}=0) and correct (rrule=1r_{rule}=1). Then, it normalizes the continuous reward model scores, rRMr_{RM}, within each group and rescales them to a small, non-overlapping range.

The formula for the stratified reward, r^(x,y)\hat{r}(x, y), is: r^(x,y)={α+2αrRMminrRMmaxrRMminrRM+ϵ,rrule=0,(1β)+2βrRMminrRMmaxrRMminrRM+ϵ,rrule=1. \hat { r } ( x , y ) = \left\{ \begin{array} { l l } { - \alpha + 2 \alpha \cdot \frac { r _ { \mathrm { RM } } - \min { r _ { \mathrm { RM } } } } { \max { r _ { \mathrm { RM } } } - \min { r _ { \mathrm { RM } } } + \epsilon } , } & { r _ { \mathrm { r u l e } } = 0 , } \\ { ( 1 - \beta ) + 2 \beta \cdot \frac { r _ { \mathrm { RM } } - \min { r _ { \mathrm { RM } } } } { \max { r _ { \mathrm { RM } } } - \min { r _ { \mathrm { RM } } } + \epsilon } , } & { r _ { \mathrm { r u l e } } = 1 . } \end{array} \right.

Symbol Explanation:

  • r^(x,y)\hat{r}(x, y): The final stratified reward for response yy to prompt xx.
  • rruler_{rule}: The binary reward from the rule-based verifier (0 for incorrect, 1 for correct).
  • rRMr_{RM}: The continuous score from the reward model.
  • minrRM\min r_{RM} and maxrRM\max r_{RM}: The minimum and maximum RM scores within the same stratum (i.e., among all responses that also got rrule=0r_{rule}=0 or all that got rrule=1r_{rule}=1).
  • α,β\alpha, \beta: Small positive hyperparameters (e.g., 0.05 or 0.1) that control the size of the reward range for the incorrect and correct groups, respectively.
  • ϵ\epsilon: A small constant to prevent division by zero if all RM scores in a group are identical.

How it works (integrated explanation):

  1. If a response is incorrect (rrule=0r_{rule} = 0):
    • The term rRMminrRMmaxrRMminrRM+ϵ\frac { r _ { \mathrm { RM } } - \min { r _ { \mathrm { RM } } } } { \max { r _ { \mathrm { RM } } } - \min { r _ { \mathrm { RM } } } + \epsilon } performs min-max normalization on the RM score, scaling it to a value between 0 and 1 relative to other incorrect responses.
    • This normalized value is then mapped to the range [α,α][-\alpha, \alpha]. For instance, the worst incorrect answer (with minrRM\min r_{RM}) gets a reward of α-\alpha, and the "best" incorrect answer (with maxrRM\max r_{RM}) gets a reward of α\alpha.
  2. If a response is correct (rrule=1r_{rule} = 1):
    • The same min-max normalization is performed, but this time relative to other correct responses.

    • This normalized value is then mapped to the range [1β,1+β][1-\beta, 1+\beta]. The "worst" correct answer gets a reward of 1β1-\beta, and the best correct answer gets 1+β1+\beta.

      By choosing small α\alpha and β\beta (e.g., α+β<1\alpha+\beta < 1), the framework guarantees that the highest possible reward for an incorrect answer (α\alpha) is always less than the lowest possible reward for a correct answer (1β1-\beta). This preserves the verifier's correctness guarantee while allowing the RM to provide a fine-grained quality ranking within each group. This process is illustrated in the figure below from the paper.

      该图像是示意图,展示了三种奖励机制:第一部分为奖励模型(a),表现出假负和假正样本的分布;第二部分为基于规则的奖励(b),给出了严格的奖励信号;第三部分为混合强化学习(HERO)(c),结合了两者的优势,改善了低质量样本的处理。图中标示了高质量和低质量响应的样本。

      Step 2: Variance-Aware Advantage Reweighting The second innovation addresses the fact that not all prompts are equally useful for training. Some are too easy (all generated responses are correct) or too hard (all are incorrect), providing little information. HERO prioritizes prompts where the model shows high uncertainty or variance in its responses, as these are the most informative for learning.

The difficulty weight, wdifficultyw_{difficulty}, is calculated using a logistic function: wdiffculty(σu)=wmin+(wmaxwmin)11+exp(k(σuσˉ)) w _ { \mathrm { d i f f c u l t y } } ( \sigma _ { u } ) = w _ { \mathrm { m i n } } + ( w _ { \mathrm { m a x } } - w _ { \mathrm { m i n } } ) \cdot \frac { 1 } { 1 + \exp \bigl ( - k ( \sigma _ { u } - \bar { \sigma } ) \bigr ) }

Symbol Explanation:

  • wdifficulty(σu)w_{difficulty}(\sigma_u): The weight assigned to the current prompt.
  • σu\sigma_u: The standard deviation of the reward model scores (rRMr_{RM}) across all candidate responses for the current prompt. This measures the "difficulty" or "ambiguity" of the prompt.
  • σˉ\bar{\sigma}: A running mean of the standard deviations across recent prompts.
  • wmin,wmaxw_{min}, w_{max}: Hyperparameters defining the minimum and maximum possible weights (e.g., 0.5 and 2.0). This bounds the reweighting effect.
  • kk: A hyperparameter controlling the steepness of the logistic curve, determining how quickly the weight transitions from minimum to maximum.

How it works (integrated explanation):

  • If the variance for the current prompt, σu\sigma_u, is much higher than the average variance σˉ\bar{\sigma}, the term (σuσˉ)(\sigma_u - \bar{\sigma}) is large and positive. The exponential term exp(k(...))\exp(-k(...)) becomes very small, and the weight wdifficultyw_{difficulty} approaches wmaxw_{max}. The prompt is up-weighted.

  • If σu\sigma_u is much lower than σˉ\bar{\sigma}, the term (σuσˉ)(\sigma_u - \bar{\sigma}) is large and negative. The exponential term becomes very large, and the weight wdifficultyw_{difficulty} approaches wminw_{min}. The prompt is down-weighted.

    Finally, this weight is multiplied by the stratified reward to get the final reward used for the policy update: rfinal(x,y)=wdiffculty(σu)r^(x,y) r _ { \mathrm { f i n a l } } ( x , y ) = w _ { \mathrm { d i f f c u l t y } } ( \sigma _ { u } ) \cdot \hat { r } ( x , y ) This ensures that the model learns more from challenging prompts where it struggles and the RM provides a strong discriminative signal.

5. Experimental Setup

5.1. Datasets

  • Training Datasets: The training data was derived from the OPENMATHREASONING benchmark. The authors created three distinct training regimes to test generalization:
    1. Easy-to-verify: 2,000 problems where the final answer can be checked deterministically by a rule-based verifier (math_verifier).
    2. Hard-to-verify: 2,000 problems where the answers have flexible formats (e.g., lists vs. sets, different orderings) that make rule-based verification difficult.
    3. Mixed: A combination of 1,000 easy-to-verify and 1,000 hard-to-verify problems.
  • Evaluation Datasets: A diverse set of benchmarks was used to evaluate performance thoroughly.
    1. Easy-to-verify Test Sets:
      • MATH500: A subset of the MATH dataset of challenging high school competition math problems.
      • AMC: Problems from the American Mathematics Competitions.
      • Minerva: A dataset of quantitative reasoning problems.
      • Olympiad: Problems from math olympiads, known for their difficulty.
    2. Hard-to-verify Test Sets:
      • HardVerify-Math: A curated benchmark of 250 problems known to be difficult for verifiers, including Olympiad questions and problems with complex answer formats.
      • TextBookReasoning: An additional dataset of hard-to-verify problems curated by the authors to be particularly challenging.

5.2. Evaluation Metrics

The paper uses different metrics depending on the verifiability of the task.

5.2.1. For Easy-to-verify Tasks: pass@1

  • Conceptual Definition: pass@1 measures the percentage of problems for which the model generates a correct solution on its first attempt. For each problem, the model generates one solution (k=1k=1), and if that solution passes the verifier, it is counted as a success. It is a strict metric for generative performance.
  • Mathematical Formula: In the paper's setup, they generate N=8N=8 candidates but only evaluate the first one. So, pass@1 is calculated as: $ \text{pass@1} = \frac{\text{Number of problems solved correctly in the first attempt}}{\text{Total number of problems}} $
  • Symbol Explanation: The numerator is the count of problems where the first generated solution is deemed correct by the math_verifier.

5.2.2. For Hard-to-verify Tasks: LLM-as-a-judge

  • Conceptual Definition: Since rule-based verifiers are unreliable for these tasks, the authors use a powerful, advanced LLM (GPT-4o) as a "judge" to evaluate correctness. The judge is given the problem, the ground-truth answer, and the model's generated answer, and is prompted to decide if the model's answer is equivalent to the ground truth. The reported score is the percentage of answers deemed correct by the LLM judge.
  • Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of answers judged correct by GPT-4o}}{\text{Total number of problems}} $
  • Symbol Explanation: The numerator is the count of problems where GPT-4o outputs "Final Decision: Yes".

5.2.3. For Verifier Analysis (Table 1): Precision, Recall, FPR, Accuracy

  • Conceptual Definition: These are standard binary classification metrics used to evaluate the performance of the verifiers themselves, where "positive" means labeling an answer as correct.
    • Precision: Of all the answers the verifier labeled as correct, what fraction was actually correct? (Measures reliability).
    • Recall: Of all the truly correct answers, what fraction did the verifier successfully identify? (Measures coverage).
    • False Positive Rate (FPR): Of all the truly incorrect answers, what fraction did the verifier mistakenly label as correct?
    • Accuracy: What fraction of all answers (both correct and incorrect) did the verifier label correctly?
  • Mathematical Formulas: $ \text{Precision} = \frac{TP}{TP + FP} $ $ \text{Recall} = \frac{TP}{TP + FN} $ $ \text{FPR} = \frac{FP}{FP + TN} $ $ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $
  • Symbol Explanation:
    • TP (True Positives): Number of correct answers correctly identified as correct.
    • FP (False Positives): Number of incorrect answers incorrectly identified as correct.
    • TN (True Negatives): Number of incorrect answers correctly identified as incorrect.
    • FN (False Negatives): Number of correct answers incorrectly identified as incorrect.

5.3. Baselines

The paper compares HERO against several important baselines:

  • Base Model: The original, pre-trained LLM without any fine-tuning (Qwen3-4B-Base, OctoThinker-8B-Hybrid-Base).
  • SFT Cold Start Model: The base model after supervised fine-tuning (SFT) on a small set of correct solutions. This is the starting point for all RL experiments.
  • RM-only: The SFT model further trained using RL (GRPO) where the reward comes only from a reward model (AceMath-RM-7B).
  • Verifier-only: The SFT model further trained using RL (GRPO) where the reward comes only from a binary rule-based verifier (math_verify (verl)).
  • Model-based Verifiers: Other models used as verifiers, including TIGER-Lab/general-verifier and Qwen2.5-7B-Instruct, for a more comprehensive comparison.

6. Results & Analysis

6.1. Core Results Analysis

The main results demonstrate HERO's effectiveness across different LLM backbones and training data regimes.

6.1.1. Performance on Qwen3-4B-Base

The following are the results from Table 2 of the original paper:

Easy-to-verify tasks Hard-to-verify tasks
MATH500 AMC Minerva Olympiad Avg. ↑ HVM TBR Avg. ↑
Qwen3-4B-Base 67.5 44.1 29.4 32.1 43.3 45.2 40.2 42.7
SFT Cold Start Model 69.1 50.3 39.1 34.3 48.2 50.8 43.3 47.1
Training with easy-to-verify samples
AceMath-7B-RM 80.2 61.6 40.6 43.3 56.4 57.2 52.0 54.6
math_verify (verl) 82.3 61.3 44.0 45.5 58.3 61.0 53.1 57.1
HERO (Ours) 85.4 69.4 44.5 48.9 62.0 73.2 59.3 66.3
Training with hard-to-verify samples
AceMath-7B-RM 79.6 58.8 39.9 42.1 55.1 59.2 48.2 53.7
math_verify (verl) 76.2 46.6 28.7 38.2 47.4 58.4 50.0 54.2
HERO (Ours) 80.0 63.4 40.7 43.1 56.8 59.0 54.0 56.5
Training with mixed samples
AceMath-7B-RM 79.6 58.8 39.9 42.1 55.1 58.4 49.6 54.0
math_verify (verl) 81.3 61.3 38.0 43.9 56.1 62.4 55.3 58.9
HERO (Ours) 81.6 64.4 42.1 47.0 58.8 71.4 56.7 64.1

Analysis:

  • Consistent Superiority: In all three training settings (easy, hard, mixed), HERO achieves the highest average score on both easy-to-verify and hard-to-verify tasks.
  • Largest Gains on Hard-to-verify Tasks: The advantage of HERO is most dramatic when evaluating on hard-to-verify tasks. When trained on easy-to-verify data, HERO achieves an average score of 66.3, a massive improvement of +9.2 points over the verifier-only baseline (57.1) and +11.7 points over the RM-only baseline (54.6). This shows that HERO's structured reward helps the model generalize its reasoning skills to problems with formats that stymie simple verifiers.
  • Stability: When training on hard-to-verify samples, the verifier-only baseline (math_verify) performs poorly on verifiable tasks (average of 47.4), even worse than the SFT model. This is because the verifier is brittle and provides misleading or sparse signals on this data. HERO, by contrast, remains stable and achieves the best score (56.8), showing its robustness to noisy training data.

6.1.2. Performance on OctoThinker-8B-Hybrid-Base

The following are the results from Table 3 of the original paper:

Easy-to-verify tasks Hard-to-verify tasks
MATH500 AMC Minerva Olympiad Avg. ↑ HVM TBR Avg. ↑
OctoThinker-8B-Hybrid-Base 32.0 15.3 9.1 11.0 16.9 26.0 21.1 23.6
SFT Cold Start Model 56.0 35.9 19.7 21.6 33.3 27.6 26.4 27.0
Training with easy-to-verify samples
AceMath-7B-RM 62.3 38.4 26.2 25.5 38.1 29.6 27.8 28.7
math_verify (verl) 60.1 39.4 26.7 24.1 37.6 31.6 28.9 30.3
HERO (Ours) 63.0 40.6 30.1 26.7 40.1 28.4 36.7 32.6
Training with hard-to-verify samples
AceMath-7B-RM 60.7 33.8 22.4 24.9 35.4 32.0 29.8 30.9
math_verify (verl) 60.0 29.7 23.9 24.8 34.6 28.8 26.7 27.8
HERO (Ours) 64.9 41.6 27.9 29.6 41.0 32.4 36.7 34.6
Training with mixed samples
AceMath-7B-RM 60.2 34.4 24.0 23.8 35.6 30.8 29.3 30.1
math_verify (verl) 59.3 33.7 24.7 24.0 35.4 27.6 28.7 28.2
HERO (Ours) 65.2 38.1 28.1 29.3 40.2 34.8 31.6 33.2

Analysis: This table shows that HERO's benefits are not limited to strong base models. OctoThinker starts from a much lower baseline performance. HERO provides substantial absolute and relative gains, lifting the model's performance significantly across all settings. For example, when training on hard-to-verify samples, HERO achieves a 41.0 average on verifiable tasks, compared to 35.4 for RM-only and 34.6 for verifier-only. This confirms that the hybrid reward structure is a generally effective method for improving reasoning.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Importance of Negative vs. Positive Dense Rewards

The following figure from the paper (Figure 2, left panel) analyzes the contribution of dense rewards in the correct (positive) versus incorrect (negative) groups.

该图像是图表,展示了正向与负向反馈对不同验证任务(易验证与难验证)的准确率影响(左图)以及范围消融实验的结果(右图)。在各种任务中,HERO方法相较于未使用反馈时提高了准确率。第一幅图中的易验证任务和难验证任务的表现分别为62.2和73.2,第二幅图中易验证样本与混合样本的准确率有所不同。

Analysis: The ablation shows that providing a dense reward signal for the negative group (incorrect answers) is particularly critical. When only negative dense rewards are used (and positive rewards are sparse), performance increases significantly, especially on hard-to-verify tasks (from 62.2 to 68.4). This suggests that much of the learning comes from being able to differentiate between various types of errors. The ability to distinguish a "nearly correct" attempt from a "wildly incorrect" one provides a much richer learning signal than simply labeling all of them as "wrong."

6.2.2. Impact of Reward Range (αα)

The right panel of Figure 2, shown above, studies the impact of the hyperparameter αα, which controls the reward range for incorrect answers.

  • For verifiable tasks, a smaller range (α=0.05α=0.05) works best. This keeps the RM's influence minimal, preserving the stability of the high-precision verifier signal.
  • For mixed tasks, a larger range (α=0.1α=0.1 or α=0.2α=0.2) is better. In this setting, the verifier fails more often, so a stronger signal from the RM is needed to guide learning on hard-to-verify samples.

6.2.3. Impact of Variance-Aware Reweighting

The following are the results from Table 4 of the original paper:

Methods Easy-to-verify Hard-to-verify
w/o reweighting 60.8 69.4
w reweighting 62.0 73.2

Analysis: This ablation clearly demonstrates the effectiveness of the variance-aware reweighting mechanism. Adding this component improves performance on both task types, with a particularly large gain of +3.8 points on hard-to-verify tasks. This confirms the intuition that focusing training capacity on more ambiguous, high-variance prompts leads to more robust and generalizable learning.

6.2.4. Impact of Reward Model Size

The following are the results from Table 5 of the original paper:

Reward model Easy-to-verify ∞ Hard-to-verify
AceMath-RM-7B 62.0 73.2
AceMath-RM-72B 62.8 71.4

Analysis: This experiment shows that replacing the 7B parameter reward model with a much larger 72B model provides almost no benefit. Performance on verifiable tasks sees a negligible increase, while performance on hard-to-verify tasks actually decreases slightly. This is a powerful finding: the success of HERO comes from its structured reward formulation, not from the raw power or scale of the reward model. This makes HERO an efficient approach, as it does not require massive RMs to be effective.

6.2.5. Naive Combination vs. HERO

The paper also demonstrates that simply mixing the two reward signals is not enough. The following are the results from Table 9 of the original paper:

Methods Easy-to-verify Hard-to-verify
Reward combine (α=0.1) 57.6 60.2
Reward combine (α=0.5) 58.7
Reward l combine (α=0.9) 55.9 60.4
HERO (Ours) 62.0 73.2

Analysis: A naive weighted sum of the verifier and RM rewards performs far worse than HERO and is not consistently better than the individual baselines. This confirms that the stratified normalization is the key innovation that enables stable and effective integration, preventing the noisy RM signal from interfering with the verifier's reliable correctness signal.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully identifies a critical weakness in current methods for training LLMs on reasoning tasks: the trade-off between reliable but sparse verifier rewards and dense but noisy reward model signals. The proposed solution, HERO (Hybrid Ensemble Reward Optimization), provides a principled and effective framework for combining these two signal types.

Through two key innovations—stratified normalization to anchor RM scores to verifier-defined correctness groups and variance-aware weighting to focus learning on the most informative prompts—HERO achieves the best of both worlds. It retains the stability and objectivity of verifiers while leveraging the nuanced, fine-grained feedback of reward models to escape learning plateaus caused by sparse rewards. Empirical results on diverse mathematical reasoning benchmarks and across different base models consistently show that HERO outperforms both verifier-only and RM-only approaches, with particularly strong gains in generalizing to hard-to-verify problems.

7.2. Limitations & Future Work

The authors acknowledge several limitations and areas for future research:

  • Verifier Dependency: HERO's effectiveness still relies on the existence of a reasonably good rule-based verifier. In domains where no such verifier exists or is highly unreliable, the benefits of stratification would be diminished.

  • RM Calibration: The reward model is trained on verifiable data, so it may be miscalibrated for hard-to-verify formats. While HERO constrains the RM, residual biases could still be exploited by the policy.

  • Hyperparameter Sensitivity: The framework introduces new hyperparameters (αα, ββ, kk, etc.) that require tuning for optimal performance, which can add complexity to the training process.

  • Evaluation on Hard Tasks: The evaluation of hard-to-verify tasks relies on an LLM-as-a-judge protocol, which can be noisy and subject to its own biases.

    Future work could explore:

  • Improving verifier coverage with hybrid symbolic-learned methods.

  • Incorporating process-level rewards (which evaluate the reasoning steps) instead of only outcome-based rewards.

  • Developing adaptive methods to set the reward ranges and weights online during training.

7.3. Personal Insights & Critique

This paper offers a highly pragmatic and well-executed solution to a tangible problem in LLM alignment. Its core strength lies in its simplicity and intuitive design.

  • The Power of Structured Hybrids: The central idea of not just mixing but structurally integrating different feedback signals is powerful. Stratified normalization is an elegant way to enforce a hierarchy of trust: the objective verifier is the ultimate arbiter of correctness, while the subjective RM is a subordinate advisor. This principle could be widely applied to other domains beyond math, such as:
    • Code Generation: Combining unit test results (verifier) with a reward model for code style, efficiency, or readability.
    • Factual Generation: Combining a fact-checking API (verifier) with a reward model for fluency and coherence.
  • Efficiency and Practicality: The finding that a massive reward model is not necessary is very encouraging. It suggests that intelligence can be baked into the reward shaping process itself, rather than just scaling up the components. This makes the approach more accessible and computationally efficient.
  • Critique and Potential Issues:
    • While effective, the method feels like a sophisticated patch on a more fundamental problem: our inability to create a single, perfect reward signal. It's an engineering solution, not a theoretical breakthrough in understanding reasoning itself.

    • The variance-aware weighting, while clever, assumes that high variance in RM scores correlates with informativeness. This is generally true but could fail. For example, a poorly calibrated RM might produce high variance on a trivial prompt, leading the model to over-focus on it.

    • The paper focuses on mathematical reasoning. While the principles seem general, its effectiveness in more subjective domains like creative writing or summarization, where objective verifiers are scarce, remains an open question.

      Overall, "Hybrid Reinforcement" is a strong piece of research that provides a clear, effective, and generalizable method for improving RL-based training of LLMs. It represents a mature step in the evolution of alignment techniques, moving from monolithic reward sources to intelligent, structured ensembles.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.