Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
TL;DR Summary
The HERO framework integrates verifiable rewards with reward models to address the limitations of sparse feedback in large language model reasoning tasks. Using stratified normalization and variance-aware weighting, HERO significantly improves performance on mathematical reasonin
Abstract
Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
The title concisely summarizes the paper's core thesis: in scenarios where feedback signals are infrequent and binary (sparse), supplementing them with more continuous and graded feedback (dense) leads to better learning outcomes. It points to a hybrid approach in the context of reinforcement learning.
1.2. Authors
Leitian Tao, Sharon Li, Jason E Weston, Ping Yu.
The authors are affiliated with FAIR at Meta and the University of Wisconsin-Madison. Jason Weston is a prominent researcher at Meta AI Research (FAIR), known for his extensive work in natural language processing, dialogue systems, and reinforcement learning. The affiliations with a top-tier industrial research lab and a major research university indicate a strong background in both theoretical and applied machine learning.
1.3. Journal/Conference
The paper is available on arXiv, a preprint server for academic papers. The publication date of October 2025 suggests this is a placeholder for a future conference submission. Given the topic and quality, it would be a suitable submission for top-tier AI/ML conferences like NeurIPS, ICML, or ICLR. arXiv is a standard platform for disseminating research quickly, but papers on it are not yet peer-reviewed.
1.4. Publication Year
The paper is listed with a publication date of October 8, 2025, which is in the future. This is likely a placeholder date assigned by arXiv or the authors. The version analyzed is , uploaded at 2025-10-08T17:09:41.000Z.
1.5. Abstract
The abstract introduces the problem of post-training Large Language Models (LLMs) for reasoning tasks. It notes the increasing reliance on verifiable rewards—deterministic 0-1 signals (correct/incorrect)—which are reliable but brittle and sparse. This "all-or-nothing" feedback limits learning, as it fails to credit partially correct answers. The paper proposes that reward models (RMs), which provide continuous (dense) feedback, can be a complementary signal. The core contribution is HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals and RM scores. HERO uses two key techniques: stratified normalization to rescale RM scores within verifier-defined correctness groups (preserving correctness while adding nuance) and variance-aware weighting to focus on challenging prompts. The abstract concludes that HERO outperforms both RM-only and verifier-only methods on mathematical reasoning benchmarks, demonstrating that a hybrid approach combines the stability of verifiers with the nuance of RMs.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2510.07242v3
- PDF Link: https://arxiv.org/pdf/2510.07242v3
- Publication Status: This is a preprint on arXiv and has not yet undergone formal peer review for a conference or journal publication.
2. Executive Summary
2.1. Background & Motivation
The central problem this paper addresses is the suboptimal training of large language models (LLMs) for complex reasoning tasks like mathematics. The dominant method, Reinforcement Learning from Verifiable Rewards (RLVR), uses deterministic checkers (verifiers) to provide a binary 0 (incorrect) or 1 (correct) reward.
Why this is a problem:
-
Brittleness: Verifiers are often too strict. They may incorrectly penalize a solution that is semantically correct but formatted differently (a "false negative"). This is common in math problems with multiple equivalent answer formats.
-
Sparsity: The binary reward is sparse. For a given problem, if all generated solutions are incorrect, they all receive a reward of
0. The model gets no gradient signal to differentiate a nearly-correct solution from a completely nonsensical one. This stalls learning, especially on difficult problems where initial attempts are likely to be wrong.On the other hand, Reward Models (RMs), trained on human preferences, provide a continuous or "dense" score. This captures nuances like partial correctness, clarity, and proximity to the correct answer. However, RMs can be noisy, misaligned with objective correctness, and prone to "reward hacking," where the LLM learns to generate text that gets a high score from the RM without actually being correct.
This creates a clear gap: one signal is reliable but sparse (verifier), while the other is dense but unreliable (RM). The paper's entry point is the insight that these two signals are complementary. The core challenge is how to combine them effectively to get the best of both worlds: the reliability of the verifier and the nuanced feedback of the RM.
2.2. Main Contributions / Findings
The paper's main contribution is HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework designed to structurally integrate sparse verifier rewards and dense reward model scores.
HERO's key components are:
- Stratified Normalization: This is the core mechanism for combining the two reward types. Instead of simply adding or averaging the scores, HERO first uses the verifier's binary label (
0or1) to partition generated answers into two groups: "incorrect" and "correct." Then, within each group, it normalizes the RM scores to a small, predefined range. This ensures that a verifiably correct answer always receives a higher reward than any incorrect answer, preserving the verifier's ground truth. The RM's role is demoted to a "tie-breaker," providing a fine-grained quality signal within the verifier-defined categories. - Variance-Aware Weighting: This technique dynamically adjusts the learning signal's strength based on prompt difficulty. For a given prompt, if the RM scores for different generated answers have high variance, it implies the prompt is challenging and the RM is providing useful, discriminative feedback. HERO up-weights the reward for such prompts. Conversely, if all answers get similar RM scores (low variance), the prompt is likely too easy or too hard, providing little learning signal, and is down-weighted. This focuses training effort on the most informative samples.
Key Findings:
- Consistent Outperformance: Across various mathematical reasoning benchmarks (both easy-to-verify and hard-to-verify), HERO consistently outperforms baselines that use only a verifier or only a reward model.
- Robustness Across Models: The benefits of HERO generalize across different LLM backbones, including a strong model (Qwen3-4B-Base) and a weaker one (OctoThinker-8B-Hybrid-Base), demonstrating its wide applicability.
- Solving Sparsity and Instability: HERO effectively addresses the sparsity problem of verifiers by providing dense gradients even when all solutions are incorrect. It also prevents the instability of RM-only training by anchoring the reward to the verifier's objective correctness. The gains are particularly significant on "hard-to-verify" tasks where verifiers are most brittle.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Large Language Models (LLMs) for Reasoning
LLMs are deep learning models, typically based on the Transformer architecture, trained on vast amounts of text data. While they excel at generating fluent text, complex reasoning (e.g., solving multi-step math problems) remains a frontier. Post-training, often called alignment, is used to refine a base LLM's capabilities for specific tasks like following instructions or performing logical reasoning. This paper focuses on a post-training method using reinforcement learning.
3.1.2. Reinforcement Learning (RL)
RL is a machine learning paradigm where an agent (here, the LLM) learns to make decisions by interacting with an environment. The agent takes actions (generating tokens of text) to reach a certain state, and receives a reward signal in return. The agent's goal is to learn a policy—a strategy for choosing actions—that maximizes the cumulative reward over time. In the context of LLMs, the policy is the model's probability distribution over the next token to generate.
3.1.3. Reward Models (RMs)
A Reward Model is a separate model trained to predict the quality of an LLM's output. It takes a prompt and a generated response as input and outputs a scalar score.
- Training: RMs are typically trained on a dataset of human preferences. For a given prompt, humans are shown two or more responses and asked to rank them. The RM is trained to assign a higher score to the preferred response.
- Loss Function: The training often uses a pairwise loss based on the Bradley-Terry model, which models the probability that one item is preferred over another. The loss function used in the paper is:
$
\mathcal { L } _ { R } = - \mathbb { E } _ { ( x , y _ { c } , y _ { r } ) \in \mathcal { D } } [ \log \sigma ( r ( x , y _ { c } ) - r ( x , y _ { r } ) ) ]
$
- : The input prompt.
- : The chosen (preferred) response.
- : The rejected (less preferred) response.
r(x, y): The scalar score from the reward model for response to prompt .- : The sigmoid function, which squashes the difference in scores to a probability between 0 and 1. The goal of minimizing this loss is to make the score difference as large and positive as possible, thus aligning the RM with human preferences. This provides a dense reward signal because it can assign any value on a continuum, capturing fine-grained quality differences.
3.1.4. Rule-Based Verifiers
A verifier is a deterministic program or function that checks if an LLM's output meets an objective criterion. For math problems, a verifier might:
- Parse the final numerical answer from the LLM's text.
- Normalize it (e.g., remove commas, convert to a standard format).
- Compare it to the ground-truth answer. The output is a binary signal, often called a sparse reward, because it offers only two states (correct/incorrect) with no intermediate values. The paper defines this verifier function as: $ \psi \left( x , y _ { i } , y _ { \mathrm { r e f } } \right) = \left{ { \begin{array} { l l } { 1 , } & { { \mathrm { i f ~ } } y _ { i } { \mathrm { ~ i s ~ e q u i v a l e n t ~ t o ~ } } y _ { \mathrm { r e f ~ } } { \mathrm { ~ g i v e n ~ } } x , } \ { 0 , } & { { \mathrm { o t h e r w i s e . } } } \end{array} } \right. $
- : The input prompt.
- : The model's generated response.
- : The ground-truth reference solution.
3.2. Previous Works
3.2.1. Reinforcement Learning from Verifiable Rewards (RLVR)
RLVR is a specific application of RL where the reward signal comes from a verifier instead of a human-preference-trained RM. This approach has been successful in domains with objective correctness, like code generation (where unit tests act as verifiers) and mathematical reasoning. It provides a stable and reliable learning signal, but as the paper argues, it is often too sparse and brittle.
3.2.2. Group Relative Policy Optimization (GRPO)
GRPO is an advanced RL algorithm that this paper builds upon. Standard RL algorithms like PPO (Proximal Policy Optimization) typically learn from a single response (trajectory) per prompt. GRPO improves upon this by generating multiple candidate responses for the same prompt and comparing them as a group.
- How it Works: Instead of calculating an absolute reward for each response, GRPO focuses on the relative advantage of each response within the group. It normalizes the rewards and uses them to identify which responses are better than the group average. This stabilizes learning and makes it more sample-efficient.
- Relevance to the Paper: The authors use GRPO as their underlying RL algorithm. However, they note a key weakness: when a verifier gives all responses for a prompt the same binary reward (e.g., all are incorrect and get a
0), the relative advantages become zero for everyone. In this case, GRPO provides no useful policy gradient, and learning stalls. This is a primary motivation for HERO, which introduces a dense signal to provide gradients even in these situations.
3.3. Technological Evolution
- Supervised Fine-Tuning (SFT): Early LLM alignment involved fine-tuning the model on a high-quality dataset of prompt-response pairs. This is effective but limited by the quality and diversity of the dataset.
- Reinforcement Learning from Human Feedback (RLHF): To go beyond static datasets, RLHF was introduced. It uses a reward model trained on human preferences to guide the LLM's learning via RL. This allows the model to explore and learn from its own outputs.
- Reinforcement Learning from Verifiable Rewards (RLVR): For domains with objective truth (math, code), RLVR replaced the subjective RM with a deterministic verifier. This brought more stability and reliability but introduced the problems of sparsity and brittleness.
- Hybrid Approaches (This Paper): The current paper represents the next logical step. It recognizes the complementary strengths of RMs and verifiers and proposes a structured way to combine them. Instead of choosing one or the other, HERO uses the verifier as a "ground truth anchor" and the RM as a "fine-grained refiner."
3.4. Differentiation Analysis
- vs. Verifier-Only (RLVR): RLVR provides a sparse, all-or-nothing signal. HERO is superior because its dense component from the RM allows it to distinguish between different "bad" answers or different "good" answers, providing a learning gradient where RLVR would provide none.
- vs. RM-Only (RLHF): RM-only training can be unstable and prone to reward hacking. HERO is superior because its
stratified normalizationstrictly subordinates the RM score to the verifier's judgment. A response deemed incorrect by the verifier can never get a higher reward than a correct one, no matter how high its RM score is. This anchors the training to objective correctness. - vs. Naive Combination: The paper shows that simply adding or averaging the verifier and RM scores is ineffective. This naive approach leads to a noisy and misaligned reward signal. HERO's
stratified normalizationis a principled, structured combination that avoids this instability. - vs. GRPO: HERO is an enhancement of GRPO, not a replacement. It solves a key failure case of GRPO (when all rewards in a group are identical) by injecting a dense, intra-group reward differential. The
variance-aware weightingis another novel addition not present in the original GRPO.
4. Methodology
4.1. Principles
The core design principle of HERO is to use the reliable but sparse signal from a rule-based verifier as the primary guide for correctness, while leveraging the dense but potentially noisy signal from a reward model as a supplementary refiner to provide more granular feedback. The methodology is built on two key innovations designed to integrate these signals in a stable and effective manner: stratified normalization and variance-aware advantage reweighting.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Motivation: The Limits of Verifiers and RMs
The authors first motivate their hybrid approach by analyzing the trade-offs of existing methods on a challenging math benchmark (HardVerify_Math). The following table, transcribed from Table 1 of the paper, summarizes their findings.
The following are the results from Table 1 of the original paper:
| Type | Verifier | Recall ↑ | Precision ↑ | FPR ↓ | Acc. ↑ |
|---|---|---|---|---|---|
| Rule-based | math_reward (verl) | 10.1 | 97.5 | 0.3 | 53.6 |
| math_verify (verl) | 68.4 | 100.0 | 0.0 | 83.7 | |
| math_verify (library) | 38.6 | 96.1 | 1.6 | 67.6 | |
| Generative Model-based | TIGER-Lab/general-verifier | 49.5 | 89.3 | 6.3 | 70.9 |
| RM-based | AceMath-7B-RM w threshold 1 | 91.7 | 67.7 | 46.4 | 73.2 |
| AceMath-7B-RM w threshold 3 | 84.2 | 72.7 | 33.5 | 75.6 | |
| AceMath-7B-RM w threshold 5 | 73.8 | 76.6 | 23.9 | 74.9 | |
| AceMath-7B-RM w threshold 7 | 62.4 | 78.5 | 18.1 | 71.9 |
Analysis:
-
Rule-based verifiers like
math_verify (verl)achieve perfect or near-perfect precision (they are rarely wrong when they label an answer as correct) but suffer from low recall (they miss many genuinely correct answers that are formatted differently). This confirms their reliability but also their brittleness. -
RM-based verifiers (using a score threshold) show the opposite trend. They can achieve very high recall (identifying most correct answers) but at the cost of much lower precision and a high false-positive rate (FPR), meaning they often label incorrect answers as correct. This confirms their ability to provide broad coverage but also their unreliability.
This empirical analysis highlights the fundamental tension and motivates a hybrid solution that can harness the precision of verifiers and the recall/coverage of RMs.
4.2.2. HERO: Hybrid Ensemble Reward Optimization
HERO addresses this tension through a two-stage reward shaping process.
Step 1: Stratified Normalization This is the core of HERO's hybrid reward calculation. Instead of naively mixing scores, it first uses the verifier's binary reward, , to stratify the generated responses for a given prompt into two groups: incorrect () and correct (). Then, it normalizes the continuous reward model scores, , within each group and rescales them to a small, non-overlapping range.
The formula for the stratified reward, , is:
Symbol Explanation:
- : The final stratified reward for response to prompt .
- : The binary reward from the rule-based verifier (0 for incorrect, 1 for correct).
- : The continuous score from the reward model.
- and : The minimum and maximum RM scores within the same stratum (i.e., among all responses that also got or all that got ).
- : Small positive hyperparameters (e.g., 0.05 or 0.1) that control the size of the reward range for the incorrect and correct groups, respectively.
- : A small constant to prevent division by zero if all RM scores in a group are identical.
How it works (integrated explanation):
- If a response is incorrect ():
- The term performs min-max normalization on the RM score, scaling it to a value between 0 and 1 relative to other incorrect responses.
- This normalized value is then mapped to the range . For instance, the worst incorrect answer (with ) gets a reward of , and the "best" incorrect answer (with ) gets a reward of .
- If a response is correct ():
-
The same min-max normalization is performed, but this time relative to other correct responses.
-
This normalized value is then mapped to the range . The "worst" correct answer gets a reward of , and the best correct answer gets .
By choosing small and (e.g., ), the framework guarantees that the highest possible reward for an incorrect answer () is always less than the lowest possible reward for a correct answer (). This preserves the verifier's correctness guarantee while allowing the RM to provide a fine-grained quality ranking within each group. This process is illustrated in the figure below from the paper.

Step 2: Variance-Aware Advantage Reweighting The second innovation addresses the fact that not all prompts are equally useful for training. Some are too easy (all generated responses are correct) or too hard (all are incorrect), providing little information. HERO prioritizes prompts where the model shows high uncertainty or variance in its responses, as these are the most informative for learning.
-
The difficulty weight, , is calculated using a logistic function:
Symbol Explanation:
- : The weight assigned to the current prompt.
- : The standard deviation of the reward model scores () across all candidate responses for the current prompt. This measures the "difficulty" or "ambiguity" of the prompt.
- : A running mean of the standard deviations across recent prompts.
- : Hyperparameters defining the minimum and maximum possible weights (e.g., 0.5 and 2.0). This bounds the reweighting effect.
- : A hyperparameter controlling the steepness of the logistic curve, determining how quickly the weight transitions from minimum to maximum.
How it works (integrated explanation):
-
If the variance for the current prompt, , is much higher than the average variance , the term is large and positive. The exponential term becomes very small, and the weight approaches . The prompt is up-weighted.
-
If is much lower than , the term is large and negative. The exponential term becomes very large, and the weight approaches . The prompt is down-weighted.
Finally, this weight is multiplied by the stratified reward to get the final reward used for the policy update: This ensures that the model learns more from challenging prompts where it struggles and the RM provides a strong discriminative signal.
5. Experimental Setup
5.1. Datasets
- Training Datasets: The training data was derived from the OPENMATHREASONING benchmark. The authors created three distinct training regimes to test generalization:
- Easy-to-verify: 2,000 problems where the final answer can be checked deterministically by a rule-based verifier (
math_verifier). - Hard-to-verify: 2,000 problems where the answers have flexible formats (e.g., lists vs. sets, different orderings) that make rule-based verification difficult.
- Mixed: A combination of 1,000 easy-to-verify and 1,000 hard-to-verify problems.
- Easy-to-verify: 2,000 problems where the final answer can be checked deterministically by a rule-based verifier (
- Evaluation Datasets: A diverse set of benchmarks was used to evaluate performance thoroughly.
- Easy-to-verify Test Sets:
MATH500: A subset of the MATH dataset of challenging high school competition math problems.AMC: Problems from the American Mathematics Competitions.Minerva: A dataset of quantitative reasoning problems.Olympiad: Problems from math olympiads, known for their difficulty.
- Hard-to-verify Test Sets:
HardVerify-Math: A curated benchmark of 250 problems known to be difficult for verifiers, including Olympiad questions and problems with complex answer formats.TextBookReasoning: An additional dataset of hard-to-verify problems curated by the authors to be particularly challenging.
- Easy-to-verify Test Sets:
5.2. Evaluation Metrics
The paper uses different metrics depending on the verifiability of the task.
5.2.1. For Easy-to-verify Tasks: pass@1
- Conceptual Definition:
pass@1measures the percentage of problems for which the model generates a correct solution on its first attempt. For each problem, the model generates one solution (), and if that solution passes the verifier, it is counted as a success. It is a strict metric for generative performance. - Mathematical Formula: In the paper's setup, they generate candidates but only evaluate the first one. So,
pass@1is calculated as: $ \text{pass@1} = \frac{\text{Number of problems solved correctly in the first attempt}}{\text{Total number of problems}} $ - Symbol Explanation: The numerator is the count of problems where the first generated solution is deemed correct by the
math_verifier.
5.2.2. For Hard-to-verify Tasks: LLM-as-a-judge
- Conceptual Definition: Since rule-based verifiers are unreliable for these tasks, the authors use a powerful, advanced LLM (GPT-4o) as a "judge" to evaluate correctness. The judge is given the problem, the ground-truth answer, and the model's generated answer, and is prompted to decide if the model's answer is equivalent to the ground truth. The reported score is the percentage of answers deemed correct by the LLM judge.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of answers judged correct by GPT-4o}}{\text{Total number of problems}} $
- Symbol Explanation: The numerator is the count of problems where GPT-4o outputs "Final Decision: Yes".
5.2.3. For Verifier Analysis (Table 1): Precision, Recall, FPR, Accuracy
- Conceptual Definition: These are standard binary classification metrics used to evaluate the performance of the verifiers themselves, where "positive" means labeling an answer as correct.
- Precision: Of all the answers the verifier labeled as correct, what fraction was actually correct? (Measures reliability).
- Recall: Of all the truly correct answers, what fraction did the verifier successfully identify? (Measures coverage).
- False Positive Rate (FPR): Of all the truly incorrect answers, what fraction did the verifier mistakenly label as correct?
- Accuracy: What fraction of all answers (both correct and incorrect) did the verifier label correctly?
- Mathematical Formulas: $ \text{Precision} = \frac{TP}{TP + FP} $ $ \text{Recall} = \frac{TP}{TP + FN} $ $ \text{FPR} = \frac{FP}{FP + TN} $ $ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $
- Symbol Explanation:
TP(True Positives): Number of correct answers correctly identified as correct.FP(False Positives): Number of incorrect answers incorrectly identified as correct.TN(True Negatives): Number of incorrect answers correctly identified as incorrect.FN(False Negatives): Number of correct answers incorrectly identified as incorrect.
5.3. Baselines
The paper compares HERO against several important baselines:
- Base Model: The original, pre-trained LLM without any fine-tuning (
Qwen3-4B-Base,OctoThinker-8B-Hybrid-Base). - SFT Cold Start Model: The base model after supervised fine-tuning (SFT) on a small set of correct solutions. This is the starting point for all RL experiments.
- RM-only: The SFT model further trained using RL (GRPO) where the reward comes only from a reward model (
AceMath-RM-7B). - Verifier-only: The SFT model further trained using RL (GRPO) where the reward comes only from a binary rule-based verifier (
math_verify (verl)). - Model-based Verifiers: Other models used as verifiers, including
TIGER-Lab/general-verifierandQwen2.5-7B-Instruct, for a more comprehensive comparison.
6. Results & Analysis
6.1. Core Results Analysis
The main results demonstrate HERO's effectiveness across different LLM backbones and training data regimes.
6.1.1. Performance on Qwen3-4B-Base
The following are the results from Table 2 of the original paper:
| Easy-to-verify tasks | Hard-to-verify tasks | |||||||
|---|---|---|---|---|---|---|---|---|
| MATH500 | AMC | Minerva | Olympiad | Avg. ↑ | HVM | TBR | Avg. ↑ | |
| Qwen3-4B-Base | 67.5 | 44.1 | 29.4 | 32.1 | 43.3 | 45.2 | 40.2 | 42.7 |
| SFT Cold Start Model | 69.1 | 50.3 | 39.1 | 34.3 | 48.2 | 50.8 | 43.3 | 47.1 |
| Training with easy-to-verify samples | ||||||||
| AceMath-7B-RM | 80.2 | 61.6 | 40.6 | 43.3 | 56.4 | 57.2 | 52.0 | 54.6 |
| math_verify (verl) | 82.3 | 61.3 | 44.0 | 45.5 | 58.3 | 61.0 | 53.1 | 57.1 |
| HERO (Ours) | 85.4 | 69.4 | 44.5 | 48.9 | 62.0 | 73.2 | 59.3 | 66.3 |
| Training with hard-to-verify samples | ||||||||
| AceMath-7B-RM | 79.6 | 58.8 | 39.9 | 42.1 | 55.1 | 59.2 | 48.2 | 53.7 |
| math_verify (verl) | 76.2 | 46.6 | 28.7 | 38.2 | 47.4 | 58.4 | 50.0 | 54.2 |
| HERO (Ours) | 80.0 | 63.4 | 40.7 | 43.1 | 56.8 | 59.0 | 54.0 | 56.5 |
| Training with mixed samples | ||||||||
| AceMath-7B-RM | 79.6 | 58.8 | 39.9 | 42.1 | 55.1 | 58.4 | 49.6 | 54.0 |
| math_verify (verl) | 81.3 | 61.3 | 38.0 | 43.9 | 56.1 | 62.4 | 55.3 | 58.9 |
| HERO (Ours) | 81.6 | 64.4 | 42.1 | 47.0 | 58.8 | 71.4 | 56.7 | 64.1 |
Analysis:
- Consistent Superiority: In all three training settings (easy, hard, mixed), HERO achieves the highest average score on both easy-to-verify and hard-to-verify tasks.
- Largest Gains on Hard-to-verify Tasks: The advantage of HERO is most dramatic when evaluating on hard-to-verify tasks. When trained on easy-to-verify data, HERO achieves an average score of 66.3, a massive improvement of +9.2 points over the verifier-only baseline (57.1) and +11.7 points over the RM-only baseline (54.6). This shows that HERO's structured reward helps the model generalize its reasoning skills to problems with formats that stymie simple verifiers.
- Stability: When training on hard-to-verify samples, the verifier-only baseline (
math_verify) performs poorly on verifiable tasks (average of 47.4), even worse than the SFT model. This is because the verifier is brittle and provides misleading or sparse signals on this data. HERO, by contrast, remains stable and achieves the best score (56.8), showing its robustness to noisy training data.
6.1.2. Performance on OctoThinker-8B-Hybrid-Base
The following are the results from Table 3 of the original paper:
| Easy-to-verify tasks | Hard-to-verify tasks | |||||||
|---|---|---|---|---|---|---|---|---|
| MATH500 | AMC | Minerva | Olympiad | Avg. ↑ | HVM | TBR | Avg. ↑ | |
| OctoThinker-8B-Hybrid-Base | 32.0 | 15.3 | 9.1 | 11.0 | 16.9 | 26.0 | 21.1 | 23.6 |
| SFT Cold Start Model | 56.0 | 35.9 | 19.7 | 21.6 | 33.3 | 27.6 | 26.4 | 27.0 |
| Training with easy-to-verify samples | ||||||||
| AceMath-7B-RM | 62.3 | 38.4 | 26.2 | 25.5 | 38.1 | 29.6 | 27.8 | 28.7 |
| math_verify (verl) | 60.1 | 39.4 | 26.7 | 24.1 | 37.6 | 31.6 | 28.9 | 30.3 |
| HERO (Ours) | 63.0 | 40.6 | 30.1 | 26.7 | 40.1 | 28.4 | 36.7 | 32.6 |
| Training with hard-to-verify samples | ||||||||
| AceMath-7B-RM | 60.7 | 33.8 | 22.4 | 24.9 | 35.4 | 32.0 | 29.8 | 30.9 |
| math_verify (verl) | 60.0 | 29.7 | 23.9 | 24.8 | 34.6 | 28.8 | 26.7 | 27.8 |
| HERO (Ours) | 64.9 | 41.6 | 27.9 | 29.6 | 41.0 | 32.4 | 36.7 | 34.6 |
| Training with mixed samples | ||||||||
| AceMath-7B-RM | 60.2 | 34.4 | 24.0 | 23.8 | 35.6 | 30.8 | 29.3 | 30.1 |
| math_verify (verl) | 59.3 | 33.7 | 24.7 | 24.0 | 35.4 | 27.6 | 28.7 | 28.2 |
| HERO (Ours) | 65.2 | 38.1 | 28.1 | 29.3 | 40.2 | 34.8 | 31.6 | 33.2 |
Analysis:
This table shows that HERO's benefits are not limited to strong base models. OctoThinker starts from a much lower baseline performance. HERO provides substantial absolute and relative gains, lifting the model's performance significantly across all settings. For example, when training on hard-to-verify samples, HERO achieves a 41.0 average on verifiable tasks, compared to 35.4 for RM-only and 34.6 for verifier-only. This confirms that the hybrid reward structure is a generally effective method for improving reasoning.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Importance of Negative vs. Positive Dense Rewards
The following figure from the paper (Figure 2, left panel) analyzes the contribution of dense rewards in the correct (positive) versus incorrect (negative) groups.

Analysis: The ablation shows that providing a dense reward signal for the negative group (incorrect answers) is particularly critical. When only negative dense rewards are used (and positive rewards are sparse), performance increases significantly, especially on hard-to-verify tasks (from 62.2 to 68.4). This suggests that much of the learning comes from being able to differentiate between various types of errors. The ability to distinguish a "nearly correct" attempt from a "wildly incorrect" one provides a much richer learning signal than simply labeling all of them as "wrong."
6.2.2. Impact of Reward Range ()
The right panel of Figure 2, shown above, studies the impact of the hyperparameter , which controls the reward range for incorrect answers.
- For verifiable tasks, a smaller range () works best. This keeps the RM's influence minimal, preserving the stability of the high-precision verifier signal.
- For mixed tasks, a larger range ( or ) is better. In this setting, the verifier fails more often, so a stronger signal from the RM is needed to guide learning on hard-to-verify samples.
6.2.3. Impact of Variance-Aware Reweighting
The following are the results from Table 4 of the original paper:
| Methods | Easy-to-verify | Hard-to-verify |
|---|---|---|
| w/o reweighting | 60.8 | 69.4 |
| w reweighting | 62.0 | 73.2 |
Analysis:
This ablation clearly demonstrates the effectiveness of the variance-aware reweighting mechanism. Adding this component improves performance on both task types, with a particularly large gain of +3.8 points on hard-to-verify tasks. This confirms the intuition that focusing training capacity on more ambiguous, high-variance prompts leads to more robust and generalizable learning.
6.2.4. Impact of Reward Model Size
The following are the results from Table 5 of the original paper:
| Reward model | Easy-to-verify | ∞ Hard-to-verify |
|---|---|---|
| AceMath-RM-7B | 62.0 | 73.2 |
| AceMath-RM-72B | 62.8 | 71.4 |
Analysis: This experiment shows that replacing the 7B parameter reward model with a much larger 72B model provides almost no benefit. Performance on verifiable tasks sees a negligible increase, while performance on hard-to-verify tasks actually decreases slightly. This is a powerful finding: the success of HERO comes from its structured reward formulation, not from the raw power or scale of the reward model. This makes HERO an efficient approach, as it does not require massive RMs to be effective.
6.2.5. Naive Combination vs. HERO
The paper also demonstrates that simply mixing the two reward signals is not enough. The following are the results from Table 9 of the original paper:
| Methods | Easy-to-verify | Hard-to-verify |
|---|---|---|
| Reward combine (α=0.1) | 57.6 | 60.2 |
| Reward | combine (α=0.5) | 58.7 |
| Reward l combine (α=0.9) | 55.9 | 60.4 |
| HERO (Ours) | 62.0 | 73.2 |
Analysis: A naive weighted sum of the verifier and RM rewards performs far worse than HERO and is not consistently better than the individual baselines. This confirms that the stratified normalization is the key innovation that enables stable and effective integration, preventing the noisy RM signal from interfering with the verifier's reliable correctness signal.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully identifies a critical weakness in current methods for training LLMs on reasoning tasks: the trade-off between reliable but sparse verifier rewards and dense but noisy reward model signals. The proposed solution, HERO (Hybrid Ensemble Reward Optimization), provides a principled and effective framework for combining these two signal types.
Through two key innovations—stratified normalization to anchor RM scores to verifier-defined correctness groups and variance-aware weighting to focus learning on the most informative prompts—HERO achieves the best of both worlds. It retains the stability and objectivity of verifiers while leveraging the nuanced, fine-grained feedback of reward models to escape learning plateaus caused by sparse rewards. Empirical results on diverse mathematical reasoning benchmarks and across different base models consistently show that HERO outperforms both verifier-only and RM-only approaches, with particularly strong gains in generalizing to hard-to-verify problems.
7.2. Limitations & Future Work
The authors acknowledge several limitations and areas for future research:
-
Verifier Dependency: HERO's effectiveness still relies on the existence of a reasonably good rule-based verifier. In domains where no such verifier exists or is highly unreliable, the benefits of stratification would be diminished.
-
RM Calibration: The reward model is trained on verifiable data, so it may be miscalibrated for hard-to-verify formats. While HERO constrains the RM, residual biases could still be exploited by the policy.
-
Hyperparameter Sensitivity: The framework introduces new hyperparameters (, , , etc.) that require tuning for optimal performance, which can add complexity to the training process.
-
Evaluation on Hard Tasks: The evaluation of hard-to-verify tasks relies on an
LLM-as-a-judgeprotocol, which can be noisy and subject to its own biases.Future work could explore:
-
Improving verifier coverage with hybrid symbolic-learned methods.
-
Incorporating process-level rewards (which evaluate the reasoning steps) instead of only outcome-based rewards.
-
Developing adaptive methods to set the reward ranges and weights online during training.
7.3. Personal Insights & Critique
This paper offers a highly pragmatic and well-executed solution to a tangible problem in LLM alignment. Its core strength lies in its simplicity and intuitive design.
- The Power of Structured Hybrids: The central idea of not just mixing but structurally integrating different feedback signals is powerful.
Stratified normalizationis an elegant way to enforce a hierarchy of trust: the objective verifier is the ultimate arbiter of correctness, while the subjective RM is a subordinate advisor. This principle could be widely applied to other domains beyond math, such as:- Code Generation: Combining unit test results (verifier) with a reward model for code style, efficiency, or readability.
- Factual Generation: Combining a fact-checking API (verifier) with a reward model for fluency and coherence.
- Efficiency and Practicality: The finding that a massive reward model is not necessary is very encouraging. It suggests that intelligence can be baked into the reward shaping process itself, rather than just scaling up the components. This makes the approach more accessible and computationally efficient.
- Critique and Potential Issues:
-
While effective, the method feels like a sophisticated patch on a more fundamental problem: our inability to create a single, perfect reward signal. It's an engineering solution, not a theoretical breakthrough in understanding reasoning itself.
-
The variance-aware weighting, while clever, assumes that high variance in RM scores correlates with informativeness. This is generally true but could fail. For example, a poorly calibrated RM might produce high variance on a trivial prompt, leading the model to over-focus on it.
-
The paper focuses on mathematical reasoning. While the principles seem general, its effectiveness in more subjective domains like creative writing or summarization, where objective verifiers are scarce, remains an open question.
Overall, "Hybrid Reinforcement" is a strong piece of research that provides a clear, effective, and generalizable method for improving RL-based training of LLMs. It represents a mature step in the evolution of alignment techniques, moving from monolithic reward sources to intelligent, structured ensembles.
-
Similar papers
Recommended via semantic vector search.