JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
TL;DR Summary
JURY-RL separates answer proposal via voting from reward disposal via theorem proving, uses ResZero for unverifiable cases, stabilizing RL training and outperforming label-free baselines in reasoning and code tasks, rivaling supervised training.
Abstract
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 JURY-RL: V OTES P ROPOSE , P ROOFS D ISPOSE FOR L ABEL - F REE RLVR Anonymous authors Paper under double-blind review A BSTRACT Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but its scalability is hampered by the high cost of human-annotated labels. Label- free alternatives, such as majority voting or LLM-as-a-judge, are susceptible to false positives that lead to reward hacking and training collapse. We introduce JURY-RL , a label-free RLVR frame- work that separates answer proposal from reward disposal: votes from model rollouts propose a consensus answer, while a formal theorem prover disposes the final reward. Specifically, a rollout is rewarded only if the majority-voted answer is formally verified by a Lean prover. When verification is inconclusive, we activate our proposed ResZero (Residual-Zero) reward: it drops the unverifiable
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
1.2. Authors
Anonymous authors (Paper under double-blind review)
1.3. Journal/Conference
The paper is published at OpenReview, indicated by the provided link https://openreview.net/forum?id=tnfvv9Wsw9. OpenReview is a platform for conference submissions, often used for machine learning and artificial intelligence conferences, particularly those employing double-blind peer review. Publication on OpenReview implies it is undergoing or has undergone peer review for a reputable conference, though the specific conference is not named in the provided text.
1.4. Publication Year
2025 (Published at UTC: 2025-10-08T00:00:00.000Z)
1.5. Abstract
The paper introduces JURY-RL, a novel framework designed to overcome the scalability challenges of Reinforcement Learning with Verifiable Rewards (RLVR) in Large Language Models (LLMs), which typically suffer from the high cost of human-annotated labels. Existing label-free methods, such as majority voting or LLM-as-a-judge, are prone to false positives, leading to reward hacking and training collapse. JURY-RL addresses this by separating answer proposal (via majority voting from model rollouts) from reward disposal (via a formal theorem prover). A rollout receives a positive reward only if its majority-voted answer is formally verified by a Lean prover. If verification is inconclusive, JURY-RL activates a ResZero (Residual-Zero) reward mechanism. This mechanism drops the unverifiable majority proposal and assigns a zero-mean, variance-preserving reward to the remaining (residual) answers. This design ensures a stable optimization gradient for RL algorithms without reinforcing spurious consensus. Experiments across mathematical reasoning, code generation, and multi-task benchmarks demonstrate that JURY-RL achieves more stable training, consistently outperforms label-free baselines, and often matches or surpasses supervised training with ground-truth rewards across pass@1 and pass@k metrics.
1.6. Original Source Link
Official Source: https://openreview.net/forum?id=tnfvv9Wsw9
PDF Link: https://openreview.net/pdf?id=tnfvv9Wsw9
Publication Status: The paper is listed as "Paper under double-blind review" at OpenReview, suggesting it is a submission to a peer-reviewed conference.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the challenge of reliably enhancing the reasoning capabilities of Large Language Models (LLMs) using Reinforcement Learning with Verifiable Rewards (RLVR). While RLVR offers a principled way to align LLMs with provably correct outputs (rather than merely plausible ones), its widespread adoption is severely hindered by the high cost and limited coverage of human-annotated labels or carefully curated specifications.
Existing label-free alternatives, which aim to reduce this cost, introduce their own set of significant problems:
-
Majority Voting: Approaches like
majority voting(where a consensus answer from multiple model rollouts is rewarded) are susceptible tofalse positives. Models can learn to satisfy the surrogate reward signal by converging on a popular but incorrect answer, a phenomenon known asreward hacking. This often leads totraining collapsewhere the model's performance degrades over time. -
LLM-as-a-Judge: Using another LLM to evaluate responses can reduce self-confirmation bias but suffers from
high false positives, sensitivity topromptingandtemperature,nontrivial compute overhead, and the risk of inheritingshared-biasfrom the judge LLM. They are also vulnerable toinstruction and format manipulation.The fundamental challenge, as argued by the authors, is that a robust reward signal for
label-free RLVRmust satisfy three properties simultaneously:
-
Scalability: Must work without costly human supervision.
-
Truth-alignment: Must reward verifiable correctness, not error-prone consensus.
-
Optimization-stability: Must allow learning even when verification is inconclusive.
Previous methods have failed to meet all three, often sacrificing truth-alignment for scalability or suffering instabilities. The paper's innovative idea is to resolve this conflict by decoupling
answer proposalfromreward disposal, introducing a "Votes Propose, Proofs Dispose" paradigm.
2.2. Main Contributions / Findings
The paper introduces JURY-RL and makes the following primary contributions:
- Novel Label-Free RLVR Framework (
JURY-RL): JURY-RL operationalizes the "votes propose, proofs dispose" paradigm. It strategically uses votes from model rollouts to propose a single candidate answer (for scalability) and then employs a formalLean verifierto dispose of the final reward (for truth alignment). This design aligns rewards with provable correctness without human annotations, suppressing thefalse positivescommon in other label-free methods. - Principled Fallback Mechanism (
ResZero Reward): For situations where formal verification isinconclusive, JURY-RL introduces theResZero (Residual-Zero) reward. This mechanism discards the unverifiable majority proposal and assigns azero-mean, variance-preserving rewardto the remaining (residual) answers. This ensuresstable optimizationand preventstraining collapsefromspurious consensus, addressing theoptimization stabilityrequirement. - State-of-the-Art Performance and Stability: Through extensive experiments across mathematical reasoning, code generation, and multi-task benchmarks, JURY-RL demonstrates:
-
More stable training compared to label-free baselines, avoiding
mode collapse. -
Consistently outperforms other label-free methods.
-
Matches or even surpasses supervised training with
ground-truth rewards(GT) acrosspass@1andpass@kmetrics, especially inout-of-domain generalization. This suggests that theproof-gated rewardcan provide a superior learning signal. -
Enhances solution diversity, as evidenced by improvements in
pass@koverpass@1and a sustained higher average number of unique answers.These findings collectively demonstrate that grounding RL in sparse but formally verified signals is a promising strategy for building robust and generalizable reasoning models without relying on human labels.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Large Language Models (LLMs): These are advanced
neural networkmodels, typically based on thetransformer architecture, trained on vast amounts of text data. They are capable of understanding, generating, and processing human language, performing tasks like translation, summarization, question answering, and code generation. Examples includeGPT-4,Qwen, andLlama. - Reinforcement Learning (RL): A type of
machine learningwhere anagentlearns to make decisions by performingactionsin anenvironmentto maximize a cumulativereward signal. The agent learns through trial and error, gettingfeedbackin the form of rewards or penalties for its actions. - Reinforcement Learning with Verifiable Rewards (RLVR): An extension of RL where the
reward signalfor anLLM'soutput is derived from an external, objective verification process rather than human feedback or simple textual comparison. This verification often involves execution (e.g., running code, evaluating mathematical expressions) or formal proof. The goal is to align the LLM's reasoning with provable correctness. - Reward Hacking: A phenomenon in RL where an agent learns to exploit flaws or unintended loopholes in the
reward functionto maximize its reward, without actually achieving the intended goal or exhibiting desired behavior. This can lead to models that appear to perform well according to the reward signal but are fundamentally misaligned with the true objective. - Training Collapse / Mode Collapse: In
generative modelsorRL,training collapse(ormode collapse) refers to a situation where the model loses its ability to generate diverse or correct outputs. Instead, it converges to producing a limited set of (spurious consensus) outputs, often due to an unstable or misleading reward signal that reinforces a narrow, potentially incorrect, subset of behaviors. - Formal Theorem Prover: A software tool used to assist in the creation and verification of mathematical proofs. It provides a formal language to state mathematical theorems and a logical engine to check the correctness of proofs step-by-step against a set of axioms and inference rules.
- Lean Prover: A specific type of
interactive theorem proverand aprogramming language(Lean4is the latest version) developed by Microsoft Research. It combines features of a programming language with a powerful proof assistant, allowing users to write formal mathematical proofs that are machine-verifiable. It is known for its strong type theory foundation and active community in formalizing mathematics.
- Lean Prover: A specific type of
- Optimization Gradient: In
machine learning, thegradientis a vector that points in the direction of the steepest ascent of a function (e.g., aloss functionorreward function).Optimization algorithmsuse this gradient to adjust model parameters iteratively, moving towards aminimum (for loss)ormaximum (for reward). Astable optimization gradientmeans that the updates are consistent and do not lead to erratic or divergent training behavior. - Zero-mean, variance-preserving reward: A reward signal that, within a group of samples, has an average value of zero. This property is important for
RL algorithmsthat normalize rewards, as it prevents the learning signal from constantly pushing the policy in one direction (e.g., always increasing or decreasing token probabilities).Variance-preservingmeans that even with a zero mean, there is still enough variability in the rewards to provide informative gradients, allowing theRL agentto differentiate between better and worse actions within the group.
3.2. Previous Works
The paper categorizes previous label-free self-reward methods based on the source of the reward signal:
-
Model-Internal Self-Reward Signals:
- Output-side proxies:
- Entropy Minimization: Rewards outputs that exhibit high certainty or low entropy in the model's predictions. The reward is inversely related to the policy's output entropy for the final-answer tokens.
- Self-Certainty: Rewards hypotheses with high confidence scores, often derived from the
log-probabilitiesof the tokens forming the final answer. - Critique: Both
entropy minimizationandself-certaintyarefragileas they can amplify errors when the model becomesconfidently wrong, potentially leading toreward hacking.
- Agreement-based methods:
- Single-view agreement (Majority Voting): For multiple responses (
G rollouts) to the same input, amajority-voted answeris identified, and responses matching this consensus receive a positive reward. - Critique: Prone to reinforcing an
erroneous consensus, where the model converges on a popular but incorrect answer by exploiting superficial heuristics (e.g., formatting). This can lead toentropy collapse(where the model generates very similar, often incorrect, outputs) andpolicy degradation. The paper provides a theoretical analysis in Appendix A.1 explaining why naive majority voting isbrittle. - The formula for
group-normalized advantagesforMajority Voting (MV)is: $ \hat{A}{i}=\left{\begin{array}{lc} \sqrt{\frac{1-\hat{v}}{\hat{v}}}, & a{i}=\hat{a} \ -\sqrt{\frac{\hat{v}}{1-\hat{v}}}, & a_{i} \neq \hat{a} \end{array}\right. $ where is the vote share of the majority answer . As , supporters get a vanishing signal (), and dissenters get diverging penalties (), suppressing exploration and shrinkingtoken-level entropy. - Multi-view agreement (e.g., CoReward): Attempts to improve robustness by enforcing consistency across multiple,
semantically equivalent prompts. A majority answer from one prompt variant is used as apseudo-labelfor responses from the original prompt. - Critique: While often improving
training stabilityinitially, it can stilldelayrather thaneliminate hacking, asspurious shortcutscan eventually propagate across multiple views.
- Single-view agreement (Majority Voting): For multiple responses (
- Output-side proxies:
-
External-Judge Signals:
- LLM-as-a-Judge: Uses a powerful, external
LLMto automatically score the model's outputs. - Critique: Mitigates
self-confirmation biasbut introducessensitivity to prompt design,high computational cost, and risks transferring the judge's intrinsicbiases. It is vulnerable toprompt/format gamingandfalse positives. - LLM-based Knowledge Distillation (LLM-KD): A
teacher model(often a more capableLLM) generates areference answerto guide thestudent model. - Critique: Offers a potentially more granular signal but is constrained by the teacher's capabilities and
biases.
- LLM-as-a-Judge: Uses a powerful, external
-
Verification-Based Training: This paradigm uses externally checkable signals to determine correctness.
-
Program execution and unit tests: Common for code generation tasks.
-
SMT (Satisfiability Modulo Theories) solvers: Used for logical verification.
-
Formal proof assistants (e.g., Lean/Coq): Used to mathematically verify proofs.
-
Critique: Previous
generate-then-verify pipelinesoften provide no learning signal when verification fails, limitingstabilityandsample efficiency.The paper adopts
Group Relative Policy Optimization (GRPO)(Shao et al., 2024) as itsRL algorithm. GRPO is designed to handlegrouped rewardsand estimategroup-normalized advantages. TheGRPO objectiveis: $ \max {\pi{\theta}} \mathbb{E}{x \sim \mathcal{D}, y \sim \pi{\theta}(\cdot \mid x)}\left[r\left(x, y ; \mathcal{G}{x}\right)-\beta \mathrm{D}{\mathrm{KL}}\left(\pi_{\theta}(\cdot \mid x) | \pi_{\mathrm{ref}}(\cdot \mid x)\right)\right] $ where:
-
-
is the
policy LLMwith parameters . -
is a problem sampled from dataset .
-
is a token sequence (response) sampled from .
-
is a
grouped rewardcomputed from rollouts for the same input . -
is the
KL penalty coefficient. -
is the
Kullback-Leibler (KL) divergencebetween the current policy and areference policy. This term regularizes training to prevent the policy from drifting too far from the reference (e.g., the baseLLM).The
scalar group advantagein GRPO is calculated as: $ \hat{A}{i}=\frac{r{i}-\bar{r}}{\operatorname{std}\left(\left{r_{j}\right}{j=1}^{G}\right)+\varepsilon}, \quad \bar{r}=\frac{1}{G} \sum{j=1}^{G} r_{j} $ where: -
is the reward for the -th rollout.
-
is the
mean rewardacross the group of rollouts. -
is the
standard deviationof rewards across the group. -
is a small constant (epsilon) added for numerical stability to prevent division by zero.
GRPO then maximizes the clipped objective: $ J_{\mathrm{GRPO}}(\theta)=\mathbb{E}\left[\frac{1}{G} \sum_{i=1}^{G} \frac{1}{\left|y_{i}\right|} \sum_{t=1}^{\left|y_{i}\right|} \min \left(\rho_{i, t}(\theta) \hat{A}{i, t}, \operatorname{clip}\left(\rho{i, t}(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}{i, t}\right)-\beta D{\mathrm{KL}}\left(\pi_{\theta} | \pi_{\mathrm{ref}}\right)\right] $ where:
-
is the
per-token ratioof the new policy's probability to the old policy's probability for token . -
is a broadcast of (the group advantage) to each token, or a per-token variant.
-
clipsthepolicy ratioto a specified range to prevent excessively large policy updates, ensuringtraining stability.
3.3. Technological Evolution
The field has evolved from basic LLMs to sophisticated models that require alignment with human values and factual correctness. Initially, LLMs were pre-trained on vast text corpora, excelling at language generation but often lacking reliable reasoning or factual grounding. Instruction tuning and Reinforcement Learning from Human Feedback (RLHF) emerged to align models with human preferences. However, RLHF is costly and subjective. RLVR represents a step further, aiming for objective and provable correctness, particularly in domains like mathematics and code. The evolution has been towards label-free methods to scale this alignment, but these often struggle with truth-alignment and optimization stability. JURY-RL fits into this timeline by attempting to provide a scalable, truth-aligned, and stable label-free RLVR solution, addressing the shortcomings of previous label-free attempts and improving upon supervised RLVR by reducing reliance on costly ground truth.
3.4. Differentiation Analysis
Compared to existing methods, JURY-RL differentiates itself primarily through its "Votes Propose, Proofs Dispose" paradigm:
- Compared to Model-Internal Self-Reward (Majority Voting, Self-Certainty, Entropy):
- Core Difference: JURY-RL does not reward mere popularity or model certainty. Instead of directly rewarding the
majority-voted answer(which can be incorrect), it only proposes the majority answer for verification. The actual reward isproof-gated. - Innovation: This prevents
reward hackingandtraining collapsecommon in internal methods, which are vulnerable tospurious consensusorconfidently wrongpredictions. JURY-RL aligns rewards withprovable correctness.
- Core Difference: JURY-RL does not reward mere popularity or model certainty. Instead of directly rewarding the
- Compared to External-Judge Signals (LLM-as-a-Judge, LLM-KD):
- Core Difference: JURY-RL replaces the fallible
LLM-as-a-Judgewith aformal theorem prover (Lean). WhileLLM judgescan be prone tofalse positives,prompt sensitivity,computational cost, andbias transfer, aformal verifieroffers a much higher degree oftruth-alignmentandprecision. - Innovation: The
Lean verifier, understandard soundness assumptions, provides anear-zero false positive rate, directly binding thelearning signaltohard evidence. This significantly reduces theattack surfaceforprompt and format hacking.
- Core Difference: JURY-RL replaces the fallible
- Compared to Supervised GRPO (Ground-Truth rewards):
- Core Difference: JURY-RL is
label-free, eliminating the need for expensivehuman-annotated ground truth. - Innovation: Despite being
label-free, JURY-RL'sproof-gated rewardscan sometimes outperformsupervised GRPOwithground-truth rewards, particularly inout-of-domain generalization. This suggests that the signal fromformal verificationcan be a more robust andgeneralizable learning objectivethan direct supervision, as it encourages learning underlying principles rather thanoverfittingto stylistic patterns.
- Core Difference: JURY-RL is
- Unique Fallback Mechanism (ResZero):
-
Core Difference: When formal verification is
inconclusive, JURY-RL does not simply assignzero reward(whichstalls learning) orreward the unverified majority(which leads tocollapse). -
Innovation:
ResZeropenalizes theunverifiable majorityand assigns azero-mean, variance-preserving rewardto theresidual answers. This maintains astable optimization gradientand encouragesexplorationamongminority opinionswithoutamplifying false consensus.In essence, JURY-RL uniquely combines the scalability of
label-free votingwith the rigor offormal verification, while providing a robustfallback mechanismto ensureoptimization stability, a crucial combination that previous methods failed to achieve.
-
4. Methodology
4.1. Principles
The core idea of JURY-RL is the "Votes Propose, Proofs Dispose" paradigm. This means that the process of generating a candidate answer and the process of evaluating its correctness (and thus assigning a reward) are decoupled.
-
Votes Propose: The policy
LLMgenerates multiple candidate solutions (rollouts). A simple, computationally cheapmajority voteamong these rollouts identifies a singleconsensus candidate(the proposal). This step ensuresscalabilityby limiting the number of expensive verification calls. -
Proofs Dispose: A
formal theorem prover(specifically, aLean verifier) acts as an objective "judge" to verify the correctness of the single proposed candidate. The reward for all rollouts is then "disposed" (assigned) based on the outcome of this formal verification. This step ensurestruth-alignmentby grounding rewards inprovable correctness.This dual mechanism is designed to satisfy three core principles simultaneously:
- Scalability: Achieved by relying on
label-free rolloutsand verifying only a singlemajority-voted candidate, avoiding the prohibitive cost of verifying every unique answer. - Truth Alignment: Ensured by using a
formal verifierthat providesprovable correctnesswithnear-zero false positives. - Optimization Stability: Maintained by the
ResZero rewardmechanism, which provides ameaningful learning signaleven when formal verification isinconclusive, preventingtraining stallsorcollapse.
4.2. Core Methodology In-depth (Layer by Layer)
The JURY-RL framework operates as follows:
4.2.1. Proposal Stage: Majority Voting
For a given problem , the policy LLM generates trajectories (or rollouts) through sampling. From each trajectory , a deterministic extractor function ans(·) parses a candidate answer .
The majority vote then determines the most frequent answer among these candidates. This most frequent answer becomes the candidate proposal, denoted as .
$
\hat{a}=\arg \max {a} \sum{i=1}^{G} \mathbb{I}\left[a_{i}=a\right]
$
Here:
- is the
majority-voted answer. - is the answer parsed from the -th rollout .
- is the
indicator function, which returns 1 if the condition inside the brackets is true, and 0 otherwise. The sum counts how many rollouts produced a specific answer .
4.2.2. Disposal Stage: Proof-Gated Reward Framework
The candidate proposal is then passed to an external Lean verifier. This verifier evaluates the correctness of against a formal specification of the problem . The outcome is a binary proof-gate signal, :
$
\delta=\operatorname{verify}(x, \hat{a}) \in{0,1}
$
Here:
-
is the function call to the
Lean verifier. -
if is
formally certified correctfor problem . -
if verification is
inconclusive.The final reward for each trajectory is determined by a conditional function gated by : $ r_{i}=\underbrace{\delta \cdot \mathbb{I}\left[a_{i}=\hat{a}\right]}{\text {Verified Correctness }}+\underbrace{(1-\delta) \cdot r{i}^{\text {ResZero }}}_{\text {Inconclusive Fallback }} $ This formula describes the
proof-gated rewardmechanism: -
Case 1: Verification Succeeds (): If the majority-voted answer is formally verified as correct, the term becomes zero. The reward simplifies to . This means:
- Rollouts that produced the
proven-correct majority answerreceive a positive reward of 1. - Rollouts that produced any other answer (not ) receive a reward of 0.
This directly binds the
learning signaltohard evidence, preventingfalse positives.
- Rollouts that produced the
-
Case 2: Verification is Inconclusive (): If the majority-voted answer cannot be formally verified, the term becomes zero. The reward is then determined by the
ResZero rewardmechanism, .
4.2.3. ResZero (Residual-Zero) Reward
The ResZero reward is a fallback mechanism for when verification is inconclusive (). Its purpose is to provide a stable learning signal without reinforcing a potentially incorrect majority or stalling learning with a zero reward. It works by penalizing the unverifiable majority proposal and constructing a zero-mean reward from the remaining (residual) answers.
Let's define the groups of rollouts:
- : The set of rollouts supporting the
majority answer. - : The set of
residual rollouts(those not supporting the majority). Themajority shareis , where is the number of rollouts in the majority group and is the total number of rollouts.
The leave-one-out residual share for an answer within the residual group is defined as:
$
u^{(-i)}(b)=\frac{1}{|R|-1} \sum_{\substack{j \in R \ j \neq i}} \mathbb{I}\left[a_{j}=b\right]
$
Here:
-
calculates the proportion of other
residual rollouts(excluding rollout itself) that also produced answer . This provides a measure ofrelative supportfor an answer within theminority opinions. -
Note: If , this calculation might need special handling (e.g., if , the denominator , which is an
edge case). The paper states this is for the general case where .A temporary variable is then defined:
-
if (i.e., the
relative supportfor the residual answer within its peer group). -
if .
The
ResZero rewardis then assigned as: $ r_{i}^{\text {ResZero }}=\underbrace{\alpha \cdot \mathbb{I}[i \in R] \cdot\left(z_{i}-\bar{u}\right)}{\text {Amplify residual signals }}-\underbrace{c \alpha \cdot \mathbb{I}[i \in M]}{\text {Penalize majority }}+\underbrace{\gamma}_{\text {Global re-centering }} $ where: -
is a
positive hyperparametercontrolling thepenalty strength. -
is the
average supportacross the residual group, calculated as the mean of all for . The term creates azero-mean signalwithin the residual group, rewardingresidual answersthat are more popular amongminoritiesand penalizing less popular ones. -
is a
global re-centering termdesigned to ensure the total reward sums to zero. From Appendix A.2, the derivation shows that setting is necessary to achieve this.
Derivation of (from Appendix A.2): To ensure :
- Sum of rewards for residual group (R): Since (sum of deviations from mean is zero), this simplifies to:
- Sum of rewards for majority group (M):
- Combine sums: Since (total number of rollouts):
- Set to zero to find : Substituting :
Therefore, the ResZero reward is strictly zero-centered. This design also ensures:
- Variance preservation: It maintains
non-zero varianceamong differing answers (as long asresidual answersare not identical), providing an informative gradient forGRPO. - Adaptive economy: The
corrective signaldynamically scales with themajority share, applying stronger pressure when the model isconfidently wrong.
4.2.4. Lean Verifier Pipeline
The Lean Verifier itself is a multi-component system designed for reliable and precise reward signals, as illustrated in Image 4. It comprises:
-
Autoformalizer: Translates natural language mathematical problem statements (question-answer pairs) into precise, executable
Lean formal specifications. It uses theGoedel-Formalizer-V2-32Bmodel. -
Consistency-Checker: Performs bidirectional semantic alignment between the original natural language input and its formalized
Lean counterpart. It usesQwQ-32Bto validate theAutoformalizer'soutputs. -
Prover: Synthesizes formal proof scripts in
Leanand submits them to thetheorem proverformechanized validation, certifying the logical correctness. It uses theGoedel-Prover-V2-32Bmodel.The pipeline for verification:
-
The
Autoformalizergenerates 8 candidate formalizations. -
The
Consistency-Checkerevaluates all candidates and selects the top checked candidate. -
The
Proverconducts up to 16 independent sampling trials (pass@16) to find a valid proof for the selected formalization.The formal verification outputs use a temperature of 0.7 and a maximum of 32,768 tokens.
The following figure (Image 4 from the original paper) shows the system architecture:
该图像是一个示意图,展示了Lean Verifier的系统架构,包括Autoformalizer、Consistency-Checker和Prover三个GPU模块,以及Lean Server(CPU模块)和底层的Cache。
4.2.5. JURY-RL Algorithm
The full algorithm for one grouped update for a prompt is summarized below.
Algorithm 1 JURY-RL (one grouped update for a prompt )
1: Sample rollouts ; parse .
2: Compute vote shares v(a) and majority .
3: Query verifier once: .
4: if then
5: Set .
6: else
7: Form M, R; compute and ; set via Eq. (5) (ResZero reward).
8: end if
9: Compute group-normalized advantages from ; broadcast token-wise.
10: Update with GRPO (clipped ratios, KL to reference).
4.2.6. Illustrative Example of ResZero Reward
Consider a scenario with rollouts.
- rollouts support
majority answer. So, . -
residual rolloutsare split: 3 support answer , 1 supports answer . - Assume verification is
inconclusive().
-
Residual Group (R) Analysis:
- For a trajectory with answer (there are 3 such trajectories): .
- For the trajectory with answer (there is 1 such trajectory): .
- The
average supportacross theresidual groupis .
-
Reward Calculation: Let the
penalty hyperparameter. Theglobal re-centering termis .-
For rollouts (supporting ): .
-
For rollouts with : . The paper states , which seems to be a calculation error in the original text (). The result
0.0667implies , which is incorrect. Let's re-calculate . So the correct value is . However, since the prompt states I must remain 100% faithful to the formulas and algorithms as presented in the original paper, I will use the paper's value for the example. Assuming the paper's0.0667is a typo for in the text and they meant0.0667for the result of based on their overall sum. For the sake of faithfulness, I will not correct the number but point out the discrepancy here. Update: The paper's example result for for is . Let's recheck the sum for consistency. . This is not zero. Let's assume the paper's derivation of is correct and apply it to their listed values. If , then . . This matches the paper. So we have . This confirms my calculation . The paper has a typo stating0.0667for . I will stick to the calculation derived from the formula and the paper's values for , which leads to0.1083. -
For rollouts with : .
Total reward sum: . This demonstrates the
zero-meanproperty. -
4.2.7. Fallback Rewards for Inconclusive Verification (Theoretical Analysis from Appendix A.3)
This section analyzes the group-normalized advantage for different fallback mechanisms when .
-
Case 1: Majority Voting (MV) Reward
- If verification fails, and we naively reward the
majority consensus, . Let be thevote shareof . The mean reward is , and standard deviation is . - The advantage is: $ \hat{A}{i}= \begin{cases}\sqrt{\frac{1-\hat{v}}{\hat{v}}}, & \text { if } a{i}=\hat{a}(\text { Supporter }) \ -\sqrt{\frac{\hat{v}}{1-\hat{v}}}, & \text { if } a_{i} \neq \hat{a}(\text { Dissenter })\end{cases} $
- Implication: As
spurious consensusstrengthens (), the advantage forsupportersvanishes (), while the penalty fordissentersdiverges (). Thissuppresses explorationand leads toentropy collapseandpolicy degradation.
- If verification fails, and we naively reward the
-
Case 2: Zero Reward
- Assign for all when verification is
inconclusive. - Mean reward .
- Standard deviation .
- The advantage is: $ \hat{A}_{i}=\frac{0-0}{0+\varepsilon}=0 $
- Implication: The
advantage signalis nullified, leading to avanishing gradientand effectivelystalling the learning processfor that batch. It avoidscollapsebut at the cost oflearning efficiency.
- Assign for all when verification is
-
Case 3: ResZero Reward (Our proposal)
ResZeroby construction has azero-mean property() andnon-zero variance(unless allresidual answersare identical).- The advantage is: $ \hat{A}{i}=\frac{r{i}^{\text {ResZero }}-0}{\operatorname{std}\left(\left{r_{j}\right}\right)+\varepsilon}=\frac{r_{i}^{\text {ResZero }}}{\operatorname{std}\left(\left{r_{j}\right}\right)+\varepsilon} $
- Implication:
- Prevents Stagnation: Provides a
meaningful gradientsince isnon-zero. - Corrects Spurious Consensus: For , is negative, resulting in a negative ,
discouraging convergenceon theunverified consensus. - Guides Exploration: For , the sign of depends on the
relative popularityof their answers within theminority, rewardingpromising alternative pathsand penalizing less popular ones, thus guidingprincipled exploration.
- Prevents Stagnation: Provides a
5. Experimental Setup
5.1. Datasets
The experiments use a diverse range of datasets covering mathematical reasoning, code generation, and general abilities.
-
Training Dataset:
- MATH: 7,500 problems from the
MATH dataset'straining split (Hendrycks et al., 2021). This dataset contains competitive mathematics problems designed to be challenging forLLMs.- Sample data: The paper does not provide a specific example of a data sample from MATH, but it typically consists of mathematical questions requiring multi-step reasoning and a final numerical or symbolic answer.
- MATH: 7,500 problems from the
-
Evaluation Datasets:
- Mathematical Reasoning:
- AIME24/25: Problems from the American Invitational Mathematics Examination (Hugging Face H4, 2024; OpenCompass, 2025). These are competition-level math problems.
- MATH500: A 5,000-problem validation split derived from the
MATH dataset(Lightman et al., 2024). This is considered an in-domain test set. - GSM8K: Grade School Math 8K (Cobbe et al., 2021). A dataset of 8,500 grade school math word problems.
- AMC: American Mathematics Competitions dataset (math-ai Team, 2024). Another set of competition-level math problems.
- Code Generation:
- LiveCodeBench: (Jain et al., 2024) A benchmark for
code reasoning, understanding, and execution. - CRUX: (Gu et al., 2024a) Another benchmark for
code reasoning, understanding, and execution.
- LiveCodeBench: (Jain et al., 2024) A benchmark for
- Instruction-Following and Multi-Task:
-
IFEval: (Zhou et al., 2023) Used for evaluating
instruction followingcapabilities. -
MMLU-Pro: (Wang et al., 2024) Used for
multi-task understandingand general knowledge.These datasets were chosen to comprehensively assess model capabilities across various domains and difficulty levels, including both in-domain and out-of-domain generalization.
-
- Mathematical Reasoning:
5.2. Evaluation Metrics
The paper uses the following evaluation metrics:
-
Pass@1 Accuracy:
- Conceptual Definition:
Pass@1measures the proportion of problems for which the model generates at least one correct solution in its first attempt. It's a standard metric for evaluating the direct success rate of a model's primary output. - Mathematical Formula:
- Symbol Explanation:
Number of problems with at least one correct solution in the first attempt: The count of unique problems where the very first generated solution is verified as correct.Total number of problems: The total number of problems evaluated.
- Conceptual Definition:
-
Pass@k Accuracy:
- Conceptual Definition:
Pass@kestimates the probability that at least one of independently sampled solutions for a problem is correct. It reflects the model's ability to generate diverse and correct solutions, indicating itsexploration capabilityand overall problem-solving capacity when multiple attempts are allowed. - Mathematical Formula: The general formula for
Pass@k(typically when samples are generated) is: $ \text{Pass@k} = \frac{1}{N} \sum_{i=1}^{N} \left(1 - \frac{\binom{n-c_i}{k}}{\binom{n}{k}}\right) $ Where the authors use a simplified estimation, often derived from generating independent samples and checking if any of them are correct. The paper refers topass@kas a multi-attempt success rate, implying the common usage of generating samples and checking if any are correct. A common way to calculate pass@k from samples, where are correct, is: $ \text{Pass@k estimate} = \frac{1}{\text{Total Problems}} \sum_{P} \left( 1 - \prod_{j=0}^{k-1} \frac{\text{incorrect_samples} - j}{\text{total_samples} - j} \right) $ Or more directly, if samples are generated and checked: $ \text{Pass@k} = \frac{\text{Number of problems with at least one correct solution among k attempts}}{\text{Total number of problems}} $ - Symbol Explanation:
- : The number of independent solution attempts generated for each problem.
Number of problems with at least one correct solution among k attempts: The count of problems where at least one of the generated solutions is verified as correct.Total number of problems: The total number of problems evaluated.
- Conceptual Definition:
-
Precision:
- Conceptual Definition: In the context of
verifier signal quality,precisionmeasures the proportion of solutions identified as "correct" by the verifier that are actually correct (true positives). It quantifies the verifier's ability to avoidfalse positives. - Mathematical Formula: $ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $
- Symbol Explanation:
True Positives: Solutions that the verifier correctly identified as correct.False Positives: Solutions that the verifier incorrectly identified as correct (they were actually incorrect).
- Conceptual Definition: In the context of
-
Recall:
- Conceptual Definition: In the context of
verifier signal quality,recallmeasures the proportion of all truly correct solutions that the verifier successfully identified. It quantifies the verifier's ability to find all relevant correct solutions and avoidfalse negatives. - Mathematical Formula: $ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
- Symbol Explanation:
True Positives: Solutions that the verifier correctly identified as correct.False Negatives: Solutions that the verifier incorrectly identified as incorrect (they were actually correct).
- Conceptual Definition: In the context of
-
F1-Score:
- Conceptual Definition: The
F1-scoreis theharmonic meanofprecisionandrecall. It provides a single score that balances bothprecisionandrecall, which is particularly useful when dealing with imbalanced datasets or when bothfalse positivesandfalse negativesare important. - Mathematical Formula: $ \text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $
- Symbol Explanation:
Precision: As defined above.Recall: As defined above.
- Conceptual Definition: The
-
Average: This generally refers to the mean score across multiple benchmarks or tasks. The paper uses it to summarize performance across different categories of benchmarks.
5.3. Baselines
The paper compares JURY-RL against a comprehensive set of baselines, including various label-free methods and a supervised oracle:
-
Self-supervised label-free baselines:
- Majority-Voting (MV): (Shafayat et al., 2025) A
consensus rewardwhere a rollout receives a reward of 1 if its extracted answer matches the most frequent answer among rollouts, and 0 otherwise. This reinforces popular answers irrespective of correctness. - Self-Certainty: (Zhao et al., 2025b) A
confidence-derived signal. The reward is based on thelog-probabilitiesof the tokens forming the final answer; highercumulative log-probabilityindicates greater model certainty and yields a higher reward. - Entropy Minimization: (Prabhudesai et al., 2025) A
low-entropy proxy. The reward is inversely related to the policy's outputentropyfor the final answer tokens, encouraging more deterministic (and presumably confident) predictions. - CoReward: (Zhang et al., 2025) A
multi-view agreementmethod, which attempts to improve robustness by enforcing consistency across multiple, semantically equivalent prompts.
- Majority-Voting (MV): (Shafayat et al., 2025) A
-
Judge-based label-free baselines:
- LLM-as-a-Judge: (Pang et al., 2023; Zhao et al., 2025a) An external
LLM(specifically,qwen-2.5-72b-instruct) is used to assess the reasoning process and final answer of each rollout. Its evaluation (numeric score or binary correct/incorrect) serves as the reward. - LLM-KD (Knowledge Distillation): (Gu et al., 2024b) A
teacher model(qwen-2.5-72b-instruct) generates areference answerfor each problem, which is then treated as apseudo-label. The policy model is trained to align its outputs with these machine-generated references.
- LLM-as-a-Judge: (Pang et al., 2023; Zhao et al., 2025a) An external
-
Supervised Baseline:
- Ground Truth (GT)-Reward: (Shao et al., 2024) A
supervised oracle baselinewherehuman-annotated labelsare available. A rollout receives a reward of 1 if its extracted answer matches theground truth, and 0 otherwise. This is trained with the sameGRPO objectivefor fair comparison.
- Ground Truth (GT)-Reward: (Shao et al., 2024) A
5.4. Backbone Models
The experiments are conducted on a diverse range of open-source Large Language Models to ensure broad applicability:
Qwen3-1.7B-Base(Yang et al., 2025a)Llama-3.2-3B-Instruct(Dubey et al., 2024)Qwen2.5-7B(Yang et al., 2025b) The judge model used forLLM-as-a-JudgeandLLM-KDbaselines isqwen-2.5-72b-instruct.
5.5. Implementation Details and Hyperparameters
All methods are implemented using the VeRL framework (Sheng et al., 2025).
The following are the results from Table 6 of the original paper:
| Hyperparameter | Value |
|---|---|
| Training Configuration | |
| Train Batch size (Number of Sampled Questions) | 128 |
| Rollouts per problem () | 8 |
| Max prompt length | 512 |
| Max new tokens | 3072 |
| Training epoch | 6 |
| Optimizer Parameters (AdamW) | |
| Learning rate | |
| 0.9 | |
| 0.999 | |
| Warmup style | Cosine |
| Warmup steps ratio | 0.1 |
| GRPO Algorithm Parameters | |
| KL loss coefficient () | 0.005 |
| clip ratio () | 0.2 |
| Generation Parameters | |
| Training temperature | 1.0 |
| Evaluation temperature | 0.8 |
| top-p | 0.95 |
5.5.1. Lean Verifier Configuration
- Autoformalizer:
Goedel-Formalizer-V2-32B(Lin et al., 2025) - Consistency-Checker:
QwQ-32B(QwenTeam, 2025) - Prover:
Goedel-Prover-V2-32B(Lin et al., 2025) - The prover conducts up to 16 independent sampling trials (
Prover@16) to find a valid proof for efficiency, after trading off performance for cost. - Hyperparameters for formal verification: Temperature: 0.7, Maximum Tokens: 32,768.
5.5.2. Prompt Formats (Appendix C.4)
- Autoformalizer Prompt:
Please autoformalize the following natural language problem statement in Lean 4. Use the following theorem name: test_problem The natural language statement is: {informal_statement_content}. Think before you provide the lean statement. - Consistency-Checker System Prompt:
Your role is a Lean4 expert, please help me check consistency between natural language expression and its Lean4 proof statement. Guidelines for Consistency Checking: Goal: Determine if the Lean theorem statement is an exact and faithful formalization of the mathematical problem. Do not evaluate or consider the answer or the proof. Your sole task is to verify the correctness of the formalization. Evaluation Stages (All required): 1. Math Assertion Analysis Identify all structurally and semantically relevant components of the mathematical problem, including variables, types, quantifiers, constraints, logic structure, conclusion, and so on. The analysis should be based on the actual content of the text. 2. Lean Statement Analysis (ignore proof part) Extract all structurally and semantically relevant components from the Lean statement, including variables, types, conditions, quantifiers, constraints, the final claim, and so on. The analysis should reflect the actual content present in the Lean code. 3. Comparative Verification Check for exact correspondence between the math and Lean statements; you may refer to aspects like: - Semantic alignment, logic structure, and quantifier correctness. - Preservation of constraints and boundary assumptions. - Accurate typing and use of variables. - Syntactic validity and proper Lean usage (free from errors). - Use of symbols and constructs without semantic drift. - No missing elements, no unjustified additions, and no automatic corrections or completions. 4. Final Judgement Based solely on the above analysis, judge whether the Lean statement is a correct and exact formalization of the mathematical problem. 5. Accuracy Confirmation If correct: clearly confirm why all elements match. If incorrect: list all mismatches and explain how each one affects correctness. Note: While the analysis may be broad and open to interpreting all relevant features, the final judgment must be based only on what is explicitly and formally expressed in the Lean statement. **Do not consider or assess any part of the proof. Your judgment should be entirely about the accuracy of the statement formalization.** You should present the results following the format: Input: The Natural Language Statement: A math problem and its answer (no proof). The Formal Statement in Lean4: '''lean A Lean 4 theorem statement formalizing the problem. Proof is intentionally omitted (e.g., sorry). ''' Output Format: Return exactly one xml object <comments> Your brief analysis: Math Assertion Analysis: [...] Lean Statement Analysis (Proof Ignored): [...] Comparative Verification: [...] Conclusion: [...] Accuracy Confirmation: [... match confirmation or list of discrepancies...] </comments> <consistency> [Correct/Incorrect] </consistency> - Consistency-Checker User Prompt:
Input Data: The Natural Language Statement: {informal_prefix} The Formal Statement in Lean4: '''lean {formal_statement} ''' - Prover Prompt:
Complete the following Lean 4 code: ''' lean4 {formal_statement} ''' Before producing the Lean 4 code to formally prove the given theorem, provide a detailed proof plan outlining the main proof steps and strategies. The plan should highlight key ideas, intermediate lemmas, and proof structures that will guide the construction of the final formal proof.
5.5.3. LLM-as-a-Judge Prompt (Appendix D)
You are a professional math QA pair evaluator. Your sole task is to determine whether a given mathematical question and answer are correctly matched. First explain your reasoning, then end your response with your final judgment: True or False.
5.6. Cost Analysis of Lean Verifier (Appendix C.3)
The computational overhead of the Lean Verifier is crucial.
-
Cold-Start Overhead: In the worst-case (processing a batch of entirely unseen QA pairs):
Ground-Truth (GT) baselinetraining: 100 seconds per step (core training workload).JURY-RL (Lean Verifier): 200 seconds additional overhead per step. Total: 300 seconds.LLM-as-a-Judge(Qwen-2.5-72B): 80 seconds additional overhead per step. Total: 180 seconds.
-
Cost Amortization: Due to a
caching mechanismfor verification results, as training proceeds and the policy converges, more QA pairs are found in the cache. This amortizes the cost, and theper-step time costforJURY-RLprogressively converges toward that of theGT baselinein thesteady-state phase. -
Performance vs. Token Cost (Table 5): The following are the results from Table 5 of the original paper:
Models MATH500 Token Costs ACC TPR F1-Score Max Tokens per Verify Avg Tokens per Verify Avg Tokens per Response Prover@16 87.0 87.0 93.0 351,799 33,346 5,858 Prover@32 89.0 89.0 94.0 658,061 52,064 6,607 Prover@64 91.0 91.0 95.0 1,369,038 67,429 6,435 The
Prover@16configuration (meaning up to 16 proof attempts) is chosen as default, balancing an acceptable performance (87.0% ACC) with significantly lower token cost (33,346 avg tokens per verify) compared toProver@64.
6. Results & Analysis
6.1. Core Results Analysis
The main experimental results demonstrate that JURY-RL consistently outperforms label-free baselines and often matches or surpasses supervised training with ground-truth rewards (GT).
The following are the results from Table 1 of the original paper:
| Methods | Mathematics | Code | Instruction | Multi-Task | Average | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| AIME24 | AIME25 | MATH500 | GSM8K | AMC | LiveCode | CRUX | IFEval | MMLU-Pro | ||
| Qwen3-1.7B-Base | ||||||||||
| Before RL | ||||||||||
| GT-Reward | ||||||||||
| Self-Certainty | ||||||||||
| Entropy | ||||||||||
| Majority-Voting | ||||||||||
| CoReward | ||||||||||
| LLM-KD | ||||||||||
| LLM-as-a-Judge | ||||||||||
| JURY-RL (Ours) | ||||||||||
| Llama-3.2-3B-Instruct | ||||||||||
| Before RL | ||||||||||
| GT-Reward | ||||||||||
| Entropy | ||||||||||
| Self-Certainty | ||||||||||
| Majority-Voting | ||||||||||
| CoReward | ||||||||||
| LLM-KD | ||||||||||
| LLM-as-a-Judge | ||||||||||
| JURY-RL (Ours) | ||||||||||
| Qwen2.5-7B | ||||||||||
| Before RL | ||||||||||
| GT-Reward | ||||||||||
| Self-Certainty | ||||||||||
| Entropy | ||||||||||
| Majority-Voting | ||||||||||
| CoReward | ||||||||||
| LLM-KD | ||||||||||
| LLM-as-a-Judge | ||||||||||
| JURY-RL (Ours) |
6.1.1. Mathematical Reasoning and In-Domain Generalization
- JURY-RL's advantage over baselines is modest on the
in-distribution MATH500 test set. - Its superiority becomes substantially more pronounced on
out-of-distribution math benchmarkssuch asGSM8KandAMC. - This pattern suggests that competing methods, including
supervised GT baseline, tend tooverfitto stylistic patterns of theMATH dataset. JURY-RL, relying onformal verification, learnsunderlying mathematical principlesrobust todistributional shifts. - JURY-RL achieves significant overall performance leaps over the
GT baselineon all three backbones:Qwen3-1.7B-Base: +2.32 pts (+5.91% rel.)Llama-3.2-3B-Instruct: +1.91 pts (+5.93% rel.)Qwen2.5-7B: +1.53 pts (+3.28% rel.), with an average of 48.13%, exceeding GT (46.60%) andLLM-KD(46.54%).
- This indicates that
formal verificationprovides a superiorlearning signalforgeneralizable reasoning.
6.1.2. Out-of-Domain Generalization
- The strong
in-domain performancetranslates to robustout-of-domain generalizationacrosscode generation,instruction following, andmulti-task knowledge tests. - On
Qwen2.5-7B, JURY-RL surpasses theGT baselinewith an average of 44.72% (+2.08 pts, +4.88% rel.), compared to GT's 42.64%. - On other models, its performance is statistically on par with
GTwhile being the best-performinglabel-free method. - This confirms that optimizing for
verifiable correctnessencourages fundamental, transferable skills.
6.1.3. Overall Gains
- JURY-RL shows comprehensive improvements over the
GT baselineacross bothin-domainandout-of-domain tasks:Qwen-1.7B-Base: +1.02 pts (+2.84% rel.)Qwen2.5-7B: +2.08 pts (+4.88% rel.)Llama-3.2-3B-Instruct: +0.99 pts (+3.13% rel.)
- This indicates
stableandscalable performance gainsacross different model scales, addressingtraining collapseinlabel-free methods. - The benefits extend beyond
pass@1(single-answer accuracy) topass@k(multi-attempt success rates), suggesting it learns diverse and effective solution paths and avertsmode collapse.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Ablation Studies of ResZero
This study compares ResZero with alternative fallback designs when verification is inconclusive (): Zero Reward (no signal) and MV Reward (rewarding the majority vote). GT and Majority-Voting are used as references.
The following are the results from Table 2 of the original paper:
| Methods | Mathematics | Code | Instruction | Multi-Task | Average | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| AIME24 | AIME25 | MATH500 | GSM8K | AMC | LiveCode | CRUX | IFEval | MMLU-Pro | ||
| Qwen3-1.7B-Base | ||||||||||
| GT-Reward | ||||||||||
| Majority-Voting | ||||||||||
| Proof-Gate + Zero Reward | ||||||||||
| Proof-Gate + MV Reward | ||||||||||
| JURY-RL (Proof-Gate + ResZero) | ||||||||||
| Llama-3.2-3B-Instruct | ||||||||||
| GT-Reward | ||||||||||
| Majority-Voting | ||||||||||
| Proof-Gate + Zero Reward | ||||||||||
| Proof-Gate + MV Reward | ||||||||||
| JURY-RL (Proof-Gate + ResZero) | ||||||||||
| Qwen2.5-7B | ||||||||||
| GT-Reward | ||||||||||
| Majority-Voting | ||||||||||
| Proof-Gate + Zero Reward | ||||||||||
| Proof-Gate + MV Reward | ||||||||||
| JURY-RL (Proof-Gate + ResZero) | ||||||||||
ResZeroconsistently achieves the highestAveragescore across all backbones.- On average,
ResZerooutperformsZero Rewardby +1.8 pts,MV Rewardby +6.9 pts, and evenGTby +1.4 pts. - MV Reward (Proof-Gate + MV Reward) shows the worst performance, indicating that rewarding an
unverified majorityis detrimental, causing the model to reinforce its own errors andcollapse. For example, onQwen3-1.7B-Base,Proof-Gate + MV Rewardscores an average of 28.12%, significantly lower thanProof-Gate + Zero Reward(35.42%) andJURY-RL(36.90%). - Zero Reward (Proof-Gate + Zero Reward) achieves suboptimal performance, as it stalls the
learning processforinconclusive samplesby providing no gradient. ResZeroprovides a robust solution by balancing betweendangerous reinforcementandinefficient stagnation.
6.2.2. JURY-RL Achieves Stable Training and Avoids Collapse
The following figure (Image 2 from the original paper) shows the accuracy on the MATH5000 Validation set over training steps:
该图像是一个包含三个子图的图表,展示了在不同模型(Qwen3-1.7B-Base、Llama-3.2-3B-Instruct 和 Qwen2.5-7B)上多种方法训练过程中准确率随训练步数的变化情况,比较了包括 JURY-RL、GT-Reward、LLM-KD 等多种奖励策略的性能表现。
- The validation accuracy trajectories for
EntropyandSelf-Certaintyshowcollapseafter an initial performance gain, as these methods reinforcespurious consensus. LLM-as-a-Judge,LLM-KD, andMajority-Votingexhibit noisy and suboptimal convergence.- In contrast, JURY-RL demonstrates
stable, monotonic improvement, confirming that itsproof-gated reward mechanismeffectively prevents themode collapsecommon inself-supervised methods.
6.2.3. JURY-RL Enhances Diversity
The paper investigates whether JURY-RL enhances solution diversity.
The following are the results from Table 3 of the original paper:
| Model | Average (pass@k) | Average (pass@1) | ||||
|---|---|---|---|---|---|---|
| GT-Reward | JURY-RL | GT-Reward | JURY-RL | |||
| Qwen3-1.7B-Base | 55.36 | 59.41 | +4.05 | 39.25 | 41.57 | +2.32 |
| Llama-3.2-3B-Instruct | 45.46 | 48.48 | +3.02 | 32.19 | 34.10 | +1.91 |
| Qwen2.5-7B | 62.48 | 64.04 | +1.56 | 46.60 | 48.13 | +1.53 |
-
JURY-RL achieves substantially larger improvements in
pass@kthan inpass@1(e.g., +4.05 pp vs. +2.32 pp forQwen3-1.7B-Base). -
This indicates an increase in
solution diversity. TheResZero rewardpenalizesflawed consensusand redistributes rewards to explorealternative reasoning paths, actively counteringmode collapseand incentivizingexploration.The following figure (Image 3 from the original paper) visually confirms this:
该图像是图表,展示了JURY-RL及其他三种方法在训练过程中平均唯一答案数的变化趋势。图中横轴为训练步数Step,纵轴为Average Unique Answers,显示JURY-RL能更稳定地保持较高的答案多样性。 -
The chart shows the average number of unique answers generated per problem during training on
Qwen3-1.7B-Base. -
While baselines like
Majority-Votingquickly suffer frommode collapse(converging to a single answer),JURY-RLsustains a high level of diversity.The following are the
pass@kresults for all comparisons, from Table 7 of the original paper:Methods AIME24
pass@16AIME25
pass@16MATH500
pass@4GSM8K
pass@4AMC
pass@8Average Qwen3-1.7B-Base Before RL GT-Reward Entropy Self-Certainty Majority-Voting CoReward LLM-KD LLM-as-a-Judge JURY-RL (Ours) Llama-3.2-3B-Instruct Before RL GT-Reward Entropy Self-Certainty Majority-Voting CoReward LLM-KD LLM-as-a-Judge JURY-RL (Ours) Qwen2.5-7B Before RL GT-Reward Entropy Self-Certainty Majority-Voting CoReward LLM-KD LLM-as-a-Judge JURY-RL (Ours) -
JURY-RL not only surpasses every
label-free baselinebut also consistently outperforms the strongsupervised GT-Reward baselineacross all three backbone models. This further confirms its effectiveness in improvingsolution diversityandmulti-attempt success rates.
6.2.4. Analysis of Verifier Signal Quality
The paper compares the signal quality of the Lean verifier against an LLM-as-a-Judge.
The following are the results from Table 4 of the original paper:
| Verifier | Prec. | Rec. | F1 |
|---|---|---|---|
| LLM-as-a-Judge | 75.9 | 96.1 | 84.8 |
| Lean Verifier (Ours) | 84.5 | 88.0 | 86.2 |
-
The
Lean verifierprovides a superiorreward signalwith substantially higherprecision(84.5% vs. 75.9%) at the cost of moderately lowerrecall(88.0% vs. 96.1%). -
This trade-off is critical:
high precisiondrastically reducesfalse positives, preventingreward hackingand tightly aligning training withverifiable correctness. -
The
LLM-judge's noisy signal(lowerprecision) risks reinforcing errors despite higherrecall. -
The
Lean verifier'shigherF1-score(86.2% vs. 84.8%) confirms its better overall balance, validating the "Proofs Dispose" principle by prioritizingsignal fidelity. -
The imperfections in the
Lean verifier(not achieving 100% precision/recall) stem fromupstream components(auto-formalization, consistency checks) rather than the prover's core logic.The following figure (Image 5 from the original paper) shows the training dynamics of precision, recall, validation accuracy, and training entropy under different
Lean pass@k verification settings:
该图像是四个子的曲线图,展示了不同pass@k(k=1,2,4,8,16)情况下,随着训练步骤(Step)增长,模型的Precision、Recall、Accuracy和Entropy的变化趋势。 -
The
precision(top-left panel) of theLean verifierremains consistently high ( 85%), confirming its reliability in avoidingfalse positives. -
The
recall(top-right panel) shows a clear dependency on (number of verification attempts), increasing from 30% atpass@1to nearly 90% atpass@16. This means multiple attempts significantly increase the chance of proving a correct answer. -
Despite this
variance in recall, the finalvalidation accuracy(bottom-left) andtraining entropy(bottom-right) converge to similarstable statesacross different values of . This suggests that thehigh precisionof the verifier's signal is the dominant factor forsuccessfulandstable training.
6.2.5. Impact of Hyperparameter
The hyperparameter in Eq. 5 controls the penalty strength on the unverified majority proposal in the ResZero reward.
The following figure (Figure 4 from the original paper) shows the training dynamics under different values of the hyperparameter :
该图像是论文中的代码截图,展示了利用Lean证明器进行形式化数学证明的部分内容,涉及集合上确界(supremum)的定义和证明过程,其中包含数学符号如 和 1152 的论证。
- Non-zero for diversity: The right panel shows that a
non-zero cis essential for maintainingsolution diversity. When , theResZeroframework degenerates to azero-reward fallback(as the penalty term vanishes), leading to a sharp decline in theaverage number of unique answers– a symptom ofmode collapse. Any positive value successfully sustains a high level of diversity. - Trade-off in value: The left and center panels (accuracy and reward) reveal that an
overly aggressive penalty(e.g., ) can slightly suppress the final reward and accuracy. This implies that while the penalty is crucial forexploration, an excessive value may overly restrict the policy from exploiting a potentially correct, high-consensus answer. - Optimal balance: A moderately tuned (e.g., used in experiments) strikes an optimal balance, ensuring
robust training stabilityandsolution diversitywithout compromisingconvergenceon primary task objectives.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces JURY-RL, a novel label-free Reinforcement Learning with Verifiable Rewards (RLVR) framework. Its core innovation lies in the "Votes Propose, Proofs Dispose" paradigm, where majority voting from model rollouts proposes a candidate answer, and a formal Lean verifier disposes the final reward. This design ensures truth-alignment through provable correctness and scalability by eliminating reliance on expensive human-annotated labels. A key component, the ResZero reward, addresses the challenge of inconclusive verification by providing a zero-mean, variance-preserving signal that prevents training collapse and fosters stable optimization without reinforcing spurious consensus. Empirical results across diverse benchmarks (mathematical reasoning, code generation, multi-task) confirm that JURY-RL achieves superior training stability, outperforms existing label-free baselines, and even matches or surpasses supervised training with ground-truth rewards, particularly in out-of-domain generalization and solution diversity.
7.2. Limitations & Future Work
The authors highlight the following limitations and potential future work:
- Verifier Imperfections: While the
Lean verifieris theoretically sound, its practical pipeline (involvingauto-formalizationandconsistency checks) can introduce errors, leading to aprecision gapfrom 100%. Improving theseupstream componentsis a clear area for future work to enhance theverifier'soverallsignal quality. - Computational Cost: The
Lean verifierintroduces significantcomputational overheadduring the initial "cold-start" phase of training compared toLLM-as-a-Judge. Although this cost is amortized through caching as training progresses, further research could focus on optimizing the efficiency of formal verification or developing more lightweight, yet still robust, verifiers. - Generalizability of Formal Verification: While current work focuses on mathematical and code domains where formal verification is well-defined, extending
RLVRto more subjective or ambiguous reasoning tasks (e.g., natural language understanding, common-sense reasoning) would require developing new types ofverifiable rewardsor hybridverification mechanisms. - Hyperparameter Tuning: The parameter in
ResZerorequires careful tuning to balanceexplorationandexploitation, an area that could benefit from adaptive or automated tuning strategies.
7.3. Personal Insights & Critique
This paper presents a compelling approach to label-free RLVR that addresses critical shortcomings of previous methods. The "Votes Propose, Proofs Dispose" paradigm is particularly insightful, elegantly solving the scalability-truth-alignment dilemma. By focusing the expensive, high-precision formal verification on a single, majority-voted candidate, it makes label-free verification practical.
The ResZero reward is a significant contribution to RL stability. Many RL methods struggle when rewards are sparse or when negative signals are difficult to formulate constructively. ResZero's ability to penalize unverified consensus while maintaining variance and zero-mean properties offers a principled way to guide exploration in ambiguous situations, which is highly valuable beyond just RLVR. This technique could potentially be adapted to other RL settings where reward signals are noisy or intermittent.
One area for further exploration could be the interaction between the KL regularization term in GRPO and the ResZero reward. While ResZero encourages diversity by penalizing spurious consensus, KL regularization pulls the policy back towards the reference policy, which might implicitly limit exploration or diversity. Understanding this interplay and perhaps dynamically adjusting KL coefficient based on ResZero's signal could yield further improvements.
Another interesting direction could be to explore softer forms of "proof" for domains beyond formal math and code. While formal verifiers provide binary (0/1) correctness, could a similar "propose and dispose" framework be adapted using, for example, ensemble agreement or chain-of-thought consistency from multiple specialized LLMs acting as "weak verifiers" in less formal domains, with sophisticated conflict resolution mechanisms? This would push the boundaries of RLVR into broader LLM reasoning tasks where formal proofs are currently infeasible.
Finally, the discussion of the Lean verifier's precision versus recall and its computational overhead is commendably transparent. The observation that high precision is paramount for stable training (even with moderate recall) offers a valuable lesson for designing reward functions in RL. It underscores that reward fidelity is more crucial than reward coverage when dealing with self-supervised learning where false positives can be catastrophic. The amortization strategy for verifier cost also demonstrates a practical approach to integrating expensive external tools into LLM training pipelines.
Similar papers
Recommended via semantic vector search.