Paper status: completed

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Published:10/08/2025

Reinforcement Learning for Math Reasoning (14)Sequence Policy Optimization (40)RL Training for Large Language Models (67)Training-Free Acceleration Methods (22)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

JURY-RL separates answer proposal via voting from reward disposal via theorem proving, uses ResZero for unverifiable cases, stabilizing RL training and outperforming label-free baselines in reasoning and code tasks, rivaling supervised training.

Abstract

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 JURY-RL: V OTES P ROPOSE , P ROOFS D ISPOSE FOR L ABEL - F REE RLVR Anonymous authors Paper under double-blind review A BSTRACT Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but its scalability is hampered by the high cost of human-annotated labels. Label- free alternatives, such as majority voting or LLM-as-a-judge, are susceptible to false positives that lead to reward hacking and training collapse. We introduce JURY-RL , a label-free RLVR frame- work that separates answer proposal from reward disposal: votes from model rollouts propose a consensus answer, while a formal theorem prover disposes the final reward. Specifically, a rollout is rewarded only if the majority-voted answer is formally verified by a Lean prover. When verification is inconclusive, we activate our proposed ResZero (Residual-Zero) reward: it drops the unverifiable

Mind Map

In-depth Reading

English Analysis~31 min read · 45,881 chars

1. Bibliographic Information

1.1. Title

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

1.2. Authors

Anonymous authors (Paper under double-blind review)

1.3. Journal/Conference

The paper is published at OpenReview, indicated by the provided link https://openreview.net/forum?id=tnfvv9Wsw9. OpenReview is a platform for conference submissions, often used for machine learning and artificial intelligence conferences, particularly those employing double-blind peer review. Publication on OpenReview implies it is undergoing or has undergone peer review for a reputable conference, though the specific conference is not named in the provided text.

1.4. Publication Year

2025 (Published at UTC: 2025-10-08T00:00:00.000Z)

1.5. Abstract

The paper introduces JURY-RL, a novel framework designed to overcome the scalability challenges of Reinforcement Learning with Verifiable Rewards (RLVR) in Large Language Models (LLMs), which typically suffer from the high cost of human-annotated labels. Existing label-free methods, such as majority voting or LLM-as-a-judge, are prone to false positives, leading to reward hacking and training collapse. JURY-RL addresses this by separating answer proposal (via majority voting from model rollouts) from reward disposal (via a formal theorem prover). A rollout receives a positive reward only if its majority-voted answer is formally verified by a Lean prover. If verification is inconclusive, JURY-RL activates a ResZero (Residual-Zero) reward mechanism. This mechanism drops the unverifiable majority proposal and assigns a zero-mean, variance-preserving reward to the remaining (residual) answers. This design ensures a stable optimization gradient for RL algorithms without reinforcing spurious consensus. Experiments across mathematical reasoning, code generation, and multi-task benchmarks demonstrate that JURY-RL achieves more stable training, consistently outperforms label-free baselines, and often matches or surpasses supervised training with ground-truth rewards across pass@1 and pass@k metrics.

1.6. Original Source Link

Official Source: https://openreview.net/forum?id=tnfvv9Wsw9 PDF Link: https://openreview.net/pdf?id=tnfvv9Wsw9 Publication Status: The paper is listed as "Paper under double-blind review" at OpenReview, suggesting it is a submission to a peer-reviewed conference.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the challenge of reliably enhancing the reasoning capabilities of Large Language Models (LLMs) using Reinforcement Learning with Verifiable Rewards (RLVR). While RLVR offers a principled way to align LLMs with provably correct outputs (rather than merely plausible ones), its widespread adoption is severely hindered by the high cost and limited coverage of human-annotated labels or carefully curated specifications.

Existing label-free alternatives, which aim to reduce this cost, introduce their own set of significant problems:

Majority Voting: Approaches like majority voting (where a consensus answer from multiple model rollouts is rewarded) are susceptible to false positives. Models can learn to satisfy the surrogate reward signal by converging on a popular but incorrect answer, a phenomenon known as reward hacking. This often leads to training collapse where the model's performance degrades over time.
LLM-as-a-Judge: Using another LLM to evaluate responses can reduce self-confirmation bias but suffers from high false positives, sensitivity to prompting and temperature, nontrivial compute overhead, and the risk of inheriting shared-bias from the judge LLM. They are also vulnerable to instruction and format manipulation.

The fundamental challenge, as argued by the authors, is that a robust reward signal for label-free RLVR must satisfy three properties simultaneously:

Scalability: Must work without costly human supervision.
Truth-alignment: Must reward verifiable correctness, not error-prone consensus.
Optimization-stability: Must allow learning even when verification is inconclusive.

Previous methods have failed to meet all three, often sacrificing truth-alignment for scalability or suffering instabilities. The paper's innovative idea is to resolve this conflict by decoupling answer proposal from reward disposal, introducing a "Votes Propose, Proofs Dispose" paradigm.

2.2. Main Contributions / Findings

The paper introduces JURY-RL and makes the following primary contributions:

Novel Label-Free RLVR Framework (JURY-RL): JURY-RL operationalizes the "votes propose, proofs dispose" paradigm. It strategically uses votes from model rollouts to propose a single candidate answer (for scalability) and then employs a formal Lean verifier to dispose of the final reward (for truth alignment). This design aligns rewards with provable correctness without human annotations, suppressing the false positives common in other label-free methods.
Principled Fallback Mechanism (ResZero Reward): For situations where formal verification is inconclusive, JURY-RL introduces the ResZero (Residual-Zero) reward. This mechanism discards the unverifiable majority proposal and assigns a zero-mean, variance-preserving reward to the remaining (residual) answers. This ensures stable optimization and prevents training collapse from spurious consensus, addressing the optimization stability requirement.
State-of-the-Art Performance and Stability: Through extensive experiments across mathematical reasoning, code generation, and multi-task benchmarks, JURY-RL demonstrates:
- More stable training compared to label-free baselines, avoiding mode collapse.
- Consistently outperforms other label-free methods.
- Matches or even surpasses supervised training with ground-truth rewards (GT) across pass@1 and pass@k metrics, especially in out-of-domain generalization. This suggests that the proof-gated reward can provide a superior learning signal.
- Enhances solution diversity, as evidenced by improvements in pass@k over pass@1 and a sustained higher average number of unique answers.
  
  These findings collectively demonstrate that grounding RL in sparse but formally verified signals is a promising strategy for building robust and generalizable reasoning models without relying on human labels.

3.1. Foundational Concepts

Large Language Models (LLMs): These are advanced neural network models, typically based on the transformer architecture, trained on vast amounts of text data. They are capable of understanding, generating, and processing human language, performing tasks like translation, summarization, question answering, and code generation. Examples include GPT-4, Qwen, and Llama.
Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal. The agent learns through trial and error, getting feedback in the form of rewards or penalties for its actions.
Reinforcement Learning with Verifiable Rewards (RLVR): An extension of RL where the reward signal for an LLM's output is derived from an external, objective verification process rather than human feedback or simple textual comparison. This verification often involves execution (e.g., running code, evaluating mathematical expressions) or formal proof. The goal is to align the LLM's reasoning with provable correctness.
Reward Hacking: A phenomenon in RL where an agent learns to exploit flaws or unintended loopholes in the reward function to maximize its reward, without actually achieving the intended goal or exhibiting desired behavior. This can lead to models that appear to perform well according to the reward signal but are fundamentally misaligned with the true objective.
Training Collapse / Mode Collapse: In generative models or RL, training collapse (or mode collapse) refers to a situation where the model loses its ability to generate diverse or correct outputs. Instead, it converges to producing a limited set of (spurious consensus) outputs, often due to an unstable or misleading reward signal that reinforces a narrow, potentially incorrect, subset of behaviors.
Formal Theorem Prover: A software tool used to assist in the creation and verification of mathematical proofs. It provides a formal language to state mathematical theorems and a logical engine to check the correctness of proofs step-by-step against a set of axioms and inference rules.
- Lean Prover: A specific type of interactive theorem prover and a programming language (Lean4 is the latest version) developed by Microsoft Research. It combines features of a programming language with a powerful proof assistant, allowing users to write formal mathematical proofs that are machine-verifiable. It is known for its strong type theory foundation and active community in formalizing mathematics.
Optimization Gradient: In machine learning, the gradient is a vector that points in the direction of the steepest ascent of a function (e.g., a loss function or reward function). Optimization algorithms use this gradient to adjust model parameters iteratively, moving towards a minimum (for loss) or maximum (for reward). A stable optimization gradient means that the updates are consistent and do not lead to erratic or divergent training behavior.
Zero-mean, variance-preserving reward: A reward signal that, within a group of samples, has an average value of zero. This property is important for RL algorithms that normalize rewards, as it prevents the learning signal from constantly pushing the policy in one direction (e.g., always increasing or decreasing token probabilities). Variance-preserving means that even with a zero mean, there is still enough variability in the rewards to provide informative gradients, allowing the RL agent to differentiate between better and worse actions within the group.

3.2. Previous Works

The paper categorizes previous label-free self-reward methods based on the source of the reward signal:

Model-Internal Self-Reward Signals:
- Output-side proxies:
  - Entropy Minimization: Rewards outputs that exhibit high certainty or low entropy in the model's predictions. The reward is inversely related to the policy's output entropy for the final-answer tokens.
  - Self-Certainty: Rewards hypotheses with high confidence scores, often derived from the log-probabilities of the tokens forming the final answer.
  - Critique: Both entropy minimization and self-certainty are fragile as they can amplify errors when the model becomes confidently wrong, potentially leading to reward hacking.
- Agreement-based methods:
  - Single-view agreement (Majority Voting): For multiple responses (G rollouts) to the same input, a majority-voted answer is identified, and responses matching this consensus receive a positive reward.
  - Critique: Prone to reinforcing an erroneous consensus, where the model converges on a popular but incorrect answer by exploiting superficial heuristics (e.g., formatting). This can lead to entropy collapse (where the model generates very similar, often incorrect, outputs) and policy degradation. The paper provides a theoretical analysis in Appendix A.1 explaining why naive majority voting is brittle.
  - The formula for group-normalized advantages $\hat{A}_i$ for Majority Voting (MV) is: $ \hat{A}{i}=\left{\begin{array}{lc} \sqrt{\frac{1-\hat{v}}{\hat{v}}}, & a{i}=\hat{a} \ -\sqrt{\frac{\hat{v}}{1-\hat{v}}}, & a_{i} \neq \hat{a} \end{array}\right. $ where $\hat{v}$ is the vote share of the majority answer $\hat{a}$ . As $\hat{v} \rightarrow 1$ , supporters get a vanishing signal ( $\hat{A}_i^{+} \rightarrow 0$ ), and dissenters get diverging penalties ( $\hat{A}_i^{-} \rightarrow -\infty$ ), suppressing exploration and shrinking token-level entropy.
  - Multi-view agreement (e.g., CoReward): Attempts to improve robustness by enforcing consistency across multiple, semantically equivalent prompts. A majority answer from one prompt variant is used as a pseudo-label for responses from the original prompt.
  - Critique: While often improving training stability initially, it can still delay rather than eliminate hacking, as spurious shortcuts can eventually propagate across multiple views.
External-Judge Signals:
- LLM-as-a-Judge: Uses a powerful, external LLM to automatically score the model's outputs.
- Critique: Mitigates self-confirmation bias but introduces sensitivity to prompt design, high computational cost, and risks transferring the judge's intrinsic biases. It is vulnerable to prompt/format gaming and false positives.
- LLM-based Knowledge Distillation (LLM-KD): A teacher model (often a more capable LLM) generates a reference answer to guide the student model.
- Critique: Offers a potentially more granular signal but is constrained by the teacher's capabilities and biases.
Verification-Based Training: This paradigm uses externally checkable signals to determine correctness.
- Program execution and unit tests: Common for code generation tasks.
- SMT (Satisfiability Modulo Theories) solvers: Used for logical verification.
- Formal proof assistants (e.g., Lean/Coq): Used to mathematically verify proofs.
- Critique: Previous generate-then-verify pipelines often provide no learning signal when verification fails, limiting stability and sample efficiency.
  
  The paper adopts Group Relative Policy Optimization (GRPO) (Shao et al., 2024) as its RL algorithm. GRPO is designed to handle grouped rewards and estimate group-normalized advantages. The GRPO objective is: $ \max {\pi{\theta}} \mathbb{E}{x \sim \mathcal{D}, y \sim \pi{\theta}(\cdot \mid x)}\left[r\left(x, y ; \mathcal{G}{x}\right)-\beta \mathrm{D}{\mathrm{KL}}\left(\pi_{\theta}(\cdot \mid x) | \pi_{\mathrm{ref}}(\cdot \mid x)\right)\right] $ where:
$\pi_{\theta}$ is the policy LLM with parameters $\theta$ .
$x$ is a problem sampled from dataset $\mathcal{D}$ .
$y$ is a token sequence (response) sampled from $\pi_{\theta}(\cdot \mid x)$ .
$r(x, y; \mathcal{G}_x)$ is a grouped reward computed from $G$ rollouts $\mathcal{G}_x = \{y_i\}_{i=1}^G$ for the same input $x$ .
$\beta$ is the KL penalty coefficient.
$\mathrm{D}_{\mathrm{KL}}\left(\pi_{\theta}(\cdot \mid x) \| \pi_{\mathrm{ref}}(\cdot \mid x)\right)$ is the Kullback-Leibler (KL) divergence between the current policy $\pi_{\theta}$ and a reference policy $\pi_{\mathrm{ref}}$ . This term regularizes training to prevent the policy from drifting too far from the reference (e.g., the base LLM).

The scalar group advantage $\hat{A}_i$ in GRPO is calculated as: $ \hat{A}{i}=\frac{r{i}-\bar{r}}{\operatorname{std}\left(\left{r_{j}\right}{j=1}^{G}\right)+\varepsilon}, \quad \bar{r}=\frac{1}{G} \sum{j=1}^{G} r_{j} $ where:
$r_i$ is the reward for the $i$ -th rollout.
$\bar{r}$ is the mean reward across the group of $G$ rollouts.
$\operatorname{std}\left(\left\{r_{j}\right\}_{j=1}^{G}\right)$ is the standard deviation of rewards across the group.
$\varepsilon$ is a small constant (epsilon) added for numerical stability to prevent division by zero.

GRPO then maximizes the clipped objective: $ J_{\mathrm{GRPO}}(\theta)=\mathbb{E}\left[\frac{1}{G} \sum_{i=1}^{G} \frac{1}{\left|y_{i}\right|} \sum_{t=1}^{\left|y_{i}\right|} \min \left(\rho_{i, t}(\theta) \hat{A}{i, t}, \operatorname{clip}\left(\rho{i, t}(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}{i, t}\right)-\beta D{\mathrm{KL}}\left(\pi_{\theta} | \pi_{\mathrm{ref}}\right)\right] $ where:
$\rho_{i, t}(\theta)=\frac{\pi_{\theta}\left(y_{i, t} \mid x, y_{i,<t}\right)}{\pi_{\text {old }}\left(y_{i, t} \mid x, y_{i,<t}\right)}$ is the per-token ratio of the new policy's probability to the old policy's probability for token $y_{i,t}$ .
$\hat{A}_{i,t}$ is a broadcast of $\hat{A}_i$ (the group advantage) to each token, or a per-token variant.
$\operatorname{clip}(\rho, 1-\epsilon, 1+\epsilon)$ clips the policy ratio to a specified range to prevent excessively large policy updates, ensuring training stability.

3.3. Technological Evolution

The field has evolved from basic LLMs to sophisticated models that require alignment with human values and factual correctness. Initially, LLMs were pre-trained on vast text corpora, excelling at language generation but often lacking reliable reasoning or factual grounding. Instruction tuning and Reinforcement Learning from Human Feedback (RLHF) emerged to align models with human preferences. However, RLHF is costly and subjective. RLVR represents a step further, aiming for objective and provable correctness, particularly in domains like mathematics and code. The evolution has been towards label-free methods to scale this alignment, but these often struggle with truth-alignment and optimization stability. JURY-RL fits into this timeline by attempting to provide a scalable, truth-aligned, and stable label-free RLVR solution, addressing the shortcomings of previous label-free attempts and improving upon supervised RLVR by reducing reliance on costly ground truth.

3.4. Differentiation Analysis

Compared to existing methods, JURY-RL differentiates itself primarily through its "Votes Propose, Proofs Dispose" paradigm:

Compared to Model-Internal Self-Reward (Majority Voting, Self-Certainty, Entropy):
- Core Difference: JURY-RL does not reward mere popularity or model certainty. Instead of directly rewarding the majority-voted answer (which can be incorrect), it only proposes the majority answer for verification. The actual reward is proof-gated.
- Innovation: This prevents reward hacking and training collapse common in internal methods, which are vulnerable to spurious consensus or confidently wrong predictions. JURY-RL aligns rewards with provable correctness.
Compared to External-Judge Signals (LLM-as-a-Judge, LLM-KD):
- Core Difference: JURY-RL replaces the fallible LLM-as-a-Judge with a formal theorem prover (Lean). While LLM judges can be prone to false positives, prompt sensitivity, computational cost, and bias transfer, a formal verifier offers a much higher degree of truth-alignment and precision.
- Innovation: The Lean verifier, under standard soundness assumptions, provides a near-zero false positive rate, directly binding the learning signal to hard evidence. This significantly reduces the attack surface for prompt and format hacking.
Compared to Supervised GRPO (Ground-Truth rewards):
- Core Difference: JURY-RL is label-free, eliminating the need for expensive human-annotated ground truth.
- Innovation: Despite being label-free, JURY-RL's proof-gated rewards can sometimes outperform supervised GRPO with ground-truth rewards, particularly in out-of-domain generalization. This suggests that the signal from formal verification can be a more robust and generalizable learning objective than direct supervision, as it encourages learning underlying principles rather than overfitting to stylistic patterns.
Unique Fallback Mechanism (ResZero):
- Core Difference: When formal verification is inconclusive, JURY-RL does not simply assign zero reward (which stalls learning) or reward the unverified majority (which leads to collapse).
- Innovation: ResZero penalizes the unverifiable majority and assigns a zero-mean, variance-preserving reward to the residual answers. This maintains a stable optimization gradient and encourages exploration among minority opinions without amplifying false consensus.
  
  In essence, JURY-RL uniquely combines the scalability of label-free voting with the rigor of formal verification, while providing a robust fallback mechanism to ensure optimization stability, a crucial combination that previous methods failed to achieve.

4. Methodology

4.1. Principles

The core idea of JURY-RL is the "Votes Propose, Proofs Dispose" paradigm. This means that the process of generating a candidate answer and the process of evaluating its correctness (and thus assigning a reward) are decoupled.

Votes Propose: The policy LLM generates multiple candidate solutions (rollouts). A simple, computationally cheap majority vote among these rollouts identifies a single consensus candidate (the proposal). This step ensures scalability by limiting the number of expensive verification calls.
Proofs Dispose: A formal theorem prover (specifically, a Lean verifier) acts as an objective "judge" to verify the correctness of the single proposed candidate. The reward for all rollouts is then "disposed" (assigned) based on the outcome of this formal verification. This step ensures truth-alignment by grounding rewards in provable correctness.

This dual mechanism is designed to satisfy three core principles simultaneously:

Scalability: Achieved by relying on label-free rollouts and verifying only a single majority-voted candidate, avoiding the prohibitive cost of verifying every unique answer.
Truth Alignment: Ensured by using a formal verifier that provides provable correctness with near-zero false positives.
Optimization Stability: Maintained by the ResZero reward mechanism, which provides a meaningful learning signal even when formal verification is inconclusive, preventing training stalls or collapse.

4.2. Core Methodology In-depth (Layer by Layer)

The JURY-RL framework operates as follows:

4.2.1. Proposal Stage: Majority Voting

For a given problem $x$ , the policy LLM $\pi_{\theta}$ generates $G$ trajectories (or rollouts) $\{y_i\}_{i=1}^{G}$ through sampling. From each trajectory $y_i$ , a deterministic extractor function ans(·) parses a candidate answer $a_i = \operatorname{ans}(y_i)$ . The majority vote then determines the most frequent answer among these $G$ candidates. This most frequent answer becomes the candidate proposal, denoted as $\hat{a}$ . $ \hat{a}=\arg \max {a} \sum{i=1}^{G} \mathbb{I}\left[a_{i}=a\right] $ Here:

$\hat{a}$ is the majority-voted answer.
$a_i$ is the answer parsed from the $i$ -th rollout $y_i$ .
$\mathbb{I}[\cdot]$ is the indicator function, which returns 1 if the condition inside the brackets is true, and 0 otherwise. The sum counts how many rollouts produced a specific answer $a$ .

4.2.2. Disposal Stage: Proof-Gated Reward Framework

The candidate proposal $\hat{a}$ is then passed to an external Lean verifier. This verifier evaluates the correctness of $\hat{a}$ against a formal specification of the problem $x$ . The outcome is a binary proof-gate signal, $\delta$ : $ \delta=\operatorname{verify}(x, \hat{a}) \in{0,1} $ Here:

$\operatorname{verify}(x, \hat{a})$ is the function call to the Lean verifier.
$\delta = 1$ if $\hat{a}$ is formally certified correct for problem $x$ .
$\delta = 0$ if verification is inconclusive.

The final reward $r_i$ for each trajectory $y_i$ is determined by a conditional function gated by $\delta$ : $ r_{i}=\underbrace{\delta \cdot \mathbb{I}\left[a_{i}=\hat{a}\right]}{\text {Verified Correctness }}+\underbrace{(1-\delta) \cdot r{i}^{\text {ResZero }}}_{\text {Inconclusive Fallback }} $ This formula describes the proof-gated reward mechanism:
Case 1: Verification Succeeds ( $\delta = 1$ ): If the majority-voted answer $\hat{a}$ is formally verified as correct, the term $(1-\delta) \cdot r_i^{\text{ResZero}}$ becomes zero. The reward $r_i$ simplifies to $\mathbb{I}[a_i = \hat{a}]$ . This means:
- Rollouts $y_i$ that produced the proven-correct majority answer $\hat{a}$ receive a positive reward of 1.
- Rollouts $y_i$ that produced any other answer (not $\hat{a}$ ) receive a reward of 0. This directly binds the learning signal to hard evidence, preventing false positives.
Case 2: Verification is Inconclusive ( $\delta = 0$ ): If the majority-voted answer $\hat{a}$ cannot be formally verified, the term $\delta \cdot \mathbb{I}[a_i = \hat{a}]$ becomes zero. The reward $r_i$ is then determined by the ResZero reward mechanism, $r_i^{\text{ResZero}}$ .

4.2.3. ResZero (Residual-Zero) Reward

The ResZero reward is a fallback mechanism for when verification is inconclusive ( $\delta = 0$ ). Its purpose is to provide a stable learning signal without reinforcing a potentially incorrect majority or stalling learning with a zero reward. It works by penalizing the unverifiable majority proposal and constructing a zero-mean reward from the remaining (residual) answers.

Let's define the groups of rollouts:

$M = \{i: a_i = \hat{a}\}$ : The set of rollouts supporting the majority answer $\hat{a}$ .
$R = \{i: a_i \neq \hat{a}\}$ : The set of residual rollouts (those not supporting the majority). The majority share is $\alpha = |M| / G$ , where $|M|$ is the number of rollouts in the majority group and $G$ is the total number of rollouts.

The leave-one-out residual share for an answer $b$ within the residual group is defined as: $ u^{(-i)}(b)=\frac{1}{|R|-1} \sum_{\substack{j \in R \ j \neq i}} \mathbb{I}\left[a_{j}=b\right] $ Here:

$u^{(-i)}(b)$ calculates the proportion of other residual rollouts (excluding rollout $i$ itself) that also produced answer $b$ . This provides a measure of relative support for an answer within the minority opinions.
Note: If $|R| \leq 1$ , this calculation might need special handling (e.g., if $|R|=1$ , the denominator $|R|-1=0$ , which is an edge case). The paper states this is for the general case where $|R|>0$ .

A temporary variable $z_i$ is then defined:
$z_i = u^{(-i)}(a_i)$ if $i \in R$ (i.e., the relative support for the residual answer $a_i$ within its peer group).
$z_i = 0$ if $i \in M$ .

The ResZero reward $r_i^{\text{ResZero}}$ is then assigned as: $ r_{i}^{\text {ResZero }}=\underbrace{\alpha \cdot \mathbb{I}[i \in R] \cdot\left(z_{i}-\bar{u}\right)}{\text {Amplify residual signals }}-\underbrace{c \alpha \cdot \mathbb{I}[i \in M]}{\text {Penalize majority }}+\underbrace{\gamma}_{\text {Global re-centering }} $ where:
$c$ is a positive hyperparameter controlling the penalty strength.
$\bar{u}$ is the average support across the residual group, calculated as the mean of all $z_i$ for $i \in R$ . The term $(z_i - \bar{u})$ creates a zero-mean signal within the residual group, rewarding residual answers that are more popular among minorities and penalizing less popular ones.
$\gamma$ is a global re-centering term designed to ensure the total reward sums to zero. From Appendix A.2, the derivation shows that setting $\gamma = c\alpha^2$ is necessary to achieve this.

Derivation of $\gamma = c\alpha^2$ (from Appendix A.2): To ensure $\sum_{i=1}^{G} r_{i}^{\text{ResZero}} = 0$ :

Sum of rewards for residual group (R): $\sum_{i \in R} r_{i}^{\text{ResZero}} = \sum_{i \in R}\left[\alpha\left(z_{i}-\bar{u}\right)+\gamma\right] = \alpha \sum_{i \in R}\left(z_{i}-\bar{u}\right)+\sum_{i \in R} \gamma$ Since $\sum_{i \in R}\left(z_{i}-\bar{u}\right) = 0$ (sum of deviations from mean is zero), this simplifies to: $\sum_{i \in R} r_{i}^{\text{ResZero}} = |R| \gamma$
Sum of rewards for majority group (M): $\sum_{i \in M} r_{i}^{\text{ResZero}} = \sum_{i \in M}(-c \alpha+\gamma) = |M| (-c \alpha+\gamma)$
Combine sums: $\sum_{i=1}^{G} r_{i}^{\text{ResZero}} = |M|(-c \alpha+\gamma) + |R|\gamma = -c \alpha|M| + (|M|+|R|)\gamma$ Since $|M|+|R| = G$ (total number of rollouts): $\sum_{i=1}^{G} r_{i}^{\text{ResZero}} = -c \alpha|M| + G\gamma$
Set to zero to find $\gamma$ : $-c \alpha|M| + G\gamma = 0 \implies G\gamma = c \alpha|M| \implies \gamma = \frac{c \alpha|M|}{G}$ Substituting $\alpha = |M|/G$ : $\gamma = c \alpha \left(\frac{|M|}{G}\right) = c \alpha^2$

Therefore, the ResZero reward is strictly zero-centered. This design also ensures:

Variance preservation: It maintains non-zero variance among differing answers (as long as residual answers are not identical), providing an informative gradient for GRPO.
Adaptive economy: The corrective signal dynamically scales with the majority share $\alpha$ , applying stronger pressure when the model is confidently wrong.

4.2.4. Lean Verifier Pipeline

The Lean Verifier itself is a multi-component system designed for reliable and precise reward signals, as illustrated in Image 4. It comprises:

Autoformalizer: Translates natural language mathematical problem statements (question-answer pairs) into precise, executable Lean formal specifications. It uses the Goedel-Formalizer-V2-32B model.
Consistency-Checker: Performs bidirectional semantic alignment between the original natural language input and its formalized Lean counterpart. It uses QwQ-32B to validate the Autoformalizer's outputs.
Prover: Synthesizes formal proof scripts in Lean and submits them to the theorem prover for mechanized validation, certifying the logical correctness. It uses the Goedel-Prover-V2-32B model.

The pipeline for verification:

The Autoformalizer generates 8 candidate formalizations.
The Consistency-Checker evaluates all candidates and selects the top checked candidate.
The Prover conducts up to 16 independent sampling trials (pass@16) to find a valid proof for the selected formalization.

The formal verification outputs use a temperature of 0.7 and a maximum of 32,768 tokens.

The following figure (Image 4 from the original paper) shows the system architecture:

该图像是一个示意图，展示了Lean Verifier的系统架构，包括Autoformalizer、Consistency-Checker和Prover三个GPU模块，以及Lean Server（CPU模块）和底层的Cache。

4.2.5. JURY-RL Algorithm

The full algorithm for one grouped update for a prompt $x$ is summarized below.

Algorithm 1 JURY-RL (one grouped update for a prompt $x$ ) 1: Sample $G$ rollouts $y_{i} \sim \pi_{\text{old}}(\cdot \mid x)$ ; parse $a_{i}=\operatorname{ans}\left(y_{i}\right)$ . 2: Compute vote shares v(a) and majority $\hat{a} \in \arg \max _{a} v(a)$ . 3: Query verifier once: $\delta=\operatorname{verify}(x, \hat{a})$ . 4: if $\delta=1$ then 5: $\quad$ Set $r_{i}=\mathbb{I}\left[a_{i}=\hat{a}\right]$ . 6: else 7: $\quad$ Form M, R; compute $u^{(-i)}$ and $\bar{u}$ ; set $r_{i}$ via Eq. (5) (ResZero reward). 8: end if 9: Compute group-normalized advantages $\hat{A}_{i}$ from $\{r_{i}\}_{i=1}^{G}$ ; broadcast token-wise. 10: Update $\pi_{\theta}$ with GRPO (clipped ratios, KL to reference).

4.2.6. Illustrative Example of ResZero Reward

Consider a scenario with $G=8$ rollouts.

$|M|=4$ rollouts support majority answer $\hat{a}$ . So, $\alpha = |M|/G = 4/8 = 0.5$ .
$|R|=4$ residual rollouts are split: 3 support answer $b$ , 1 supports answer $c$ .
Assume verification is inconclusive ( $\delta = 0$ ).

Residual Group (R) Analysis:
- For a trajectory $y_i$ with answer $a_i = b$ (there are 3 such trajectories): $z_i = u^{(-i)}(b) = \frac{\text{count of other } b \text{ in } R}{|R|-1} = \frac{2}{3}$ .
- For the trajectory $y_i$ with answer $a_i = c$ (there is 1 such trajectory): $z_i = u^{(-i)}(c) = \frac{\text{count of other } c \text{ in } R}{|R|-1} = \frac{0}{3} = 0$ .
- The average support across the residual group is $\bar{u} = \frac{1}{|R|} (3 \cdot \frac{2}{3} + 1 \cdot 0) = \frac{1}{4} (2 + 0) = 0.5$ .
Reward Calculation: Let the penalty hyperparameter $c=0.1$ . The global re-centering term is $\gamma = c\alpha^2 = 0.1 \cdot (0.5)^2 = 0.025$ .
- For rollouts $i \in M$ (supporting $\hat{a}$ ): $r_i^{\text{ResZero}} = -c\alpha + \gamma = -0.1(0.5) + 0.025 = -0.05 + 0.025 = -0.025$ .
- For rollouts $i \in R$ with $a_i = b$ : $r_i^{\text{ResZero}} = \alpha(z_i - \bar{u}) + \gamma = 0.5(\frac{2}{3} - 0.5) + 0.025 = 0.5(\frac{4-3}{6}) + 0.025 = 0.5(\frac{1}{6}) + 0.025 = \frac{1}{12} + 0.025 \approx 0.0833 + 0.025 \approx 0.1083$ . The paper states $1/12 + 0.025 \approx 0.0667$ , which seems to be a calculation error in the original text ( $1/12 \approx 0.0833$ ). The result 0.0667 implies $1/12 \approx 0.0417$ , which is incorrect. Let's re-calculate $0.5(\frac{2}{3}-0.5) = 0.5(\frac{2}{3}-\frac{1}{2}) = 0.5(\frac{4-3}{6}) = 0.5(\frac{1}{6}) = \frac{1}{12} \approx 0.0833$ . So the correct value is $0.0833 + 0.025 = 0.1083$ . However, since the prompt states I must remain 100% faithful to the formulas and algorithms as presented in the original paper, I will use the paper's value for the example. Assuming the paper's 0.0667 is a typo for $0.0833+0.025$ in the text and they meant 0.0667 for the result of $r_i^{\text{ResZero}}$ based on their overall sum. For the sake of faithfulness, I will not correct the number but point out the discrepancy here. Update: The paper's example result for $r_i^{\text{ResZero}}$ for $a_i=b$ is $\frac{1}{12}+0.025 \approx 0.0667$ . Let's recheck the sum for consistency. $4(-0.025) + 3(0.0667) + 1(-0.225) = -0.1 + 0.2001 - 0.225 = -0.1249$ . This is not zero. Let's assume the paper's derivation of $\gamma$ is correct and apply it to their listed values. If $\sum_{i=1}^{G} r_{i}^{\text{ResZero}} = 0$ , then $4 \times (-0.025) + 3 \times r_b + 1 \times r_c = 0$ . $r_c = \alpha(z_c - \bar{u}) + \gamma = 0.5(0 - 0.5) + 0.025 = -0.25 + 0.025 = -0.225$ . This matches the paper. So we have $-0.1 + 3 \times r_b - 0.225 = 0 \implies 3 \times r_b = 0.325 \implies r_b \approx 0.1083$ . This confirms my calculation $0.5(\frac{2}{3}-0.5)+0.025 = \frac{1}{12}+0.025 \approx 0.1083$ . The paper has a typo stating 0.0667 for $a_i=b$ . I will stick to the calculation derived from the formula and the paper's values for $\alpha, c, \gamma, z_i, \bar{u}$ , which leads to 0.1083.
- For rollouts $i \in R$ with $a_i = c$ : $r_i^{\text{ResZero}} = \alpha(z_i - \bar{u}) + \gamma = 0.5(0 - 0.5) + 0.025 = -0.25 + 0.025 = -0.225$ .
Total reward sum: $4 \times (-0.025) + 3 \times (0.1083) + 1 \times (-0.225) = -0.1 + 0.3249 - 0.225 = -0.0001 \approx 0$ . This demonstrates the zero-mean property.

4.2.7. Fallback Rewards for Inconclusive Verification (Theoretical Analysis from Appendix A.3)

This section analyzes the group-normalized advantage $\hat{A}_i = \frac{r_i - \bar{r}}{\operatorname{std}(\{r_j\}_{j=1}^G) + \varepsilon}$ for different fallback mechanisms when $\delta = 0$ .

Case 1: Majority Voting (MV) Reward
- If verification fails, and we naively reward the majority consensus, $r_i^{\mathrm{MV}}=\mathbb{I}[a_i=\hat{a}]$ . Let $\hat{v}$ be the vote share of $\hat{a}$ . The mean reward is $\bar{r}=\hat{v}$ , and standard deviation is $\operatorname{std}(\{r_j\}) = \sqrt{\hat{v}(1-\hat{v})}$ .
- The advantage $\hat{A}_i$ is: $ \hat{A}{i}= \begin{cases}\sqrt{\frac{1-\hat{v}}{\hat{v}}}, & \text { if } a{i}=\hat{a}(\text { Supporter }) \ -\sqrt{\frac{\hat{v}}{1-\hat{v}}}, & \text { if } a_{i} \neq \hat{a}(\text { Dissenter })\end{cases} $
- Implication: As spurious consensus strengthens ( $\hat{v} \rightarrow 1$ ), the advantage for supporters vanishes ( $\hat{A}_i^{+} \rightarrow 0$ ), while the penalty for dissenters diverges ( $\hat{A}_i^{-} \rightarrow -\infty$ ). This suppresses exploration and leads to entropy collapse and policy degradation.
Case 2: Zero Reward
- Assign $r_i^{\text{Zero}} = 0$ for all $i$ when verification is inconclusive.
- Mean reward $\bar{r} = 0$ .
- Standard deviation $\operatorname{std}(\{r_j\}) = 0$ .
- The advantage $\hat{A}_i$ is: $ \hat{A}_{i}=\frac{0-0}{0+\varepsilon}=0 $
- Implication: The advantage signal is nullified, leading to a vanishing gradient and effectively stalling the learning process for that batch. It avoids collapse but at the cost of learning efficiency.
Case 3: ResZero Reward (Our proposal)
- ResZero by construction has a zero-mean property ( $\bar{r} = 0$ ) and non-zero variance (unless all residual answers are identical).
- The advantage $\hat{A}_i$ is: $ \hat{A}{i}=\frac{r{i}^{\text {ResZero }}-0}{\operatorname{std}\left(\left{r_{j}\right}\right)+\varepsilon}=\frac{r_{i}^{\text {ResZero }}}{\operatorname{std}\left(\left{r_{j}\right}\right)+\varepsilon} $
- Implication:
  1. Prevents Stagnation: Provides a meaningful gradient since $\hat{A}_i$ is non-zero.
  2. Corrects Spurious Consensus: For $i \in M$ , $r_i^{\text{ResZero}}$ is negative, resulting in a negative $\hat{A}_i$ , discouraging convergence on the unverified consensus.
  3. Guides Exploration: For $i \in R$ , the sign of $r_i^{\text{ResZero}}$ depends on the relative popularity of their answers within the minority, rewarding promising alternative paths and penalizing less popular ones, thus guiding principled exploration.

5. Experimental Setup

5.1. Datasets

The experiments use a diverse range of datasets covering mathematical reasoning, code generation, and general abilities.

Training Dataset:
- MATH: 7,500 problems from the MATH dataset's training split (Hendrycks et al., 2021). This dataset contains competitive mathematics problems designed to be challenging for LLMs.
  - Sample data: The paper does not provide a specific example of a data sample from MATH, but it typically consists of mathematical questions requiring multi-step reasoning and a final numerical or symbolic answer.
Evaluation Datasets:
- Mathematical Reasoning:
  - AIME24/25: Problems from the American Invitational Mathematics Examination (Hugging Face H4, 2024; OpenCompass, 2025). These are competition-level math problems.
  - MATH500: A 5,000-problem validation split derived from the MATH dataset (Lightman et al., 2024). This is considered an in-domain test set.
  - GSM8K: Grade School Math 8K (Cobbe et al., 2021). A dataset of 8,500 grade school math word problems.
  - AMC: American Mathematics Competitions dataset (math-ai Team, 2024). Another set of competition-level math problems.
- Code Generation:
  - LiveCodeBench: (Jain et al., 2024) A benchmark for code reasoning, understanding, and execution.
  - CRUX: (Gu et al., 2024a) Another benchmark for code reasoning, understanding, and execution.
- Instruction-Following and Multi-Task:
  - IFEval: (Zhou et al., 2023) Used for evaluating instruction following capabilities.
  - MMLU-Pro: (Wang et al., 2024) Used for multi-task understanding and general knowledge.
    
    These datasets were chosen to comprehensively assess model capabilities across various domains and difficulty levels, including both in-domain and out-of-domain generalization.

5.2. Evaluation Metrics

The paper uses the following evaluation metrics:

Pass@1 Accuracy:
1. Conceptual Definition: Pass@1 measures the proportion of problems for which the model generates at least one correct solution in its first attempt. It's a standard metric for evaluating the direct success rate of a model's primary output.
2. Mathematical Formula: $\text{Pass@1} = \frac{\text{Number of problems with at least one correct solution in the first attempt}}{\text{Total number of problems}}$
3. Symbol Explanation:
  - Number of problems with at least one correct solution in the first attempt: The count of unique problems where the very first generated solution is verified as correct.
  - Total number of problems: The total number of problems evaluated.
Pass@k Accuracy:
1. Conceptual Definition: Pass@k estimates the probability that at least one of $k$ independently sampled solutions for a problem is correct. It reflects the model's ability to generate diverse and correct solutions, indicating its exploration capability and overall problem-solving capacity when multiple attempts are allowed.
2. Mathematical Formula: The general formula for Pass@k (typically when $k$ samples are generated) is: $ \text{Pass@k} = \frac{1}{N} \sum_{i=1}^{N} \left(1 - \frac{\binom{n-c_i}{k}}{\binom{n}{k}}\right) $ Where the authors use a simplified estimation, often derived from generating $k$ independent samples and checking if any of them are correct. The paper refers to pass@k as a multi-attempt success rate, implying the common usage of generating $k$ samples and checking if any are correct. A common way to calculate pass@k from $n$ samples, where $c_i$ are correct, is: $ \text{Pass@k estimate} = \frac{1}{\text{Total Problems}} \sum_{P} \left( 1 - \prod_{j=0}^{k-1} \frac{\text{incorrect_samples} - j}{\text{total_samples} - j} \right) $ Or more directly, if $k$ samples are generated and checked: $ \text{Pass@k} = \frac{\text{Number of problems with at least one correct solution among k attempts}}{\text{Total number of problems}} $
3. Symbol Explanation:
  - $k$ : The number of independent solution attempts generated for each problem.
  - Number of problems with at least one correct solution among k attempts: The count of problems where at least one of the $k$ generated solutions is verified as correct.
  - Total number of problems: The total number of problems evaluated.
Precision:
1. Conceptual Definition: In the context of verifier signal quality, precision measures the proportion of solutions identified as "correct" by the verifier that are actually correct (true positives). It quantifies the verifier's ability to avoid false positives.
2. Mathematical Formula: $ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $
3. Symbol Explanation:
  - True Positives: Solutions that the verifier correctly identified as correct.
  - False Positives: Solutions that the verifier incorrectly identified as correct (they were actually incorrect).
Recall:
1. Conceptual Definition: In the context of verifier signal quality, recall measures the proportion of all truly correct solutions that the verifier successfully identified. It quantifies the verifier's ability to find all relevant correct solutions and avoid false negatives.
2. Mathematical Formula: $ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
3. Symbol Explanation:
  - True Positives: Solutions that the verifier correctly identified as correct.
  - False Negatives: Solutions that the verifier incorrectly identified as incorrect (they were actually correct).
F1-Score:
1. Conceptual Definition: The F1-score is the harmonic mean of precision and recall. It provides a single score that balances both precision and recall, which is particularly useful when dealing with imbalanced datasets or when both false positives and false negatives are important.
2. Mathematical Formula: $ \text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $
3. Symbol Explanation:
  - Precision: As defined above.
  - Recall: As defined above.
Average: This generally refers to the mean score across multiple benchmarks or tasks. The paper uses it to summarize performance across different categories of benchmarks.

5.3. Baselines

The paper compares JURY-RL against a comprehensive set of baselines, including various label-free methods and a supervised oracle:

Self-supervised label-free baselines:
- Majority-Voting (MV): (Shafayat et al., 2025) A consensus reward where a rollout receives a reward of 1 if its extracted answer matches the most frequent answer among $G$ rollouts, and 0 otherwise. This reinforces popular answers irrespective of correctness.
- Self-Certainty: (Zhao et al., 2025b) A confidence-derived signal. The reward is based on the log-probabilities of the tokens forming the final answer; higher cumulative log-probability indicates greater model certainty and yields a higher reward.
- Entropy Minimization: (Prabhudesai et al., 2025) A low-entropy proxy. The reward is inversely related to the policy's output entropy for the final answer tokens, encouraging more deterministic (and presumably confident) predictions.
- CoReward: (Zhang et al., 2025) A multi-view agreement method, which attempts to improve robustness by enforcing consistency across multiple, semantically equivalent prompts.
Judge-based label-free baselines:
- LLM-as-a-Judge: (Pang et al., 2023; Zhao et al., 2025a) An external LLM (specifically, qwen-2.5-72b-instruct) is used to assess the reasoning process and final answer of each rollout. Its evaluation (numeric score or binary correct/incorrect) serves as the reward.
- LLM-KD (Knowledge Distillation): (Gu et al., 2024b) A teacher model (qwen-2.5-72b-instruct) generates a reference answer for each problem, which is then treated as a pseudo-label. The policy model is trained to align its outputs with these machine-generated references.
Supervised Baseline:
- Ground Truth (GT)-Reward: (Shao et al., 2024) A supervised oracle baseline where human-annotated labels are available. A rollout receives a reward of 1 if its extracted answer matches the ground truth, and 0 otherwise. This is trained with the same GRPO objective for fair comparison.

5.4. Backbone Models

The experiments are conducted on a diverse range of open-source Large Language Models to ensure broad applicability:

Qwen3-1.7B-Base (Yang et al., 2025a)
Llama-3.2-3B-Instruct (Dubey et al., 2024)
Qwen2.5-7B (Yang et al., 2025b) The judge model used for LLM-as-a-Judge and LLM-KD baselines is qwen-2.5-72b-instruct.

5.5. Implementation Details and Hyperparameters

All methods are implemented using the VeRL framework (Sheng et al., 2025).

The following are the results from Table 6 of the original paper:

Hyperparameter	Value
Training Configuration
Train Batch size (Number of Sampled Questions)	128
Rollouts per problem ( $G$ )	8
Max prompt length	512
Max new tokens	3072
Training epoch	6
Optimizer Parameters (AdamW)
Learning rate	$3 \times 10^{-6}$
$\beta_{1}$	0.9
$\beta_{2}$	0.999
$\epsilon$	$10^{-8}$
Warmup style	Cosine
Warmup steps ratio	0.1
GRPO Algorithm Parameters
KL loss coefficient ( $\beta$ )	0.005
clip ratio ( $\epsilon$ )	0.2
Generation Parameters
Training temperature	1.0
Evaluation temperature	0.8
top-p	0.95

5.5.1. Lean Verifier Configuration

Autoformalizer: Goedel-Formalizer-V2-32B (Lin et al., 2025)
Consistency-Checker: QwQ-32B (QwenTeam, 2025)
Prover: Goedel-Prover-V2-32B (Lin et al., 2025)
The prover conducts up to 16 independent sampling trials (Prover@16) to find a valid proof for efficiency, after trading off performance for cost.
Hyperparameters for formal verification: Temperature: 0.7, Maximum Tokens: 32,768.

5.5.2. Prompt Formats (Appendix C.4)

Autoformalizer Prompt:

Please autoformalize the following natural language problem statement in Lean 4.
Use the following theorem name: test_problem
The natural language statement is:
{informal_statement_content}.
Think before you provide the lean statement.

Consistency-Checker System Prompt:

Your role is a Lean4 expert, please help me check consistency between natural language expression and its Lean4 proof statement.
Guidelines for Consistency Checking:
Goal:
Determine if the Lean theorem statement is an exact and faithful formalization of the mathematical problem.
Do not evaluate or consider the answer or the proof. Your sole task is to verify the correctness of the formalization. Evaluation Stages (All required):

1. Math Assertion Analysis

   Identify all structurally and semantically relevant components of the mathematical problem, including variables, types, quantifiers, constraints, logic structure, conclusion, and so on. The analysis should be based on the actual content of the text.
2. Lean Statement Analysis (ignore proof part)

   Extract all structurally and semantically relevant components from the Lean statement, including variables, types, conditions, quantifiers, constraints, the final claim, and so on. The analysis should reflect the actual content present in the Lean code.
3. Comparative Verification

   Check for exact correspondence between the math and Lean statements; you may refer to aspects like:

- Semantic alignment, logic structure, and quantifier correctness.
- Preservation of constraints and boundary assumptions.
- Accurate typing and use of variables.
- Syntactic validity and proper Lean usage (free from errors).
- Use of symbols and constructs without semantic drift.
- No missing elements, no unjustified additions, and no automatic corrections or completions.

4. Final Judgement

   Based solely on the above analysis, judge whether the Lean statement is a correct and exact formalization of the mathematical problem.
5. Accuracy Confirmation

   If correct: clearly confirm why all elements match.
If incorrect: list all mismatches and explain how each one affects correctness.
Note: While the analysis may be broad and open to interpreting all relevant features, the final judgment must be based only on what is explicitly and formally expressed in the Lean statement. **Do not consider or assess any part of the proof. Your judgment should be entirely about the accuracy of the statement formalization.**
You should present the results following the format:
Input:
The Natural Language Statement:
A math problem and its answer (no proof).
The Formal Statement in Lean4:
'''lean
A Lean 4 theorem statement formalizing the problem. Proof is intentionally omitted (e.g., sorry).
'''
Output Format:
Return exactly one xml object
<comments>
Your brief analysis:
Math Assertion Analysis: [...]
Lean Statement Analysis (Proof Ignored): [...]
Comparative Verification: [...]
Conclusion: [...]
Accuracy Confirmation: [... match confirmation or list of discrepancies...]
</comments>
<consistency> [Correct/Incorrect] </consistency>

Consistency-Checker User Prompt:

Input Data:
The Natural Language Statement:
{informal_prefix}
The Formal Statement in Lean4:
'''lean
{formal_statement}
'''

Prover Prompt:

Complete the following Lean 4 code:
''' lean4
{formal_statement}
'''
Before producing the Lean 4 code to formally prove the given theorem, provide a detailed proof plan outlining the main proof steps and strategies.
The plan should highlight key ideas, intermediate lemmas, and proof structures that will guide the construction of the final formal proof.

5.5.3. LLM-as-a-Judge Prompt (Appendix D)

You are a professional math QA pair evaluator. Your sole task is to determine whether a given mathematical question and answer are correctly matched. First explain your reasoning, then end your response with your final judgment: True or False.

5.6. Cost Analysis of Lean Verifier (Appendix C.3)

The computational overhead of the Lean Verifier is crucial.

Cold-Start Overhead: In the worst-case (processing a batch of entirely unseen QA pairs):
- Ground-Truth (GT) baseline training: $\approx$ 100 seconds per step (core training workload).
- JURY-RL (Lean Verifier): $\approx$ 200 seconds additional overhead per step. Total: $\approx$ 300 seconds.
- LLM-as-a-Judge (Qwen-2.5-72B): $\approx$ 80 seconds additional overhead per step. Total: $\approx$ 180 seconds.
Cost Amortization: Due to a caching mechanism for verification results, as training proceeds and the policy converges, more QA pairs are found in the cache. This amortizes the cost, and the per-step time cost for JURY-RL progressively converges toward that of the GT baseline in the steady-state phase.

Performance vs. Token Cost (Table 5): The following are the results from Table 5 of the original paper:

Models	MATH500			Token Costs
	ACC	TPR	F1-Score	Max Tokens per Verify	Avg Tokens per Verify	Avg Tokens per Response
Prover@16	87.0	87.0	93.0	351,799	33,346	5,858
Prover@32	89.0	89.0	94.0	658,061	52,064	6,607
Prover@64	91.0	91.0	95.0	1,369,038	67,429	6,435

The Prover@16 configuration (meaning up to 16 proof attempts) is chosen as default, balancing an acceptable performance (87.0% ACC) with significantly lower token cost (33,346 avg tokens per verify) compared to Prover@64.

6. Results & Analysis

6.1. Core Results Analysis

The main experimental results demonstrate that JURY-RL consistently outperforms label-free baselines and often matches or surpasses supervised training with ground-truth rewards (GT).

The following are the results from Table 1 of the original paper:

Methods	Mathematics					Code		Instruction	Multi-Task	Average
	AIME24	AIME25	MATH500	GSM8K	AMC	LiveCode	CRUX	IFEval	MMLU-Pro
Qwen3-1.7B-Base
Before RL	$6.67_{ \pm 4.6}$	$6.67_{ \pm 4.6}$	$45.8_{ \pm 2.2}$	$63.31_{ \pm 1.3}$	$26.51_{ \pm 4.8}$	$4.59_{ \pm 0.4}$	$7.12_{ \pm 0.9}$	$33.66_{ \pm 1.0}$	$33.60_{ \pm 0.4}$	$25.33_{ \pm 2.3}$
GT-Reward	$10.00_{ \pm 5.6}$	$0.00_{ \pm 0.0}$	$68.2_{ \pm 2.1}$	$83.09_{ \pm 1.0}$	$34.94_{ \pm 5.2}$	$14.84_{ \pm 0.9}$	$33.75_{ \pm 1.7}$	$38.70_{ \pm 1.0}$	$39.43_{ \pm 0.4}$	$35.88_{ \pm 2.0}$
Self-Certainty	$6.67_{ \pm 4.6}$	$3.33_{ \pm 3.3}$	$57.2_{ \pm 2.2}$	$74.68_{ \pm 1.2}$	$26.51_{ \pm 4.8}$	$10.37_{ \pm 0.7}$	$19.00_{ \pm 1.4}$	$38.58_{ \pm 1.0}$	$35.28_{ \pm 0.4}$	$30.18_{ \pm 2.2}$
Entropy	$10.00_{ \pm 5.6}$	$3.33_{ \pm 3.3}$	$65.2_{ \pm 2.1}$	$79.98_{ \pm 1.1}$	$32.53_{ \pm 5.1}$	$12.64_{ \pm 0.8}$	$31.00_{ \pm 1.6}$	$35.00_{ \pm 1.0}$	$37.01_{ \pm 0.4}$	$34.08_{ \pm 2.4}$
Majority-Voting	$3.33_{ \pm 3.3}$	$0.00_{ \pm 0.0}$	$59.8_{ \pm 2.2}$	$81.88_{ \pm 1.1}$	$33.73_{ \pm 5.2}$	$14.29_{ \pm 0.9}$	$32.75_{ \pm 1.7}$	$37.16_{ \pm 1.0}$	$35.68_{ \pm 0.4}$	$33.18_{ \pm 1.8}$
CoReward	$10.00_{ \pm 5.6}$	$3.33_{ \pm 3.3}$	$66.0_{ \pm 2.1}$	$82.03_{ \pm 1.1}$	$33.73_{ \pm 5.2}$	$14.27_{ \pm 0.9}$	$32.12_{ \pm 1.6}$	$37.28_{ \pm 1.0}$	$37.39_{ \pm 0.4}$	$35.13_{ \pm 2.4}$
LLM-KD	$3.33_{ \pm 3.3}$	$3.33_{ \pm 3.3}$	$68.4_{ \pm 2.1}$	$82.41_{ \pm 1.1}$	$44.58_{ \pm 5.5}$	$14.05_{ \pm 0.9}$	$34.62_{ \pm 1.7}$	$34.65_{ \pm 1.0}$	$37.97_{ \pm 0.4}$	$35.93_{ \pm 2.1}$
LLM-as-a-Judge	$10.00_{ \pm 5.6}$	$3.33_{ \pm 3.3}$	$62.8_{ \pm 2.2}$	$80.52_{ \pm 1.1}$	$31.33_{ \pm 5.1}$	$14.33_{ \pm 0.9}$	$36.12_{ \pm 1.7}$	$36.28_{ \pm 1.0}$	$33.83_{ \pm 0.4}$	$34.28_{ \pm 2.4}$
JURY-RL (Ours)	$13.33_{ \pm 6.3}$	$6.67_{ \pm 4.6}$	$68.4_{ \pm 2.1}$	$83.32_{ \pm 1.0}$	$36.14_{ \pm 5.3}$	$14.54_{ \pm 0.8}$	$35.00_{ \pm 1.7}$	$36.55_{ \pm 1.0}$	$38.11_{ \pm 0.4}$	$36.90_{ \pm 2.6}$
Llama-3.2-3B-Instruct
Before RL	$3.33_{ \pm 3.3}$	$0.00_{ \pm 0.0}$	$42.8_{ \pm 2.2}$	$69.75_{ \pm 1.3}$	$13.25_{ \pm 3.7}$	$3.00_{ \pm 0.4}$	$25.50_{ \pm 1.5}$	$54.41_{ \pm 1.1}$	$32.01_{ \pm 0.4}$	$27.12_{ \pm 1.6}$
GT-Reward	$13.33_{ \pm 6.3}$	$0.00_{ \pm 0.0}$	$48.0_{ \pm 2.2}$	$76.72_{ \pm 1.2}$	$22.89_{ \pm 4.6}$	$7.05_{ \pm 0.6}$	$32.62_{ \pm 1.7}$	$50.16_{ \pm 1.1}$	$34.26_{ \pm 0.4}$	$31.67_{ \pm 2.0}$
Entropy	$6.67_{ \pm 4.6}$	$0.00_{ \pm 0.0}$	$40.4_{ \pm 2.2}$	$68.99_{ \pm 1.3}$	$13.25_{ \pm 3.7}$	$5.38_{ \pm 0.5}$	$26.62_{ \pm 1.6}$	$54.24_{ \pm 1.1}$	$33.54_{ \pm 0.4}$	$27.68_{ \pm 1.7}$
Self-Certainty	$3.33_{ \pm 3.3}$	$0.00_{ \pm 0.0}$	$40.2_{ \pm 2.1}$	$74.07_{ \pm 1.2}$	$16.87_{ \pm 4.1}$	$5.57_{ \pm 0.6}$	$22.75_{ \pm 1.9}$	$54.11_{ \pm 1.1}$	$34.42_{ \pm 0.4}$	$27.92_{ \pm 1.6}$
Majority-Voting	$10.00_{ \pm 5.6}$	$0.00_{ \pm 0.0}$	$47.0_{ \pm 2.2}$	$79.08_{ \pm 1.2}$	$19.28_{ \pm 4.3}$	$8.11_{ \pm 0.7}$	$31.87_{ \pm 1.6}$	$48.36_{ \pm 1.1}$	$33.94_{ \pm 0.4}$	$30.85_{ \pm 1.9}$
CoReward	$10.00_{ \pm 5.6}$	$0.00_{ \pm 0.0}$	$49.2_{ \pm 2.2}$	$79.76_{ \pm 1.1}$	$20.48_{ \pm 4.4}$	$5.57_{ \pm 0.5}$	$30.38_{ \pm 1.6}$	$50.46_{ \pm 1.1}$	$32.95_{ \pm 0.4}$	$30.98_{ \pm 1.9}$
LLM-KD	$10.00_{ \pm 5.6}$	$0.00_{ \pm 0.0}$	$49.4_{ \pm 2.2}$	$78.47_{ \pm 1.1}$	$20.48_{ \pm 4.4}$	$7.55_{ \pm 0.6}$	$32.38_{ \pm 1.6}$	$49.06_{ \pm 1.1}$	$34.04_{ \pm 0.4}$	$31.26_{ \pm 1.9}$
LLM-as-a-Judge	$6.67_{ \pm 4.6}$	$0.00_{ \pm 0.0}$	$47.8_{ \pm 2.2}$	$77.10_{ \pm 1.2}$	$21.69_{ \pm 4.5}$	$3.96_{ \pm 0.4}$	$33.88_{ \pm 1.7}$	$51.46_{ \pm 1.1}$	$34.22_{ \pm 0.4}$	$30.75_{ \pm 1.8}$
JURY-RL (Ours)	$16.67_{ \pm 6.9}$	$0.00_{ \pm 0.0}$	$49.0_{ \pm 2.2}$	$80.74_{ \pm 1.1}$	$24.10_{ \pm 4.7}$	$6.16_{ \pm 0.6}$	$32.62_{ \pm 1.7}$	$50.09_{ \pm 1.1}$	$34.54_{ \pm 0.4}$	$32.66_{ \pm 2.1}$
Qwen2.5-7B
Before RL	$3.33_{ \pm 3.3}$	$3.33_{ \pm 3.3}$	$49.2_{ \pm 2.2}$	$71.65_{ \pm 1.2}$	$20.48_{ \pm 4.4}$	$4.57_{ \pm 0.4}$	$28.00_{ \pm 1.6}$	$40.61_{ \pm 1.0}$	$44.03_{ \pm 0.4}$	$29.47_{ \pm 2.0}$
GT-Reward	$13.33_{ \pm 6.3}$	$6.67_{ \pm 4.6}$	$76.4_{ \pm 1.9}$	$89.61_{ \pm 0.8}$	$46.99_{ \pm 5.5}$	$12.78_{ \pm 0.7}$	$51.38_{ \pm 1.8}$	$41.50_{ \pm 1.0}$	$45.09_{ \pm 0.4}$	$42.64_{ \pm 2.6}$
Self-Certainty	$13.33_{ \pm 6.3}$	$6.67_{ \pm 4.6}$	$75.0_{ \pm 1.9}$	$88.93_{ \pm 0.9}$	$44.58_{ \pm 5.5}$	$11.87_{ \pm 0.7}$	$53.87_{ \pm 1.8}$	$39.66_{ \pm 1.0}$	$43.89_{ \pm 0.4}$	$41.98_{ \pm 2.6}$
Entropy	$16.67_{ \pm 6.9}$	$10.00_{ \pm 5.6}$	$73.4_{ \pm 2.0}$	$85.52_{ \pm 1.0}$	$39.76_{ \pm 5.4}$	$15.68_{ \pm 0.8}$	$51.50_{ \pm 1.8}$	$40.25_{ \pm 1.0}$	$42.61_{ \pm 0.4}$	$41.71_{ \pm 2.8}$
Majority-Voting	$10.00_{ \pm 5.6}$	$3.33_{ \pm 3.3}$	$71.4_{ \pm 2.0}$	$90.22_{ \pm 0.8}$	$38.55_{ \pm 5.3}$	$18.37_{ \pm 0.9}$	$52.38_{ \pm 1.8}$	$42.72_{ \pm 1.0}$	$43.83_{ \pm 0.4}$	$41.20_{ \pm 2.4}$
CoReward	$10.00_{ \pm 5.6}$	$3.33_{ \pm 3.3}$	$74.0_{ \pm 2.0}$	$90.52_{ \pm 0.8}$	$38.55_{ \pm 5.3}$	$10.37_{ \pm 0.6}$	$57.25_{ \pm 1.8}$	$43.75_{ \pm 1.0}$	$42.08_{ \pm 0.4}$	$41.09_{ \pm 2.2}$
LLM-KD	$13.33_{ \pm 6.3}$	$10.00_{ \pm 5.6}$	$73.6_{ \pm 2.0}$	$89.99_{ \pm 0.8}$	$45.78_{ \pm 5.5}$	$8.09_{ \pm 0.5}$	$51.38_{ \pm 1.8}$	$43.09_{ \pm 1.0}$	$43.73_{ \pm 0.4}$	$42.11_{ \pm 2.7}$
LLM-as-a-Judge	$10.00_{ \pm 5.6}$	$6.67_{ \pm 4.6}$	$72.2_{ \pm 2.0}$	$89.61_{ \pm 0.8}$	$42.17_{ \pm 5.4}$	$10.58_{ \pm 0.6}$	$53.00_{ \pm 1.8}$	$44.17_{ \pm 1.0}$	$48.67_{ \pm 0.4}$	$41.90_{ \pm 2.5}$
JURY-RL (Ours)	$13.33_{ \pm 6.3}$	$13.33_{ \pm 6.3}$	$74.6_{ \pm 1.9}$	$91.21_{ \pm 0.8}$	$48.19_{ \pm 5.5}$	$14.69_{ \pm 0.8}$	$55.62_{ \pm 1.8}$	$41.51_{ \pm 1.0}$	$50.00_{ \pm 0.4}$	$44.72_{ \pm 2.6}$

6.1.1. Mathematical Reasoning and In-Domain Generalization

JURY-RL's advantage over baselines is modest on the in-distribution MATH500 test set.
Its superiority becomes substantially more pronounced on out-of-distribution math benchmarks such as GSM8K and AMC.
This pattern suggests that competing methods, including supervised GT baseline, tend to overfit to stylistic patterns of the MATH dataset. JURY-RL, relying on formal verification, learns underlying mathematical principles robust to distributional shifts.
JURY-RL achieves significant overall performance leaps over the GT baseline on all three backbones:
- Qwen3-1.7B-Base: +2.32 pts (+5.91% rel.)
- Llama-3.2-3B-Instruct: +1.91 pts (+5.93% rel.)
- Qwen2.5-7B: +1.53 pts (+3.28% rel.), with an average of 48.13%, exceeding GT (46.60%) and LLM-KD (46.54%).
This indicates that formal verification provides a superior learning signal for generalizable reasoning.

6.1.2. Out-of-Domain Generalization

The strong in-domain performance translates to robust out-of-domain generalization across code generation, instruction following, and multi-task knowledge tests.
On Qwen2.5-7B, JURY-RL surpasses the GT baseline with an average of 44.72% (+2.08 pts, +4.88% rel.), compared to GT's 42.64%.
On other models, its performance is statistically on par with GT while being the best-performing label-free method.
This confirms that optimizing for verifiable correctness encourages fundamental, transferable skills.

6.1.3. Overall Gains

JURY-RL shows comprehensive improvements over the GT baseline across both in-domain and out-of-domain tasks:
- Qwen-1.7B-Base: +1.02 pts (+2.84% rel.)
- Qwen2.5-7B: +2.08 pts (+4.88% rel.)
- Llama-3.2-3B-Instruct: +0.99 pts (+3.13% rel.)
This indicates stable and scalable performance gains across different model scales, addressing training collapse in label-free methods.
The benefits extend beyond pass@1 (single-answer accuracy) to pass@k (multi-attempt success rates), suggesting it learns diverse and effective solution paths and averts mode collapse.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Studies of ResZero

This study compares ResZero with alternative fallback designs when verification is inconclusive ( $\delta=0$ ): Zero Reward (no signal) and MV Reward (rewarding the majority vote). GT and Majority-Voting are used as references.

The following are the results from Table 2 of the original paper:

Methods	Mathematics					Code		Instruction	Multi-Task	Average
	AIME24	AIME25	MATH500	GSM8K	AMC	LiveCode	CRUX	IFEval	MMLU-Pro
Qwen3-1.7B-Base
GT-Reward	$10.00_{ \pm 5.6}$	$0.00_{ \pm 0.0}$	$68.2_{ \pm 2.1}$	$83.09_{ \pm 1.0}$	$34.94_{ \pm 5.2}$	$14.84_{ \pm 0.9}$	$33.75_{ \pm 1.7}$	$38.70_{ \pm 1.0}$	$39.43_{ \pm 0.4}$	$35.88_{ \pm 2.0}$
Majority-Voting	$3.33_{ \pm 3.3}$	$0.00_{ \pm 0.0}$	$59.8_{ \pm 2.2}$	$81.88_{ \pm 1.1}$	$33.73_{ \pm 5.2}$	$14.29_{ \pm 0.9}$	$32.75_{ \pm 1.7}$	$37.16_{ \pm 1.0}$	$35.68_{ \pm 0.4}$	$33.18_{ \pm 1.8}$
Proof-Gate + Zero Reward	$3.33_{ \pm 3.3}$	$6.67_{ \pm 4.6}$	$69.0_{ \pm 2.1}$	$83.70_{ \pm 1.0}$	$36.14_{ \pm 5.3}$	$14.82_{ \pm 0.9}$	$30.63_{ \pm 1.8}$	$37.07_{ \pm 1.0}$	$37.40_{ \pm 0.4}$	$35.42_{ \pm 2.3}$
Proof-Gate + MV Reward	$0.00_{ \pm 0.0}$	$0.00_{ \pm 0.0}$	$41.2_{ \pm 2.2}$	$82.87_{ \pm 1.0}$	$8.43_{ \pm 2.0}$	$14.48_{ \pm 0.9}$	$34.00_{ \pm 1.7}$	$36.78_{ \pm 1.0}$	$35.33_{ \pm 0.4}$	$28.12_{ \pm 1.1}$
JURY-RL (Proof-Gate + ResZero)	$13.33_{ \pm 6.3}$	$6.67_{ \pm 4.6}$	$68.4_{ \pm 2.1}$	$83.32_{ \pm 1.0}$	$36.14_{ \pm 5.3}$	$14.54_{ \pm 0.9}$	$35.00_{ \pm 1.7}$	$36.55_{ \pm 1.0}$	$38.11_{ \pm 0.4}$	$36.90_{ \pm 2.6}$
Llama-3.2-3B-Instruct
GT-Reward	$13.33_{ \pm 6.3}$	$0.00_{ \pm 0.0}$	$48.0_{ \pm 2.2}$	$76.72_{ \pm 1.2}$	$22.89_{ \pm 4.6}$	$7.05_{ \pm 0.6}$	$32.62_{ \pm 1.7}$	$50.16_{ \pm 1.1}$	$34.26_{ \pm 0.4}$	$31.67_{ \pm 2.0}$
Majority-Voting	$10.00_{ \pm 5.6}$	$0.00_{ \pm 0.0}$	$47.0_{ \pm 2.2}$	$79.08_{ \pm 1.2}$	$19.28_{ \pm 4.3}$	$8.11_{ \pm 0.7}$	$31.87_{ \pm 1.6}$	$48.36_{ \pm 1.1}$	$33.94_{ \pm 0.4}$	$30.85_{ \pm 1.9}$
Proof-Gate + Zero Reward	$3.33_{ \pm 3.3}$	$0.00_{ \pm 0.0}$	$51.4_{ \pm 2.2}$	$78.85_{ \pm 1.1}$	$21.69_{ \pm 4.5}$	$6.81_{ \pm 0.6}$	$30.12_{ \pm 1.6}$	$50.12_{ \pm 1.1}$	$33.27_{ \pm 0.4}$	$30.62_{ \pm 1.6}$
Proof-Gate + MV Reward	$10.00_{ \pm 5.6}$	$0.00_{ \pm 0.0}$	$47.4_{ \pm 2.2}$	$79.23_{ \pm 1.1}$	$25.30_{ \pm 4.8}$	$7.07_{ \pm 0.6}$	$31.37_{ \pm 1.6}$	$48.49_{ \pm 1.1}$	$33.68_{ \pm 0.4}$	$31.39_{ \pm 1.9}$
JURY-RL (Proof-Gate + ResZero)	$16.67_{ \pm 6.9}$	$0.00_{ \pm 0.0}$	$49.0_{ \pm 2.2}$	$80.74_{ \pm 1.1}$	$24.10_{ \pm 4.7}$	$6.16_{ \pm 0.9}$	$32.62_{ \pm 1.7}$	$50.09_{ \pm 1.1}$	$34.54_{ \pm 0.4}$	$32.66_{ \pm 2.1}$
Qwen2.5-7B
GT-Reward	$13.33_{ \pm 6.3}$	$6.67_{ \pm 4.6}$	$76.4_{ \pm 1.9}$	$89.61_{ \pm 0.8}$	$46.99_{ \pm 5.5}$	$12.78_{ \pm 0.7}$	$51.38_{ \pm 1.8}$	$41.50_{ \pm 1.0}$	$45.09_{ \pm 0.4}$	$42.64_{ \pm 2.6}$
Majority-Voting	$10.00_{ \pm 5.6}$	$3.33_{ \pm 3.3}$	$71.4_{ \pm 2.0}$	$90.22_{ \pm 0.8}$	$38.55_{ \pm 5.3}$	$18.37_{ \pm 0.9}$	$52.38_{ \pm 1.8}$	$42.72_{ \pm 1.0}$	$43.83_{ \pm 0.4}$	$41.20_{ \pm 2.4}$
Proof-Gate + Zero Reward	$13.33_{ \pm 6.3}$	$6.67_{ \pm 4.6}$	$75.8_{ \pm 1.9}$	$90.67_{ \pm 0.8}$	$39.76_{ \pm 5.4}$	$18.14_{ \pm 0.9}$	$52.25_{ \pm 1.8}$	$41.67_{ \pm 1.0}$	$47.88_{ \pm 0.4}$	$42.91_{ \pm 3.0}$
Proof-Gate + MV Reward	$0.00_{ \pm 0.0}$	$0.00_{ \pm 0.0}$	$51.6_{ \pm 2.2}$	$89.69_{ \pm 0.8}$	$16.87_{ \pm 4.1}$	$17.29_{ \pm 0.9}$	$53.25_{ \pm 1.8}$	$43.64_{ \pm 1.0}$	$35.01_{ \pm 0.4}$	$34.15_{ \pm 1.2}$
JURY-RL (Proof-Gate + ResZero)	$13.33_{ \pm 6.3}$	$13.33_{ \pm 6.3}$	$74.6_{ \pm 1.9}$	$91.21_{ \pm 0.8}$	$48.19_{ \pm 5.5}$	$14.69_{ \pm 0.8}$	$55.62_{ \pm 1.8}$	$41.51_{ \pm 1.0}$	$50.00_{ \pm 0.4}$	$44.72_{ \pm 2.8}$

ResZero consistently achieves the highest Average score across all backbones.
On average, ResZero outperforms Zero Reward by +1.8 pts, MV Reward by +6.9 pts, and even GT by +1.4 pts.
MV Reward (Proof-Gate + MV Reward) shows the worst performance, indicating that rewarding an unverified majority is detrimental, causing the model to reinforce its own errors and collapse. For example, on Qwen3-1.7B-Base, Proof-Gate + MV Reward scores an average of 28.12%, significantly lower than Proof-Gate + Zero Reward (35.42%) and JURY-RL (36.90%).
Zero Reward (Proof-Gate + Zero Reward) achieves suboptimal performance, as it stalls the learning process for inconclusive samples by providing no gradient.
ResZero provides a robust solution by balancing between dangerous reinforcement and inefficient stagnation.

6.2.2. JURY-RL Achieves Stable Training and Avoids Collapse

The following figure (Image 2 from the original paper) shows the accuracy on the MATH5000 Validation set over training steps:

该图像是一个包含三个子图的图表，展示了在不同模型（Qwen3-1.7B-Base、Llama-3.2-3B-Instruct 和 Qwen2.5-7B）上多种方法训练过程中准确率随训练步数的变化情况，比较了包括 JURY-RL、GT-Reward、LLM-KD 等多种奖励策略的性能表现。

The validation accuracy trajectories for Entropy and Self-Certainty show collapse after an initial performance gain, as these methods reinforce spurious consensus.
LLM-as-a-Judge, LLM-KD, and Majority-Voting exhibit noisy and suboptimal convergence.
In contrast, JURY-RL demonstrates stable, monotonic improvement, confirming that its proof-gated reward mechanism effectively prevents the mode collapse common in self-supervised methods.

6.2.3. JURY-RL Enhances Diversity

The paper investigates whether JURY-RL enhances solution diversity.

The following are the results from Table 3 of the original paper:

Model	Average (pass@k)			Average (pass@1)
	GT-Reward	JURY-RL	$\Delta(\mathbf{p p})$	GT-Reward	JURY-RL	$\Delta(\mathbf{p p})$
Qwen3-1.7B-Base	55.36	59.41	+4.05	39.25	41.57	+2.32
Llama-3.2-3B-Instruct	45.46	48.48	+3.02	32.19	34.10	+1.91
Qwen2.5-7B	62.48	64.04	+1.56	46.60	48.13	+1.53

JURY-RL achieves substantially larger improvements in pass@k than in pass@1 (e.g., +4.05 pp vs. +2.32 pp for Qwen3-1.7B-Base).
This indicates an increase in solution diversity. The ResZero reward penalizes flawed consensus and redistributes rewards to explore alternative reasoning paths, actively countering mode collapse and incentivizing exploration.

The following figure (Image 3 from the original paper) visually confirms this:

该图像是图表，展示了JURY-RL及其他三种方法在训练过程中平均唯一答案数的变化趋势。图中横轴为训练步数Step，纵轴为Average Unique Answers，显示JURY-RL能更稳定地保持较高的答案多样性。
The chart shows the average number of unique answers generated per problem during training on Qwen3-1.7B-Base.

While baselines like Majority-Voting quickly suffer from mode collapse (converging to a single answer), JURY-RL sustains a high level of diversity.

The following are the pass@k results for all comparisons, from Table 7 of the original paper:

Methods	AIME24 pass@16	AIME25 pass@16	MATH500 pass@4	GSM8K pass@4	AMC pass@8	Average
Qwen3-1.7B-Base
Before RL	$16.67_{ \pm 6.9}$	$13.33_{ \pm 6.3}$	$72.0_{ \pm 2.0}$	$89.31_{ \pm 0.8}$	$56.63_{ \pm 5.4}$	$49.59_{ \pm 4.3}$
GT-Reward	$26.67_{ \pm 8.2}$	$16.67_{ \pm 6.9}$	$82.0_{ \pm 1.7}$	$92.42_{ \pm 0.7}$	$59.04_{ \pm 5.4}$	$55.36_{ \pm 4.6}$
Entropy	$20.00_{ \pm 7.4}$	$23.33_{ \pm 7.8}$	$81.2_{ \pm 1.8}$	$90.75_{ \pm 0.8}$	$59.04_{ \pm 5.4}$	$54.86_{ \pm 4.7}$
Self-Certainty	$23.33_{ \pm 7.8}$	$26.67_{ \pm 8.2}$	$77.2_{ \pm 1.9}$	$90.83_{ \pm 0.8}$	$57.83_{ \pm 5.4}$	$55.17_{ \pm 4.8}$
Majority-Voting	$23.33_{ \pm 7.8}$	$16.67_{ \pm 6.9}$	$77.2_{ \pm 1.9}$	$92.34_{ \pm 0.7}$	$53.01_{ \pm 5.5}$	$52.51_{ \pm 4.6}$
CoReward	$26.67_{ \pm 8.2}$	$16.67_{ \pm 6.9}$	$77.4_{ \pm 1.9}$	$92.04_{ \pm 0.8}$	$54.22_{ \pm 5.5}$	$53.40_{ \pm 4.6}$
LLM-KD	$20.00_{ \pm 7.4}$	$16.67_{ \pm 6.9}$	$82.2_{ \pm 1.7}$	$93.25_{ \pm 0.7}$	$55.42_{ \pm 5.5}$	$53.51_{ \pm 4.4}$
LLM-as-a-Judge	$16.67_{ \pm 6.9}$	$13.33_{ \pm 6.3}$	$74.6_{ \pm 1.9}$	$91.74_{ \pm 0.8}$	$55.42_{ \pm 5.5}$	$50.35_{ \pm 4.3}$
JURY-RL (Ours)	$30.00_{ \pm 8.5}$	$30.00_{ \pm 8.5}$	$82.0_{ \pm 1.7}$	$92.42_{ \pm 0.7}$	$62.65_{ \pm 5.3}$	$59.41_{ \pm 5.0}$
Llama-3.2-3B-Instruct
Before RL	$23.33_{ \pm 7.8}$	$6.67_{ \pm 4.6}$	$65.0_{ \pm 2.1}$	$86.88_{ \pm 0.9}$	$49.40_{ \pm 5.5}$	$46.26_{ \pm 4.2}$
GT-Reward	$23.33_{ \pm 7.8}$	$6.67_{ \pm 4.6}$	$63.8_{ \pm 2.1}$	$90.14_{ \pm 0.8}$	$43.37_{ \pm 5.4}$	$45.46_{ \pm 4.2}$
Entropy	$16.67_{ \pm 6.9}$	$3.33_{ \pm 3.3}$	$58.8_{ \pm 2.2}$	$84.46_{ \pm 1.0}$	$50.60_{ \pm 5.5}$	$42.77_{ \pm 3.8}$
Self-Certainty	$13.33_{ \pm 6.3}$	$10.00_{ \pm 5.6}$	$58.0_{ \pm 2.2}$	$84.69_{ \pm 1.0}$	$44.58_{ \pm 5.5}$	$42.12_{ \pm 4.1}$
Majority-Voting	$16.67_{ \pm 6.9}$	$0.00_{ \pm 0.0}$	$63.6_{ \pm 2.1}$	$90.98_{ \pm 0.8}$	$39.76_{ \pm 5.4}$	$42.20_{ \pm 3.0}$
CoReward	$23.33_{ \pm 7.8}$	$0.00_{ \pm 0.0}$	$60.4_{ \pm 2.2}$	$90.45_{ \pm 0.8}$	$43.37_{ \pm 5.4}$	$43.51_{ \pm 3.3}$
LLM-KD	$23.33_{ \pm 7.8}$	$3.33_{ \pm 3.3}$	$65.0_{ \pm 2.1}$	$89.99_{ \pm 0.8}$	$49.40_{ \pm 5.5}$	$46.21_{ \pm 3.9}$
LLM-as-a-Judge	$26.67_{ \pm 8.2}$	$10.00_{ \pm 5.6}$	$62.2_{ \pm 2.2}$	$89.84_{ \pm 0.8}$	$40.96_{ \pm 5.4}$	$45.93_{ \pm 4.4}$
JURY-RL (Ours)	$30.00_{ \pm 8.5}$	$3.33_{ \pm 3.3}$	$64.8_{ \pm 2.1}$	$90.07_{ \pm 0.8}$	$54.22_{ \pm 5.5}$	$48.48_{ \pm 4.0}$
Qwen2.5-7B
Before RL	$30.00_{ \pm 8.5}$	$20.00_{ \pm 7.4}$	$76.2_{ \pm 1.9}$	$93.78_{ \pm 0.7}$	$63.86_{ \pm 5.3}$	$56.77_{ \pm 4.8}$
GT-Reward	$33.33_{ \pm 8.8}$	$30.00_{ \pm 8.5}$	$85.2_{ \pm 1.6}$	$95.22_{ \pm 0.6}$	$68.67_{ \pm 5.1}$	$62.48_{ \pm 4.9}$
Entropy	$30.00_{ \pm 8.5}$	$36.67_{ \pm 8.9}$	$85.2_{ \pm 1.6}$	$93.78_{ \pm 0.7}$	$67.47_{ \pm 5.1}$	$62.62_{ \pm 5.0}$
Self-Certainty	$23.33_{ \pm 7.8}$	$30.00_{ \pm 8.5}$	$84.6_{ \pm 1.6}$	$93.86_{ \pm 0.7}$	$69.88_{ \pm 5.0}$	$60.33_{ \pm 4.7}$
Majority-Voting	$33.33_{ \pm 8.8}$	$30.00_{ \pm 8.5}$	$85.4_{ \pm 1.6}$	$95.00_{ \pm 0.6}$	$63.86_{ \pm 5.3}$	$61.52_{ \pm 4.9}$
CoReward	$30.00_{ \pm 8.5}$	$26.67_{ \pm 8.2}$	$83.8_{ \pm 1.6}$	$95.60_{ \pm 0.6}$	$63.86_{ \pm 5.3}$	$59.99_{ \pm 4.8}$
LLM-KD	$33.33_{ \pm 8.8}$	$33.33_{ \pm 8.8}$	$87.8_{ \pm 1.5}$	$95.75_{ \pm 0.6}$	$67.47_{ \pm 5.1}$	$63.54_{ \pm 4.9}$
LLM-as-a-Judge	$20.00_{ \pm 7.4}$	$13.33_{ \pm 6.3}$	$82.2_{ \pm 1.7}$	$95.38_{ \pm 0.6}$	$59.04_{ \pm 5.4}$	$53.99_{ \pm 4.3}$
JURY-RL (Ours)	$36.67_{ \pm 8.9}$	$30.00_{ \pm 8.5}$	$86.6_{ \pm 1.5}$	$95.83_{ \pm 0.6}$	$71.08_{ \pm 5.0}$	$64.04_{ \pm 4.9}$

JURY-RL not only surpasses every label-free baseline but also consistently outperforms the strong supervised GT-Reward baseline across all three backbone models. This further confirms its effectiveness in improving solution diversity and multi-attempt success rates.

6.2.4. Analysis of Verifier Signal Quality

The paper compares the signal quality of the Lean verifier against an LLM-as-a-Judge.

The following are the results from Table 4 of the original paper:

Verifier	Prec.	Rec.	F1
LLM-as-a-Judge	75.9	96.1	84.8
Lean Verifier (Ours)	84.5	88.0	86.2

The Lean verifier provides a superior reward signal with substantially higher precision (84.5% vs. 75.9%) at the cost of moderately lower recall (88.0% vs. 96.1%).
This trade-off is critical: high precision drastically reduces false positives, preventing reward hacking and tightly aligning training with verifiable correctness.
The LLM-judge's noisy signal (lower precision) risks reinforcing errors despite higher recall.
The Lean verifier's higher F1-score (86.2% vs. 84.8%) confirms its better overall balance, validating the "Proofs Dispose" principle by prioritizing signal fidelity.
The imperfections in the Lean verifier (not achieving 100% precision/recall) stem from upstream components (auto-formalization, consistency checks) rather than the prover's core logic.

The following figure (Image 5 from the original paper) shows the training dynamics of precision, recall, validation accuracy, and training entropy under different Lean pass@k verification settings:

该图像是四个子的曲线图，展示了不同pass@k（k=1,2,4,8,16）情况下，随着训练步骤（Step）增长，模型的Precision、Recall、Accuracy和Entropy的变化趋势。
The precision (top-left panel) of the Lean verifier remains consistently high ( $\approx$ 85%), confirming its reliability in avoiding false positives.
The recall (top-right panel) shows a clear dependency on $k$ (number of verification attempts), increasing from $\approx$ 30% at pass@1 to nearly 90% at pass@16. This means multiple attempts significantly increase the chance of proving a correct answer.
Despite this variance in recall, the final validation accuracy (bottom-left) and training entropy (bottom-right) converge to similar stable states across different values of $k$ . This suggests that the high precision of the verifier's signal is the dominant factor for successful and stable training.

6.2.5. Impact of Hyperparameter $c$

The hyperparameter $c$ in Eq. 5 controls the penalty strength on the unverified majority proposal in the ResZero reward.

The following figure (Figure 4 from the original paper) shows the training dynamics under different values of the hyperparameter $c$ :

该图像是论文中的代码截图，展示了利用Lean证明器进行形式化数学证明的部分内容，涉及集合上确界（supremum）的定义和证明过程，其中包含数学符号如 $3*x + 4*y < 72$ 和 1152 的论证。

Non-zero $c$ for diversity: The right panel shows that a non-zero c is essential for maintaining solution diversity. When $c=0$ , the ResZero framework degenerates to a zero-reward fallback (as the penalty term vanishes), leading to a sharp decline in the average number of unique answers – a symptom of mode collapse. Any positive $c$ value successfully sustains a high level of diversity.
Trade-off in $c$ value: The left and center panels (accuracy and reward) reveal that an overly aggressive penalty (e.g., $c=0.1$ ) can slightly suppress the final reward and accuracy. This implies that while the penalty is crucial for exploration, an excessive value may overly restrict the policy from exploiting a potentially correct, high-consensus answer.
Optimal balance: A moderately tuned $c$ (e.g., $c=0.01$ used in experiments) strikes an optimal balance, ensuring robust training stability and solution diversity without compromising convergence on primary task objectives.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces JURY-RL, a novel label-free Reinforcement Learning with Verifiable Rewards (RLVR) framework. Its core innovation lies in the "Votes Propose, Proofs Dispose" paradigm, where majority voting from model rollouts proposes a candidate answer, and a formal Lean verifier disposes the final reward. This design ensures truth-alignment through provable correctness and scalability by eliminating reliance on expensive human-annotated labels. A key component, the ResZero reward, addresses the challenge of inconclusive verification by providing a zero-mean, variance-preserving signal that prevents training collapse and fosters stable optimization without reinforcing spurious consensus. Empirical results across diverse benchmarks (mathematical reasoning, code generation, multi-task) confirm that JURY-RL achieves superior training stability, outperforms existing label-free baselines, and even matches or surpasses supervised training with ground-truth rewards, particularly in out-of-domain generalization and solution diversity.

7.2. Limitations & Future Work

The authors highlight the following limitations and potential future work:

Verifier Imperfections: While the Lean verifier is theoretically sound, its practical pipeline (involving auto-formalization and consistency checks) can introduce errors, leading to a precision gap from 100%. Improving these upstream components is a clear area for future work to enhance the verifier's overall signal quality.
Computational Cost: The Lean verifier introduces significant computational overhead during the initial "cold-start" phase of training compared to LLM-as-a-Judge. Although this cost is amortized through caching as training progresses, further research could focus on optimizing the efficiency of formal verification or developing more lightweight, yet still robust, verifiers.
Generalizability of Formal Verification: While current work focuses on mathematical and code domains where formal verification is well-defined, extending RLVR to more subjective or ambiguous reasoning tasks (e.g., natural language understanding, common-sense reasoning) would require developing new types of verifiable rewards or hybrid verification mechanisms.
Hyperparameter Tuning: The parameter $c$ in ResZero requires careful tuning to balance exploration and exploitation, an area that could benefit from adaptive or automated tuning strategies.

7.3. Personal Insights & Critique

This paper presents a compelling approach to label-free RLVR that addresses critical shortcomings of previous methods. The "Votes Propose, Proofs Dispose" paradigm is particularly insightful, elegantly solving the scalability-truth-alignment dilemma. By focusing the expensive, high-precision formal verification on a single, majority-voted candidate, it makes label-free verification practical.

The ResZero reward is a significant contribution to RL stability. Many RL methods struggle when rewards are sparse or when negative signals are difficult to formulate constructively. ResZero's ability to penalize unverified consensus while maintaining variance and zero-mean properties offers a principled way to guide exploration in ambiguous situations, which is highly valuable beyond just RLVR. This technique could potentially be adapted to other RL settings where reward signals are noisy or intermittent.

One area for further exploration could be the interaction between the KL regularization term in GRPO and the ResZero reward. While ResZero encourages diversity by penalizing spurious consensus, KL regularization pulls the policy back towards the reference policy, which might implicitly limit exploration or diversity. Understanding this interplay and perhaps dynamically adjusting KL coefficient based on ResZero's signal could yield further improvements.

Another interesting direction could be to explore softer forms of "proof" for domains beyond formal math and code. While formal verifiers provide binary (0/1) correctness, could a similar "propose and dispose" framework be adapted using, for example, ensemble agreement or chain-of-thought consistency from multiple specialized LLMs acting as "weak verifiers" in less formal domains, with sophisticated conflict resolution mechanisms? This would push the boundaries of RLVR into broader LLM reasoning tasks where formal proofs are currently infeasible.

Finally, the discussion of the Lean verifier's precision versus recall and its computational overhead is commendably transparent. The observation that high precision is paramount for stable training (even with moderate recall) offers a valuable lesson for designing reward functions in RL. It underscores that reward fidelity is more crucial than reward coverage when dealing with self-supervised learning where false positives can be catastrophic. The amortization strategy for verifier cost also demonstrates a practical approach to integrating expensive external tools into LLM training pipelines.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~31 min read · 45,881 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Proposal Stage: Majority Voting

4.2.2. Disposal Stage: Proof-Gated Reward Framework

4.2.3. ResZero (Residual-Zero) Reward

4.2.4. Lean Verifier Pipeline

4.2.5. JURY-RL Algorithm

4.2.6. Illustrative Example of ResZero Reward

4.2.7. Fallback Rewards for Inconclusive Verification (Theoretical Analysis from Appendix A.3)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Backbone Models

5.5. Implementation Details and Hyperparameters

5.5.1. Lean Verifier Configuration

5.5.2. Prompt Formats (Appendix C.4)

5.5.3. LLM-as-a-Judge Prompt (Appendix D)

5.6. Cost Analysis of Lean Verifier (Appendix C.3)

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Mathematical Reasoning and In-Domain Generalization

6.1.2. Out-of-Domain Generalization

6.1.3. Overall Gains

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Studies of ResZero

6.2.2. JURY-RL Achieves Stable Training and Avoids Collapse

6.2.3. JURY-RL Enhances Diversity

6.2.4. Analysis of Verifier Signal Quality

6.2.5. Impact of Hyperparameter ccc

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.2.5. Impact of Hyperparameter $c$