Tags: Reinforcement Learning for Math Reasoning - Paper Library

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Published:12/9/2025

RL Training for Large Language ModelsLLM Reasoning Capacity EnhancementSequence Policy OptimizationLong-Context ModelingReinforcement Learning for Math Reasoning

This study examines whether reinforcement learning (RL) truly enhances reasoning capabilities in language models, offering a transparent framework. Key findings include RL's effectiveness at the model's competence edge, with minimal pretraining seed needed for transfer, while mi

04

ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute

Published:8/30/2025

LLM Reasoning Capacity EnhancementReinforcement Learning for Math Reasoning

The paper introduces 'ParaThinker,' a novel paradigm for scaling LLMs that utilizes native thought parallelism to overcome the bottleneck of 'Tunnel Vision' in testtime computation, significantly enhancing reasoning capabilities by synthesizing multiple diverse reasoning paths.

04

DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

Self-Verifiable Mathematical ReasoningLLM-based Theorem ProvingReinforcement Learning for Math ReasoningProof Generator and VerifierQuantitative Reasoning Capability Enhancement

The DeepSeekMathV2 model addresses the effectiveness of large language models in mathematical reasoning. By training a theorem prover verifier, it enables selfverification, producing more accurate proofs and achieving excellent results in competitions, demonstrating the potenti

042

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Published:9/12/2025

Vision-Language-Action ModelReinforcement Learning for Math ReasoningRL Training for Large Language ModelsMulti-Environment RenderingEfficient Reinforcement Learning Framework

SimpleVLARL is introduced to enhance the training of VisionLanguageAction models using reinforcement learning, addressing data scarcity and generalization issues. Results show stateoftheart performance on OpenVLAOFT, reducing reliance on labeled data.

03

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Published:10/8/2025

RL Training for Large Language ModelsTraining-Free Acceleration MethodsReinforcement Learning for Math ReasoningSequence Policy Optimization

JURYRL separates answer proposal via voting from reward disposal via theorem proving, uses ResZero for unverifiable cases, stabilizing RL training and outperforming labelfree baselines in reasoning and code tasks, rivaling supervised training.

04

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

Published:10/8/2025

RL Training for Large Language ModelsSequence Policy OptimizationReinforcement Learning for Math Reasoning

ROVER algorithm leverages the special MDP structure of RLVR in math reasoning, recovering optimal actions from fixed random policy valuation, bypassing complex policy iteration. It enhances LLM reasoning quality and diversity efficiently and simply.

02

Learning to Reason without External Rewards

Published:5/26/2025

RL Training for Large Language ModelsSequence Policy OptimizationTraining-Free Acceleration MethodsReinforcement Learning for Math Reasoning

Intuitor leverages a model’s selfcertainty as an intrinsic reward in reinforcement learning, enabling unsupervised training of LLMs for complex reasoning with strong crossdomain generalization and no reliance on external labels or rewards.

06

Preference-Based Process Reward Model for Robust Mathematical Reasoning

Published:10/8/2025

Preference-Based Process Reward ModelReinforcement Learning for Math ReasoningMCTS-based Data ConstructionStep-Wise Supervision MechanismSequence Policy Optimization

This work presents a preferencebased process reward model trained on MCTSderived data to reduce search bias. Enhanced GRPO enables stable RL training, improving intermediate step accuracy by 23% in mathematical reasoning tasks.

014

Papers