Papers

Sign in to view your remaining parses.
Tag Filter
Reinforcement Learning for Math Reasoning
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
Published:12/9/2025
RL Training for Large Language ModelsLLM Reasoning Capacity EnhancementSequence Policy OptimizationLong-Context ModelingReinforcement Learning for Math Reasoning
This study examines whether reinforcement learning (RL) truly enhances reasoning capabilities in language models, offering a transparent framework. Key findings include RL's effectiveness at the model's competence edge, with minimal pretraining seed needed for transfer, while mi
04
ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute
Published:8/30/2025
LLM Reasoning Capacity EnhancementReinforcement Learning for Math Reasoning
The paper introduces 'ParaThinker,' a novel paradigm for scaling LLMs that utilizes native thought parallelism to overcome the bottleneck of 'Tunnel Vision' in testtime computation, significantly enhancing reasoning capabilities by synthesizing multiple diverse reasoning paths.
04
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
Self-Verifiable Mathematical ReasoningLLM-based Theorem ProvingReinforcement Learning for Math ReasoningProof Generator and VerifierQuantitative Reasoning Capability Enhancement
The DeepSeekMathV2 model addresses the effectiveness of large language models in mathematical reasoning. By training a theorem prover verifier, it enables selfverification, producing more accurate proofs and achieving excellent results in competitions, demonstrating the potenti
042
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Published:9/12/2025
Vision-Language-Action ModelReinforcement Learning for Math ReasoningRL Training for Large Language ModelsMulti-Environment RenderingEfficient Reinforcement Learning Framework
SimpleVLARL is introduced to enhance the training of VisionLanguageAction models using reinforcement learning, addressing data scarcity and generalization issues. Results show stateoftheart performance on OpenVLAOFT, reducing reliance on labeled data.
03
JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
Published:10/8/2025
RL Training for Large Language ModelsTraining-Free Acceleration MethodsReinforcement Learning for Math ReasoningSequence Policy Optimization
JURYRL separates answer proposal via voting from reward disposal via theorem proving, uses ResZero for unverifiable cases, stabilizing RL training and outperforming labelfree baselines in reasoning and code tasks, rivaling supervised training.
04
Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards
Published:10/8/2025
RL Training for Large Language ModelsSequence Policy OptimizationReinforcement Learning for Math Reasoning
ROVER algorithm leverages the special MDP structure of RLVR in math reasoning, recovering optimal actions from fixed random policy valuation, bypassing complex policy iteration. It enhances LLM reasoning quality and diversity efficiently and simply.
02
Learning to Reason without External Rewards
Published:5/26/2025
RL Training for Large Language ModelsSequence Policy OptimizationTraining-Free Acceleration MethodsReinforcement Learning for Math Reasoning
Intuitor leverages a model’s selfcertainty as an intrinsic reward in reinforcement learning, enabling unsupervised training of LLMs for complex reasoning with strong crossdomain generalization and no reliance on external labels or rewards.
06
Preference-Based Process Reward Model for Robust Mathematical Reasoning
Published:10/8/2025
Preference-Based Process Reward ModelReinforcement Learning for Math ReasoningMCTS-based Data ConstructionStep-Wise Supervision MechanismSequence Policy Optimization
This work presents a preferencebased process reward model trained on MCTSderived data to reduce search bias. Enhanced GRPO enables stable RL training, improving intermediate step accuracy by 23% in mathematical reasoning tasks.
014