AiPaper

Papers

Sign in to view your remaining parses.
Tag Filter
Reinforcement Learning for Math Reasoning
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Published:9/12/2025
Vision-Language-Action ModelReinforcement Learning for Math ReasoningRL Training for Large Language ModelsMulti-Environment RenderingEfficient Reinforcement Learning Framework
SimpleVLARL is introduced to enhance the training of VisionLanguageAction models using reinforcement learning, addressing data scarcity and generalization issues. Results show stateoftheart performance on OpenVLAOFT, reducing reliance on labeled data.
01
JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
Published:10/8/2025
RL Training for Large Language ModelsTraining-Free Acceleration MethodsReinforcement Learning for Math ReasoningSequence Policy Optimization
JURYRL separates answer proposal via voting from reward disposal via theorem proving, uses ResZero for unverifiable cases, stabilizing RL training and outperforming labelfree baselines in reasoning and code tasks, rivaling supervised training.
03
Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards
Published:10/8/2025
RL Training for Large Language ModelsSequence Policy OptimizationReinforcement Learning for Math Reasoning
ROVER algorithm leverages the special MDP structure of RLVR in math reasoning, recovering optimal actions from fixed random policy valuation, bypassing complex policy iteration. It enhances LLM reasoning quality and diversity efficiently and simply.
02
Learning to Reason without External Rewards
Published:5/26/2025
RL Training for Large Language ModelsSequence Policy OptimizationTraining-Free Acceleration MethodsReinforcement Learning for Math Reasoning
Intuitor leverages a model’s selfcertainty as an intrinsic reward in reinforcement learning, enabling unsupervised training of LLMs for complex reasoning with strong crossdomain generalization and no reliance on external labels or rewards.
03
Preference-Based Process Reward Model for Robust Mathematical Reasoning
Published:10/8/2025
Preference-Based Process Reward ModelReinforcement Learning for Math ReasoningMCTS-based Data ConstructionStep-Wise Supervision MechanismSequence Policy Optimization
This work presents a preferencebased process reward model trained on MCTSderived data to reduce search bias. Enhanced GRPO enables stable RL training, improving intermediate step accuracy by 23% in mathematical reasoning tasks.
06