Papers
Sign in to view your remaining parses.
Tag Filter
RL Training for Large Language Models
HybridFlow: A Flexible and Efficient RLHF Framework
Published:9/28/2024
RL Training for Large Language ModelsHybrid Controller RL FrameworkReinforcement Learning from Human FeedbackDataflow Computation ModelDistributed Computation Optimization
HybridFlow is a hybrid framework that integrates single and multicontroller paradigms to enhance the efficiency and flexibility of RLHF systems. It features hierarchical APIs and a 3DHybridEngine for efficient model weight repartitioning, achieving 1.53 to 20.57 times throughpu
01
A-LAMP: Agentic LLM-Based Framework for Automated MDP Modeling and Policy Generation
Published:12/12/2025
RL Training for Large Language ModelsMarkov Decision Process ModelingAutomated Policy GenerationVerifiable Stage-wise ModelingAdvanced Reinforcement Learning Applications
The ALAMP framework automates the transition from natural language task descriptions to MDP modeling and policy generation. By decomposing modeling, coding, and training into verifiable stages, ALAMP enhances policy generation capabilities, outpacing traditional large language
02
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
Published:6/2/2025
RL Training for Large Language ModelsNegative Reinforcement MechanismMathematical Reasoning DatasetPositive and Negative Sample ReinforcementStrategies for Enhancing Reasoning Capability
This study examines reinforcement learning with verifiable rewards (RLVR), breaking down the learning signal into Positive Sample Reinforcement (PSR) and Negative Sample Reinforcement (NSR). It finds that training solely with negative samples enhances model performance and divers
011
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
Published:12/9/2025
RL Training for Large Language ModelsLLM Reasoning Capacity EnhancementSequence Policy OptimizationLong-Context ModelingReinforcement Learning for Math Reasoning
This study examines whether reinforcement learning (RL) truly enhances reasoning capabilities in language models, offering a transparent framework. Key findings include RL's effectiveness at the model's competence edge, with minimal pretraining seed needed for transfer, while mi
04
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
Published:10/9/2025
RL Training for Large Language ModelsHybrid Reward OptimizationMath Reasoning BenchmarksReward Model-Based LearningSparse Reward Problem
The HERO framework integrates verifiable rewards with reward models to address the limitations of sparse feedback in large language model reasoning tasks. Using stratified normalization and varianceaware weighting, HERO significantly improves performance on mathematical reasonin
03
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
Published:10/21/2025
RL Training for Large Language ModelsBalanced Policy OptimizationAdaptive Clipping MechanismOff-Policy OptimizationEfficient Sample Replay
This paper presents BAPO, a method for stabilizing offpolicy reinforcement learning for large language models by using balanced policy optimization with adaptive clipping, addressing issues of optimization imbalance and improving sample efficiency.
04
HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
Published:10/8/2025
RL Training for Large Language ModelsAgentic Retrieval-Augmented GenerationHierarchical Process RewardsKnowledge-Grounded Process RewardsSearch Decision Optimization
HiPRAG introduces a novel hierarchical process rewards method to tackle common oversearch and undersearch issues in agentic retrievalaugmented generation, improving search efficiency and accuracy significantly across multiple QA benchmarks, demonstrating the importance of opti
03
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
RL Training for Large Language ModelsLong-Context ModelingLLM Reasoning Capacity EnhancementSparse Attention Mechanism
DeepSeekV3.2 is presented, balancing computational efficiency and reasoning capabilities through three innovations: a sparse attention mechanism that reduces complexity, a scalable reinforcement learning framework rivaling GPT5, and a synthesis pipeline enhancing generalization
072
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Published:4/9/2021
RL Training for Large Language ModelsLarge Language Model Fine-TuningTransformer-Based Efficient Forward PredictionGPU Cluster TrainingPipeline Parallel Training
This paper introduces a novel interleaved pipeline parallelism schedule, combining tensor, pipeline, and data parallelism, to enhance the training efficiency of large language models on GPU clusters, achieving 502 petaFLOP/s on 3072 GPUs with over 10% throughput improvement.
03
$π^{*}_{0.6}$: a VLA That Learns From Experience
Published:11/19/2025
Vision-Language-Action ModelRL Training for Large Language ModelsExperience-Based Reinforcement LearningRobotic Data Collection and OptimizationAdvantage-Conditioned Policies
The study presents RECAP, a method for training VisionLanguageAction models through realworld learning. The model, pretrained using offline reinforcement learning, demonstrates significant performance improvements on various tasks, such as laundry folding and es
013
FlowRL: Matching Reward Distributions for LLM Reasoning
Published:9/19/2025
RL Training for Large Language ModelsFlow-Balanced Optimization MethodsReward Distribution MatchingKL Divergence-Based Policy OptimizationMathematical reasoning tasks
FlowRL introduces a novel method that matches full reward distributions via flow balancing, enhancing diversity in reasoning. Experiments show FlowRL improves performance by 10% over GRPO and 5.1% over PPO in math tasks, demonstrating the significance of reward distribution match
03
USB-Rec: An Effective Framework for Improving Conversational Recommendation Capability of Large Language Model
Published:9/21/2025
RL Training for Large Language ModelsConversational Recommender SystemsLLM-based Recommendation SystemsUser-Simulator-Based FrameworkPreference Optimization Dataset Construction
The USBRec framework enhances Large Language Models' capabilities in conversational recommendation through a usersimulatorbased preference optimization dataset and a selfenhancement strategy. Extensive experiments show it consistently outperforms existing stateoftheart met
04
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Published:9/12/2025
Vision-Language-Action ModelReinforcement Learning for Math ReasoningRL Training for Large Language ModelsMulti-Environment RenderingEfficient Reinforcement Learning Framework
SimpleVLARL is introduced to enhance the training of VisionLanguageAction models using reinforcement learning, addressing data scarcity and generalization issues. Results show stateoftheart performance on OpenVLAOFT, reducing reliance on labeled data.
03
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
Published:5/18/2025
RL Training for Large Language ModelsGroup Relative Policy OptimizationDiscriminative Constrained Optimization FrameworkLarge Reasoning ModelsEnhancement of Mathematical Reasoning Capabilities
DisCO is a new framework for Large Reasoning Models, addressing limitations of Group Relative Policy Optimization. By using a discriminative objective and nonclipping scoring functions, it eliminates difficulty bias and achieves stable longterm training, enhancing mathematical
04
Grounded in Reality: Learning and Deploying Proactive LLM from Offline
Logs
Published:10/29/2025
RL Training for Large Language ModelsSequence Policy OptimizationLarge Language Model Fine-Tuning
LearntoAsk learns proactive LLMs from offline expert logs without simulators by leveraging observed future data to infer turnbyturn rewards, decomposing longhorizon tasks for effective training and deployment in realworld highstakes domains.
06
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Published:9/30/2025
RL Training for Large Language ModelsLLM Reasoning Capacity EnhancementSequence Policy OptimizationMemory Mechanisms for LLMsTest-Time Scaling Techniques
ReasoningBank distills selfjudged experiences into general reasoning strategies, enabling agents to retrieve and update memories for continual improvement. Combined with MaTTS, it enhances learning efficiency and performance in continuous multitask scenarios.
06
MiniOneRec: An Open-Source Framework for Scaling Generative
Recommendation
Published:10/28/2025
Generative Recommendation SystemsLarge Language Model Fine-TuningRL Training for Large Language ModelsSequence Policy OptimizationResidual Quantized Variational Autoencoder (RQ-VAE)
MiniOneRec, the first opensource generative recommendation framework, uses Residual Quantized VAE for SID and posttrains 0.5B–7B parameter Qwen models, confirming scaling benefits and improving ranking accuracy and diversity via aligned SID processing and constrained RL.
030
Plug-and-Play Policy Planner for Large Language Model Powered Dialogue
Agents
Published:11/1/2023
Large Language Model Fine-TuningRL Training for Large Language ModelsLLM-guided motion planningDialogue Policy PlanningSelf-Play Reinforcement Learning
PPDPP introduces a tunable dialogue policy planner enhancing LLMs' proactive dialogue capabilities via supervised finetuning and reinforcement learning, achieving superior generalization and performance across diverse applications.
05
Self-Improving LLM Agents at Test-Time
Published:10/8/2025
Large Language Model Fine-TuningRL Training for Large Language ModelsLLM Reasoning Capacity EnhancementLLM Confidence CalibrationSelf-Improving Large Language Models
This work introduces a testtime selfimprovement method for LLM agents using uncertainty detection, selfgenerated data augmentation, and finetuning, achieving higher accuracy with fewer samples and enhancing robustness in complex tasks through distillation.
011
Spinning Straw into Gold: Relabeling LLM Agent Trajectories in Hindsight for Successful Demonstrations
Published:10/8/2025
Large Language Model Fine-TuningSequence Policy OptimizationRL Training for Large Language ModelsLong-Horizon Consistency ModelingLLM Reasoning Capacity Enhancement
Hindsight Supervised Learning relabels LLM agent trajectories with actual achieved goals, using masking and reweighting to enhance finetuning in longhorizon tasks, showing improved performance and sample efficiency over baselines in ALFWorld and WebShop.
04