Tags: RL Training for Large Language Models - Paper Library

HybridFlow: A Flexible and Efficient RLHF Framework

Published:9/28/2024

RL Training for Large Language ModelsHybrid Controller RL FrameworkReinforcement Learning from Human FeedbackDataflow Computation ModelDistributed Computation Optimization

HybridFlow is a hybrid framework that integrates single and multicontroller paradigms to enhance the efficiency and flexibility of RLHF systems. It features hierarchical APIs and a 3DHybridEngine for efficient model weight repartitioning, achieving 1.53 to 20.57 times throughpu

A-LAMP: Agentic LLM-Based Framework for Automated MDP Modeling and Policy Generation

Published:12/12/2025

RL Training for Large Language ModelsMarkov Decision Process ModelingAutomated Policy GenerationVerifiable Stage-wise ModelingAdvanced Reinforcement Learning Applications

The ALAMP framework automates the transition from natural language task descriptions to MDP modeling and policy generation. By decomposing modeling, coding, and training into verifiable stages, ALAMP enhances policy generation capabilities, outpacing traditional large language

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Published:6/2/2025

RL Training for Large Language ModelsNegative Reinforcement MechanismMathematical Reasoning DatasetPositive and Negative Sample ReinforcementStrategies for Enhancing Reasoning Capability

This study examines reinforcement learning with verifiable rewards (RLVR), breaking down the learning signal into Positive Sample Reinforcement (PSR) and Negative Sample Reinforcement (NSR). It finds that training solely with negative samples enhances model performance and divers

011

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Published:12/9/2025

RL Training for Large Language ModelsLLM Reasoning Capacity EnhancementSequence Policy OptimizationLong-Context ModelingReinforcement Learning for Math Reasoning

This study examines whether reinforcement learning (RL) truly enhances reasoning capabilities in language models, offering a transparent framework. Key findings include RL's effectiveness at the model's competence edge, with minimal pretraining seed needed for transfer, while mi

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Published:10/9/2025

RL Training for Large Language ModelsHybrid Reward OptimizationMath Reasoning BenchmarksReward Model-Based LearningSparse Reward Problem

The HERO framework integrates verifiable rewards with reward models to address the limitations of sparse feedback in large language model reasoning tasks. Using stratified normalization and varianceaware weighting, HERO significantly improves performance on mathematical reasonin

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Published:10/21/2025

RL Training for Large Language ModelsBalanced Policy OptimizationAdaptive Clipping MechanismOff-Policy OptimizationEfficient Sample Replay

This paper presents BAPO, a method for stabilizing offpolicy reinforcement learning for large language models by using balanced policy optimization with adaptive clipping, addressing issues of optimization imbalance and improving sample efficiency.

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

Published:10/8/2025

RL Training for Large Language ModelsAgentic Retrieval-Augmented GenerationHierarchical Process RewardsKnowledge-Grounded Process RewardsSearch Decision Optimization

HiPRAG introduces a novel hierarchical process rewards method to tackle common oversearch and undersearch issues in agentic retrievalaugmented generation, improving search efficiency and accuracy significantly across multiple QA benchmarks, demonstrating the importance of opti

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

RL Training for Large Language ModelsLong-Context ModelingLLM Reasoning Capacity EnhancementSparse Attention Mechanism

DeepSeekV3.2 is presented, balancing computational efficiency and reasoning capabilities through three innovations: a sparse attention mechanism that reduces complexity, a scalable reinforcement learning framework rivaling GPT5, and a synthesis pipeline enhancing generalization

072

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Published:4/9/2021

RL Training for Large Language ModelsLarge Language Model Fine-TuningTransformer-Based Efficient Forward PredictionGPU Cluster TrainingPipeline Parallel Training

This paper introduces a novel interleaved pipeline parallelism schedule, combining tensor, pipeline, and data parallelism, to enhance the training efficiency of large language models on GPU clusters, achieving 502 petaFLOP/s on 3072 GPUs with over 10% throughput improvement.

$π^{*}_{0.6}$: a VLA That Learns From Experience

Published:11/19/2025

Vision-Language-Action ModelRL Training for Large Language ModelsExperience-Based Reinforcement LearningRobotic Data Collection and OptimizationAdvantage-Conditioned Policies

The study presents RECAP, a method for training VisionLanguageAction models through realworld learning. The

π^{}{0.6}

model, pretrained using offline reinforcement learning, demonstrates significant performance improvements on various tasks, such as laundry folding and es

013

FlowRL: Matching Reward Distributions for LLM Reasoning

Published:9/19/2025

RL Training for Large Language ModelsFlow-Balanced Optimization MethodsReward Distribution MatchingKL Divergence-Based Policy OptimizationMathematical reasoning tasks

FlowRL introduces a novel method that matches full reward distributions via flow balancing, enhancing diversity in reasoning. Experiments show FlowRL improves performance by 10% over GRPO and 5.1% over PPO in math tasks, demonstrating the significance of reward distribution match

USB-Rec: An Effective Framework for Improving Conversational Recommendation Capability of Large Language Model

Published:9/21/2025

RL Training for Large Language ModelsConversational Recommender SystemsLLM-based Recommendation SystemsUser-Simulator-Based FrameworkPreference Optimization Dataset Construction

The USBRec framework enhances Large Language Models' capabilities in conversational recommendation through a usersimulatorbased preference optimization dataset and a selfenhancement strategy. Extensive experiments show it consistently outperforms existing stateoftheart met

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Published:9/12/2025

Vision-Language-Action ModelReinforcement Learning for Math ReasoningRL Training for Large Language ModelsMulti-Environment RenderingEfficient Reinforcement Learning Framework

SimpleVLARL is introduced to enhance the training of VisionLanguageAction models using reinforcement learning, addressing data scarcity and generalization issues. Results show stateoftheart performance on OpenVLAOFT, reducing reliance on labeled data.

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Published:5/18/2025

RL Training for Large Language ModelsGroup Relative Policy OptimizationDiscriminative Constrained Optimization FrameworkLarge Reasoning ModelsEnhancement of Mathematical Reasoning Capabilities

DisCO is a new framework for Large Reasoning Models, addressing limitations of Group Relative Policy Optimization. By using a discriminative objective and nonclipping scoring functions, it eliminates difficulty bias and achieves stable longterm training, enhancing mathematical

Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

Published:10/29/2025

RL Training for Large Language ModelsSequence Policy OptimizationLarge Language Model Fine-Tuning

LearntoAsk learns proactive LLMs from offline expert logs without simulators by leveraging observed future data to infer turnbyturn rewards, decomposing longhorizon tasks for effective training and deployment in realworld highstakes domains.

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Published:9/30/2025

RL Training for Large Language ModelsLLM Reasoning Capacity EnhancementSequence Policy OptimizationMemory Mechanisms for LLMsTest-Time Scaling Techniques

ReasoningBank distills selfjudged experiences into general reasoning strategies, enabling agents to retrieve and update memories for continual improvement. Combined with MaTTS, it enhances learning efficiency and performance in continuous multitask scenarios.

MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation

Published:10/28/2025

Generative Recommendation SystemsLarge Language Model Fine-TuningRL Training for Large Language ModelsSequence Policy OptimizationResidual Quantized Variational Autoencoder (RQ-VAE)

MiniOneRec, the first opensource generative recommendation framework, uses Residual Quantized VAE for SID and posttrains 0.5B–7B parameter Qwen models, confirming scaling benefits and improving ranking accuracy and diversity via aligned SID processing and constrained RL.

030

Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents

Published:11/1/2023

Large Language Model Fine-TuningRL Training for Large Language ModelsLLM-guided motion planningDialogue Policy PlanningSelf-Play Reinforcement Learning

PPDPP introduces a tunable dialogue policy planner enhancing LLMs' proactive dialogue capabilities via supervised finetuning and reinforcement learning, achieving superior generalization and performance across diverse applications.

Self-Improving LLM Agents at Test-Time

Published:10/8/2025

Large Language Model Fine-TuningRL Training for Large Language ModelsLLM Reasoning Capacity EnhancementLLM Confidence CalibrationSelf-Improving Large Language Models

This work introduces a testtime selfimprovement method for LLM agents using uncertainty detection, selfgenerated data augmentation, and finetuning, achieving higher accuracy with fewer samples and enhancing robustness in complex tasks through distillation.

011

Spinning Straw into Gold: Relabeling LLM Agent Trajectories in Hindsight for Successful Demonstrations

Published:10/8/2025

Large Language Model Fine-TuningSequence Policy OptimizationRL Training for Large Language ModelsLong-Horizon Consistency ModelingLLM Reasoning Capacity Enhancement

Hindsight Supervised Learning relabels LLM agent trajectories with actual achieved goals, using masking and reweighting to enhance finetuning in longhorizon tasks, showing improved performance and sample efficiency over baselines in ALFWorld and WebShop.

1 - 20 / 39

Papers