HybridFlow: A Flexible and Efficient RLHF Framework
TL;DR Summary
HybridFlow is a hybrid framework that integrates single and multi-controller paradigms to enhance the efficiency and flexibility of RLHF systems. It features hierarchical APIs and a 3D-HybridEngine for efficient model weight repartitioning, achieving 1.53 to 20.57 times throughpu
Abstract
Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN) and each edge denotes data dependencies between the NNs. RLHF complicates the dataflow by expanding each node into a distributed LLM training or generation program, and each edge into a many-to-many multicast. Traditional RL frameworks execute the dataflow using a single controller to instruct both intra-node computation and inter-node communication, which can be inefficient in RLHF due to large control dispatch overhead for distributed intra-node computation. Existing RLHF systems adopt a multi-controller paradigm, which can be inflexible due to nesting distributed computation and data communication. We propose HybridFlow, which combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow. We carefully design a set of hierarchical APIs that decouple and encapsulate computation and data dependencies in the complex RLHF dataflow, allowing efficient operation orchestration to implement RLHF algorithms and flexible mapping of the computation onto various devices. We further design a 3D-HybridEngine for efficient actor model resharding between training and generation phases, with zero memory redundancy and significantly reduced communication overhead. Our experimental results demonstrate 1.53~20.57 throughput improvement when running various RLHF algorithms using HybridFlow, as compared with state-of-the-art baselines. HybridFlow source code will be available at https://github.com/volcengine/verl.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
HybridFlow: A Flexible and Efficient RLHF Framework
1.2. Authors
Guangming Sheng (The University of Hong Kong), Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin (ByteDance), Chuan Wu (The University of Hong Kong).
1.3. Journal/Conference
EuroSys '25 (Twentieth European Conference on Computer Systems). Comment: EuroSys is a premier conference in the field of computer systems, comparable to SOSP/OSDI, indicating that this work is a significant contribution to systems research, particularly in distributed computing and machine learning systems.
1.4. Publication Year
Published online September 28, 2024 (Preprint/Published date in source). It is accepted for the conference occurring in March-April 2025.
1.5. Abstract
This paper addresses the inefficiencies and inflexibilities in existing Reinforcement Learning from Human Feedback (RLHF) frameworks used for Large Language Model (LLM) alignment. RLHF involves complex dataflows with distributed training and generation nodes connected by many-to-many multicast data dependencies. Existing systems either use a single-controller paradigm (inefficient due to dispatch overhead) or a multi-controller paradigm (inflexible due to tight coupling of computation and communication). The authors propose HybridFlow, a framework combining both paradigms. It features hierarchical APIs to decouple computation from data dependencies and a 3D-HybridEngine for efficient actor model resharding (switching between training and generation) with zero memory redundancy. Experimental results show a throughput improvement of to compared to state-of-the-art baselines.
1.6. Original Source Link
- arXiv: https://arxiv.org/abs/2409.19256
- PDF: https://arxiv.org/pdf/2409.19256v2.pdf
- Code: https://github.com/volcengine/verl
2. Executive Summary
2.1. Background & Motivation
Reinforcement Learning from Human Feedback (RLHF) is the standard method for aligning Large Language Models (LLMs) like GPT-4 or Llama to human values (e.g., helpfulness, harmlessness). Unlike standard supervised training, RLHF is a complex, multi-stage process involving four distinct models:
- Actor: The LLM being trained (generates responses).
- Critic: Evaluates the responses (outputs a value score).
- Reference Policy: An unchanged copy of the original LLM (prevents the Actor from deviating too far).
- Reward Model: Scores responses based on human preference.
The Core Problem: Implementing RLHF efficiently is difficult because it combines distributed LLM training (computation-intensive) with LLM generation (memory-bound) in a complex dependency graph.
- Traditional RL Frameworks (e.g., RLLib): Use a Single-Controller (centralized) approach. This is flexible but slow for LLMs because the central controller becomes a bottleneck when dispatching millions of micro-operations to GPUs.
- Current RLHF Systems (e.g., DeepSpeed-Chat, OpenRLHF): Use a Multi-Controller (decentralized) approach. Each GPU runs its own control loop. This is fast but inflexible. Changing the dataflow (e.g., trying a new algorithm like Safe-RLHF) requires rewriting complex, hard-coded communication logic embedded deep within the training code.
- Resource Inefficiency: Actor training and generation have different optimal parallelism strategies. Existing systems often use suboptimal setups or copy model weights inefficiently, wasting GPU memory and bandwidth.
2.2. Main Contributions / Findings
- Hybrid Programming Model: HybridFlow proposes a "Hierarchical Hybrid" paradigm. It uses a Single-Controller for high-level orchestration (managing the dataflow graph and dependencies) and a Multi-Controller for low-level execution (running the heavy distributed matrix multiplications on GPUs). This offers the "best of both worlds": flexibility and efficiency.
- 3D-HybridEngine: A specialized engine for the Actor model that allows different parallel strategies for Training (computation-heavy) and Generation (memory-heavy). It introduces a novel zero-redundancy model resharding technique, eliminating the memory overhead typically required when switching between these two modes.
- Auto-Mapping Algorithm: An algorithm that automatically determines the best way to place the four models onto a cluster of GPUs and selects the optimal parallelism strategy (Data, Tensor, Pipeline parallelism) for each to minimize end-to-end latency.
- Performance: Achieves massive speedups () over strong baselines like DeepSpeed-Chat and NeMo-Aligner.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand HybridFlow, one must grasp the following concepts:
-
RLHF Workflow (PPO): The most common RLHF algorithm, Proximal Policy Optimization (PPO), operates in loops (iterations). Each iteration has three phases:
-
Generation: The Actor generates text based on prompts.
-
Preparation (Experience Collection): The Actor, Critic, Reference, and Reward models all perform forward passes to calculate scores, log-probabilities, and advantages.
-
Training: The Actor and Critic update their weights based on the calculated loss using backpropagation.
The following figure (Figure 1 from the original paper) illustrates this dataflow for PPO and its variants (Safe-RLHF, ReMax), showing the complex dependencies between models.
该图像是RLHF算法的工作流程示意图,包括PPO、Safe-RLHF和ReMax三种架构。图中标示的阶段分别为生成、准备和训练,对应于不同的计算节点。关键部分包括 的连接,展示了算法间的协作关系。
-
-
LLM Parallelism (3D Parallelism): LLMs are too large to fit on one GPU. Strategies to split them include:
- Data Parallelism (DP): Replicate the model on multiple GPUs; split the input data batch.
- Tensor Parallelism (TP): Split individual large matrix multiplications across GPUs (intra-layer).
- Pipeline Parallelism (PP): Split the model layers across GPUs (inter-layer).
- 3D Parallelism: Combining DP, TP, and PP. A configuration is often denoted as a tuple
(p, t, d). - ZeRO (Zero Redundancy Optimizer): A form of DP that shards optimizer states and gradients to save memory.
-
Controller Paradigms:
- Single-Controller: One central CPU process tells every GPU exactly what to do (e.g., "Receive data," "Compute MatMul"). Good for logic, bad for latency at scale.
- Multi-Controller (SPMD - Single Program Multiple Data): Every GPU runs a copy of the program. They coordinate synchronization points (e.g.,
AllReduce) themselves. Fast, but hard to change logic dynamically.
3.2. Previous Works & Limitations
The authors categorize existing systems into two groups:
- General RL Frameworks (RLLib, RLLib Flow): Designed for small neural networks. They use the single-controller model. Limitation: Too much overhead for LLMs; they treat an LLM like a small MLP, choking on the control messages needed for distributed execution.
- Dedicated RLHF Systems (DeepSpeed-Chat, OpenRLHF, NeMo-Aligner): Designed for LLMs using the multi-controller model.
-
DeepSpeed-Chat: Colocates all models on the same GPUs. Limitation: Inflexible placement; wasteful if models have different resource needs.
-
OpenRLHF: Places models on separate GPUs (Standalone). Limitation: Uses two separate copies of the Actor (one for training, one for generation), wasting massive memory.
-
NeMo-Aligner: Uses 3D parallelism but forces the same parallel strategy for training and generation. Limitation: Inefficient generation (generation prefers TP, training prefers PP/DP).
The following figure (Figure 2 from the original paper) contrasts these paradigms. Figure 2(a) shows the rigid nesting of communication and computation in existing multi-controller systems. Figure 2(b) shows HybridFlow's decoupled approach.
该图像是示意图,展示了在RLHF系统中编程模型的差异(图2)。左侧部分(a)描述了现有的多控制器范式,强调了其灵活性不足和效率低下的问题。右侧部分(b)展示了HybridFlow的混合编程模型,结合了单控制器和多控制器架构,以实现计算和数据依赖的解耦,提供更高的灵活性和执行效率。图中灰色节点表示当前未执行的操作。
-
The following table (Table 1 from the original paper) summarizes the key differences between HybridFlow and baselines.
| RLHF system | DeepSpeed-Chat | OpenRLHF | NeMo-Aligner | HybridFlow |
|---|---|---|---|---|
| Parallelism | Training: ZeRO Generation: TP |
3D Parallelism for both training and generation | ||
| Actor weights | Model resharding from ZeRO to TP | Using two copies of actor weights for the two stages | Using identical model partition in two stages (shared weights) | Generation: 3D Parallelism Zero-redundancy model resharding |
| Model Placement | Colocate all models on the same set of devices | Each model placed on separate devices | Actor/Ref colocated on some GPUs Critic/RM colocated on other GPUs |
Support various model placement |
| Execution Pattern | Sequential | Parallel | Parallel | Support various execution patterns |
Note: "TP" stands for Tensor Parallelism, "ZeRO" for Zero Redundancy Optimizer.
4. Methodology
4.1. Principles: The Hybrid Programming Model
HybridFlow's core philosophy is decoupling. It separates:
-
Intra-node Computation (The "Heavy Lifting"): Handled by a Multi-Controller paradigm. This includes the massively parallel matrix operations required for LLM training and generation.
-
Inter-node Coordination (The "Logic"): Handled by a Single-Controller. This manages the dataflow, telling the system when to move data from the Actor to the Critic, or when to switch the Actor from "Generation Mode" to "Training Mode."
This hybrid approach allows the control logic to be written simply in Python (flexible) while the execution remains as fast as optimized C++/CUDA code (efficient).
The following figure (Figure 4 from the original paper) presents the overall architecture, showing the Controller interacting with the Resource Pool and Parallel Workers.
该图像是HybridFlow架构示意图。图中展示了RLHF数据流图、模型配置和设备配置的结构,包括用户输入、ParallelWorker、资源池及物理设备等层次。特别描述了3D-HybridEngine和自动映射的方法,以实现高效的RLHF模型训练与生成。
4.2. Hierarchical APIs
To implement this, HybridFlow provides a layered API structure:
4.2.1. Intra-node: 3DParallelWorker
This is a base class that wraps existing efficient LLM engines (like Megatron-LM or vLLM). It handles the setup of 3D parallel groups (Process Groups for NCCL).
- Function: It initializes the model weights on the GPUs using a specific
(p, t, d)configuration. - Model Classes: Specific classes like
ActorWorker,CriticWorker, etc., inherit from this. They expose high-level methods likeupdate_actororgenerate_sequences.
4.2.2. Inter-node: Transfer Protocols
This is a critical innovation. Data (like prompts or hidden states) needs to move between models that might be sharded differently. For example, the Actor might be using Tensor Parallelism , while the Critic uses Tensor Parallelism .
- Mechanism: Each operation is decorated with
@register(protocol). A protocol defines acollectfunction (how to gather data from the source) and adistributefunction (how to scatter data to the destination). - Example: In Figure 5(b) (below), the single controller coordinates data moving from the Actor (Generation) to the Critic (Inference). The controller doesn't touch the heavy tensors itself; it instructs the workers to transfer data directly, avoiding a bottleneck.
-
Step 1-3: Controller calls
collecton Actor. -
Step 4: Controller passes handles/metadata to Critic.
-
Step 5-6: Critic calls
distributeto fetch the actual data chunks from Actor GPUs.The following figure (Figure 5 from the original paper) illustrates the API structure and the asynchronous data transfer protocol.
该图像是示意图,展示了层次API的结构与功能。(a) 显示了Actor模型的初始化及资源池分配。(b) 说明在单控制器模式下的异步数据重分配过程,涉及多个数据传输步骤。
-
4.3. The 3D-HybridEngine
The Actor model is unique in RLHF because it performs two very different tasks: Training (backward pass, gradients) and Generation (auto-regressive forward passes).
-
Training is compute-bound and memory-heavy (needs to store optimizer states). It often uses high Pipeline Parallelism (PP) or Data Parallelism (DP).
-
Generation is memory-bandwidth bound. It benefits from Tensor Parallelism (TP) to reduce latency and huge Data Parallelism to maximize throughput.
The Challenge: Switching between these two optimal configurations requires "resharding" (moving weights around GPUs). If done naively (e.g., gathering all weights to CPU and redistributing), it is slow and consumes double the memory (redundancy).
The Solution: Zero-Redundancy Resharding HybridFlow proposes a specific way to construct the parallel groups so that the weights needed for Generation are a subset of the weights already present for Training.
4.3.1. Parallel Group Definition
Let the parallel configuration for Training be defined by groups of size (pipeline), (tensor), and (data). The total GPUs . For Generation, we define a new configuration .
- : Pipeline size for generation.
- : Tensor size for generation.
- : Micro-Data Parallel size. This is a virtual DP dimension added during generation.
- Relationship: The number of GPUs is constant, so . Therefore, the micro DP size is derived as:
4.3.2. Zero-Redundancy Logic
The engine organizes the GPU ranks such that the Generation TP groups are sub-groups of Training TP groups (or aligned strategically).
- Process:
-
Training Stage: Weights are sharded according to
(p, t, d). -
Transition: Instead of gathering all weights globally, each GPU only gathers the specific weights it needs for its new role in the generation configuration.
-
Generation Stage: GPUs execute generation.
-
Back to Training: Gradients/updates are synchronized back to the
(p, t, d)layout.The following figure (Figure 8 from the original paper) visually demonstrates this. In 8(b), notice how the "Generation" groups (blue boxes) overlap efficiently with "Training" groups, unlike the naive approach in 8(a) which requires grey "Redundant" memory blocks.
该图像是示意图,展示了在训练和生成阶段之间模型权重的重分配。图中包含两种分组方法,分别为 HybridFlow-V 和优化的并行分组方法 HybridFlow,重点展示了所有权重的收集及冗余训练权重的处理。整体过程通过不同的组进行信息交换,强调高效的计算和数据依赖关系。
-
4.3.3. Communication Volume Analysis
The paper provides a rigorous comparison of communication costs. Let be the model size.
- DeepSpeed-Chat (Naive All-Gather): Gathers all weights across all GPUs.
- HybridFlow (Optimized): Restricts gathering to within the Micro-DP group.
- Explanation: By keeping small (local), the volume is significantly reduced compared to gathering across the entire cluster (
tpd).
- Explanation: By keeping small (local), the volume is significantly reduced compared to gathering across the entire cluster (
4.4. Auto-Mapping Algorithm
Given a cluster of GPUs and the four models, how do we place them? HybridFlow uses Algorithm 1 to solve this.
Algorithm Logic:
-
Enumerate Placements (): Generate all valid ways to partition the models into groups (e.g., {Actor}, {Critic}, {Ref, Reward} vs {Actor, Critic, Ref, Reward}). This is a Bell number problem, but small for 4 models (15 options).
-
Resource Check: For each placement, calculate the minimum GPUs () needed to fit the models into memory (prevent OOM).
-
Iterate Allocations (): Try assigning different numbers of GPUs to each model group, starting from .
-
Auto-Parallel Search: For a given model on GPUs, call
auto_parallel. This subroutine (Algorithm 2 in appendix) iterates through valid(p, t, d)combinations and uses a Simulator (analytical cost model) to predict latency. -
Cost Estimation (
d_cost): Simulate the end-to-end RLHF iteration time.- If models are on different devices (Set A, Set B), their latencies are the
max(run in parallel). - If models are colocated (Set A), their latencies are the
sum(run sequentially).
- If models are on different devices (Set A, Set B), their latencies are the
-
Selection: Pick the mapping with the lowest total cost ().
The following figure (Figure 3 from the original paper) shows an example of a placement plan and its execution timeline.
该图像是示意图,展示了数据流图 D 的执行模式。其中左侧展示了模型的放置方案,右侧展示不同机器(A、B、C)的执行模式。图中标识了生成(Gen)、参考模型(Ref)、奖励模型(RM)等各个部分的协同工作。
5. Experimental Setup
5.1. Datasets
- Dataset: "Dahoas/full-hh-rlhf" from HuggingFace.
- Content: A dataset for "Helpful and Harmless" RLHF. It consists of prompts and human-ranked responses.
- Purpose: Standard benchmark for aligning LLMs.
5.2. Evaluation Metrics
- RLHF Throughput (tokens/sec):
- Concept: Measures the total processing speed of the system.
- Formula:
- Explanation: "Total Tokens" is the batch size multiplied by sequence length. "Time" covers Generation, Experience Collection, and Training phases.
5.3. Baselines
- DeepSpeed-Chat: Microsoft's system. Colocates all models. Uses ZeRO for training, TP for generation.
- OpenRLHF: Ray-based. Places models on separate GPUs. Uses two full copies of the Actor (memory inefficient).
- NeMo-Aligner: NVIDIA's system. Uses 3D parallelism but keeps the same configuration for training/generation (inefficient generation).
5.4. Hardware & Configuration
- Cluster: 16 nodes, each with 8 NVIDIA A100-80GB GPUs (Total 128 GPUs).
- Network: 600GB/s NVLink intra-node; 200Gbps inter-node.
- Models: Llama-2 architecture, sizes 7B, 13B, 34B, 70B.
6. Results & Analysis
6.1. Core Results Analysis (Throughput)
HybridFlow consistently outperforms all baselines across different model sizes and algorithms (PPO, ReMax, Safe-RLHF).
The following figure (Figure 9 from the original paper) shows the PPO throughput comparison.
该图像是图表,展示了HybridFlow相较于基线系统在PPO和ReMax任务中的吞吐量提升。图中数字表示HybridFlow相对于其他方法的加速倍数。
- Analysis:
- vs. NeMo-Aligner: Massive speedup (up to ). NeMo suffers because it uses training-optimized parallelism for generation, which is very slow.
- vs. DeepSpeed-Chat: Speedup of ~. DeepSpeed loses time resharding weights globally and cannot run models in parallel.
- vs. OpenRLHF: Speedup of ~. OpenRLHF wastes memory on duplicate weights, limiting the batch size it can fit, and incurs heavy communication overhead.
6.2. Model Placement Analysis
The authors validated their Auto-Mapping algorithm by manually testing different placements: "Colocate", "Standalone", "Split".
The following figures (Figure 12 and 13 from the original paper) show these comparisons.

该图像是图表,展示了在不同 GPU 数量下,HybridFlow 与其他方法(Colocate、Standalone 和 Split)的吞吐量(tokens/s)对比。图中分别包含了 13B 和 34B 模型的性能指标,以帮助理解 HybridFlow 的效率优势。
- Finding: There is no single "best" placement.
- Small Cluster (16-64 GPUs): "Colocate" is often best (avoids idle GPUs).
- Large Cluster (128 GPUs): "Standalone" or "Split" becomes better (allows parallel execution of Actor and Critic).
- HybridFlow's Advantage: The "HybridFlow" bar (representing the Auto-Mapping choice) always matches the highest performing manual strategy, proving the algorithm works.
6.3. 3D-HybridEngine Efficiency
Two key aspects were analyzed: Transition Time (Resharding) and Generation Efficiency.
Transition Time: The following figure (Figure 14 from the original paper) compares the time taken to switch between training and generation.
该图像是一个条形图,展示了不同RLHF框架在不同GPU数量下的时间消耗,包括OpenRLHF、HybridFlow-V、DS-Chat和HybridFlow。图中分别显示了7B、13B、34B和70B模型在不同GPU配备下的表现,HybridFlow在时间消耗上明显优于其他方法。
- Result: HybridFlow (green) is significantly faster than DeepSpeed (blue) and OpenRLHF (orange). For 70B models, HybridFlow reduces transition overhead by 89.1%. This validates the "Zero-Redundancy" communication formula derived in Section 4.3.3.
Generation Efficiency: The following figure (Figure 15 from the original paper) shows why decoupling parallelism is vital.
该图像是一个柱状图,展示了在不同生成并行大小下,7B和13B अभिनेता模型的生成时间和过渡时间。图中蓝色柱子表示生成时间,橙色柱子表示过渡时间,横轴标记不同的并行配置。通过对比可以明显看出不同配置对时间的影响。
- Analysis:
- The bars show Generation Time (blue) + Transition Time (orange).
- Far Right (): This mimics NeMo-Aligner (using Training TP size for Generation). Generation is very slow (tall blue bar).
- Middle ( or
4): Using a smaller Tensor Parallelism for generation significantly speeds up execution. HybridFlow enables this optimization; NeMo does not.
6.4. Auto-Mapping Runtime
The following figure (Figure 16 from the original paper) shows the runtime of Algorithm 1.
该图像是一个柱状图,展示了模型大小与GPU数量对设备映射算法运行时间的影响。随着模型规模和GPU数量的增加,运行时间明显增长,时间范围从大约10秒到1000秒不等。
- Result: Even for large clusters (128 GPUs) and large models (70B), the algorithm finds the optimal plan in seconds or minutes, which is negligible compared to the days required for training.
7. Conclusion & Reflections
7.1. Conclusion Summary
HybridFlow successfully addresses the rigidity and inefficiency of prior RLHF systems. By decoupling the control logic (Single-Controller) from the execution logic (Multi-Controller), it allows flexible algorithm design. By decoupling the parallelism strategies of Training and Generation via the 3D-HybridEngine, it achieves optimal performance for both phases without memory waste. The Auto-Mapping algorithm ensures these components are deployed optimally on any hardware cluster. The result is a robust framework that is up to faster than state-of-the-art industrial solutions.
7.2. Limitations & Future Work
- Fault Tolerance: While the paper mentions basic checkpointing, complex fault tolerance (e.g., handling a node failure during an asynchronous multicast) in a hybrid-controller setup is non-trivial and not deeply stress-tested in the evaluation.
- Resource Multiplexing: The current resource pool assumes exclusive access or simple sequential sharing. Fine-grained GPU sharing (e.g., running a small Reward Model kernel alongside a large Actor kernel) is listed as future work.
- Homogeneous Assumption: The Auto-Mapping algorithm largely assumes homogeneous GPUs, though the authors claim it is extensible.
7.3. Personal Insights & Critique
- System Design Philosophy: The "Hybrid" control model is a brilliant application of the "Control Plane / Data Plane" separation concept found in networking. The Single Controller acts as the Control Plane (slow, logic-heavy), and the Multi-Controller workers act as the Data Plane (fast, throughput-heavy). This is a pattern likely to be adopted by future complex AI systems beyond RLHF (e.g., multi-agent systems).
- The "Zero-Redundancy" Insight: The realization that Generation weights are just a subset of Training weights—and that smart rank assignment can eliminate data movement—is a classic systems optimization: trading a bit of complexity in setup (group management) for massive gains in runtime performance.
- Impact: This framework lowers the barrier for researching new RLHF algorithms. Previously, testing a new dataflow (like Safe-RLHF) required hacking deeply coupled C++/Python code. With HybridFlow, it's Python-level logic, which could accelerate algorithmic innovation in alignment.
Similar papers
Recommended via semantic vector search.