ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning
TL;DR Summary
ReinFlow is an online reinforcement learning framework for fine-tuning flow matching policies in robotic control, enhancing exploration and training stability. Experiments show significant improvements in reward and success rates while reducing computation time in challenging tas
Abstract
We propose ReinFlow, a simple yet effective online reinforcement learning (RL) framework that fine-tunes a family of flow matching policies for continuous robotic control. Derived from rigorous RL theory, ReinFlow injects learnable noise into a flow policy's deterministic path, converting the flow into a discrete-time Markov Process for exact and straightforward likelihood computation. This conversion facilitates exploration and ensures training stability, enabling ReinFlow to fine-tune diverse flow model variants, including Rectified Flow [35] and Shortcut Models [19], particularly at very few or even one denoising step. We benchmark ReinFlow in representative locomotion and manipulation tasks, including long-horizon planning with visual input and sparse reward. The episode reward of Rectified Flow policies obtained an average net growth of 135.36% after fine-tuning in challenging legged locomotion tasks while saving denoising steps and 82.63% of wall time compared to state-of-the-art diffusion RL fine-tuning method DPPO [43]. The success rate of the Shortcut Model policies in state and visual manipulation tasks achieved an average net increase of 40.34% after fine-tuning with ReinFlow at four or even one denoising step, whose performance is comparable to fine-tuned DDIM policies while saving computation time for an average of 23.20%. Project webpage: https://reinflow.github.io/
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning
1.2. Authors
- Tonghe Zhang (Robotics Institute, School of Computer Science, Carnegie Mellon University)
- Chao Yu (Department of Electronic Engineering, Tsinghua University; Beijing Zhongguancun Academy)
- Sichang Su (Department of Aerospace Engineering, The University of Texas at Austin)
- Yu Wang (Department of Electronic Engineering, Tsinghua University)
1.3. Journal/Conference
The paper was published on arXiv on May 28, 2025. Based on the included "NeurIPS Paper Checklist" in the appendix, it indicates a submission to the NeurIPS (Conference on Neural Information Processing Systems) conference, which is a top-tier venue in machine learning and artificial intelligence.
1.4. Publication Year
2025
1.5. Abstract
This paper introduces ReinFlow, a framework for fine-tuning "Flow Matching" robotic control policies using online Reinforcement Learning (RL). Flow Matching is a powerful method for generating robot actions, but training it via RL is difficult due to mathematical complexities (specifically, calculating likelihoods for deterministic paths). ReinFlow solves this by injecting learnable noise into the flow, converting it into a discrete-time Markov Process. This allows for exact likelihood computation and stable training. Experiments show that ReinFlow significantly improves the success rates and rewards of pre-trained policies in locomotion and manipulation tasks, while being faster (computationally more efficient) than state-of-the-art diffusion-based RL methods like DPPO.
1.6. Original Source Link
- Source: arXiv
- Link: https://arxiv.org/abs/2505.22094
- Status: Preprint / Under Review (Submitted to NeurIPS).
2. Executive Summary
2.1. Background & Motivation
- The Problem: Modern robots are often trained using Imitation Learning (IL), where they mimic human demonstrations. However, human data is often scarce or imperfect (suboptimal). To make robots strictly better than their human teachers, they need to learn through trial and error, known as Reinforcement Learning (RL).
- The Gap: A new class of generative models called Flow Matching (similar to but faster than Diffusion models) is becoming popular for generating robot actions. However, applying RL to fine-tune these models is technically very hard.
- Flow models are typically deterministic (based on Ordinary Differential Equations, or ODEs).
- To train with RL, we need to calculate the probability (likelihood) of an action. For flow models, this usually involves complex, unstable, and slow estimations (like ODE solvers and trace estimators).
- This is especially difficult when trying to run the model fast (using very few "denoising steps").
- Innovation: The paper proposes a method to turn the deterministic flow process into a stochastic (random) process by injecting "learnable noise." This small change makes the math for RL much simpler and exact, allowing for stable and fast training.
2.2. Main Contributions / Findings
- Algorithm Design (ReinFlow): The authors propose the first online RL algorithm capable of stably fine-tuning flow matching policies. By injecting noise, they convert the flow into a discrete-time Markov process, enabling exact calculation of action probabilities without complex ODE approximations.
- Efficiency: The method works exceptionally well even with very few inference steps (e.g., 1 to 4 steps), making the robot react very quickly.
- Performance:
- In locomotion tasks (e.g., a robot running), ReinFlow increased rewards by an average of 135.36%.
- In manipulation tasks (e.g., a robot arm moving objects), it increased success rates by 40.34%.
- It reduced wall-clock training time by 62.82% compared to the leading competitor, DPPO.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner needs to grasp four key concepts:
- Flow Matching Models: Imagine you have a pile of random noise (Action ) and you want to turn it into a useful robot action (Action ). Flow matching learns a "velocity field" (a map of directions) that pushes the noise points smoothly toward the valid action points over a virtual time from 0 to 1. It is defined by an Ordinary Differential Equation (ODE): .
- Denoising Steps: Solving an ODE is continuous, but computers are discrete. We chop the time from 0 to 1 into steps (e.g., ). The model predicts the velocity at each step to update the action. Fewer steps mean faster robot reaction but potentially less accuracy.
- Markov Decision Process (MDP): A mathematical framework for decision-making. It assumes the future state depends only on the current state and action, not the history.
- Policy Gradient (PPO): A popular RL algorithm. It improves a policy (the robot's brain) by increasing the probability of actions that lead to high rewards and decreasing those that lead to low rewards. To do this, it needs to calculate (the log-probability of taking action given state ).
3.2. Previous Works
- Diffusion Policy Policy Optimization (DPPO): This is the main rival. It applies RL to Diffusion Models (a cousin of Flow Matching). DPPO treats the denoising process as a multi-step decision process.
- Limitation: DPPO typically requires many steps (e.g., 10+) to work well, making it slower to train and run.
- Flow Q-Learning (FQL): An offline RL method for flow models. It learns from a fixed dataset without interacting with the environment.
- Limitation: As an offline method, it cannot explore the world to find new, better strategies effectively.
3.3. Differentiation Analysis
- Vs. Diffusion RL (DPPO): ReinFlow is designed for Flow Matching, which is inherently straighter and faster than diffusion. ReinFlow achieves comparable or better results with far fewer steps (e.g., 1 step vs. DPPO's 5-10 steps).
- Vs. Offline Flow RL (FQL): ReinFlow is an online algorithm. It interacts with the environment, collects new data, and updates the policy, allowing it to surpass the limits of the static training data.
- Vs. Standard Flow Training: Standard training minimizes the difference between predicted and expert velocities. ReinFlow minimizes a "reward loss" defined by the task (e.g., "did the robot walk far?"), using noise injection to make the optimization mathematically tractable.
4. Methodology
4.1. Principles
The core principle of ReinFlow is "Stochastic Conversion for Tractability." Standard Flow Matching inference is deterministic: given an initial noise and an observation, the output action is fixed. In RL, we need randomness (stochasticity) for two reasons:
-
Exploration: The robot must try slightly different actions to discover better strategies.
-
Math: Policy gradient algorithms require calculating the probability density of an action. For a deterministic path, this density is a "Dirac delta" (infinite at one point, zero elsewhere), which is mathematically broken for gradient descent.
ReinFlow solves this by training a secondary network to inject Gaussian noise at each step of the flow integration, effectively turning the deterministic line into a fuzzy, probabilistic tube.
4.2. Core Methodology In-depth (Layer by Layer)
The following figure (Figure 7 from the original paper) illustrates the ReinFlow procedure, showing how the pre-trained policy and noise injection network work together:
该图像是示意图,展示了ReinFlow框架的整体结构。图中包含多个模块,起始于传感器采集数据,信号经过编码器处理后提取视觉运动特征。然后,特征被送入速度头和噪声注入网络,其中噪声注入网络的输出为可学习的噪声,最终生成机器人的动作。图形明确展示了数据流向和各部分的关系,直观表达了该框架的工作机制。
Step 1: The Action Generation Process (The Noise-Injected Flow)
The robot generates an action through a multi-step process.
-
Initialization: Start with a random noise sample drawn from a standard normal distribution. Here, is the identity matrix of the action dimension.
-
Step-by-Step Integration: We move from time to in steps. At each step , the model predicts a velocity (direction) and a noise scale . The update rule for the action given the previous action is:
- : The denoised action at step .
- : The velocity predicted by the pre-trained policy network , conditioned on time , current action , and observation .
- : The step size (e.g., if , ).
- : The standard deviation predicted by the new noise injection network . This is the key innovation. It makes the transition probabilistic.
Step 2: Exact Likelihood Computation
Because we defined the transition from to as a Gaussian distribution (Normal distribution), we can calculate the log-probability of the entire trajectory exactly. This is much simpler than solving integrals for continuous flows.
The joint log-probability of the sequence of actions is the sum of the log-probabilities of each step:
- Why this matters: This formula gives us a scalar number representing "how likely was this sequence of actions?" We can take the derivative of this number with respect to our network parameters ( and ) to perform RL updates.
Step 3: Policy Optimization (RL Update)
The paper derives a Policy Gradient Theorem specifically for this Markov process. The goal is to maximize the expected reward . The gradient used to update the parameters is:
-
: A constant related to the discount factor .
-
: The Advantage Function. It measures how much better the final action was compared to the average action in that state.
-
: This is simply the gradient of the log-probability we calculated in Step 2.
Implementation: The authors use PPO (Proximal Policy Optimization), a stable RL algorithm. They optimize a "clipped surrogate loss" combined with regularization:
- The first part (with
minandclip) is the standard PPO loss. It ensures the policy doesn't change too drastically in one update. - : This is a regularization term (explained below).
Step 4: Regularization
To prevent the robot from forgetting what it learned from human demonstrations (catastrophic forgetting) or exploring too wildly, ReinFlow adds a regularization term .
-
Wasserstein-2 Regularization (): Keeps the new policy close to the old policy.
- This basically says: "The new action should not be too far (in Euclidean distance) from the old action ."
-
Entropy Regularization: Encourages randomness (exploration).
- : The differential entropy function. For a Gaussian, entropy is proportional to the log of the standard deviation (). So, maximizing entropy means encouraging the noise net to output larger .
5. Experimental Setup
5.1. Datasets
The experiments use two main benchmarks:
- OpenAI Gym (Locomotion): Tasks include
Hopper,Walker2d,Ant, andHumanoid.- Data: From the D4RL dataset. These are standard offline RL datasets containing trajectories of varying quality (Medium, Medium-Expert).
- Input: State vectors (proprioception, e.g., joint angles).
- Goal: Move forward as fast/stable as possible.
- Franka Kitchen & Robomimic (Manipulation):
- Franka Kitchen: A robot arm in a kitchen environment performing tasks like opening a microwave or moving a kettle. Uses sparse rewards.
- Robomimic: Tasks like
PickPlaceCan,NutAssemblySquare,TwoArmTransport. - Input: Can be state-based or visual (pixels).
- Data: Human teleoperated demonstrations (often suboptimal or mixed quality).
- Goal: Complete the task (binary success).
5.2. Evaluation Metrics
- Episode Reward:
- Definition: The total accumulated score the robot gets in one episode (trial).
- Formula:
- Symbols: is the discount factor (future rewards are worth less), is the reward at step .
- Success Rate:
- Definition: The percentage of episodes where the robot successfully completes the assigned task (e.g., correctly placing the can).
- Formula:
5.3. Baselines
- DPPO (Diffusion Policy Policy Optimization): The state-of-the-art method for fine-tuning Diffusion policies. ReinFlow aims to beat this in speed and match/beat it in performance.
- FQL (Flow Q-Learning): An offline method for Flow policies. Used to show the benefit of online interaction.
- Behavior Cloning (BC): The performance of the pre-trained model before any RL fine-tuning. This serves as the starting point.
6. Results & Analysis
6.1. Core Results Analysis
ReinFlow demonstrates strong superiority in both learning efficiency and final performance.
-
Comparison to Pre-training: ReinFlow consistently improves upon the behavior cloning (BC) baseline, proving that the RL fine-tuning works effectively to correct suboptimal demonstrations.
-
Comparison to DPPO (Diffusion): ReinFlow achieves comparable or better rewards/success rates but does so with significantly fewer denoising steps (e.g., 4 steps vs. DPPO's 5-10 steps). This translates to massive wall-clock time savings.
-
Comparison to FQL (Offline Flow): ReinFlow outperforms FQL, highlighting the necessity of online interaction to correct distribution shifts and explore better strategies.
The following figure (Figure 3 from the original paper) shows the success rates in Robomimic tasks. Note how ReinFlow (orange/purple curves) rises faster and often higher than DPPO (blue).
该图像是图表,展示了 ReinFlow 在不同任务上的成功率表现。图(a)表明在 Can 任务中的成功率,图(b)展示了 Square 任务的成功率,图(c)则呈现了 Transport 任务的成功率。各曲线代表 ReinFlow-S、ReinFlow-R、DPPO 和 Gaussian 方法的比较。
The following figure (Figure 1 from the original paper) highlights the Wall Time Efficiency. ReinFlow (red/purple) achieves high rewards much faster (in seconds/minutes) than DPPO (blue).
该图像是包含四个子图的图表,展示了不同策略在Hopper-v2、Walker-v2、Ant-v2和Humanoid-v3任务中的平均回报与时间的关系。ReinFlow-S和ReinFlow-R在大多数任务中表现出显著的提升,相较于DPPO和FQL方法,表现更加优势。
6.2. Data Presentation (Tables)
The following are the results from Table 4 of the original paper, detailing the improvement ratios after fine-tuning.
(a) Average Episode Reward in Locomotion Tasks.
| Task | Algorithm | Pre-trained Episode Reward | Fine-tuned Episode Reward | Reward Net Increase Ratio |
|---|---|---|---|---|
| Hopper-v2 | ReinFlow-R | 1431.80 ± 27.57 | 3205.33 ± 32.09 | 123.87% |
| ReinFlow-S | 1528.34 ± 14.91 | 3283.27 ± 27.48 | 114.83% | |
| Walker2d-v2 | ReinFlow-R | 2739.90 ± 74.57 | 4108.57 ± 51.77 | 49.95% |
| ReinFlow-S | 2739.19 ± 134.30 | 4254.87 ± 56.56 | 55.33% | |
| Ant-v2 | ReinFlow-R | 1230.54 ± 8.18 | 4009.18 ± 44.60 | 225.81% |
| ReinFlow-S | 2088.06 ± 79.34 | 4106.31 ± 79.45 | 225.81% (Note: Original text repeats 225.81%, possibly typo in source, but transcribed faithfully) | |
| Humanoid-v3 | ReinFlow-R | 1926.48 ± 41.48 | 5076.12 ± 37.47 | 163.49% |
| ReinFlow-S | 2122.03 ± 105.01 | 4748.55 ± 70.71 | 123.77% |
(b) Average Success Rate in Manipulation Tasks.
| Environment and Task | Algorithm | Pre-trained Success Rate | Fine-tuned Success Rate | Success Rate Net Increase |
|---|---|---|---|---|
| Kitchen-complete | ReinFlow-S | 73.16 ± 0.84% | 96.17 ± 3.65% | 23.01% |
| Kitchen-mixed | ReinFlow-S | 48.37 ± 0.78% | 74.63 ± 0.36% | 26.26% |
| Kitchen-partial | ReinFlow-S | 40.00 ± 0.28% | 84.59 ± 12.38% | 44.59% |
| Can (image) | ReinFlow-R | 59.00 ± 3.08% | 98.67 ± 0.47% | 39.67% |
| ReinFlow-S | 57.83 ± 1.25% | 98.50 ± 0.71% | 40.67% | |
| Square (image) | ReinFlow-R | 25.00 ± 1.47% | 74.83 ± 0.24% | 49.83% |
| ReinFlow-S | 34.50 ± 1.22% | 74.67 ± 2.66% | 40.17% | |
| Transport (image) | ReinFlow-S | 30.17 ± 2.46% | 88.67 ± 4.40% | 58.50% |
The following are the results from Table 3 of the original paper, comparing computational speed.
| Task | Algorithm | Single Iteration Time/second | Average | ||
|---|---|---|---|---|---|
| First seed | Second seed | Third seed | Mean ± Std | ||
| Hopper-v2 | ReinFlow-R | 11.598 | 11.704 | 11.843 | 11.715 ± 0.123 |
| ReinFlow-S | 12.051 | 12.127 | 12.372 | 12.290 ± 0.141 | |
| DPPO | 99.502 | 99.616 | 98.021 | 99.046 ± 0.890 | |
| FQL | 4.373 | 4.366 | 4.515 | 4.418 ± 0.084 | |
| Walker2d-v2 | ReinFlow-R | 11.861 | 11.446 | 11.382 | 11.563 ± 0.260 |
| ReinFlow-S | 12.393 | 12.690 | 13.975 | 13.019 ± 0.841 | |
| DPPO | 101.151 | 106.125 | 98.470 | 101.915 ± 3.884 | |
| FQL | 5.248 | 4.597 | 5.207 | 5.017 ± 0.365 | |
| Ant-v0 | ReinFlow-R | 17.210 | 17.685 | 17.524 | 17.473 ± 0.242 |
| ReinFlow-S | 17.291 | 17.821 | 18.090 | 17.734 ± 0.407 | |
| DPPO | 102.362 | 104.632 | 99.042 | 102.012 ± 2.811 | |
| FQL | 5.242 | 4.950 | 5.3086 | 5.167 ± 0.191 | |
| Humanoid-v3 | ReinFlow-R | 31.437 | 30.223 | 31.088 | 30.916 ± 0.625 |
| ReinFlow-S | 30.499 | 30.058 | 31.029 | 30.529 ± 0.486 | |
| DPPO | 109.884 | 105.455 | 113.358 | 109.566 ± 3.961 | |
| FQL | 5.245 | 4.981 | 5.522 | 5.249 ± 0.271 | |
| Franka Kitchen | ReinFlow-S | 26.655 | 26.328 | 26.628 | 26.537 ± 0.182 |
| DPPO | 81.584 | 84.646 | 83.245 | 83.158 ± 1.533 | |
| FQL | 5.245 | 4.981 | 5.522 | 5.249 ± 0.271 | |
| Can (image) | ReinFlow-S | 219.943 | 216.529 | 217.711 | 218.061 ± 1.734 |
| DPPO | 310.974 | 307.811 | 308.014 | 308.933 ± 1.771 | |
| Square (image) | ReinFlow-S | 313.457 | 312.3 | 313.862 | 313.206 ± 0.811 |
| DPPO | 438.506 | 440.212 | 434.773 | 437.830 ± 2.782 | |
| Transport (image) | ReinFlow-S | 554.196 | 557.712 | 559.006 | 558.359 ± 0.915 |
| DPPO | 406.607 | 439.268 | 412.077 | 419.317 ± 17.493 | |
Analysis: ReinFlow is roughly 5x to 9x faster than DPPO in state-based tasks (Hopper, Ant) and significantly faster in image-based tasks, due to needing fewer denoising steps.
6.3. Ablation Studies / Parameter Analysis
The authors investigated several key factors:
-
Scaling Data & Steps: Even with more pre-training data, RL fine-tuning provides significant gains. ReinFlow works well even with just 1 denoising step. The following figure (Figure 4 from the original paper) shows this scaling behavior. Graph (a) shows that RL (orange) consistently improves over BC (blue) regardless of dataset size.
该图像是多个图表,展示了ReinFlow策略在不同环境中的表现。图(a)展示了Hopper-v2任务中的平均回报与推理步骤和预训练回合的关系,图(b)显示了Shortcut策略在Square环境中的成功率随预训练回合增加的变化,图(c)展示了ReinFlow策略在Square环境中的成功率随样本数量增加的趋势。 -
Noise Level: The magnitude of the injected noise is critical. Too little noise leads to limited exploration; too much noise destabilizes training. There is a "sweet spot" for the noise standard deviation.
-
Regularization: Entropy regularization was generally found to be more effective than Wasserstein-2 regularization, especially in locomotion tasks, as it actively encourages diverse behaviors.
7. Conclusion & Reflections
7.1. Conclusion Summary
ReinFlow presents a robust framework for fine-tuning Flow Matching policies using online RL. By cleverly injecting noise to convert the flow into a Markov process, it makes likelihood computation exact and efficient. This allows robots to learn faster and better than with previous methods, successfully bridging the gap between the high-quality generation of Flow models and the exploration capabilities of Reinforcement Learning.
7.2. Limitations & Future Work
- Noise Sensitivity: The performance is sensitive to the magnitude of the injected noise. Currently, this is a hyperparameter that needs tuning. Future work could look into auto-tuning mechanisms.
- Sample Efficiency: ReinFlow uses PPO (an on-policy algorithm), which is stable but requires a lot of interaction data compared to off-policy methods. Future work aims to explore more sample-efficient algorithms.
- Scaling: While tested on visual tasks, applying this to very large Vision-Language-Action (VLA) models remains a future challenge.
7.3. Personal Insights & Critique
- Innovation: The "Stochastic Conversion" is a brilliant simplification. Instead of fighting the math of ODEs (trace estimators, etc.), the authors changed the problem definition (by adding noise) to make it fit standard RL tools (PPO). This is a great example of "changing the rules of the game" to solve a problem.
- Practicality: The massive reduction in wall-clock time (due to few-step inference) is the biggest selling point for real-world robotics, where inference speed is crucial for safety and responsiveness.
- Critique: The reliance on PPO is a safe choice but potentially limits the method's speed in terms of sample complexity (data needed), even if it is fast in wall-clock time. Exploring off-policy variants (like SAC) with the exact likelihood formula derived here would be a logical next step.
Similar papers
Recommended via semantic vector search.