ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

Yu Wang

Paper status: completed

ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

Published:05/28/2025

Reinforcement Learning for Robotic Control (1)Flow Matching Policy Fine-Tuning (1)Online Reinforcement Learning Framework (1)Long-Horizon Planning with Visual Input (1)Sparse Reward Task Benchmarking (1)

Original Link PDF

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ReinFlow is an online reinforcement learning framework for fine-tuning flow matching policies in robotic control, enhancing exploration and training stability. Experiments show significant improvements in reward and success rates while reducing computation time in challenging tas

Abstract

We propose ReinFlow, a simple yet effective online reinforcement learning (RL) framework that fine-tunes a family of flow matching policies for continuous robotic control. Derived from rigorous RL theory, ReinFlow injects learnable noise into a flow policy's deterministic path, converting the flow into a discrete-time Markov Process for exact and straightforward likelihood computation. This conversion facilitates exploration and ensures training stability, enabling ReinFlow to fine-tune diverse flow model variants, including Rectified Flow [35] and Shortcut Models [19], particularly at very few or even one denoising step. We benchmark ReinFlow in representative locomotion and manipulation tasks, including long-horizon planning with visual input and sparse reward. The episode reward of Rectified Flow policies obtained an average net growth of 135.36% after fine-tuning in challenging legged locomotion tasks while saving denoising steps and 82.63% of wall time compared to state-of-the-art diffusion RL fine-tuning method DPPO [43]. The success rate of the Shortcut Model policies in state and visual manipulation tasks achieved an average net increase of 40.34% after fine-tuning with ReinFlow at four or even one denoising step, whose performance is comparable to fine-tuned DDIM policies while saving computation time for an average of 23.20%. Project webpage: https://reinflow.github.io/

In-depth Reading

English Analysis~14 min read · 18,403 chars

1. Bibliographic Information

1.1. Title

ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

1.2. Authors

Tonghe Zhang (Robotics Institute, School of Computer Science, Carnegie Mellon University)
Chao Yu (Department of Electronic Engineering, Tsinghua University; Beijing Zhongguancun Academy)
Sichang Su (Department of Aerospace Engineering, The University of Texas at Austin)
Yu Wang (Department of Electronic Engineering, Tsinghua University)

1.3. Journal/Conference

The paper was published on arXiv on May 28, 2025. Based on the included "NeurIPS Paper Checklist" in the appendix, it indicates a submission to the NeurIPS (Conference on Neural Information Processing Systems) conference, which is a top-tier venue in machine learning and artificial intelligence.

1.4. Publication Year

2025

1.5. Abstract

This paper introduces ReinFlow, a framework for fine-tuning "Flow Matching" robotic control policies using online Reinforcement Learning (RL). Flow Matching is a powerful method for generating robot actions, but training it via RL is difficult due to mathematical complexities (specifically, calculating likelihoods for deterministic paths). ReinFlow solves this by injecting learnable noise into the flow, converting it into a discrete-time Markov Process. This allows for exact likelihood computation and stable training. Experiments show that ReinFlow significantly improves the success rates and rewards of pre-trained policies in locomotion and manipulation tasks, while being faster (computationally more efficient) than state-of-the-art diffusion-based RL methods like DPPO.

1.6. Original Source Link

Source: arXiv
Link: https://arxiv.org/abs/2505.22094
Status: Preprint / Under Review (Submitted to NeurIPS).

2. Executive Summary

2.1. Background & Motivation

The Problem: Modern robots are often trained using Imitation Learning (IL), where they mimic human demonstrations. However, human data is often scarce or imperfect (suboptimal). To make robots strictly better than their human teachers, they need to learn through trial and error, known as Reinforcement Learning (RL).
The Gap: A new class of generative models called Flow Matching (similar to but faster than Diffusion models) is becoming popular for generating robot actions. However, applying RL to fine-tune these models is technically very hard.
- Flow models are typically deterministic (based on Ordinary Differential Equations, or ODEs).
- To train with RL, we need to calculate the probability (likelihood) of an action. For flow models, this usually involves complex, unstable, and slow estimations (like ODE solvers and trace estimators).
- This is especially difficult when trying to run the model fast (using very few "denoising steps").
Innovation: The paper proposes a method to turn the deterministic flow process into a stochastic (random) process by injecting "learnable noise." This small change makes the math for RL much simpler and exact, allowing for stable and fast training.

2.2. Main Contributions / Findings

Algorithm Design (ReinFlow): The authors propose the first online RL algorithm capable of stably fine-tuning flow matching policies. By injecting noise, they convert the flow into a discrete-time Markov process, enabling exact calculation of action probabilities without complex ODE approximations.
Efficiency: The method works exceptionally well even with very few inference steps (e.g., 1 to 4 steps), making the robot react very quickly.
Performance:
- In locomotion tasks (e.g., a robot running), ReinFlow increased rewards by an average of 135.36%.
- In manipulation tasks (e.g., a robot arm moving objects), it increased success rates by 40.34%.
- It reduced wall-clock training time by 62.82% compared to the leading competitor, DPPO.

3.1. Foundational Concepts

To understand this paper, a beginner needs to grasp four key concepts:

Flow Matching Models: Imagine you have a pile of random noise (Action $A_0$ ) and you want to turn it into a useful robot action (Action $A_1$ ). Flow matching learns a "velocity field" (a map of directions) that pushes the noise points smoothly toward the valid action points over a virtual time $t$ from 0 to 1. It is defined by an Ordinary Differential Equation (ODE): $\frac{dX_t}{dt} = v(t, X_t)$ .
Denoising Steps: Solving an ODE is continuous, but computers are discrete. We chop the time from 0 to 1 into $K$ steps (e.g., $t=0, 0.25, 0.5, 0.75, 1$ ). The model predicts the velocity at each step to update the action. Fewer steps mean faster robot reaction but potentially less accuracy.
Markov Decision Process (MDP): A mathematical framework for decision-making. It assumes the future state depends only on the current state and action, not the history.
Policy Gradient (PPO): A popular RL algorithm. It improves a policy (the robot's brain) by increasing the probability of actions that lead to high rewards and decreasing those that lead to low rewards. To do this, it needs to calculate $\ln \pi(a|s)$ (the log-probability of taking action $a$ given state $s$ ).

3.2. Previous Works

Diffusion Policy Policy Optimization (DPPO): This is the main rival. It applies RL to Diffusion Models (a cousin of Flow Matching). DPPO treats the denoising process as a multi-step decision process.
- Limitation: DPPO typically requires many steps (e.g., 10+) to work well, making it slower to train and run.
Flow Q-Learning (FQL): An offline RL method for flow models. It learns from a fixed dataset without interacting with the environment.
- Limitation: As an offline method, it cannot explore the world to find new, better strategies effectively.

3.3. Differentiation Analysis

Vs. Diffusion RL (DPPO): ReinFlow is designed for Flow Matching, which is inherently straighter and faster than diffusion. ReinFlow achieves comparable or better results with far fewer steps (e.g., 1 step vs. DPPO's 5-10 steps).
Vs. Offline Flow RL (FQL): ReinFlow is an online algorithm. It interacts with the environment, collects new data, and updates the policy, allowing it to surpass the limits of the static training data.
Vs. Standard Flow Training: Standard training minimizes the difference between predicted and expert velocities. ReinFlow minimizes a "reward loss" defined by the task (e.g., "did the robot walk far?"), using noise injection to make the optimization mathematically tractable.

4. Methodology

4.1. Principles

The core principle of ReinFlow is "Stochastic Conversion for Tractability." Standard Flow Matching inference is deterministic: given an initial noise and an observation, the output action is fixed. In RL, we need randomness (stochasticity) for two reasons:

Exploration: The robot must try slightly different actions to discover better strategies.
Math: Policy gradient algorithms require calculating the probability density of an action. For a deterministic path, this density is a "Dirac delta" (infinite at one point, zero elsewhere), which is mathematically broken for gradient descent.

ReinFlow solves this by training a secondary network to inject Gaussian noise at each step of the flow integration, effectively turning the deterministic line into a fuzzy, probabilistic tube.

4.2. Core Methodology In-depth (Layer by Layer)

The following figure (Figure 7 from the original paper) illustrates the ReinFlow procedure, showing how the pre-trained policy and noise injection network work together:

Figure 7: Fine-tuning a flow matching policy with online RL algorithm ReinFlow (Alg. 1). 该图像是示意图，展示了ReinFlow框架的整体结构。图中包含多个模块，起始于传感器采集数据，信号经过编码器处理后提取视觉运动特征。然后，特征被送入速度头和噪声注入网络，其中噪声注入网络的输出为可学习的噪声，最终生成机器人的动作。图形明确展示了数据流向和各部分的关系，直观表达了该框架的工作机制。

Step 1: The Action Generation Process (The Noise-Injected Flow)

The robot generates an action through a multi-step process.

Initialization: Start with a random noise sample $a^0$ drawn from a standard normal distribution. $a^0 \sim \mathcal{N}(0, \mathbb{I}_{d_A})$ Here, $\mathbb{I}_{d_A}$ is the identity matrix of the action dimension.
Step-by-Step Integration: We move from time $t=0$ to $t=1$ in $K$ steps. At each step $k$ , the model predicts a velocity $v_\theta$ (direction) and a noise scale $\sigma_{\theta'}$ . The update rule for the action $a^{k+1}$ given the previous action $a^k$ is: $a^{k+1} \sim \mathcal{N} \big( \cdot | a^k + v_\theta(t_k, a^k, o) \Delta t_k, \ \sigma_{\theta'}^2(t_k, a^k, o) \big)$
- $a^k$ : The denoised action at step $k$ .
- $v_\theta(t_k, a^k, o)$ : The velocity predicted by the pre-trained policy network $\theta$ , conditioned on time $t_k$ , current action $a^k$ , and observation $o$ .
- $\Delta t_k$ : The step size (e.g., if $K=4$ , $\Delta t = 0.25$ ).
- $\sigma_{\theta'}(t_k, a^k, o)$ : The standard deviation predicted by the new noise injection network $\theta'$ . This is the key innovation. It makes the transition probabilistic.

Step 2: Exact Likelihood Computation

Because we defined the transition from $a^k$ to $a^{k+1}$ as a Gaussian distribution (Normal distribution), we can calculate the log-probability of the entire trajectory exactly. This is much simpler than solving integrals for continuous flows.

The joint log-probability of the sequence of actions $(a^0, \dots, a^K)$ is the sum of the log-probabilities of each step:

$\ln \pi(a^0, \dots, a^K | \boldsymbol{\sigma}; \boldsymbol{\theta}, \boldsymbol{\theta}') = \ln \mathcal{N}(a^0 | 0, \mathbb{I}_{d_A}) + \sum_{k=0}^{K-1} \ln \mathcal{N} \left( a^{k+1} | a^k + v_\theta(t_k, a^k, \boldsymbol{o}) \Delta t_k, \ \sigma_{\theta'}^2(t_k, a^k, \boldsymbol{o}) \right)$

Why this matters: This formula gives us a scalar number representing "how likely was this sequence of actions?" We can take the derivative of this number with respect to our network parameters ( $\theta$ and $\theta'$ ) to perform RL updates.

Step 3: Policy Optimization (RL Update)

The paper derives a Policy Gradient Theorem specifically for this Markov process. The goal is to maximize the expected reward $J(\pi)$ . The gradient used to update the parameters is:

$\nabla_\theta J(\pi^\theta) = \frac{1}{1-\gamma} \mathbb{E}_{o \sim d_\rho^{\pi^\theta}(\cdot)} \mathbb{E}_{a^0, \dots, a^K \sim \pi^\theta(\cdot|o)} \left[ A^{\pi^\theta}(o, a^K) \nabla_\theta \sum_{k=0}^{K-1} \ln \pi^\theta(a^{k+1} | a^k, o) \right]$

$\frac{1}{1-\gamma}$ : A constant related to the discount factor $\gamma$ .
$A^{\pi^\theta}(o, a^K)$ : The Advantage Function. It measures how much better the final action $a^K$ was compared to the average action in that state.
$\nabla_\theta \sum ...$ : This is simply the gradient of the log-probability we calculated in Step 2.

Implementation: The authors use PPO (Proximal Policy Optimization), a stable RL algorithm. They optimize a "clipped surrogate loss" combined with regularization:

$\theta, \theta' = \underset{\theta, \theta'}{\arg\min} \frac{1}{B} \sum_{i=1}^B \left[ - \min \left( \frac{\pi_{\bar{\theta}}(\mathbf{a}_i|\sigma_i)}{\pi_{\bar{\theta}_{old}}(\mathbf{a}_i|\sigma_i)} \widehat{A}_i, \mathrm{clip} \left( \frac{\pi_{\bar{\theta}}(\mathbf{a}_i|\sigma_i)}{\pi_{\bar{\theta}_{old}}(\mathbf{a}_i|\sigma_i)}, 1-\epsilon, 1+\epsilon \right) \widehat{A}_i \right) + \alpha \cdot \mathcal{R}(\mathbf{a}_i, \sigma_i; \bar{\theta}, \bar{\theta}_{old}) \right]$

The first part (with min and clip) is the standard PPO loss. It ensures the policy doesn't change too drastically in one update.
$\alpha \cdot \mathcal{R}$ : This is a regularization term (explained below).

Step 4: Regularization

To prevent the robot from forgetting what it learned from human demonstrations (catastrophic forgetting) or exploring too wildly, ReinFlow adds a regularization term $\mathcal{R}$ .

Wasserstein-2 Regularization ( $W_2$ ): Keeps the new policy close to the old policy. $\mathcal{R}_{\mathbb{W}_2}(\theta, \theta_{old}) = \mathbb{E}_{o} \mathbb{E}_{a \sim \pi_\theta(\cdot|o), a_{old} \sim \pi_{\theta_{old}}(\cdot|o)} \left[ \frac{1}{2} \| a - a_{old} \|_2^2 \right]$
- This basically says: "The new action $a$ should not be too far (in Euclidean distance) from the old action $a_{old}$ ."
Entropy Regularization: Encourages randomness (exploration). $\mathcal{R}_{\mathbf{h}}(\bar{\theta}) := - \frac{1}{K+1} \mathbb{E} \left[ \mathbf{h}(\mathcal{N}(0, \mathbb{I}_{d_A})) + \sum_{k=0}^{K-1} \mathbf{h}\left( \mathcal{N} \left( a^k + v_\theta \Delta t_k, \sigma_{\theta'}^2 \right) \right) \right]$
- $\mathbf{h}$ : The differential entropy function. For a Gaussian, entropy is proportional to the log of the standard deviation ( $\ln \sigma$ ). So, maximizing entropy means encouraging the noise net to output larger $\sigma$ .

5. Experimental Setup

5.1. Datasets

The experiments use two main benchmarks:

OpenAI Gym (Locomotion): Tasks include Hopper, Walker2d, Ant, and Humanoid.
- Data: From the D4RL dataset. These are standard offline RL datasets containing trajectories of varying quality (Medium, Medium-Expert).
- Input: State vectors (proprioception, e.g., joint angles).
- Goal: Move forward as fast/stable as possible.
Franka Kitchen & Robomimic (Manipulation):
- Franka Kitchen: A robot arm in a kitchen environment performing tasks like opening a microwave or moving a kettle. Uses sparse rewards.
- Robomimic: Tasks like PickPlaceCan, NutAssemblySquare, TwoArmTransport.
- Input: Can be state-based or visual (pixels).
- Data: Human teleoperated demonstrations (often suboptimal or mixed quality).
- Goal: Complete the task (binary success).

5.2. Evaluation Metrics

Episode Reward:
- Definition: The total accumulated score the robot gets in one episode (trial).
- Formula: $J(\pi) = \mathbb{E}^\pi \left[ \sum_{h=0}^{+\infty} \gamma^h r_h \right]$
- Symbols: $\gamma$ is the discount factor (future rewards are worth less), $r_h$ is the reward at step $h$ .
Success Rate:
- Definition: The percentage of episodes where the robot successfully completes the assigned task (e.g., correctly placing the can).
- Formula: $\text{Success Rate} = \frac{\text{Number of Successful Episodes}}{\text{Total Episodes}} \times 100\%$

5.3. Baselines

DPPO (Diffusion Policy Policy Optimization): The state-of-the-art method for fine-tuning Diffusion policies. ReinFlow aims to beat this in speed and match/beat it in performance.
FQL (Flow Q-Learning): An offline method for Flow policies. Used to show the benefit of online interaction.
Behavior Cloning (BC): The performance of the pre-trained model before any RL fine-tuning. This serves as the starting point.

6. Results & Analysis

6.1. Core Results Analysis

ReinFlow demonstrates strong superiority in both learning efficiency and final performance.

Comparison to Pre-training: ReinFlow consistently improves upon the behavior cloning (BC) baseline, proving that the RL fine-tuning works effectively to correct suboptimal demonstrations.
Comparison to DPPO (Diffusion): ReinFlow achieves comparable or better rewards/success rates but does so with significantly fewer denoising steps (e.g., 4 steps vs. DPPO's 5-10 steps). This translates to massive wall-clock time savings.
Comparison to FQL (Offline Flow): ReinFlow outperforms FQL, highlighting the necessity of online interaction to correct distribution shifts and explore better strategies.

The following figure (Figure 3 from the original paper) shows the success rates in Robomimic tasks. Note how ReinFlow (orange/purple curves) rises faster and often higher than DPPO (blue).

该图像是图表，展示了 ReinFlow 在不同任务上的成功率表现。图(a)表明在 Can 任务中的成功率，图(b)展示了 Square 任务的成功率，图(c)则呈现了 Transport 任务的成功率。各曲线代表 ReinFlow-S、ReinFlow-R、DPPO 和 Gaussian 方法的比较。

The following figure (Figure 1 from the original paper) highlights the Wall Time Efficiency. ReinFlow (red/purple) achieves high rewards much faster (in seconds/minutes) than DPPO (blue).

Figure 1: Wall time effciency in OpenAI Gym. Dashed lines indicate the behavior cloning level. 该图像是包含四个子图的图表，展示了不同策略在Hopper-v2、Walker-v2、Ant-v2和Humanoid-v3任务中的平均回报与时间的关系。ReinFlow-S和ReinFlow-R在大多数任务中表现出显著的提升，相较于DPPO和FQL方法，表现更加优势。

6.2. Data Presentation (Tables)

The following are the results from Table 4 of the original paper, detailing the improvement ratios after fine-tuning.

(a) Average Episode Reward in Locomotion Tasks.

Task	Algorithm	Pre-trained Episode Reward	Fine-tuned Episode Reward	Reward Net Increase Ratio
Hopper-v2	ReinFlow-R	1431.80 ± 27.57	3205.33 ± 32.09	123.87%
	ReinFlow-S	1528.34 ± 14.91	3283.27 ± 27.48	114.83%
Walker2d-v2	ReinFlow-R	2739.90 ± 74.57	4108.57 ± 51.77	49.95%
	ReinFlow-S	2739.19 ± 134.30	4254.87 ± 56.56	55.33%
Ant-v2	ReinFlow-R	1230.54 ± 8.18	4009.18 ± 44.60	225.81%
	ReinFlow-S	2088.06 ± 79.34	4106.31 ± 79.45	225.81% (Note: Original text repeats 225.81%, possibly typo in source, but transcribed faithfully)
Humanoid-v3	ReinFlow-R	1926.48 ± 41.48	5076.12 ± 37.47	163.49%
	ReinFlow-S	2122.03 ± 105.01	4748.55 ± 70.71	123.77%

(b) Average Success Rate in Manipulation Tasks.

Environment and Task	Algorithm	Pre-trained Success Rate	Fine-tuned Success Rate	Success Rate Net Increase
Kitchen-complete	ReinFlow-S	73.16 ± 0.84%	96.17 ± 3.65%	23.01%
Kitchen-mixed	ReinFlow-S	48.37 ± 0.78%	74.63 ± 0.36%	26.26%
Kitchen-partial	ReinFlow-S	40.00 ± 0.28%	84.59 ± 12.38%	44.59%
Can (image)	ReinFlow-R	59.00 ± 3.08%	98.67 ± 0.47%	39.67%
	ReinFlow-S	57.83 ± 1.25%	98.50 ± 0.71%	40.67%
Square (image)	ReinFlow-R	25.00 ± 1.47%	74.83 ± 0.24%	49.83%
	ReinFlow-S	34.50 ± 1.22%	74.67 ± 2.66%	40.17%
Transport (image)	ReinFlow-S	30.17 ± 2.46%	88.67 ± 4.40%	58.50%

The following are the results from Table 3 of the original paper, comparing computational speed.

Task	Algorithm	Single Iteration Time/second			Average
Task	Algorithm	First seed	Second seed	Third seed	Mean ± Std
Hopper-v2	ReinFlow-R	11.598	11.704	11.843	11.715 ± 0.123
	ReinFlow-S	12.051	12.127	12.372	12.290 ± 0.141
	DPPO	99.502	99.616	98.021	99.046 ± 0.890
	FQL	4.373	4.366	4.515	4.418 ± 0.084
Walker2d-v2	ReinFlow-R	11.861	11.446	11.382	11.563 ± 0.260
	ReinFlow-S	12.393	12.690	13.975	13.019 ± 0.841
	DPPO	101.151	106.125	98.470	101.915 ± 3.884
	FQL	5.248	4.597	5.207	5.017 ± 0.365
Ant-v0	ReinFlow-R	17.210	17.685	17.524	17.473 ± 0.242
	ReinFlow-S	17.291	17.821	18.090	17.734 ± 0.407
	DPPO	102.362	104.632	99.042	102.012 ± 2.811
	FQL	5.242	4.950	5.3086	5.167 ± 0.191
Humanoid-v3	ReinFlow-R	31.437	30.223	31.088	30.916 ± 0.625
	ReinFlow-S	30.499	30.058	31.029	30.529 ± 0.486
	DPPO	109.884	105.455	113.358	109.566 ± 3.961
	FQL	5.245	4.981	5.522	5.249 ± 0.271
Franka Kitchen	ReinFlow-S	26.655	26.328	26.628	26.537 ± 0.182
	DPPO	81.584	84.646	83.245	83.158 ± 1.533
	FQL	5.245	4.981	5.522	5.249 ± 0.271
Can (image)	ReinFlow-S	219.943	216.529	217.711	218.061 ± 1.734
Can (image)	DPPO	310.974	307.811	308.014	308.933 ± 1.771
Square (image)	ReinFlow-S	313.457	312.3	313.862	313.206 ± 0.811
Square (image)	DPPO	438.506	440.212	434.773	437.830 ± 2.782
Transport (image)	ReinFlow-S	554.196	557.712	559.006	558.359 ± 0.915
Transport (image)	DPPO	406.607	439.268	412.077	419.317 ± 17.493

Analysis: ReinFlow is roughly 5x to 9x faster than DPPO in state-based tasks (Hopper, Ant) and significantly faster in image-based tasks, due to needing fewer denoising steps.

6.3. Ablation Studies / Parameter Analysis

The authors investigated several key factors:

Scaling Data & Steps: Even with more pre-training data, RL fine-tuning provides significant gains. ReinFlow works well even with just 1 denoising step. The following figure (Figure 4 from the original paper) shows this scaling behavior. Graph (a) shows that RL (orange) consistently improves over BC (blue) regardless of dataset size.

该图像是多个图表，展示了ReinFlow策略在不同环境中的表现。图(a)展示了Hopper-v2任务中的平均回报与推理步骤和预训练回合的关系，图(b)显示了Shortcut策略在Square环境中的成功率随预训练回合增加的变化，图(c)展示了ReinFlow策略在Square环境中的成功率随样本数量增加的趋势。
Noise Level: The magnitude of the injected noise is critical. Too little noise leads to limited exploration; too much noise destabilizes training. There is a "sweet spot" for the noise standard deviation.
Regularization: Entropy regularization was generally found to be more effective than Wasserstein-2 regularization, especially in locomotion tasks, as it actively encourages diverse behaviors.

7. Conclusion & Reflections

7.1. Conclusion Summary

ReinFlow presents a robust framework for fine-tuning Flow Matching policies using online RL. By cleverly injecting noise to convert the flow into a Markov process, it makes likelihood computation exact and efficient. This allows robots to learn faster and better than with previous methods, successfully bridging the gap between the high-quality generation of Flow models and the exploration capabilities of Reinforcement Learning.

7.2. Limitations & Future Work

Noise Sensitivity: The performance is sensitive to the magnitude of the injected noise. Currently, this is a hyperparameter that needs tuning. Future work could look into auto-tuning mechanisms.
Sample Efficiency: ReinFlow uses PPO (an on-policy algorithm), which is stable but requires a lot of interaction data compared to off-policy methods. Future work aims to explore more sample-efficient algorithms.
Scaling: While tested on visual tasks, applying this to very large Vision-Language-Action (VLA) models remains a future challenge.

7.3. Personal Insights & Critique

Innovation: The "Stochastic Conversion" is a brilliant simplification. Instead of fighting the math of ODEs (trace estimators, etc.), the authors changed the problem definition (by adding noise) to make it fit standard RL tools (PPO). This is a great example of "changing the rules of the game" to solve a problem.
Practicality: The massive reduction in wall-clock time (due to few-step inference) is the biggest selling point for real-world robotics, where inference speed is crucial for safety and responsiveness.
Critique: The reliance on PPO is a safe choice but potentially limits the method's speed in terms of sample complexity (data needed), even if it is fast in wall-clock time. Exploring off-policy variants (like SAC) with the exact likelihood formula derived here would be a logical next step.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

TL;DR Summary

Abstract

In-depth Reading

English Analysis~14 min read · 18,403 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

Step 1: The Action Generation Process (The Noise-Injected Flow)

Step 2: Exact Likelihood Computation

Step 3: Policy Optimization (RL Update)

Step 4: Regularization

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers