Paper status: completed

Fast and Robust Visuomotor Riemannian Flow Matching Policy

Published:12/14/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces the Riemannian Flow Matching Policy (RFMP) for visuomotor tasks, offering fast inference and easy training. It incorporates geometric constraints for robustness and outperforms traditional diffusion policies in real and simulated tasks.

Abstract

Diffusion-based visuomotor policies excel at learning complex robotic tasks by effectively combining visual data with high-dimensional, multi-modal action distributions. However, diffusion models often suffer from slow inference due to costly denoising processes or require complex sequential training arising from recent distilling approaches. This paper introduces Riemannian Flow Matching Policy (RFMP), a model that inherits the easy training and fast inference capabilities of flow matching (FM). Moreover, RFMP inherently incorporates geometric constraints commonly found in realistic robotic applications, as the robot state resides on a Riemannian manifold. To enhance the robustness of RFMP, we propose Stable RFMP (SRFMP), which leverages LaSalle's invariance principle to equip the dynamics of FM with stability to the support of a target Riemannian distribution. Rigorous evaluation on ten simulated and real-world tasks show that RFMP successfully learns and synthesizes complex sensorimotor policies on Euclidean and Riemannian spaces with efficient training and inference phases, outperforming Diffusion Policies and Consistency Policies.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Fast and Robust Visuomotor Riemannian Flow Matching Policy

1.2. Authors

The paper is authored by Haoran Ding, Noémie Jaquier, Jan Peters, and Leonel Rozo. Their affiliations and research backgrounds are:

  • Haoran Ding: Affiliated with Tongji University, the Technical University of Darmstadt, the Bosch Center for Artificial Intelligence (BCAI), and Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). His research focuses on visuomotor policy learning, generative models, and imitation learning.

  • Noémie Jaquier: An assistant professor at KTH Royal Institute of Technology, leading the Geometric Robot (GeoRob) Lab. Her expertise lies in leveraging differential geometry (including Riemannian manifolds) for robot learning, optimization, and control.

  • Jan Peters: A full professor at the Technische Universitaet Darmstadt and a department head at the German Research Center for Artificial Intelligence (DFKI). He is a highly influential researcher and IEEE Fellow with extensive work in machine learning for robotics.

  • Leonel Rozo: A lead research scientist at the Bosch Center for Artificial Intelligence (BCAI). His research focuses on learning robot motion skills from human demonstrations, combining machine learning, optimal control, and Riemannian manifold theory.

    The collective expertise of the authors is perfectly aligned with the paper's topic, combining deep knowledge of robotics, generative models, and the geometric foundations of robot motion.

1.3. Journal/Conference

The paper is an arXiv preprint. The authors' previous work on this topic [17] was published in the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2024. IROS is one of the premier international conferences in robotics, known for its high standards and significant impact on the field.

1.4. Publication Year

The paper was submitted to arXiv in December 2024.

1.5. Abstract

The abstract introduces a new method for learning visuomotor policies for robots, named Riemannian Flow Matching Policy (RFMP). It addresses the slow inference speed of popular diffusion-based models. RFMP is built upon Flow Matching (FM), a generative modeling technique known for easy training and fast inference. A key feature of RFMP is its inherent ability to handle geometric constraints, as robot states (like orientation) often exist on Riemannian manifolds. To further improve robustness, the paper proposes Stable RFMP (SRFMP), which uses LaSalle's invariance principle to ensure the model's dynamics are stable and converge to the target action distribution. The authors evaluate their methods on ten simulated and real-world tasks, demonstrating that RFMP and SRFMP outperform state-of-the-art Diffusion Policies and Consistency Policies in both training efficiency and inference speed.

2. Executive Summary

2.1. Background & Motivation

  • Core Problem: In modern robotics, learning complex skills from human demonstrations (imitation learning) is a central challenge. Visuomotor policies, which map visual sensor data to robot actions, are key to this. Diffusion models, such as Diffusion Policies (DP), have become state-of-the-art because they excel at learning complex, multi-modal action distributions (i.e., when there are multiple valid ways to perform a task). However,他们的主要缺点是推理速度慢。Diffusion models generate actions through a slow, iterative "denoising" process, which can take up to a second per action on a standard GPU. This delay makes them unsuitable for real-time, reactive robot control.

  • Existing Gaps:

    1. Inference Speed: While techniques like Denoising Diffusion Implicit Models (DDIM) offer some speed-up, they are still not fast enough for many applications.
    2. Training Complexity: Recent methods like Consistency Policies (CP) accelerate inference by "distilling" a pre-trained diffusion model into a faster student model. However, this introduces a complex, multi-stage training process that is computationally expensive and can be unstable.
    3. Geometric Constraints: Robot actions, particularly end-effector orientations (rotations), are not simple vectors in Euclidean space. They reside on curved spaces known as Riemannian manifolds (e.g., the sphere S3S^3 for quaternions). Standard diffusion models ignore this geometry, which can lead to invalid actions and require naive post-processing fixes like normalization.
  • Paper's Entry Point: The authors propose abandoning the diffusion framework in favor of a newer class of generative models: Flow Matching (FM). FM models are known for being much easier to train (simulation-free) and faster at inference because they learn a simpler Ordinary Differential Equation (ODE) flow. To address the geometric constraints, they leverage Riemannian Flow Matching (RFM), an extension of FM for data on manifolds. This forms the basis of their Riemannian Flow Matching Policy (RFMP). To tackle potential instability issues with RFMP, they introduce Stable RFMP (SRFMP), which incorporates principles from control theory to guarantee the learned flow converges robustly to the desired action distribution.

2.2. Main Contributions / Findings

The paper presents three main contributions:

  1. A Novel Stable Flow Matching Framework for Manifolds (SRFM): The authors introduce Stable Riemannian Flow Matching (SRFM). This is a theoretical contribution that extends Stable Flow Matching (SFM) from Euclidean spaces to Riemannian manifolds. It provides a principled way to construct a generative flow that is guaranteed to be stable and converge to a target distribution on a curved space.

  2. A Fast and Robust Policy Learning Algorithm (SRFMP): Building on SRFM, they propose the Stable Riemannian Flow Matching Policy (SRFMP). This is a practical algorithm for learning visuomotor policies that inherits the fast training and inference of RFMP but adds a crucial layer of robustness. The stability ensures that the policy's action generation process is reliable and less sensitive to integration parameters, which can further speed up inference.

  3. Comprehensive Empirical Validation: The authors conduct a rigorous evaluation of both RFMP and SRFMP across a wide range of 10 tasks, including simulated benchmarks (like Robomimic and Franka Kitchen) and two real-world robot manipulation tasks. Their experiments show that RFMP and SRFMP achieve performance comparable or superior to Diffusion Policies (DP) and Consistency Policies (CP) but with significantly faster inference and simpler, more efficient training. Notably, SRFMP demonstrates superior robustness, especially in long-horizon tasks.


3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, it's essential to grasp the following concepts:

  • Visuomotor Policy: In robotics, a policy π(ao)\pi(a|o) is a function that determines which action aa a robot should take given a certain observation oo. A visuomotor policy specifically uses visual data (e.g., camera images) as part of its observation. The goal of imitation learning is to train a policy that mimics expert demonstrations.

  • Generative Models: These are machine learning models that learn the underlying probability distribution of a dataset. Once trained, they can generate new, synthetic data samples that resemble the training data. This paper focuses on generative models for creating robot action sequences.

  • Diffusion Models (DDPM/DDIM): A popular class of generative models.

    • Forward Process: They start with a real data sample (e.g., an expert action) and gradually add Gaussian noise over many steps until it becomes pure noise.
    • Reverse Process: They then train a neural network to reverse this process. The network learns to predict and remove the noise at each step.
    • Inference: To generate a new sample, the model starts with random noise and iteratively applies the denoising network to produce a clean data sample. This iterative process is computationally expensive and is the primary reason for the slow inference of Diffusion Policies (DP).
  • Continuous Normalizing Flows (CNF): A type of generative model that transforms a simple, known probability distribution (like a Gaussian) into a complex data distribution using an invertible transformation. This transformation is defined as the solution to an Ordinary Differential Equation (ODE): dx(t)dt=vt(x(t))\frac{d\mathbf{x}(t)}{dt} = v_t(\mathbf{x}(t)), where vtv_t is a neural network called a vector field. Training CNFs traditionally requires maximizing the data likelihood, which involves solving this ODE, making it slow.

  • Flow Matching (FM): A modern, efficient method for training CNFs. Instead of maximizing likelihood, FM defines a simple, fixed "target" flow (e.g., a straight line) between a noise sample and a data sample. It then trains the neural network vector field vtv_t to directly match this simple target vector field. This is a regression problem, which avoids solving the ODE during training, making it significantly faster and more stable than traditional CNF training.

  • Riemannian Manifolds:

    • Intuition: A Riemannian manifold is a mathematical space that is "curved" but locally resembles "flat" Euclidean space. Think of the surface of a sphere: up close, it looks like a flat plane, but globally it's curved. This concept is vital for robotics because many robot states, like orientations, are not simple vectors. For example, a 3D rotation can be represented by a unit quaternion, which lives on the surface of a 4D hypersphere (S3S^3), a Riemannian manifold.
    • Key Tools:
      • Tangent Space (TxMT_x\mathcal{M}): At any point xx on the manifold M\mathcal{M}, the tangent space is the flat vector space that best approximates the manifold at that point. It's where we can perform standard vector operations like addition and scaling.
      • Exponential Map (expx\exp_x): A function that "rolls" the flat tangent space onto the curved manifold. It takes a vector uu in the tangent space TxMT_x\mathcal{M} and maps it to a point yy on the manifold by moving along a geodesic (the "straightest" path on the manifold).
      • Logarithmic Map (Logx\mathrm{Log}_x): The inverse of the exponential map. It takes a point yy on the manifold and finds the corresponding vector in the tangent space TxMT_x\mathcal{M}.
      • Riemannian Metric (gxg_x): This defines how to measure distances and norms within each tangent space, accounting for the manifold's curvature.

3.2. Previous Works

  • Diffusion Policy (DP) [2]: This seminal work applied diffusion models to visuomotor policy learning. It demonstrated state-of-the-art performance on various manipulation tasks by effectively modeling multi-modal action distributions. However, its major drawback, as highlighted by the current paper, is its slow inference, requiring 10 to 100 denoising steps.

  • Consistency Policy (CP) [3]: This work aims to accelerate DP. It uses a technique called consistency distillation. First, a standard DP model is trained (the "teacher"). Then, a new model (the "student") is trained to map any point along a denoising trajectory directly to the final clean action. This allows for much faster inference, sometimes in a single step. The downside is the two-stage training process, which increases overall training time and complexity.

  • Flow Matching (FM) [15]: The foundational paper for the method used. Lipman et al. proposed Conditional Flow Matching (CFM), where the model learns a vector field to transport a noise sample x0\mathbf{x}_0 to a data sample x1\mathbf{x}_1. The CFM loss is: $ \ell _ { \mathrm { C F M } } ( \pmb \theta ) = \mathbb { E } _ { t , q ( \pmb { x } _ { 1 } ) , p ( \pmb { x } _ { 0 } ) } | | v _ { t } ( \pmb { x } _ { t } ; \pmb \theta ) - u _ { t } ( \pmb { x } _ { t } | \pmb x _ { 1 } ) | | _ { 2 } ^ { 2 } $ where vtv_t is the learned network, utu_t is a simple target vector field, and xt\mathbf{x}_t is a point interpolated between x0\mathbf{x}_0 and x1\mathbf{x}_1. This simple regression loss is the key to FM's training efficiency.

  • Riemannian Flow Matching (RFM) [16]: This work extended FM to non-Euclidean spaces. The core idea is to define the flow and the loss function on a Riemannian manifold. The interpolation between noise x0\mathbf{x}_0 and data x1\mathbf{x}_1 is done along a geodesic path, and the loss is measured using the Riemannian metric: $ \ell _ { \mathrm { R C F M } } = \mathbb { E } _ { t , q ( \pmb { x } _ { 1 } ) , p ( \pmb { x } _ { 0 } ) } \Vert v _ { t } ( \pmb { x } _ { t } ; \pmb { \theta } ) - u _ { t } ( \pmb { x } _ { t } | \pmb { x } _ { 1 } ) \Vert _ { g _ { \pmb { x } _ { t } } } ^ { 2 } $ This allows the model to respect the geometric constraints of the data, which is what RFMP builds upon.

  • Stable Autonomous Flow Matching (SFM) [19, 20]: This work identified that standard FM flows can become unstable and diverge after the integration time reaches t=1t=1. To fix this, they introduced stability using LaSalle's invariance principle from control theory. By designing a vector field that is the negative gradient of a specific energy function (Lyapunov-like function), they ensure the flow converges to the target distribution and stays there. This is the core idea that SRFMP generalizes to Riemannian manifolds.

3.3. Technological Evolution

  1. Early Imitation Learning: Methods like Behavioral Cloning (BC) used simple regression or classification, often struggling with multi-modal and high-dimensional data.
  2. Generative Models in Robotics: Normalizing Flows were used to model complex action distributions but were slow to train.
  3. The Rise of Diffusion Models: Diffusion Policies (DP) emerged as a breakthrough, showing remarkable performance in learning complex, multi-modal visuomotor skills. Their main limitation became inference speed.
  4. Accelerating Diffusion: Consistency Models (CM) and Consistency Policies (CP) were developed to distill large diffusion models into faster ones, but at the cost of training complexity.
  5. The Flow Matching Alternative: Flow Matching (FM) was introduced as a generative modeling paradigm offering a better trade-off: simulation-free (easy) training and fast inference.
  6. This Paper's Position: This paper sits at the cutting edge. It adapts the FM paradigm for robotics, extends it to handle geometric constraints (RFMP), and introduces stability guarantees for robustness (SRFMP), positioning it as a direct and potentially superior competitor to the diffusion-based policy lineage.

3.4. Differentiation Analysis

Feature Diffusion Policy (DP) Consistency Policy (CP) RFMP / SRFMP (This Paper)
Core Generative Model Diffusion Models (SDE/DDPM) Diffusion & Consistency Models Flow Matching (ODE/CNF)
Training Process Single-stage, but complex score matching. Two-stage (train DP teacher, then distill student). Complex and slow. Single-stage, simple regression loss. Easy and fast.
Inference Process Iterative denoising (solving an SDE). Slow. Direct mapping (distilled function). Fast. ODE integration. Fast, especially with SRFMP's 1-step trick.
Handling Geometry Not natively. Requires post-processing (e.g., normalization), which is a naive fix. Same as DP. Natively. The framework is defined on Riemannian manifolds, respecting data geometry by design.
Robustness/Stability No explicit guarantees. No explicit guarantees. RFMP can be unstable past t=1. SRFMP provides stability guarantees, ensuring convergence.

4. Methodology

This section details the technical core of the paper, explaining how RFMP and SRFMP are formulated and implemented.

4.1. Principles

The central idea is to frame visuomotor policy learning as a generative modeling problem using Riemannian Flow Matching. Instead of learning the complex, noisy process of diffusion models, the paper proposes learning a deterministic vector field that "flows" a simple noise distribution to the distribution of expert actions, conditioned on visual observations. The method is designed from the ground up to respect the geometry of robot action spaces. The stable version, SRFMP, further enhances this by ensuring the learned flow robustly converges to the target, preventing instability and enabling even faster inference.

4.2. Core Methodology In-depth

4.2.1. Riemannian Flow Matching Policy (RFMP)

RFMP adapts the RCFM framework for learning a visuomotor policy πθ(ao)\pi_{\theta}(\mathbf{a}|\mathbf{o}).

1. Receding Horizon and Data Formulation To ensure smooth and temporally consistent actions, RFMP uses a receding horizon strategy. Instead of predicting a single action, the policy predicts a future sequence of actions.

  • Action Horizon Vector: The policy's output is a sequence a=[as,as+1,...,as+Tp]\mathbf{a} = [\mathbf{a}^s, \mathbf{a}^{s+1}, ..., \mathbf{a}^{s+T_p}], where TpT_p is the prediction horizon.
  • Target Sample (a1a_1): A sequence of expert actions sampled from the demonstration data, i.e., a1q(a1)\mathbf{a}_1 \sim q(\mathbf{a}_1).
  • Prior Sample (a0a_0): A sequence of noise actions sampled from a base distribution p0p_0. To encourage smoothness, all actions in the sequence are identical: a0=[ab,ab,...,ab]\mathbf{a}_0 = [\mathbf{a}_b, \mathbf{a}_b, ..., \mathbf{a}_b], where ab\mathbf{a}_b is sampled from a simple auxiliary distribution (e.g., Gaussian or uniform on the manifold).
  • Observation Vector (o\mathbf{o}): To provide temporal context, the observation vector is not just the current state but a combination of recent observations: o=[os1,oc,sc]\mathbf{o} = [\mathbf{o}^{s-1}, \mathbf{o}^c, s-c], where os1\mathbf{o}^{s-1} is the previous observation, and oc\mathbf{o}^c is a context observation sampled from a window of size ToT_o.

2. The RFMP Loss Function The goal is to train a neural network vt(ato;θ)v_t(\mathbf{a}_t | \mathbf{o}; \mathbf{\theta}) that approximates a target vector field ut(ata1)u_t(\mathbf{a}_t | \mathbf{a}_1). This is done by minimizing the following loss function: RFMP=Et,q(a1),p(a0)vt(ato;θ)ut(ata1)gat2(11) \ell _ { \mathrm { R F M P } } = \mathbb { E } _ { t , q ( \boldsymbol { a } _ { 1 } ) , p ( \boldsymbol { a } _ { 0 } ) } \| v _ { t } ( \boldsymbol { a } _ { t } | \boldsymbol { o } ; \boldsymbol { \theta } ) - u _ { t } ( \boldsymbol { a } _ { t } | \boldsymbol { a } _ { 1 } ) \| _ { g _ { \boldsymbol { a } _ { t } } } ^ { 2 } \quad (11) Let's break this down:

  • Et,q(a1),p(a0)\mathbb{E}_{t, q(\mathbf{a}_1), p(\mathbf{a}_0)}: The expectation is taken over uniformly sampled time t[0,1]t \in [0, 1], a target action sequence a1\mathbf{a}_1 from the expert data, and a prior noise sequence a0\mathbf{a}_0.
  • vt(ato;θ)v_t(\mathbf{a}_t | \mathbf{o}; \mathbf{\theta}): This is the learned vector field, implemented as a neural network with parameters θ\mathbf{\theta}. It takes the interpolated action at\mathbf{a}_t and the observation vector o\mathbf{o} as input and outputs a tangent vector.
  • ut(ata1)u_t(\mathbf{a}_t | \mathbf{a}_1): This is the target vector field. It is simple and predefined. The paper uses a target field derived from a geodesic path.
  • at\mathbf{a}_t: This is the action at time tt along the geodesic path from a0\mathbf{a}_0 to a1\mathbf{a}_1. It is computed using the exponential and logarithmic maps: at=Expa1(tLoga1(a0))(8) \mathbf{a}_t = \mathrm {Exp}_{\mathbf{a}_1}(t \mathrm{Log}_{\mathbf{a}_1}(\mathbf{a}_0)) \quad (8) Here, Loga1(a0)\mathrm{Log}_{\mathbf{a}_1}(\mathbf{a}_0) gives the tangent vector at a1\mathbf{a}_1 that points towards a0\mathbf{a}_0. Scaling it by tt and using the exponential map Expa1\mathrm{Exp}_{\mathbf{a}_1} moves along the geodesic from a1\mathbf{a}_1 towards a0\mathbf{a}_0.
  • gat2\| \cdot \|_{g_{\mathbf{a}_t}}^2: This is the squared norm computed using the Riemannian metric at point at\mathbf{a}_t. It measures the distance between the learned and target vectors in the tangent space at at\mathbf{a}_t, correctly accounting for the manifold's geometry.

3. Training and Inference (Algorithm 1)

  • Training: The training loop is simple. At each step, it samples t,a0,a1t, \mathbf{a}_0, \mathbf{a}_1, and a corresponding o\mathbf{o}. It computes at\mathbf{a}_t and the target vector field utu_t (which is the time derivative of Eq. 8), and then performs a gradient descent step on the loss in Eq. 11.
  • Inference: To generate an action sequence, the model:
    1. Samples an initial noise sequence a0p0\mathbf{a}_0 \sim p_0.

    2. Gets the current observation vector o\mathbf{o}.

    3. Solves the ODE dadt=vt(ao;θ)\frac{d\mathbf{a}}{dt} = v_t(\mathbf{a} | \mathbf{o}; \mathbf{\theta}) from t=0t=0 to t=1t=1 using a numerical solver (e.g., Euler's method). The projected Euler step is: ξt+Δt=Expξt(vθ(ξt,o)Δt)\mathbf{\xi}_{t+\Delta t} = \mathrm{Exp}_{\mathbf{\xi}_t}(v_{\mathbf{\theta}}(\mathbf{\xi}_t, \mathbf{o}) \Delta t).

    4. Executes the first few actions (TaT_a) of the generated sequence and repeats the process.

      Problem with RFMP: As shown in Figure 2, if the ODE is integrated beyond t=1t=1, the flow can become unstable and move away from the target distribution, hurting performance.

4.2.2. Stable Riemannian Flow Matching Policy (SRFMP)

SRFMP is designed to fix the instability of RFMP by building a flow that is guaranteed to converge to the target distribution.

1. The Principle of Stability The method leverages LaSalle's invariance principle (Theorem 1). In simple terms, if you can define an "energy function" H(x)H(\mathbf{x}) that always decreases or stays constant as you move along the flow (i.e., its directional derivative LuH0\mathcal{L}_u H \leq 0), then the flow is guaranteed to converge to a set where this energy stops decreasing (LuH=0\mathcal{L}_u H = 0). SRFMP designs its flow to do exactly this, with the target distribution being the low-energy region.

2. Augmenting the State Space LaSalle's principle applies to time-independent (autonomous) vector fields. However, the RFMP vector field is time-dependent. To resolve this, the state is augmented with a "pseudo-time" or "temperature" variable τ\tau.

  • Augmented State: ξ=[x,τ]\mathbf{\xi} = [\mathbf{x}, \tau], where xM\mathbf{x} \in \mathcal{M} is the action and τR\tau \in \mathbb{R} is the temperature. The augmented state lives on the product manifold M×R\mathcal{M} \times \mathbb{R}.

3. The SRFMP Formulation The core is to define an energy function HH and a vector field u=Hu = -\nabla H that satisfies LaSalle's principle on the Riemannian manifold.

  • Energy Function (HH): The paper defines a quadratic energy function based on the geodesic distance on the manifold: H(ξξ1)=12Logξ1(ξ)ALogξ1(ξ)(18) H ( \pmb { \xi } | \pmb { \xi } _ { 1 } ) = \frac { 1 } { 2 } \mathrm { L o g } _ { \pmb { \xi } _ { 1 } } ( \pmb { \xi } ) ^ { \top } \pmb { A } \mathrm { L o g } _ { \pmb { \xi } _ { 1 } } ( \pmb { \xi } ) \quad (18) Here, Logξ1(ξ)\mathrm{Log}_{\mathbf{\xi}_1}(\mathbf{\xi}) is the tangent vector at the target point ξ1\mathbf{\xi}_1 pointing to the current point ξ\mathbf{\xi}. The energy is a weighted squared distance in the tangent space. A\mathbf{A} is a positive-definite matrix, chosen to be diagonal: A=diag(λxI,λτ)\mathbf{A} = \mathrm{diag}(\lambda_x \mathbf{I}, \lambda_{\tau}).

  • Stable Target Vector Field (uu): The stable vector field is the negative Riemannian gradient of HH: u(ξξ1)=ξH(ξξ1)=ALogξ1(ξ)(19) \boldsymbol { u } ( \pmb { \xi } | \pmb { \xi } _ { 1 } ) = - \nabla _ { \pmb { \xi } } H ( \pmb { \xi } | \pmb { \xi } _ { 1 } ) ^ { \top } = - \mathbf{A} \mathrm { L o g } _ { \pmb { \xi } _ { 1 } } ( \pmb { \xi } ) \quad (19) This can be split into two components: u(ξtξ1)=[ux(xtx1)uτ(τtτ1)]=[λxLogx1(xt)λτ(τtτ1)](20) u ( \pmb { \xi } _ { t } | \pmb { \xi } _ { 1 } ) = \left[ \begin{array} { c } { u _ { \pmb { x } } ( \pmb { x } _ { t } | \pmb { x } _ { 1 } ) } \\ { u _ { \tau } ( \tau _ { t } | \tau _ { 1 } ) } \end{array} \right] = \left[ \begin{array} { c } { - \lambda _ { \pmb { x } } \mathrm { L o g } _ { \pmb { x } _ { 1 } } ( \pmb { x } _ { t } ) } \\ { - \lambda _ { \tau } ( \tau _ { t } - \tau _ { 1 } ) } \end{array} \right] \quad (20) The action component uxu_{\mathbf{x}} pushes the point xt\mathbf{x}_t towards the target x1\mathbf{x}_1 along a geodesic, and the temperature component uτu_{\tau} pushes τt\tau_t towards τ1\tau_1.

  • Stable Flow (ψt\psi_t): The flow generated by this vector field has a closed-form solution: ψt(ξ0ξ1)=[ψt(x0x1)ψt(τ0τ1)]=[Expx1(eλxtLogx1(x0))τ1+eλτt(τ0τ1)](21) \psi _ { t } ( \pmb { \xi } _ { 0 } | \pmb { \xi } _ { 1 } ) = \left[ \begin{array} { c } { \psi _ { t } ( \pmb { x } _ { 0 } | \pmb { x } _ { 1 } ) } \\ { \psi _ { t } ( \tau _ { 0 } | \tau _ { 1 } ) } \end{array} \right] = \left[ \begin{array} { c } { \mathrm { E x p } _ { \pmb { x } _ { 1 } } \left( e ^ { - \lambda _ { x } t } \mathrm { L o g } _ { \pmb { x } _ { 1 } } ( \pmb { x } _ { 0 } ) \right) } \\ { \tau _ { 1 } + e ^ { - \lambda _ { \tau } t } ( \tau _ { 0 } - \tau _ { 1 } ) } \end{array} \right] \quad (21) This defines the interpolated point ξt\mathbf{\xi}_t used for training. The parameters λx\lambda_x and λτ\lambda_{\tau} control the convergence speed of the action and temperature, respectively.

  • SRFMP Loss Function: The loss is analogous to RFMP's, but for the augmented, stable system: SRFMP=Et,q(a1),p(a0)v(ξto;θ)u(ξtξ1)gat2(22) \ell _ { \mathrm { S R F M P } } = \mathbb { E } _ { t , q ( \pmb { a } _ { 1 } ) , p ( \pmb { a } _ { 0 } ) } \| v ( \pmb { \xi } _ { t } | \pmb { o } ; \pmb { \theta } ) - u ( \pmb { \xi } _ { t } | \pmb { \xi } _ { 1 } ) \| _ { g _ { \pmb { a } _ { t } } } ^ { 2 } \quad (22) The network v(ξto;θ)v(\mathbf{\xi}_t | \mathbf{o}; \mathbf{\theta}) now learns the stable vector field uu.

4. Accelerated Inference for SRFMP A key benefit of the stable formulation is the potential for single-step inference. The paper analyzes the discrete Euler integration step for the flow: xt+1Expxt(λxLogxt(x1)Δt)(25) \mathbf{x}_{t+1} \approx \mathrm{Exp}_{\mathbf{x}_t}(\lambda_x \mathrm{Log}_{\mathbf{x}_t}(\mathbf{x}_1) \Delta t) \quad (25) The authors observe that if the learned vector field is perfect, setting the time step Δt=1/λx\Delta t = 1/\lambda_x makes the flow converge to the target x1\mathbf{x}_1 in a single step. In practice, they use this large step for the first integration step to get close to the target quickly, and then use smaller steps for refinement. This significantly reduces the Number of Function Evaluations (NFE) needed for good performance.


5. Experimental Setup

The authors conducted a comprehensive set of experiments to validate RFMP and SRFMP.

5.1. Tasks and Datasets

The methods were evaluated on ten tasks across simulation and the real world, covering Euclidean and Riemannian action spaces.

  • Simulated Tasks:

    1. PUSH-T [2]: A 2D vision-based task where a circular agent must push a T-shaped block into a target zone. The action space is Euclidean (R2\mathbb{R}^2).
    2. SPHERE PUSH-T (New): A version of PUSH-T projected onto a sphere (S2S^2), creating a task with a Riemannian action space.
    3. Robomimic [10]: Five manipulation tasks (LIFT, CAN, SQUARE, TOOL HANG, TRANSPORT) from a large-scale benchmark. These tasks are evaluated with both state-based and vision-based observations. The action space is 7D Euclidean.
    4. Franka Kitchen [58]: A complex, long-horizon task requiring a sequence of four sub-tasks. The action space is a product of manifolds: R3×S3×R2\mathbb{R}^3 \times S^3 \times \mathbb{R}^2 (position, orientation as quaternion, gripper). This tests scalability to high-dimensional action spaces (1944-dim action sequence).
  • Real-World Tasks:

    1. PICK & PLACE: A Franka robot picks up a mug and places it on a plate. The orientation is mostly constant.

    2. MUG FLIPPING: The robot grasps a horizontally placed mug and places it upright, requiring complex orientation changes. The action space for both real-world tasks is R3×S3×R1\mathbb{R}^3 \times S^3 \times \mathbb{R}^1 (position, orientation, gripper).

      The following figure from the paper illustrates some of the robotic task setups.

      该图像是示意图,展示了机器人在不同任务中的操作过程。包括了机器人在欧几里得和黎曼空间中的动作,以及对应的传感器反馈,展示了算法的有效性和鲁棒性。 该图像是示意图,展示了机器人在不同任务中的操作过程。包括了机器人在欧几里得和黎曼空间中的动作,以及对应的传感器反馈,展示了算法的有效性和鲁棒性。

5.2. Implementation Details

  • Network Architecture: The vector field is parameterized by a UNet architecture, similar to the one used in Diffusion Policy [2]. For conditioning on observations and time, they use Feature-wise Linear Modulation (FiLM). For vision-based tasks, a ResNet-18 backbone processes the images.

  • Prior Distributions: For Euclidean tasks, a Gaussian prior is used. For Riemannian tasks (SPHERE PUSH-T, Franka Kitchen, real-world), they test a spherical uniform distribution and a wrapped Gaussian distribution. The figure below illustrates these priors on a sphere.

    Fig. 5. 2D visualization of prior distributions on the sphere: wrapped Gaussian distribution (left) and sphere uniform distribution (right). The sphere Gaussian distribution is obtained by first sampling from a Euclidean Gaussian distribution on a tangent space of the sphere (the red points), followed by the exponential map, which projects the samples onto the sphere manifold \(S ^ { d }\) (blue points). The spherical uniform distribution is computed by normalizing samples from a zero-mean Euclidean Gaussian distribution. 该图像是图表,展示了球面上先验分布的二维可视化:左侧是包裹高斯分布,右侧是球面均匀分布。球面高斯分布通过在球面切空间上采样欧式高斯分布(红点),再通过指数映射将样本投影到球面流形 SdS^{d}(蓝点)。球面均匀分布通过对零均值的欧式高斯分布样本进行归一化计算得到。

  • Hyperparameters: Models are trained using the AdamW optimizer. Key hyperparameters for SRFMP are the convergence rates λx\lambda_x and λτ\lambda_{\tau}, which were set to 2.5 based on an ablation study.

    The following are the results from Table I of the original paper, detailing the training configuration for each experiment.

    Experiment Image res. Crop res. RFMP VF # params SRFMP VF # params ResNet # params Epochs Batch size
    PUSH-T Tasks 84 × 84 8.0 × 107 8.14 × 107 1.12 × 107 300 256
    Euclidean PusH-T SPHeRe Push-T 96 × 96
    100 × 100
    84 × 84 8.0 × 107 8.14 × 107 1.12 × 107 300 256
    State-based simulation tasks
    Robomimic LIFT N.A. N.A. 6.58 x 107 6.68 × 107 N.A. 50 256
    Robomimic CAN N.A. N.A. 6.58 × 107 6.68 × 107 N.A. 50 256
    Robomimic SQUARE N.A. N.A. 6.58 x 107 6.68 x 107 N.A. 50 512
    Robomimic TooL HANG N.A. N.A. 6.58 × 107 6.68 × 107 N.A. 100 512
    Franka Kitchen N.A. N.A. 6.69 × 107 6.89 × 107 N.A. 500 128
    Vision-based simulation tasks
    Robomimic LIfT 2 × 84 × 84 2 × 76 × 76 9.48 × 107 9.69 × 107 2 × 1.12 × 107 100 256
    Robomimic CAN 2 x 84 × 84 2 × 76 × 76 9.48 × 107 9.69 x 107 2 x 1.12 × 107 100 256
    Robomimic SQUARE 2 × 84 × 84 2 × 76 × 76 9.48 × 107 9.69 × 107 2 × 1.12 × 107 100 512
    Robomimic TRANSPORT 4 × 84 × 84 4 × 76 × 76 1.24 × 108 1.27 × 108 4 × 1.12 × 107 200 256
    Real-world experiments
    PIck & PLace 320 × 240 288 × 216 2.51 × 107 2.66 × 107 1.12 × 107 300 256
    Mug FliPPING 320 × 240 256 × 192 2.51 × 107 2.66 × 107 1.12 × 107 300 256

5.3. Evaluation Metrics

The policies are evaluated on three main criteria:

  1. Performance Score: This is a task-specific metric.

    • Conceptual Definition: It measures how successfully the robot completes its objective. For Push-T, it is the maximum coverage ratio of the T-block over the target area. For Robomimic, Franka Kitchen, and real-world tasks, it is the success rate (the percentage of trials where the task is fully completed).
    • Formula: Success Rate = Number of Successful TrialsTotal Number of Trials\frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}}
  2. Inference Time (NFE):

    • Conceptual Definition: Number of Function Evaluations (NFE) is the number of times the learned vector field network vθv_{\theta} is called to compute one action sequence. Since each call takes roughly the same amount of time, NFE is a direct, hardware-independent proxy for inference speed. Lower NFE means faster inference.
  3. Trajectory Smoothness (Jerkiness):

    • Conceptual Definition: Jerk is the rate of change of acceleration, measuring how smoothly the robot's motion changes. High jerkiness implies sudden, jerky movements, which are undesirable as they can cause wear on the robot and are less safe. Lower jerkiness is better.
    • Mathematical Formula: Jerk is the third time-derivative of position, j(t)=d3p(t)dt3j(t) = \frac{d^3 p(t)}{dt^3}. The paper reports the average magnitude of jerk over a trajectory.

5.4. Baselines

The proposed methods are compared against strong, representative baselines:

  • Diffusion Policy (DP) [2]: The state-of-the-art diffusion-based method. The paper uses the faster DDIM sampler for a stronger baseline. For tasks with Riemannian action spaces, DP's outputs are naively projected onto the manifold (e.g., normalizing quaternions), which is a common but sub-optimal practice.

  • Consistency Policy (CP) [3]: A state-of-the-art method for accelerating diffusion policies via distillation. This baseline is used to check if RFMP/SRFMP's speed benefits come at a performance cost compared to dedicated acceleration techniques.


6. Results & Analysis

6.1. Core Results Analysis

The experimental results consistently demonstrate the superiority of RFMP and SRFMP over diffusion-based methods in terms of efficiency and robustness.

6.1.1. Push-T Tasks: Euclidean and Sphere

On the simple Euclidean PUSH-T task, SRFMP shows exceptional performance even with a single function evaluation (NFE=1), achieving an 87.5% success rate, whereas DP completely fails (10.9%). DP requires 10 NFE to catch up. This highlights the fast inference capability of the flow matching approach.

The following are the results from Table III of the original paper:

Policy NFE
1 3 5 10
RFMP 0.848 0.855 0.923 0.891
SRFMP 0.875 0.851 0.837 0.856
DP 0.109 0.79 0.838 0.862

On the Sphere PUSH-T task, which has a Riemannian action space, RFMP and SRFMP with a geometrically appropriate prior (spherical uniform) significantly outperform DP. DP performs poorly with geometric priors and only achieves good performance when using a standard Euclidean Gaussian prior followed by a naive normalization step. This shows the advantage of RFMP/SRFMP's native handling of geometry.

6.1.2. Stability and Robustness Analysis

The paper provides a clear demonstration of SRFMP's stability. Figure 1 and Figure 3 show that while the flow of RFMP can diverge after t=1t=1, the flow of SRFMP converges and remains stable on the target distribution.

Fig. 1. Flows of the RFMP (top) and SRFMP (bottom) at times \(t \\stackrel { - } { = } \\{ 0 . 0 , 1 . 0 , 1 . 5 \\}\) . The policies are learned from pick-and-place demonstration (black) and conditioned on visual observations. Note that the flow of SRFMP is stable to the target distribution at \(t > 1 . 0\) , enhancing the policy robustness and inference time. 该图像是示意图,展示了RFMP(上方)和SRFMP(下方)在时间t={0.0,1.0,1.5}t = \{0.0, 1.0, 1.5\}时的流动情况。图中所示的策略来自于捡放演示(黑色),并依据视觉观察进行调整。注意,SRFMP在t>1.0t > 1.0时流动稳定,增强了策略的鲁棒性和推理速度。

The following experiment (Table VII) quantifies this. When the integration time is extended beyond t=1t=1, RFMP's performance degrades sharply, while SRFMP's performance remains robustly high. This stability makes SRFMP more reliable and less sensitive to the choice of integration horizon.

The following are the results from Table VII of the original paper:

Euclidean PuSH-T t = 1.0 = 1.2 t = 1.6
RFMP 0.855 0.492 0.191
SRFMP 0.862 0.851 0.829
Sphere PuSH-T
t = 1.0 = 1.2 t = 1.6
RFMP 0.736 0.574 0.264
SRFMP 0.727 0.736 0.685

6.1.3. Robomimic Benchmark: Training and Inference Efficiency

On the Robomimic benchmark, the results show two key advantages:

  1. Training Efficiency: As shown in the learning curves below (Figure 8), RFMP and SRFMP achieve high success rates with far fewer training epochs than DP. For example, on the LIFT and CAN tasks, they reach near-perfect performance in just 20 epochs, while DP is still struggling after 50.

    Fig. 8. Success rate (mean and standard deviation) on Robomimic tasks with state-based observations at different checkpoints. The models performance of LIFT, CAN, and SQUARE tasks is checked every 10 epochs throughout the 50-epoch training process using 3 NFE. For the TooL HANG task, the models are trained over 100 epochs and checked every 20 epochs using \(1 0 ~ \\mathrm { N F E }\) . 该图像是图表,展示了在不同检查点下,基于状态的Robomimic任务的成功率(均值和标准差)。图中分别显示了LIFT、CAN、SQUARE和TooL HANG任务在不同训练轮次下的模型表现,使用了3 NFE和100个训练轮次的检查。

  2. Inference Efficiency: The results on vision-based tasks (Table XI) are particularly telling. RFMP and SRFMP consistently outperform both DP and the accelerated CP. For instance, on the SQUARE task, SRFMP with just 1 NFE (86% success) is already on par with CP (82-84%) and vastly better than DP (0%). Furthermore, RFMP/SRFMP training is much simpler and faster per epoch than CP's two-stage distillation process (Table X).

    The following are the results from Table XI of the original paper:

    Task LIfT CAN SQUARE TRansPoRT
    NFE 123510 123510 123510 123510
    RFMP 11111 0.780.820.90.960.94 0.560.740.90.90.9 0.60.820.780.780.8
    SRFMP 11111 0.880.880.90.90.86 0.860.820.90.880.9 0.80.820.820.780.78
    DP 00.70.960.980.98 00.380.660.680.66 00.040.160.260.12 00.540.660.780.82
    CP 0.50.440.460.460.4 0.820.820.840.840.82 0.320.520.540.480.52 0.40.360.440.380.36

6.1.4. Franka Kitchen: Long-Horizon Task

The Franka Kitchen benchmark is the most striking result. On this complex, long-horizon task, SRFMP achieves a 100% success rate across all NFE values. In stark contrast, RFMP and DP almost completely fail. This suggests that for tasks requiring high precision over many steps, the stability and precision of SRFMP's flow are not just beneficial but essential.

The following are the results from Table XII of the original paper:

Policy NFE
1 2 3 5 10
RFMP 0.04 0.08 0.12 0.14 0.14
SRFMP 1 1 1 1 1
DP 0 0 0 0.02 0.02

6.1.5. Real-World Experiments

The results on the real Franka robot mirror the simulation findings. On both PICK & PLACE and the more complex MUG FLIPPING tasks, RFMP and SRFMP achieve high success rates and smooth trajectories with very few NFE (as low as 2). DP, on the other hand, requires more NFE to become successful and produces much jerkier, less smooth motions at low NFE.

The following figure (Figure 12) summarizes the success rate and jerkiness on the real-world tasks.

Fig. 12. Success rate and predicted actions jerkiness as a function of NFE on the PICK & PLACE and MUG FLIPPING tasks. 该图像是一个图表,展示了在 PICK & PLACE 和 MUG FLIPPING 任务中,NFE(每步迭代数)对成功率和预测动作的颤动影响。上方左侧为 PICK & PLACE 任务的成功率,右侧为其颤动情况;下方左侧为 MUG FLIPPING 任务的成功率,右侧为其颤动情况。

6.2. Ablation Studies / Parameter Analysis

The paper includes an extensive ablation study on the PUSH-T task (Tables IV and V) to analyze the effect of key hyperparameters.

  • Observation Horizon (ToT_o): A short horizon (To=2T_o=2) works best. This is intuitive, as older visual information is less relevant for immediate action.

  • Prediction Horizon (TpT_p): A moderate horizon (Tp=16T_p=16) provides a good balance between temporal smoothness and responsiveness to new observations.

  • SRFMP Stability Parameters (λx,λτ\lambda_x, \lambda_{\tau}): The best performance for SRFMP was found when λx=λτ=2.5\lambda_x = \lambda_{\tau} = 2.5. This makes the convergence rates of the spatial and temperature components equal, leading to a flow that resembles an optimal transport map, which promotes fast convergence.


7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Riemannian Flow Matching Policy (RFMP) and its robust variant, Stable RFMP (SRFMP), as powerful alternatives to diffusion-based models for visuomotor policy learning. The core findings are:

  • Efficiency: RFMP and SRFMP offer a superior trade-off between performance and efficiency, achieving fast inference (low NFE) and simple, efficient training compared to Diffusion Policies (DP) and Consistency Policies (CP).
  • Geometric Awareness: By building on Riemannian Flow Matching, the framework naturally handles geometric constraints in robot action spaces, a significant advantage over methods that treat all data as Euclidean.
  • Robustness: The proposed SRFMP, which incorporates stability guarantees via LaSalle's invariance principle, demonstrates remarkable robustness. It is less sensitive to integration parameters and excels in long-horizon, high-precision tasks where other methods fail. The comprehensive evaluations on 10 different tasks strongly support the claim that flow matching policies are a highly promising direction for learning complex robot behaviors.

7.2. Limitations & Future Work

The authors identify two main directions for future work:

  1. Exploring Equivariant Structures: The current policy network does not have built-in symmetries. Incorporating equivariant architectures (e.g., to handle changes in scale, rotation, and translation) could improve data efficiency and generalization, requiring fewer demonstrations.
  2. Multi-modal Perception: The current vision backbone is standard. Investigating more advanced, multi-modal perception backbones could enable the framework to tackle more complex, contact-rich manipulation tasks where information from touch or sound might be crucial.

7.3. Personal Insights & Critique

This paper is a strong piece of engineering and theoretical work with significant practical implications for robotics.

  • Positive Insights:

    • The paper's main strength is that it doesn't just propose an incremental improvement but presents a compelling shift in paradigm from diffusion to flow matching for policy learning. It elegantly solves multiple problems at once: inference speed, training complexity, and geometric correctness.
    • The introduction of stability via SRFMP is a standout contribution. The dramatic performance gap on the Franka Kitchen benchmark is a powerful demonstration of why theoretical guarantees like stability matter in practice, especially as tasks become more complex.
    • The idea of importing principles from classical control theory (LaSalle's invariance) into modern deep generative models is a very fruitful research direction. It bridges the gap between the rigorous, theory-driven world of control and the powerful, data-driven world of deep learning.
  • Potential Issues and Areas for Improvement:

    • Hyperparameter Sensitivity: While SRFMP performs exceptionally well, its performance seems tied to the choice of the stability parameters λx\lambda_x and λτ\lambda_{\tau}. The paper finds good values via an ablation study, but it's unclear how these would be chosen for a completely new task. Future work could investigate adaptive or learned schedules for these parameters.

    • Scalability to More Complex Manifolds: The paper demonstrates success on spheres (S2,S3S^2, S^3) and products thereof. While powerful, it would be interesting to see how the framework scales to more complex manifolds that may lack closed-form geodesics, exponential, or logarithmic maps, as this would require numerical approximations that could impact performance and speed.

    • Comparison with Other Fast Samplers: The primary competitor is DP with DDIM. However, the field of fast diffusion sampling is evolving rapidly with methods like DPM-Solver, PF-ODE, etc. While FM is fundamentally different, a comparison against the absolute latest and fastest diffusion samplers would further strengthen the claims.

      Overall, this paper provides a clear, well-argued, and empirically validated case for using (Stable) Riemannian Flow Matching as the new backbone for learning high-performance visuomotor policies. It is a significant step towards creating robot learning algorithms that are not only powerful but also efficient and reliable enough for real-world deployment.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.