Paper status: completed

Robot (Imitation) Learning

Published:06/01/1999
Original Link
Price: 0.100000
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work introduces behavioral cloning for robot imitation learning, leveraging offline multimodal expert demonstrations without reward design, enabling safe, direct observation-to-action mapping and overcoming real-world reinforcement learning challenges.

Abstract

Figure 18 | (A) Average (with standard deviation) evolution of the actuation levels over the first 5 recorded episodes in lerobot/svla_so101_pickplace . Proprioperceptive states provide invaluable to determine the robot’s state during an episode. (B) Camera frames are also recorded alongside measurements on the robot’s state, capturing information about the robot’s interaction with its environment. 4 Robot (Imitation) Learning The best material model for a cat is another, or preferably the same cat Norbert Wiener TL;DR Behavioral Cloning provides a natural platform to learn from real-world interactions without the need to design any reward function, and generative models prove more effective than point-wise policies at dealing with multimodal demonstration datasets. Learning from human demonstrations provides a pragmatic alternative to the RL pipeline discussed in Section 3. Indeed, especially in real-world robotics, online exploration is typically costly and potentially unsafe, and designing (dense) reward signals is a brittle and task-specific process. Further, even success detection itself often requires bespoke instrumentation, while episodic training demands relia

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Robot (Imitation) Learning

1.2. Authors

The provided text is an excerpt from a research paper or tutorial chapter. The authors are not explicitly listed within this specific excerpt.

1.3. Journal/Conference

The provided text appears to be a chapter from a tutorial or textbook, indicated by section numbering (e.g., "Section 3" reference, "This tutorial has charted...") and the general pedagogical tone. The specific journal or conference where this paper was originally published is not mentioned in the excerpt. Given the depth and structure, it could be a chapter in a comprehensive robotics learning guide.

1.4. Publication Year

The publication year is not explicitly stated in the provided excerpt. However, references within the text (e.g., Zhao et al. (2023), Chietal.(2024)Chi et al. (2024), Shukor et al. (2025), Black et al. (2024), Lipman et al. (2023)) suggest that the content is highly current, likely published in late 2024 or early 2025, or reflecting research up to that period.

1.5. Abstract

The paper's abstract introduces Behavioral Cloning (BC) as a natural platform for learning from real-world robot interactions, eliminating the need for reward function design. It highlights that generative models are more effective than point-wise policies in handling multimodal demonstration datasets. The abstract positions learning from human demonstrations as a practical alternative to reinforcement learning (RL) for robotics, especially where exploration is costly and unsafe, and reward design is brittle and task-specific. BC frames control as an imitation learning problem, using offline, reward-free expert trajectories of variable lengths that may contain multiple strategies for the same goal. It learns a mapping from observations (images, proprioceptive information) to expert actions, thereby avoiding rewards and extensive exploration. Training supervised models on this non-i.i.d. sequential data enables offline learning, which mitigates exploration risks and eliminates handcrafted reward shaping, making BC scalable for hardware robot learning.

/files/papers/6907680a971e575bdfc172d9/paper.pdf This is a direct link to the PDF, indicating its publication status is likely either an officially published chapter/paper or a preprint hosted by an academic repository.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inherent difficulty and unsustainability of deploying Reinforcement Learning (RL) for real-world robotics. RL, while powerful, suffers from several critical drawbacks in practical robot applications:

  1. Costly and Unsafe Exploration: Robots exploring in the real world can damage themselves, their environment, or harm humans, making extensive exploration impractical and dangerous.

  2. Brittle and Task-Specific Reward Design: Crafting effective reward functions for complex robotic tasks is challenging, often requiring intricate hand-tuning that does not generalize well to different tasks or environments.

  3. Reliable Resets: Episodic training in RL demands reliable resets of the environment, which is often difficult or impossible to achieve on physical hardware.

    These challenges highlight a significant gap in scaling robot learning to hardware. The paper's entry point and innovative idea is to embrace Imitation Learning (IL), specifically Behavioral Cloning (BC), as a pragmatic and safer alternative. BC leverages existing expert demonstrations (e.g., human teleoperation data) to teach robots desired behaviors, bypassing the need for reward functions and risky exploration. The innovation further extends to using generative models to handle the multimodal nature of human demonstrations, where a single goal might be achieved through multiple, distinct strategies.

2.2. Main Contributions / Findings

The paper provides a comprehensive overview of the evolution and advancements in robot imitation learning, from basic Behavioral Cloning to sophisticated generalist robot policies. Its primary contributions and key findings include:

  1. Advocacy for Behavioral Cloning: It establishes BC as a natural and effective approach for robot learning, emphasizing its ability to learn from real-world interactions without reward design, thus limiting exploration risks and obviating handcrafted reward shaping.

  2. Generative Models for Multimodality: It highlights the superiority of generative models (such as Variational Auto-Encoders (VAEs), Diffusion Models (DMs), and Flow Matching (FM)) over point-wise policies in handling multimodal demonstration datasets. These models capture the diverse strategies present in human demonstrations, leading to more robust and flexible robot behaviors.

  3. Introduction of Advanced BC Methods:

    • Action Chunking with Transformers (ACT): Presents ACT as a method leveraging CVAEs and Transformers to learn sequences of actions (action chunks), inspired by human planning, which significantly improves the robot's ability to perform complex, multimodal behaviors.
    • Diffusion Policy (DP): Introduces DP, which applies Diffusion Models to imitation learning, demonstrating its effectiveness in predicting action chunks by conditioning a noise regressor on a history of observations. DP is noted for its data efficiency and training stability.
  4. Optimized Inference Strategies: The paper describes asynchronous inference techniques that decouple action chunk prediction from action execution, enabling responsive and efficient deployment of complex policies on real hardware, even with computational latency.

  5. Generalist Robot Policies (VLAs): It charts the paradigm shift towards generalist, multitask policies, exemplified by Vision-Language-Action (VLA) models like π0\pi_0 and SmolVLA. These models aim to operate across embodiments and tasks, guided by natural language instructions, by integrating Vision-Language Models (VLMs) with action experts trained via flow matching on large, diverse datasets.

  6. Emphasis on Open-Source Robotics: The paper underscores the growing importance of open-source efforts and decentralized datasets (e.g., SmolVLA) to democratize access to foundational robotics models, addressing the accessibility gap highlighted by proprietary models.

    The paper's findings collectively illustrate a clear trajectory in robot learning towards more adaptive, robust, and generalizable autonomous systems, moving beyond task-specific limitations through advanced imitation learning techniques and multimodal generative models.

3. Prerequisite Knowledge & Related Work

This section provides the foundational knowledge and context of previous works necessary to understand the advanced robot imitation learning techniques discussed in the paper.

3.1. Foundational Concepts

3.1.1. Behavioral Cloning (BC)

Behavioral Cloning (BC) is a straightforward yet powerful imitation learning technique where a robot learns to mimic the actions of an expert (typically a human) by observing demonstrations. It casts the problem as a supervised learning task: given pairs of observations (what the expert saw) and actions (what the expert did), the goal is to train a policy (a mapping function) that can predict the correct action for a given observation.

  • Observations (oo or state): These are the inputs the robot receives from its environment. In robotics, observations can be images from cameras (visual information) and proprioceptive information (e.g., joint angles, velocities, forces from the robot's own sensors).
  • Actions (aa): These are the commands or controls the robot executes in the environment. They can be joint torques, joint position targets, end-effector poses, or even high-level discrete commands.
  • Expert Demonstrations (DD): A dataset consisting of sequences of observation-action pairs collected from a skilled human operator or another proficient system. These demonstrations are typically reward-free, meaning they do not explicitly contain reward signals (unlike in reinforcement learning).
  • Reward Function: In Reinforcement Learning (RL), a reward function defines the goal of a task by assigning numerical values to state-action transitions. Positive rewards encourage desired behaviors, while negative rewards (penalties) discourage undesired ones. BC bypasses the need for designing these often brittle and task-specific functions.
  • Offline Learning: Offline learning refers to training a policy entirely on a pre-collected, fixed dataset of demonstrations, without any further interaction or exploration in the real environment during training. This is crucial for robotics as it limits exploration risks (prevents the robot from performing dangerous actions).
  • Non-i.i.d. Sequential Data: i.i.d. stands for independent and identically distributed. Data is i.i.d. if each sample is drawn independently from the same underlying probability distribution. In expert demonstrations, sequential data (trajectories over time) is non-i.i.d. because actions at one timestep depend on previous observations and actions, and the sequence itself has temporal correlations. This violates the i.i.d. assumption often made by standard supervised learning algorithms, posing challenges like covariate shift.

3.1.2. Reinforcement Learning (RL)

Reinforcement Learning (RL) is an area of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal. The agent learns through trial and error, exploring the environment and receiving feedback in the form of rewards. It involves concepts like states, actions, rewards, policies, and value functions.

  • Exploration: The process by which an RL agent tries new, potentially suboptimal actions to discover better strategies for maximizing reward. In robotics, this can be costly and unsafe.

3.1.3. Generative Models (GMs)

Generative Models (GMs) are a class of statistical models that learn the underlying probability distribution of a dataset, p(x), and can then generate new samples that resemble the training data. Unlike discriminative models which learn to map inputs to outputs (e.g., classify an image), generative models learn to capture the intrinsic structure of the data. In imitation learning, GMs can learn the complex joint distribution of observations and actions, p(o, a), which is particularly useful for multimodal demonstrations.

  • Multimodal Demonstrations/Strategies: In robotics, a multimodal demonstration refers to scenarios where there are multiple, distinct ways or sequences of actions to achieve the same goal. For example, grasping an object can be done from different angles or with different hand configurations. Point-wise policies (traditional BC) often struggle with this by averaging across modes, leading to indecisive or suboptimal actions. GMs, by learning the full distribution, can represent and sample from these different modes.

3.1.4. Variational Auto-Encoders (VAEs)

Variational Auto-Encoders (VAEs) are a type of generative model that learn a latent variable model of the data. They consist of two main components:

  • Encoder (q(zx)q(z|x)): An inference network that maps an input data point xx (e.g., (o, a) pair) to a probability distribution in a lower-dimensional latent space ZZ. Instead of outputting a single point, it typically outputs the parameters (mean and variance) of a Gaussian distribution, from which a latent variable zz is sampled.
  • Decoder (p(xz)p(x|z)): A generative network that maps a latent variable zz back to the original data space, attempting to reconstruct the input xx.
  • Latent Variable (zz): A hidden or unobserved variable that influences the observed data. In VAEs, zz captures compressed, meaningful representations of the input data.
  • Reconstruction Loss (L_rec): Measures how well the decoder can reconstruct the original input from the latent variable. Typically a mean squared error (MSE) or binary cross-entropy (BCE).
  • KL Divergence (D_KL): Kullback-Leibler divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. In VAEs, it regularizes the latent space by forcing the encoder's output distribution q(zx)q(z|x) to be close to a simple prior distribution p(z) (often a standard Gaussian). This prevents the encoder from encoding too much information into zz and ensures the latent space is smooth and continuous.
  • Evidence Lower Bound (ELBO): The objective function that VAEs optimize. It is a lower bound on the true log-likelihood of the data. Maximizing the ELBO simultaneously optimizes reconstruction and regularization of the latent space.

3.1.5. Diffusion Models (DMs)

Diffusion Models (DMs) are another class of generative models that learn to generate data by reversing a gradual noise-adding process.

  • Forward Diffusion Process: In this process, Gaussian noise is gradually added to a data sample x0x_0 over several timesteps TT, creating a sequence of noisy samples x1,x2,,xTx_1, x_2, \ldots, x_T, where xTx_T is almost pure noise. This is typically a Markov chain, meaning each noisy state xtx_t only depends on the previous state xt1x_{t-1}.
  • Reverse Denoising Process: The Diffusion Model learns to reverse this process, starting from pure noise and iteratively removing noise to generate a clean data sample. This is done by training a noise regressor (often a U-Net) to predict the noise added at each step, or the original data sample itself.
  • Denoising: The process of removing noise from a noisy input to recover a cleaner signal or data sample.
  • Gaussian Noise: Random noise drawn from a Gaussian (normal) distribution.

3.1.6. Flow Matching (FM)

Flow Matching (FM) is a recent generative modeling technique that provides a continuous-time, deterministic alternative to Diffusion Models. Instead of learning a discrete denoising process, FM learns a vector field that transports samples from a simple prior distribution (e.g., standard Gaussian) to a complex target data distribution along a continuous path.

  • Vector Field: A mathematical construct that assigns a vector to each point in space. In FM, the vector field describes the direction and magnitude of movement for data points as they transform from the prior to the data distribution.
  • Continuous Transformations: Unlike DMs that involve discrete steps, FM models the transformation as a continuous flow over time, typically from t=0t=0 to t=1t=1.
  • Deterministic Trajectories: At inference time, FM can generate samples by integrating the learned vector field along a deterministic path, which can be more efficient than the stochastic paths typically found in DMs.

3.1.7. Action Chunking

Action chunking refers to the strategy of predicting not just a single action at a given timestep, but a sequence or chunk of future actions (at,at+1,,at+Ha1)(a_t, a_{t+1}, \ldots, a_{t+H_a-1}). This is motivated by how humans plan, often thinking several steps ahead.

  • Benefits: Can simplify learning long-horizon tasks, better capture temporal dependencies in actions, and potentially make policies more robust to small errors by correcting within the chunk.

3.1.8. Transformers

Transformers are a neural network architecture primarily known for their success in Natural Language Processing (NLP) but now widely used in Computer Vision (CV) and robotics. Their key innovation is the self-attention mechanism.

  • Self-Attention: A mechanism that allows a model to weigh the importance of different parts of the input sequence when processing a specific element. For example, when encoding a word, it can decide how much to "attend" to other words in the sentence.
  • Encoder-Decoder Architecture: A common Transformer structure where an encoder processes the input sequence (e.g., observations) and a decoder generates the output sequence (e.g., action chunks). Cross-attention mechanisms allow the decoder to attend to the encoder's output.

3.1.9. Vision-Language Models (VLMs) and Vision-Language-Action (VLA) Models

  • Vision-Language Models (VLMs): Large neural networks trained on vast datasets of paired images and text. They learn to understand and relate information across both visual and linguistic modalities. This enables them to perform tasks like image captioning, visual question answering (VQA), and image generation from text prompts.
  • Vision-Language-Action (VLA) Models: Extend VLMs by incorporating the ability to generate actions for robots. They take visual observations and natural language instructions as input and output robot actions. VLAs aim to create generalist policies that can perform a wide range of tasks and adapt to new scenarios based on high-level human commands.
  • Generalist Policies: Robot policies that can perform a wide variety of tasks across different environments and robot embodiments (e.g., different robot arms, mobile bases). They are often guided by unstructured instructions like natural language.
  • Embodiment: Refers to the physical form and characteristics of a robot (e.g., number of arms, type of grippers, mobile base). Cross-embodiment capabilities mean a policy can work with different robot hardware configurations.

3.1.10. Covariate Shift

Covariate shift occurs when the distribution of input data (observations) in the training set differs from the distribution of input data encountered during deployment or testing. In Behavioral Cloning, if the learned policy makes a small error and drives the robot into a state that was not present in the expert demonstrations, subsequent predictions might be increasingly erroneous because the policy has never seen observations from that region of the state space. This can lead to compounding errors and instability.

3.2. Previous Works

The paper builds upon a rich history of imitation learning and generative models. Key prior works mentioned or implicitly foundational include:

  • Behavioral Cloning (Pomerleau, 1988): The foundational work on Behavioral Cloning, demonstrating the feasibility of training neural networks to mimic human driving behavior in autonomous vehicles. This established the core idea of supervised learning from expert demonstrations.
  • DAgger (Dataset Aggregation): A method designed to mitigate covariate shift in Behavioral Cloning. DAgger iteratively collects new expert demonstrations in states visited by the learned policy, aggregates them with the original dataset, and retrains the policy. While effective, the paper notes it "falls out of our scope" when a fixed offline dataset is used and no more data can be collected.
  • Variational Auto-Encoders (Kingma and Welling, 2013): This seminal work introduced VAEs, providing a principled framework for variational inference and generative modeling using latent variables and optimizing the Evidence Lower Bound (ELBO). This enabled BC to learn multimodal distributions more effectively than point-wise policies.
  • Conditional VAEs (Sohn et al., 2015): Extended VAEs to allow conditioning on additional input information. This is crucial for imitation learning where the generated actions (aa) are conditioned on observations (oo) and latent variables (zz), effectively learning p(ao,z)p(a|o,z).
  • Diffusion Models (Ho et al., 2020): Introduced a novel class of generative models that achieve high-quality sample generation by learning to reverse a Markov chain of Gaussian noise addition. This work provided the foundation for Diffusion Policy.
  • Flow Matching (Lipman et al., 2023): Presented Flow Matching as a method for generative modeling that learns continuous transformations via vector fields, offering a deterministic and potentially more efficient alternative to Diffusion Models for sampling.
  • Action Chunking with Transformers (ACT) (Zhao et al., 2023): This work specifically applies CVAEs and Transformer architectures to imitation learning for robotics, emphasizing action chunking for improved performance and more natural human-like planning.
  • Diffusion Policy (Chi et al., 2024): Applied Diffusion Models to robot imitation learning, demonstrating their effectiveness in generating action chunks conditioned on past observations.
  • ALOHA (A Low-cost, Open-source Hardware for Bimanual Teleoperation) (Zhao et al., 202X, likely 2023 as per ACT): A practical contribution enabling low-cost data collection, highlighted for its role in making robot learning more accessible.
  • PaliGemma, Llama2-7B, SmolVLM-2: These large language models (LLMs) and Vision-Language Models (VLMs) are mentioned as backbones for the generalist policies like π0\pi_0 and SmolVLA, leveraging their pre-trained knowledge.

3.3. Technological Evolution

The field of robot learning has undergone a significant evolution, as detailed in the paper:

  1. Traditional Model-Based Control: Initially, robotics relied heavily on dynamic models and handcrafted control laws. While precise, these methods were brittle and required extensive engineering for each specific task and environment, making them difficult to generalize or scale.

  2. Early Behavioral Cloning: The advent of Behavioral Cloning (BC) provided a more data-driven approach, using supervised learning to directly map observations to actions from expert demonstrations. This mitigated the need for explicit dynamic models but faced challenges with covariate shift and multimodal behaviors.

  3. Reinforcement Learning (RL): RL offered a powerful framework for learning optimal policies through interaction and rewards. However, its application in real-world robotics was hampered by the cost and safety concerns of exploration, and the difficulty of reward design. Efforts like HIL-SERL (mentioned briefly in conclusions) aimed to make RL more feasible by integrating human guidance.

  4. Generative Models for Multimodality: To address the multimodal nature of human demonstrations (where multiple valid strategies exist for a single task), point-wise BC policies evolved to incorporate generative models like VAEs and later Diffusion Models and Flow Matching. These models could learn and represent the full distribution of expert actions, enabling more flexible and robust behaviors.

  5. Action Chunking and Transformer Architectures: Action chunking (predicting sequences of actions) combined with powerful Transformer architectures (e.g., ACT, Diffusion Policy) further improved the performance of imitation learning, allowing robots to learn complex, temporally extended skills.

  6. Generalist Vision-Language-Action (VLA) Models: The most recent paradigm shift involves foundational models for robotics, drawing inspiration from the success of large language models (LLMs) and Vision-Language Models (VLMs). VLAs like π0\pi_0 and SmolVLA leverage vast multi-modal datasets, VLM backbones, and advanced generative models (flow matching) to learn generalist policies that can understand natural language instructions and operate across diverse tasks and robot embodiments. This represents a move towards truly versatile and adaptable robots.

  7. Open-Source and Decentralized Data: A parallel evolution is the increasing emphasis on open-source hardware, software, and decentralized data collection (e.g., Open-X, DROID, SmolVLA), aiming to democratize access to and advance robot learning research beyond large, proprietary institutions.

    This paper's work sits at the forefront of this evolution, particularly in the generative models and generalist VLA stages, summarizing the state-of-the-art and pointing towards future directions.

3.4. Differentiation Analysis

The paper's approach, and the advancements it surveys, differentiate from previous methods primarily in how they handle the complexity and diversity of real-world robot tasks and human demonstrations:

  • From Point-wise Policies to Generative Models:

    • Previous (Point-wise BC): Traditional Behavioral Cloning typically learns a deterministic mapping f:OAf: \mathcal{O} \to \mathcal{A}, predicting a single, specific action for a given observation. This is effective for unimodal behaviors but struggles with multimodal demonstrations (e.g., multiple ways to grasp an object), often averaging across modes to produce indecisive or suboptimal actions.
    • This Paper's Focus (Generative Models): The paper strongly advocates for generative models (VAEs, DMs, FMs) which learn the full probability distribution p(o, a) or p(ao)p(a|o). This allows them to capture and sample from the multimodal strategies present in expert data, yielding more robust and flexible behaviors. For example, a generative model can produce a distribution over different valid grasps, rather than an average, potentially ambiguous one.
  • From Single-Action Prediction to Action Chunking:

    • Previous (Single-Action BC): Early BC policies often predicted a single action ata_t at each timestep tt. This can lead to compounding errors due to covariate shift and makes learning long-horizon tasks challenging.
    • This Paper's Focus (Action Chunking): Modern approaches like ACT and Diffusion Policy predict action chunks (sequences of actions at:t+Haa_{t:t+H_a}). This aligns with human planning, helps in capturing temporal dependencies, and provides a more stable foundation for sequential decision-making.
  • From Task-Specific Policies to Generalist VLAs:

    • Previous (Single-Task BC/RL): Historically, robot learning policies were trained for specific tasks (e.g., pick-and-place, block pushing) on isolated datasets. They lacked the ability to generalize to new tasks, environments, or robot embodiments.
    • This Paper's Focus (Generalist VLAs): The cutting-edge Vision-Language-Action (VLA) models (π0\pi_0, SmolVLA) discussed in the paper represent a paradigm shift towards generalist policies. They leverage Vision-Language Models (VLMs) as backbones and are trained on massive, diverse datasets. Key differentiators include:
      • Language Conditioning: Robots can receive instructions in natural language, enabling zero-shot generalization to novel variations of known tasks.
      • Cross-Embodiment Capabilities: These models can control different robot hardware configurations (e.g., varying numbers of degrees of freedom, mobile/static platforms) by intelligently handling action space differences.
      • Unified Architectures: They integrate visual perception, language understanding, and action generation within a single Transformer-based framework (often Mixture of Experts), making them efficient and broadly capable.
  • From Synchronous to Asynchronous Inference:

    • Previous (Synchronous Inference): Directly deploying complex policies on resource-constrained robot hardware often leads to latency and idle periods as the robot waits for policy computations.

    • This Paper's Focus (Optimized Async Inference): The paper introduces asynchronous inference strategies that decouple computation (PolicyServer) from execution (RobotClient). This allows computational burden to be managed remotely, while the robot continues executing actions from a queue, significantly improving responsiveness and practical deployability.

      In essence, the paper traces a progression from merely mimicking actions to understanding the underlying distribution of actions, planning sequences of actions, and ultimately to building highly versatile, language-conditioned robot agents that can operate broadly and efficiently in the real world.

4. Methodology

This section provides a detailed breakdown of the technical methodologies discussed in the paper, integrating mathematical formulas with their procedural explanations.

4.1. Behavioral Cloning (BC)

Behavioral Cloning (BC) forms the foundational approach to imitation learning discussed. It frames the problem of learning control as a supervised learning task.

The core idea is to learn a mapping function, ff, from observations oto_t to expert actions ata_t. This mapping f:OAf: \mathcal{O} \mapsto \mathcal{A} is learned by minimizing a risk function L\mathcal{L} between the predicted actions f(ot)f(o_t) and the true expert actions ata_t observed in a dataset D\mathcal{D}.

Formally, given a dataset D={τ(i)}i=1N\mathcal{D} = \{ \tau^{(i)} \}_{i=1}^N of NN expert trajectories, where each trajectory is τ(i)={(ot(i),at(i))}t=0Ti\tau^{(i)} = \{ (o_t^{(i)}, a_t^{(i)}) \}_{t=0}^{T_i} (a sequence of observation-action pairs of length TiT_i), BC aims to solve the following optimization problem:

minfE(ot,at)p()L(at,f(ot)) \operatorname* { m i n } _ { f } \mathbb { E } _ { ( o _ { t } , a _ { t } ) \sim p ( \bullet ) } \mathcal { L } \big ( a _ { t } , f \big ( o _ { t } \big ) \big )

Here:

  • ff is the policy or mapping function we aim to learn, which takes an observation oto_t and outputs an action ata_t.
  • otOo_t \in \mathcal{O} represents the observations at time tt (e.g., images and proprioceptive information).
  • atAa_t \in \mathcal{A} represents the expert actions at time tt.
  • p()p(\bullet) denotes the unknown expert's joint observation-action distribution p:O×A[0,1]p: \mathcal{O} \times \mathcal{A} \mapsto [0, 1]. In practice, this distribution is approximated by the collected dataset D\mathcal{D}.
  • L:A×AR\mathcal{L}: \mathcal{A} \times \mathcal{A} \mapsto \mathbb{R} is an arbitrary risk function (or loss function), such as Mean Squared Error (MSE) for continuous actions or Cross-Entropy for discrete actions, measuring the discrepancy between the true expert action ata_t and the policy's predicted action f(ot)f(o_t).
  • E\mathbb{E} denotes the expected value over the expert's distribution.

4.1.1. Data Characteristics in BC

The dataset D\mathcal{D} for imitation learning consists of offline, reward-free expert trajectories. A key characteristic is that these samples are not i.i.d. (independent and identically distributed). Expert demonstrations are collected sequentially, meaning observations and actions are temporally correlated. This non-i.i.d. nature can pose challenges for standard supervised learning algorithms, as predictions at one step can lead to states not seen in the training data, a phenomenon known as covariate shift, which can cause compounding errors.

4.1.2. Limitations of Point-Estimate Policies

While conceptually elegant, point-estimate policies (where f(ot)f(o_t) outputs a single, deterministic action) trained with the above objective suffer from several issues:

  1. Covariate Shift: Small prediction errors can accumulate, driving the policy into out-of-distribution states where its performance degrades significantly.

  2. Multimodal Averaging: Expert demonstrations often contain multimodal strategies for achieving the same goal (e.g., multiple ways to grasp an object). A point-estimate policy trained with a typical risk function like MSE tends to average these different strategies, leading to an indecisive or unstable action that may not correspond to any successful expert behavior.

    To address the multimodality issue, the paper transitions to discussing generative models.

4.2. A (Concise) Introduction to Generative Models

Generative Models (GMs) are employed in imitation learning to approximate the unknown data distribution p(x), which in the context of BC represents the expert's joint distribution over observation-action pairs (o, a). Given a finite set of NN pairs D={(o,a)i}i=0N\mathcal{D} = \{ (o, a)_i \}_{i=0}^N (assumed i.i.d. for the purpose of introducing GMs, even if the full trajectories are non-i.i.d.), GMs aim to learn a parametric distribution pθ(o,a)p_\theta(o, a) such that:

  1. New samples generated from pθ()p_\theta(\bullet) resemble those in D\mathcal{D}.

  2. High likelihood is assigned to the observed regions of the true, unobservable p(o, a).

    Likelihood-based learning provides a principled objective for training GMs.

4.2.1. Variational Auto-Encoders

Variational Auto-Encoders (VAEs) are a prominent class of generative models that introduce an inductive bias: they posit that observed samples (o, a) are influenced by an unobservable latent variable zZz \in Z.

The joint distribution p(o, a) is modeled as a marginalization over this latent variable zz from the complete joint distribution p(o, a, z):

p(o,a)=supp(Z)p(o,az)p(z)dz p ( o , a ) = \int _ { \mathrm { s u p p } ( Z ) } p ( o , a | z ) p ( z ) \, dz

Here:

  • zZz \in Z is the latent variable.

  • p(z) is the prior distribution over the latent space ZZ, often chosen as a simple standard Gaussian N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}).

  • p(o,az)p(o, a | z) is the likelihood or decoder function, which generates an observation-action pair given a latent variable zz.

  • supp(Z)dz\int_{\mathrm{supp}(Z)} \dots dz denotes integration over the support of ZZ.

    Intuitively, zz can represent higher-level information about the task or strategy being performed by the human demonstrator, effectively capturing different multimodal behaviors.

To train a VAE, we want to maximize the log-likelihood of the observed data under the model. For a dataset D\mathcal{D} of NN i.i.d. observation-action pairs, the log-likelihood is:

logpθ(D)=logi=0Npθ((o,a)i) \log p _ { \theta } ( \mathcal { D } ) = \log \displaystyle \sum _ { i = 0 } ^ { N } p _ { \theta } ( ( o , a ) _ { i } )

Substituting the latent variable model and introducing an approximate posterior qϕ(z(o,a)i)q_\phi(z | (o,a)_i) (the encoder network), the log-likelihood for a single data point (o,a)_i can be expressed as:

logpθ((o,a)i)=logsupp(Z)pθ((o,a)iz)p(z)dz=logsupp(Z)qϕ(z(o,a)i)qϕ(z(o,a)i)pθ((o,a)iz)p(z)dz=logEzqϕ(z(o,a)i)[p(z)qϕ(z(o,a)i)pθ((o,a)iz)], \begin{array} { r l } & { \log p _ { \theta } ( ( o , a ) _ { i } ) = \log \int _ { \mathrm { s u p p } ( Z ) } p _ { \theta } ( ( o , a ) _ { i } | z ) p ( z ) \, dz } \\ & { \qquad = \log \int _ { \mathrm { s u p p } ( Z ) } \frac { q _ { \phi } ( z | ( o , a ) _ { i } ) } { q _ { \phi } ( z | ( o , a ) _ { i } ) } \cdot p _ { \theta } ( ( o , a ) _ { i } | z ) p ( z ) \, dz } \\ & { \qquad = \log \mathbb { E } _ { z \sim q _ { \phi } ( z | ( o , a ) _ { i } ) } \left[ \frac { p ( z ) } { q _ { \phi } ( z | ( o , a ) _ { i } ) } \cdot p _ { \theta } ( ( o , a ) _ { i } | z ) \right] , } \end{array}

where we multiply by 1=qϕ(z(o,a)i)qϕ(z(o,a)i)1 = \frac{q_\phi(z | (o,a)_i)}{q_\phi(z | (o,a)_i)} to introduce the approximate posterior qϕq_\phi for the expectation. The full sum over NN data points is then:

logpθ(D)=i=0NlogEzqϕ(z(o,a)i)[p(z)qϕ(z(o,a)i)pθ((o,a)iz)] \log p _ { \theta } ( \mathcal { D } ) = \sum _ { i = 0 } ^ { N } \log \mathbb { E } _ { z \sim q _ { \phi } ( z | ( o , a ) _ { i } ) } \left[ \frac { p ( z ) } { q _ { \phi } ( z | ( o , a ) _ { i } ) } \cdot p _ { \theta } ( ( o , a ) _ { i } | z ) \right]

This log-likelihood is generally intractable when neural networks are used to model the distributions. Kingma and Welling (2013) introduced the Evidence Lower Bound (ELBO) as a tractable objective by applying Jensen's inequality (logE[]E[log()]\log \mathbb{E}[\bullet] \geq \mathbb{E}[\log(\bullet)]). For a single data point (o, a)_i, the ELBO is:

logpθ((o,a)i)Ezqϕ(z(o,a)i)[logpθ((o,a)iz)]+Ezqϕ(z(o,a)i)[log(p(z)qϕ(z(o,a)i))]=Ezqϕ(z(o,a)i)[logpθ((o,a)iz)]DKL[qϕ(z(o,a)i)p(z)] \begin{array} { l } { \displaystyle \log p _ { \theta } ( ( o , a ) _ { i } ) \geq \mathbb E _ { z \sim q _ { \phi } ( z | ( o , a ) _ { i } ) } \left[ \log p _ { \theta } ( ( o , a ) _ { i } | z ) \right] + \mathbb E _ { z \sim q _ { \phi } ( z | ( o , a ) _ { i } ) } \left[ \log \biggl ( \frac { p ( z ) } { q _ { \phi } ( z | ( o , a ) _ { i } ) } \biggr ) \right] } \\ { \displaystyle \qquad = \mathbb E _ { z \sim q _ { \phi } ( z | ( o , a ) _ { i } ) } \left[ \log p _ { \theta } ( ( o , a ) _ { i } | z ) \right] - \operatorname* { D } _ { \mathrm { K L } } \left[ q _ { \phi } ( z | ( o , a ) _ { i } ) \| p ( z ) \right] } \end{array}

Summing over all data points in D\mathcal{D}, the total ELBO objective to maximize is:

ELBOD(θ,ϕ)=i=0N(Ezqϕ(z(o,a)i)[logpθ((o,a)iz)]DKL[qϕ(z(o,a)i)p(z)]) \mathrm { E L B O } _ { \mathcal { D } } ( \theta , \phi ) = \sum _ { i = 0 } ^ { N } \bigl ( \mathbb { E } _ { z \sim q _ { \phi } ( z \mid ( o , a ) _ { i } ) } \bigl [ \log p _ { \theta } ( ( o , a ) _ { i } | z ) \bigr ] - \mathrm { D } _ { \mathrm { K L } } \bigl [ q _ { \phi } ( z | ( o , a ) _ { i } ) \| p ( z ) \bigr ] \bigr )

Here:

  • θ\theta are the parameters of the decoder pθ(o,az)p_\theta(o,a|z).

  • ϕ\phi are the parameters of the encoder qϕ(zo,a)q_\phi(z|o,a).

  • DKL[QP]\operatorname{D}_{\mathrm{KL}}[Q \| P] is the Kullback-Leibler divergence from distribution PP to distribution QQ, defined as ExQ[logQ(x)P(x)]\mathbb{E}_{x \sim Q} \left[ \log \frac{Q(x)}{P(x)} \right]. It measures the information lost when PP is used to approximate QQ. In VAEs, this term encourages the encoder's distribution qϕ(zo,a)q_\phi(z|o,a) to be close to the simple prior p(z).

    The training objective is typically to minimize the negative ELBO, which can be factorized into two interpretable terms (considering one (o,a)D(o, a) \sim \mathcal{D} for simplicity):

minθ,ϕELBO(o,a)D(θ,ϕ)=minθ,ϕLrec(θ)+Lreg(ϕ),Lrec(θ)=Ezqϕ(o,a)[logpθ(o,az)]Lreg(ϕ)=DKL[qϕ(zo,a)p(z)]. \begin{array} { r l } & { \underset { \theta , \phi } { \operatorname* { m i n } } - \mathrm { E L B O } _ { ( o , a ) \sim \mathcal { D } } ( \theta , \phi ) = \underset { \theta , \phi } { \operatorname* { m i n } } \mathbf { L } ^ { \mathrm { r e c } } ( \theta ) + \mathbf { L } ^ { \mathrm { r e g } } ( \phi ) , } \\ & { \quad \quad \quad \quad \mathbf { L } ^ { \mathrm { r e c } } ( \theta ) = - \mathbb { E } _ { z \sim q _ { \phi } ( \bullet \vert o , a ) } \big [ \log p _ { \theta } ( o , a \vert z ) \big ] } \\ & { \quad \quad \quad \quad \mathbf { L } ^ { \mathrm { r e g } } ( \phi ) = \mathrm { D } _ { \mathrm { K L } } \big [ q _ { \phi } ( z \vert o , a ) \Vert p ( z ) \big ] . } \end{array}

  • Lrec(θ)\mathbf{L}^{\mathrm{rec}}(\theta) is the reconstruction loss: It encourages the decoder to accurately reconstruct the input data.

  • Lreg(ϕ)\mathbf{L}^{\mathrm{reg}}(\phi) is the regularization loss: It forces the encoder's approximate posterior qϕ(zo,a)q_\phi(z|o,a) to resemble the prior distribution p(z), usually a standard Gaussian N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}). This enforces a smooth and continuous latent space.

    The expectation in Lrec\mathbf{L}^{\mathrm{rec}} is often approximated via Monte Carlo (MC) estimates:

Ezqϕ(o,a)[logpθ(o,az)]=Lrec1ni=0nlogpθ(o,azi) - \mathbb { E } _ { z \sim q _ { \phi } ( \bullet | o , a ) } \big [ \log p _ { \theta } ( o , a | z ) \big ] = \mathbf { L } ^ { \mathrm { r e c } } \approx - \frac { 1 } { n } \sum _ { i = 0 } ^ { n } \log p _ { \theta } ( o , a | z _ { i } )

If pθ(o,az)p_\theta(o,a|z) is parametrized as an isotropic Gaussian distribution with mean μθ(z)\mu_\theta(z) and variance σ2\sigma^2, the log-likelihood term simplifies, and the reconstruction loss becomes:

Lrec1ni=0n(o,a)μθ(zi)22 \mathbf { L } ^ { \mathrm { r e c } } \approx \frac { 1 } { n } \sum _ { i = 0 } ^ { n } \big \| ( o , a ) - \mu _ { \theta } ( z _ { i } ) \big \| _ { 2 } ^ { 2 }

This means training a VAE amounts to:

  1. Minimizing the reconstruction loss Lrec\mathbf{L}^{\mathrm{rec}} to ensure the decoder can reconstruct the input examples.

  2. Minimizing the regularization loss Lreg\mathbf{L}^{\mathrm{reg}} to enforce information compression and a well-structured latent space.

    The KL-divergence term can be computed in closed-form if qϕq_\phi is modeled as a Gaussian N(μϕ(o,a),Σϕ(o,a))\mathcal{N}(\mu_\phi(o,a), \Sigma_\phi(o,a)) and p(z) is a standard Gaussian N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}).

The image below (Figure 22 from the original paper) illustrates the latent variable model in a robotics application, showing how observed (o,a) variables are influenced by an unobservable latent variable zz.

Figure22 | (A) The latent variable model in a robotics application regulates influence between observed `( o , a )` variables and ausevabte bixacte bmode mn va 该图像是示意图,展示了图22中基于潜变量模型的机器人模仿学习框架。其通过学习隐含变量 zz 来调节观察动作对 (o, a) 的影响,包含公式 p(o,az)p(o,a|z)q(zo,a)q(z|o,a),并展示了对应的编码器和解码器结构。

Figure 22 | (A) The latent variable model in a robotics application regulates influence between observed ( o , a ) variables and unobservable latent variables zz.

4.2.2. Diffusion Models

Diffusion Models (DMs) are another class of generative models that approximate an unknown data distribution by modeling a process of denoising. Unlike VAEs which use a single latent variable, DMs posit a Markov chain of latent variables where each ztz_t depends only on zt1z_{t-1}.

The generative process is decomposed into a series of Markovian interactions between latent variables:

p(o,a=z0)=supp(Z0)supp(Z1)supp(ZT)p(z0,z1,zT)p(z0,z1,zT)=p(zT)t=1Tp(zt1zt), \begin{array} { c } { { \displaystyle p \big ( \underbrace { o , a } _ { = z _ { 0 } } \big ) = \int _ { \mathrm { s u p p } ( Z _ { 0 } ) } \int _ { \mathrm { s u p p } ( Z _ { 1 } ) } \cdot \cdot \cdot \int _ { \mathrm { s u p p } ( Z _ { T } ) } p \big ( z _ { 0 } , z _ { 1 } , \dots z _ { T } ) } } \\ { { \displaystyle p \big ( z _ { 0 } , z _ { 1 } , \dots z _ { T } \big ) = p ( z _ { T } ) \prod _ { t = 1 } ^ { T } p ( z _ { t - 1 } | z _ { t } ) , } } \end{array}

Here:

  • z0z_0 represents the original observation-action pair (o, a).

  • z1,,zTz_1, \dots, z_T are increasingly noisy latent variables, where zTz_T is almost pure noise (e.g., standard Gaussian).

  • p(zT)p(z_T) is a simple prior distribution for the final latent variable.

  • p(zt1zt)p(z_{t-1}|z_t) is the reverse transition probability, which the DM learns to model, effectively denoising ztz_t to get zt1z_{t-1}.

    The image below (Figure 23 from the original paper) graphically represents this Markov chain of latent variables, where samples from the posterior distribution are progressively "higher up" in the hierarchy.

    该图像是一个示意图,展示了变分时序模型中的概率转移过程及其逆向推断分布,包含公式\(p_\\theta\)和\(q_\\theta\)的条件概率表示,描述了从潜变量序列到观测数据的生成与推断关系。 该图像是一个示意图,展示了变分时序模型中的概率转移过程及其逆向推断分布,包含公式pθp_\thetaqθq_\theta的条件概率表示,描述了从潜变量序列到观测数据的生成与推断关系。

Figure 23 | A Markov chain with samples from the posterior distribution being progressively higher up in the hierarchy.

The core idea is to train a model to reverse a known forward diffusion process (adding noise). Hoetal.(2020)Ho et al. (2020) simplified the training objective for DMs. They assume a fixed isotropic Gaussian posterior for the forward process: q(ztzt1)=N(1βtzt1,βtI)q(z_t | z_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t} z_{t-1}, \beta_t \mathbf{I}).

The simplified training objective for Diffusion Models is given by:

L(θ)=Et,z0,ϵ[ϵϵθ(αˉtz0+ϵ1αˉt,t)2],tU({1,,T}),z0D,ϵN(0,I×0). \mathcal { L } ( \theta ) = \mathbb { E } _ { t , z _ { 0 } , \epsilon } \big [ \| \epsilon - \epsilon _ { \theta } \big ( \sqrt { \bar { \alpha } _ { t } } z _ { 0 } + \epsilon \sqrt { 1 - \bar { \alpha } _ { t } } , t \big ) \| ^ { 2 } \big ] , \quad t \sim \mathcal { U } ( \{ 1 , \dots , T \} ) , \quad z _ { 0 } \sim \mathcal { D } , \quad \epsilon \sim \mathcal { N } ( \mathbf { 0 } , \mathbf { I } \times \mathbf { 0 } ) .

Here:

  • ϵ\epsilon is the Gaussian noise sampled from N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}).

  • ϵθ(,t)\epsilon_\theta(\dots, t) is the noise regressor (a neural network, often a U-Net) parameterized by θ\theta, which is trained to predict the noise ϵ\epsilon that was added to z0z_0 at timestep tt.

  • z0z_0 is a sample from the target data distribution D\mathcal{D} (e.g., an observation-action pair).

  • tt is a random timestep sampled uniformly from {1,,T}\{1, \dots, T\}.

  • αt=1βt\alpha_t = 1 - \beta_t, and αˉt=k=1tαk\bar{\alpha}_t = \prod_{k=1}^t \alpha_k. These terms are derived from the variance schedule βt\beta_t of the forward diffusion process.

  • The argument to ϵθ\epsilon_\theta represents a noisy version of z0z_0 at time tt, specifically zt=αˉtz0+ϵ1αˉtz_t = \sqrt{\bar{\alpha}_t} z_0 + \epsilon \sqrt{1 - \bar{\alpha}_t}.

  • The objective minimizes the L2 norm (squared difference) between the true noise ϵ\epsilon and the predicted noise ϵθ\epsilon_\theta.

    At inference time (to generate a new sample), the model starts from pure noise zTN(0,I)z_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and iteratively denoises it using the learned ϵθ\epsilon_\theta. The process involves computing zt1z_{t-1} from ztz_t using the predicted noise. This sampling formula is:

zt1=1αt(ztβt1αˉtϵθ(zt,t))+σtϵ,ϵN(0,I), z _ { t - 1 } = \frac { 1 } { \sqrt { \alpha _ { t } } } \Bigg ( z _ { t } - \frac { \beta _ { t } } { \sqrt { 1 - \bar { \alpha } _ { t } } } \epsilon _ { \theta } ( z _ { t } , t ) \Bigg ) + \sigma _ { t } \epsilon , \quad \epsilon \sim \mathcal { N } ( \mathbf { 0 } , \mathbf { I } ) ,

where:

  • ztz_t is the noisy sample at time tt.

  • ϵθ(zt,t)\epsilon_\theta(z_t, t) is the predicted noise.

  • σt\sigma_t is a learned or fixed variance parameter.

  • This equation shows how to iteratively remove noise to move from ztz_t to zt1z_{t-1}, eventually reconstructing the original data z0z_0.

    The image below (Figure 24 from the original paper) illustrates how samples diffuse away from the original data points in a 2D space due to Gaussian noise.

    该图像是多时间步t=10,100,500,1000下机器人观测与动作示例、对应动作分布热图及得分场示意图,展示了行为克隆中状态动作联合分布随时间的变化及得分场向量场特征。 该图像是多时间步t=10,100,500,1000下机器人观测与动作示例、对应动作分布热图及得分场示意图,展示了行为克隆中状态动作联合分布随时间的变化及得分场向量场特征。

Figure 24 | The figure illustrates the process of denoising samples from a tractable, easy-to-sample distribution.

The image below (Figure 25 from the original paper) visualizes how a diffused sample, initially a clean observation-action pair, gets corrupted by Gaussian noise.

该图像是一个图表,展示了观察值\(q_2\)和动作值\(q_2^h\)之间的二维概率分布。图中颜色深浅表示概率密度,说明两者存在强相关性,符合行为克隆中映射观测到动作的学习过程。 该图像是一个图表,展示了观察值q2q_2和动作值q2hq_2^h之间的二维概率分布。图中颜色深浅表示概率密度,说明两者存在强相关性,符合行为克隆中映射观测到动作的学习过程。

Figure 25 | The figure shows how, when teleoperated, the points distribute along a diagonal.

4.2.3. Flow Matching

Flow Matching (FM) extends Diffusion Models by modeling generative processes as continuous transformations rather than discrete steps. It learns a deterministic, differentiable flow ψ\psi that transports samples from a simple prior distribution p0p_0 (e.g., standard Gaussian) to a complex data distribution p1p_1. This flow is defined by a (possibly time-dependent) vector field vv.

The continuous transformation is formalized by an Ordinary Differential Equation (ODE):

ddtψ(z,t)=v(t,ψ(t,z)),ψ(0,z)=z. \begin{array} { c } { { \displaystyle { \frac { d } { d t } } \psi ( z , t ) = v ( t , \psi ( t , z ) ) , } } \\ { { \psi ( 0 , z ) = z . } } \end{array}

Here:

  • ψ(z,t)\psi(z, t) describes the position of a particle that started at zz at time t=0t=0, as it flows under the influence of the vector field vv.

  • t[0,1]t \in [0, 1] represents continuous time.

  • v(t,ψ(t,z))v(t, \psi(t, z)) is the vector field that dictates the velocity of the particle at position ψ(t,z)\psi(t, z) and time tt.

  • ψ(0,z)=z\psi(0, z) = z means the flow starts at the initial point zz from the prior distribution p0p_0.

    FM models learn to approximate a target vector field u(t, z) such that the induced flows transform p0p_0 into p1p_1. For Diffusion Models, the corresponding vector field u(t,zz0)u(t, z|z_0) is defined as:

u(t,zz0)=ddtα(1t)1(α(1t))2(α(1t)zz0),α(t)=e120tβ(s)ds,z0D. u ( t , z | z _ { 0 } ) = \frac { \frac { d } { d t } \alpha ( 1 - t ) } { 1 - ( \alpha ( 1 - t ) ) ^ { 2 } } ( \alpha ( 1 - t ) z - z _ { 0 } ) , \quad \alpha ( t ) = e ^ { - \frac { 1 } { 2 } \int _ { 0 } ^ { t } \beta ( s ) d s } , \quad \forall z _ { 0 } \in \mathcal { D } .

Here:

  • zz is the current point in the latent space at time tt.

  • z0z_0 is a sample from the target data distribution D\mathcal{D}.

  • α(t)\alpha(t) and β(s)\beta(s) are continuous generalizations of the discrete variance schedule from Diffusion Models.

  • This vector field is conditional on z0z_0, meaning the flow path depends on the specific target data point.

    A common approach in Flow Matching is Conditional Flow Matching (CFM), which defines a simple path between a sample z1p1z_1 \sim p_1 (from the data distribution) and a sample z0p0z_0 \sim p_0 (from the prior distribution) using linear interpolation: zt=(1t)z0+tz1z_t = (1-t)z_0 + tz_1. This results in a target vector field u(t,zt)=z1z0u(t, z_t) = z_1 - z_0.

FM models are trained with a simple regression objective:

L(θ)=Et,z0,z1[vθ((1t)z0+tz1,t)(z1z0)2],tU([0,1]), \mathcal { L } ( \theta ) = \mathbb { E } _ { t , z _ { 0 } , z _ { 1 } } \big [ \| v _ { \theta } ( ( 1 - t ) z _ { 0 } + t z _ { 1 } , t ) - ( z _ { 1 } - z _ { 0 } ) \| ^ { 2 } \big ] , \quad t \sim \mathcal { U } ( [ 0 , 1 ] ) ,

Here:

  • vθ(z,t)v_\theta(z, t) is the vector field regressor (a neural network parameterized by θ\theta) that we train.

  • z0p0()z_0 \sim p_0(\bullet) is a sample from the prior distribution.

  • z1p1()z_1 \sim p_1(\bullet) is a sample from the data distribution.

  • (1t)z0+tz1(1-t)z_0 + tz_1 represents the interpolated point ztz_t along the path from z0z_0 to z1z_1.

  • z1z0z_1 - z_0 is the target vector field for this linear path.

  • tt is sampled continuously from U([0,1])\mathcal{U}([0, 1]).

    At inference time, samples are generated by starting with z0p0z_0 \sim p_0 and numerically integrating the learned vector field dzdt=vθ(zt,t)\frac{dz}{dt} = v_\theta(z_t, t) for t[0,1]t \in [0, 1].

The image below (Figure 27 from the original paper) illustrates mass flows across a support for different vector fields, showing how Flow Matching learns to approximate target transformations.

该图像是两个二维分布随时间演化的图示,展示了概率密度和概率流场的变化过程。每列对应不同时间\(t\),其中概率密度用颜色深浅表示,概率流用白色箭头指示,刻画了分布从初始状态到演变的动态过程。 该图像是两个二维分布随时间演化的图示,展示了概率密度和概率流场的变化过程。每列对应不同时间tt,其中概率密度用颜色深浅表示,概率流用白色箭头指示,刻画了分布从初始状态到演变的动态过程。

Figure 27 | The figure illustrates flows of mass across the same support (top versus bottom, using two different time-invariant 2D-fields u1(x,y)=(x,0)u_1(x,y)=(x,0) and u2(x,y)=(x/2,y/2)u_2(x,y)=(x/\sqrt{2}, y/\sqrt{2})). Notice time flows continuously in [0, 1]. FM models learn to approximate a target vector field, thereby producing arbitrary (goal) transformations of an easy-to-sample initial distribution.

The image below (Figure 10 from the original paper) compares the distribution evolution of Diffusion Models and Flow Matching over 50 steps, demonstrating FM's more direct path.

该图像是机器人手臂运动的示意图,展示了不同时间点 \(q_0, q_{110}, q_{200}\) 的关节角度及对应的视觉观测 \(o_0, o_{110}, o_{200}\) 和动作 \(a_0, a_{110}, a_{200}\)。图中体现了机器人基于行为克隆从人体示范数据中学习动作的过程。 该图像是机器人手臂运动的示意图,展示了不同时间点 q0,q110,q200q_0, q_{110}, q_{200} 的关节角度及对应的视觉观测 o0,o110,o200o_0, o_{110}, o_{200} 和动作 a0,a110,a200a_0, a_{110}, a_{200}。图中体现了机器人基于行为克隆从人体示范数据中学习动作的过程。

Figure 10 | Comparison of Diffusion and Flow Matching methods on joint distribution of robot observations and actions over T=50T = 5 0 steps.

4.3. Action Chunking with Transformers (ACT)

Action Chunking with Transformers (ACT) (Zhao et al., 2023) applies Conditional Variational Auto-Encoders (CVAEs) and Transformer architectures to imitation learning, specifically focusing on learning action chunks (at:t+ka_{t:t+k}) instead of single actions (ata_t).

ACT leverages Conditional VAEs (CVAEs) which are a variation of standard VAEs that allow conditioning on additional input variables. In ACT, the policy distribution p(ao)p(a|o) is directly modeled. The CVAE objective adapts the ELBO from VAEs (Section 4.2.1) to include explicit conditioning on observations oio_i and a learned prior pω(zoi)p_\omega(z|o_i).

The CVAE ELBO objective used in ACT for a dataset D\mathcal{D} is:

ELBOD(θ,ϕ,ω)=i=0N(Ezqϕ(oi,ai)[logpθ(aiz,oi)]DKL[qϕ(zoi,ai)pω(zoi)]) \mathrm { E L B O } _ { \mathcal { D } } ( \theta , \phi , \omega ) = \sum _ { i = 0 } ^ { N } \bigl ( \mathbb { E } _ { z \sim q _ { \phi } ( \cdot | o _ { i } , a _ { i } ) } \bigl [ \log p _ { \theta } ( a _ { i } | z , o _ { i } ) \bigr ] - \mathrm { D } _ { \mathrm { K L } } \bigl [ q _ { \phi } ( z | o _ { i } , a _ { i } ) \| p _ { \omega } ( z | o _ { i } ) \bigr ] \bigr )

Here:

  • θ\theta are the parameters of the decoder (likelihood) pθ(aiz,oi)p_\theta(a_i | z, o_i).

  • ϕ\phi are the parameters of the approximate posterior (encoder) qϕ(zoi,ai)q_\phi(z | o_i, a_i).

  • ω\omega are the parameters of the conditional prior pω(zoi)p_\omega(z | o_i). This prior is now learned and conditioned on the observation oio_i, providing a more informed latent space regularization.

  • The first term is the reconstruction loss (expected log-likelihood of actions given latent variable and observation).

  • The second term is the KL divergence between the approximate posterior and the conditional prior.

    ACT is trained as a

\beta`-CVAE` `(Higgins et al., 2017)`, where the `KL regularization term` is scaled by a hyperparameter βR+\beta \in \mathbb{R}^+. A higher β\beta leads to a `less expressive latent space`, enforcing more compression.

### 4.3.1. ACT Architecture and Inference

The `ACT` architecture (Figures 28, 29, 30) comprises:
1.  **CVAE Encoder:** (Figure 28) Takes `input action chunks` at:t+ka_{t:t+k}, `embeds` them, aggregates with `positional encoding`, and feeds them into a `Transformer encoder` to predict the `style variable` zz. Importantly, this `encoder` is used *only during training* to provide a target for the `latent space` and is `entirely disregarded at inference time`. It primarily uses `proprioceptive states` for efficiency.
2.  **CVAE Decoder:** (Figure 29) A `full encoder-decoder Transformer architecture`. `Camera observations` from multiple views are embedded using `pre-trained visual encoders` and combined with `positional embeddings`. This visual information, along with `proprioceptive information` and the `style variable` zz (or a fixed value during inference), is fed to the `Transformer` to decode `action chunks`. The `encoder` part of the `Transformer` shares matrices (`K, V`) with the `decoder`.

    The image below (Figure 28 from the original paper) shows the `CVAE encoder` used in `ACT`.

    ![该图像是包含图表和实验装置照片的复合图,左侧图表展示了策略执行与专家示范的wrist roll轨迹对比,右侧图像为标注两种抓取模式的机器人手爪特写。](/files/papers/6907680a971e575bdfc172d9/images/3.jpg)
    *该图像是包含图表和实验装置照片的复合图,左侧图表展示了策略执行与专家示范的wrist roll轨迹对比,右侧图像为标注两种抓取模式的机器人手爪特写。*

Figure 28 | The CVAE encoder used in ACT. Input action chunks are first embedded and aggregated with positional encoding, then fed to a Transformer encoder to predict the style variable zz. The encoder is exclusively used to train the decoder, and it is entirely disregarded at inference time.

The image below (Figure 29 from the original paper) shows the `CVAE decoder` used in `ACT`.

![该图像是图表,展示了机器人手部在“拾取积木”和“推动积木”两种操作任务中的概率分布情况。左图对应任务A,右图对应任务B,图中以q6q_6表征动作概率的高低变化,体现了行为克隆模型对不同任务动作预测的区分能力。](/files/papers/6907680a971e575bdfc172d9/images/4.jpg)
*该图像是图表,展示了机器人手部在“拾取积木”和“推动积木”两种操作任务中的概率分布情况。左图对应任务A,右图对应任务B,图中以q6q_6表征动作概率的高低变化,体现了行为克隆模型对不同任务动作预测的区分能力。*

Figure 29 | The CVAE decoder used in ACT, comprising a full encoder-decoder Transformer architecture. Camera observations from all `_n` camera views are first embedded using pre-trained visual encoders, and then aggregated with the corresponding positional embeddings. Then, the proprioceptive information and style variable zz retrieved from the CVAE encoder, are fed to the encoder-decoder Transformer for inference. The encoder shares the matrices `K, V` with the decoder, and is trained to decode fixed positional embeddings into action chunks.

The image below (Figure 30 from the original paper) shows the overall `ACT` framework.

![该图像是示意图,展示了结合变换器(Transformer)的动作分块(Action Chunking)方法。左侧为动作执行对比,展示单步动作与动作分块的区别;中间为模型框架图,包含CVAE编码器、Transformer编码器与解码器的结构及交叉注意力机制;右侧为相关概率分布的公式表达,涉及qphi(zo,a)q_{\\phi}(z|o,a)pomega(zo)p_{\\omega}(z|o)ptheta(az,o)](/files/papers/6907680a971e575bdfc172d9/images/13.jpg)该图像是示意图,展示了结合变换器(Transformer)的动作分块(ActionChunking)方法。左侧为动作执行对比,展示单步动作与动作分块的区别;中间为模型框架图,包含CVAE编码器、Transformer编码器与解码器的结构及交叉注意力机制;右侧为相关概率分布的公式表达,涉及p_{\\theta}(a|z,o)…](/files/papers/6907680a971e575bdfc172d9/images/13.jpg)
*该图像是示意图,展示了结合变换器(Transformer)的动作分块(Action Chunking)方法。左侧为动作执行对比,展示单步动作与动作分块的区别;中间为模型框架图,包含CVAE编码器、Transformer编码器与解码器的结构及交叉注意力机制;右侧为相关概率分布的公式表达,涉及q_{\phi}(z|o,a)p_{\omega}(z|o)p_{\theta}(a|z,o)Figure30TheACTframeworkdesignedtocopewithhighdimensionalmultimodaldemonstrationdata,andatransformerbasedCVAEarchitecture.Attesttime,ACTemploysadeterministicprocedureforsampling。*

Figure 30 | The ACT framework designed to cope with high-dimensional multi-modal demonstration data, and a transformer-based CVAE architecture.

At `test time`, `ACT` employs a deterministic procedure for `sampling` z.Insteadofsamplingfromacomplexposterior,. Instead of sampling from a complex `posterior`, zissimplysetto0,giventhattheconditionalprioron is simply set to `0`, given that the `conditional prior` on zduringtrainingistypicallyastandardGaussian during training is typically a `standard Gaussian` \mathcal{N}(\mathbf{0}, \mathbf{I}).Conditioningontheobservation. `Conditioning` on the observation oisachievedbyexplicitlyfeedingproprioceptiveandvisualobservationstothedecoder is achieved by explicitly feeding `proprioceptive` and `visual observations` to the `decoder` p_\theta(a|z,o).

### 4.3.2. Code Example: Training and Using ACT in Practice

The paper provides code snippets `Code 7` and `Code 8` for `training` and `using ACT`, respectively. These snippets demonstrate how to:
*   Configure `ACTPolicy` with input/output features.
*   Instantiate `LeRobotDataset` with `action chunking` `delta_timestamps`.
*   Run a `training loop` with `optimizer`, `dataloader`, and `loss calculation`.
*   Save/load `policy checkpoints` and `pre/post processors`.
*   Inference loop for `real robot deployment`, including `robot observation capture`, `preprocessing`, `action selection` via `model.select_action()`, `postprocessing`, and `sending actions` to the robot.

## 4.4. Diffusion Policy

`Diffusion Policy (DP)` (Chi et al., 2024)adaptsDiffusionModelsforimitationlearninginrobotics.SimilartoACT,DPfocusesonpredictingactionchunksandmodelingtheconditionaldistribution adapts `Diffusion Models` for `imitation learning` in robotics. Similar to `ACT`, `DP` focuses on predicting `action chunks` and modeling the `conditional distribution` p(a|o)ratherthanthefulljointp(o,a).Thischoiceismotivatedbythecomputationalburdenofdiffusionforobservationsandtheprimarygoalofgeneratingcontrolsforrobots.TheDiffusionPolicytrainsanoiseregressor rather than the full joint `p(o,a)`. This choice is motivated by the computational burden of `diffusion` for `observations` and the primary goal of generating `controls` for robots.

The `Diffusion Policy` trains a `noise regressor` \epsilon_\theta(introducedinSection4.2.2)topredictthenoiseaddedtoanactionchunk,conditionedonobservations.Theconditional,simplifieddiffusionobjectiveforDPis:L(θ)=Et,at:t+Ha,ϵ[ϵϵθ(αˉtat:t+Ha+ϵ1αˉt,t,otHo:t)2],tU({1,,T}),at:t+Ha,otHo:tD,ϵN(0,I).Here: (introduced in Section 4.2.2) to predict the noise added to an `action chunk`, conditioned on `observations`.

The `conditional, simplified diffusion objective` for `DP` is:
```

\begin{array} { r l } & { \mathcal { L } ( \theta ) = \mathbb { E } _ { t , a _ { t : t + H _ { a } } , \epsilon } \bigl [ | | \epsilon - \epsilon _ { \theta } \bigl ( \sqrt { \bar { \alpha } _ { t } } a _ { t : t + H _ { a } } + \epsilon \sqrt { 1 - \bar { \alpha } _ { t } } , t , o _ { t - H _ { o } : t } \bigr ) | | ^ { 2 } \bigr ] , } \\ & { \quad \quad \quad t \sim \mathcal { U } ( \{ 1 , \dots , T \} ) , \quad a _ { t : t + H _ { a } } , o _ { t - H _ { o } : t } \sim \mathcal { D } , \quad \epsilon \sim \mathcal { N } ( \mathbf { 0 } , \mathbf { I } ) . } \end{array}

```
Here:
*   \epsilonistheGaussiannoisesampledfrom is the `Gaussian noise` sampled from \mathcal{N}(\mathbf{0}, \mathbf{I})..
*   \epsilon_\theta(\dots, t, o_{t-H_o:t})isthenoiseregressor(parameterizedby is the `noise regressor` (parameterized by \theta)whichnowtakesthreeinputs:1.Anoisyactionchunk:) which now takes three inputs:
    1.  A noisy `action chunk`: \sqrt{\bar{\alpha}_t} a_{t:t+H_a} + \epsilon \sqrt{1 - \bar{\alpha}_t}(theactionchunk (the `action chunk` a_{t:t+H_a}corruptedbynoiseattimestep corrupted by noise at timestep t).2.Thediffusiontimestep).
    2.  The `diffusion timestep` t.3.Astackofpreviousobservations.
    3.  A `stack of previous observations` o_{t-H_o:t},providingcriticalcontextfortheactionchunkprediction., providing critical context for the `action chunk` prediction.
*   a_{t:t+H_a}isanactionchunkoflength is an `action chunk` of length H_a..
*   o_{t-H_o:t}isahistoryofobservationsoflength is a history of observations of length H_o.
*   The objective minimizes the `L2 norm` between the true noise and the predicted noise.

### 4.4.1. DP Architecture and Inference

`DP` typically uses a `U-Net` architecture for its `noise regressor` \epsilon_\theta(Figure31).1.Aninitialnoisyactionchunk (Figure 31).
1.  An initial noisy `action chunk` \ddot{a}_{t:t+H_a}ismappedtoahighdimensionalspace.2.Imageobservationsandrobotposesarealsoembeddedandaggregatedwiththeactionembeddings.3.TheUNetistrainedtoregressthenoise,withobservationinformationconditionedateverylayeroftheUNetblock.Theimagebelow(Figure31fromtheoriginalpaper)illustratestheDiffusionPolicyarchitecture.![Figure31TheDiffusionPolicyarchicture,asinChietal.(2024).AstackofHopreviousobservationsisusedasexternalconditioningtodenoiseagroupofHaactions.Conditio](/files/papers/6907680a971e575bdfc172d9/images/14.jpg)该图像是一个示意图,展示了DiffusionPolicy的架构,如Chi等(2024)所述。它使用栈状的 is mapped to a high-dimensional space.
2.  `Image observations` and `robot poses` are also `embedded` and aggregated with the `action embeddings`.
3.  The `U-Net` is trained to regress the noise, with `observation information` conditioned at `every layer` of the `U-Net` block.

    The image below (Figure 31 from the original paper) illustrates the `Diffusion Policy` architecture.

    ![Figure 31 | The Diffusion Policy archicture, as in Chi et al. (2024). A stack of `H _ { o }` previous observations is used as external conditioning to denoise a group of `H _ { a }` actions. Conditio…](/files/papers/6907680a971e575bdfc172d9/images/14.jpg)
    *该图像是一个示意图,展示了Diffusion Policy的架构,如Chi等(2024)所述。它使用栈状的H_o个前序观测作为外部条件,去噪一组个前序观测作为外部条件,去噪一组H_a个动作。条件注入发生在UNet的每一层,经过个动作。条件注入发生在U-Net的每一层,经过T步去噪后,获得完整的动作组。Figure31TheDiffusionPolicyarchitecture,asinChietal.(2024).Astackof步去噪后,获得完整的动作组。*

Figure 31 | The Diffusion Policy architecture, as in Chi et al. (2024). A stack of H_opreviousobservationsisusedasexternalconditioningtodenoiseagroupof previous observations is used as external conditioning to denoise a group of H_aactions.ConditioningisperformedateverylayerofaUNetblock.DiffusionPolicyallowstoobtainfullyformedactionchunkswithaslittleas actions. Conditioning is performed at every layer of a U-Net block. Diffusion Policy allows to obtain fully-formed action chunks with as little as T=10denoisingsteps.Atinferencetime,DPstartswithanoisyactionchunkanditerativelyusesthenoisepredictortosubtracttheestimatednoiseover denoising steps.

At `inference time`, `DP` starts with a noisy `action chunk` and iteratively uses the `noise predictor` to subtract the estimated noise over Tsteps(e.g.,asfewas10stepsforDP)toobtainthedesiredactionchunk steps (e.g., as few as 10 steps for `DP`) to obtain the desired `action chunk` a_{t:t+H_a}.Thisiterativedenoisingprocessisconditionedonthecurrentobservations. This iterative `denoising` process is conditioned on the current `observations` o_{t-H_o:t}.

### 4.4.2. Code Example: Training and Using Diffusion Policies in Practice

The paper provides code snippets `Code 9` and `Code 10` for `training` and `using Diffusion Policy`. These snippets demonstrate similar steps to `ACT`, but for the `DiffusionPolicy` class, including:
*   Configuration of `DiffusionConfig` and `DiffusionPolicy`.
*   Dataset instantiation with `delta_timestamps` for both `observations` and `actions`.
*   Standard `training loop` and `real robot inference loop` using `model.select_action()`.

## 4.5. Optimized Inference

The use of `action chunks` (sequences of H_alowlevelcommands)inmodernpolicieslikeACTandDiffusionPolicyopensopportunitiesforoptimizedinference,particularlyforasynchronousdeploymentonrealrobots.Thechallengeistobalancethecomputationalcostofpredictinganactionchunkwiththeneedforresponsive,realtimecontrol.Twomainstrategiesareidentified:1.SynchronousInference(Naı¨ve):Therobotcomputesanewactionchunk low-level commands) in modern policies like `ACT` and `Diffusion Policy` opens opportunities for `optimized inference`, particularly for `asynchronous deployment` on real robots.

The challenge is to balance the computational cost of predicting an `action chunk` with the need for responsive, real-time control. Two main strategies are identified:
1.  **Synchronous Inference (Naïve):** The robot computes a new `action chunk` \mathbf{A}_t \gets \pi(o_t)ateverytimestep at every timestep tandthenconsumesthefirstaction and then consumes the first action a_t = \mathrm{PopFRONT}(\mathbf{A}_t).Thisisadaptivebutresourceintensive.2.OpenLoopInference:Therobotcomputesanactionchunk. This is `adaptive` but `resource-intensive`.
2.  **Open-Loop Inference:** The robot computes an `action chunk` \mathbf{A}_tonce,executesall once, executes all H_aactions,andonlythencapturesanewobservation actions, and only then captures a new observation o_{t+H_a}topredictthenextchunk.Thisreducescomputationbutresultsinopenloopcontrolforextendedperiods,makingitlessresponsivetodynamicenvironments.Thepaperproposesanasynchronousinferenceapproachtooptimizethistradeoff,decouplingactionchunkpredictionfromactionexecution.ThisinvolvesaPolicyServer(potentiallyonapowerfulremotemachinewithGPUs)andaRobotClient(ontherobothardware).Theimagebelow(Figure32fromtheoriginalpaper)illustratestheinteractionbetweenaPolicyServerandRobotClientforasynchronousinference.![该图像是一个时序示意图,展示了PolicyServerRobotClient之间基于观测 to predict the next chunk. This reduces computation but results in `open-loop control` for extended periods, making it less responsive to dynamic environments.

    The paper proposes an `asynchronous inference` approach to optimize this trade-off, decoupling `action chunk prediction` from `action execution`. This involves a `PolicyServer` (potentially on a powerful remote machine with GPUs) and a `RobotClient` (on the robot hardware).

The image below (Figure 32 from the original paper) illustrates the interaction between a `PolicyServer` and `RobotClient` for `asynchronous inference`.

![该图像是一个时序示意图,展示了PolicyServer与RobotClient之间基于观测o_k运行策略运行策略(o_k)并接收动作的交互过程,包含动作执行和观测更新的时序关系,以及队列合并函数并接收动作的交互过程,包含动作执行和观测更新的时序关系,以及队列合并函数f((o_0), (o_k))](/files/papers/6907680a971e575bdfc172d9/images/15.jpg)该图像是一个时序示意图,展示了PolicyServerRobotClient之间基于观测。](/files/papers/6907680a971e575bdfc172d9/images/15.jpg)
*该图像是一个时序示意图,展示了PolicyServer与RobotClient之间基于观测o_k运行策略运行策略(o_k)并接收动作的交互过程,包含动作执行和观测更新的时序关系,以及队列合并函数并接收动作的交互过程,包含动作执行和观测更新的时序关系,以及队列合并函数f((o_0), (o_k))Figure32ThefigureshowsthetimingdiagramillustratingtheinteractionbetweenthePolicyServerandRobotClient,showingtheexecutionofpolicy。*

Figure 32 | The figure shows the timing diagram illustrating the interaction between the PolicyServer and RobotClient, showing the execution of policy \pi(o_k)basedonobservation based on observation o_kandreceptionofactions,includingthetimingofactionexecution,observationupdates,andtheaggregationfunction and reception of actions, including the timing of action execution, observation updates, and the aggregation function f(\pi(o_0), \pi(o_k)).

### 4.5.1. Asynchronous Inference Control-Loop

The `asynchronous inference control-loop` is formally described by `Algorithm 1`:

**Algorithm 1: Asynchronous inference control-loop**

1: Input: `horizon` T,chunksize, `chunk size` H_a,threshold, `threshold` g \in [0, 1]2:Init:capture
2: Init: `capture` o_0;send; `send` o_0toPOLICYSERVER;receive to `POLICYSERVER`; `receive` \mathbf{A}_0 \gets \pi(o_0)3:for
3: for tto to Tdo4: do
4:    a_t \gets \mathrm{PopFRONT}(\mathbf{A}_t)5:
5:    \mathrm{EXECUTE}(a_t)executeactionatstep execute action at step t6:if
6:    if \frac{|\mathbf{A}_t|}{H_a} < gthenqueuebelowthreshold7:capturenewobservation, then queue below threshold
7:        `capture new observation`, o_{t+1}8:if
8:        if \mathrm{NeedsProcessing}(o_{t+1})thensimilarityfilter,ortriggersdirectprocessing9:asynchandle then similarity filter, or triggers direct processing
9:            `async_handle` \gets \mathrm{AsyncInfer}(o_{t+1})Triggernewchunkprediction(nonblocking)10: Trigger new chunk prediction (non blocking)
10:           \tilde{\mathbf{A}}_{t+1} \gets \pi(o_{t+1})Newqueueispredictedwiththepolicy11: New queue is predicted with the policy
11:           \mathbf{A}_{t+1} \gets f(\mathbf{A}_t, \tilde{\mathbf{A}}_{t+1})aggregateoverlaps(ifany)12:endif13:endif14:if aggregate overlaps (if any)
12:       end if
13:   end if
14:   if \mathrm{NOTCOMPLETED}(\mathrm{async\_handle})then15: then
15:       \mathbf{A}_{t+1} \gets \mathbf{A}_tNoupdateonqueue(inferenceisnotoverjustyet)16:endif17:endforHeresabreakdownofthealgorithm:Initialization:TheRobotClientcapturesaninitialobservation No update on queue (inference is not over just yet)
16:   end if
17: end for

Here's a breakdown of the algorithm:
*   **Initialization:** The `RobotClient` captures an initial observation o_0,sendsittothePolicyServer,andreceivesthefirstactionchunk, sends it to the `PolicyServer`, and receives the first `action chunk` \mathbf{A}_0.ActionConsumption(Line45):Ateachtimestep.
*   **Action Consumption (Line 4-5):** At each timestep t,theRobotClienttakesthefirstactionfromitscurrentactionqueue, the `RobotClient` takes the first action from its current `action queue` \mathbf{A}_t(PopFRONT)andexecutesit.TriggeringNewInference(Line613):TheRobotClientmonitorsthesizeofitsactionqueue.Iftheratioofremainingactionstothechunksize (`PopFRONT`) and executes it.
*   **Triggering New Inference (Line 6-13):**
    *   The `RobotClient` monitors the size of its `action queue`. If the ratio of remaining actions to the `chunk size` H_afallsbelowathreshold falls below a `threshold` g(i.e., (i.e., \frac{|\mathbf{A}_t|}{H_a} < g),itsignalsthatanewactionchunkshouldberequested.Anewobservation), it signals that a new `action chunk` should be requested.
    *   A new observation o_{t+1}iscaptured.Asimilarityfilter is captured.
    *   A `similarity filter` \mathrm{NeedsProcessing}(o_{t+1})canbeappliedtoavoidsendingredundantobservationsiftheenvironmenthasntchangedsignificantly. can be applied to avoid sending redundant observations if the environment hasn't changed significantly.
    *   AsyncInfer}(o_{t+1})

triggersaanon-blockingrequest to thePolicyServerfor a new chunk\tilde{\mathbf{A}}_{t+1}. This allows theRobotClientto continue executing actions from its current queue while thePolicyServercomputes. * Once A~t+1\tilde{\mathbf{A}}_{t+1} is received, it is aggregated with the remaining part of the old chunk At\mathbf{A}_t using a function ff (e.g.,exponential moving averageon overlapping sections) to form the newaction queue` At+1\mathbf{A}_{t+1}.

  • Handling Latency (Line 14-16): If the asynchronous inference for a new chunk is not yet completed (NOTCOMPLETED(async_handle)), the RobotClient continues to use the previous action queue At\mathbf{A}_t.

4.5.2. Analytical Study of Async Inference

The behavior of asynchronous inference can be analyzed to understand its trade-offs.

  • Latency (\ell): The total time required to get a new action chunk after sending an observation. This includes client-to-server communication (tCSt_{C \to S}), server inference time (S\ell_S), and server-to-client communication (tSCt_{S \to C}).
    • E[]=E[tCS]+E[S]+E[tSC]\mathbb{E}[\ell] = \mathbb{E}[t_{C \to S}] + \mathbb{E}[\ell_S] + \mathbb{E}[t_{S \to C}].
    • Assuming communication times are small, E[]E[S]\mathbb{E}[\ell] \simeq \mathbb{E}[\ell_S].
  • Control Cycle (Δt\Delta t): The time between consecutive robot actions. For 30 frames-per-second (fps), Δt=33ms\Delta t = 33 \mathrm{ms}.
  • Queue Exhaustion Threshold: To avoid idle periods (the robot waiting for a new chunk with an empty queue), the threshold gg must satisfy: gE[S]/ΔtHa g \ge \frac { \mathbb { E } [ \ell _ { S } ] / \Delta t } { H _ { a } } This formula shows that if server latency E[S]\mathbb{E}[\ell_S] is high or chunk size HaH_a is small, gg needs to be set higher to trigger new inference requests earlier, ensuring the queue is replenished before it's empty.

4.5.3. Impact of Threshold gg

The parameter gg (fraction of chunk size remaining to trigger new inference) governs the trade-off:

  • Sequential Limit (g=0g=0): The RobotClient drains the entire action chunk before requesting a new one. This leads to idle periods during inference, equivalent to synchronous inference.

  • Asynchronous Inference (g(0,1)g \in (0,1)): The client consumes a (1-g) fraction of its queue before triggering new inference. This amortizes computation while keeping the queue filled.

  • Sync-Inference Limit (g=1g=1): A new observation is sent at virtually every timestep, similar to synchronous inference (as in Zhao et al., 2023). This keeps the queue almost always full but incurs a high compute price.

    The image below (Figure 33 from the original paper) illustrates how the action queue size evolves for different values of gg.

    Figure 33 | Action queue size evolution at runtime for various levels of \(g\) when (A) not filtering out observation based on jspacilarynd tegu neuplcatevatinmesuihilary j 该图像是图表,展示了不同 gg 值下,运行时动作队列大小随推理时间步长的变化,分为(A) 无观察过滤和(B) 有观察过滤两种情况,反映了过滤观察对队列大小的影响。

Figure 33 | Action queue size evolution at runtime for various levels of gg when (A) not filtering out observation based on joint-space similarity and (B) with joint-space similarity filtering, showing how filtering reduces unnecessary updates.

4.5.4. Code Example: Using Async Inference

The paper provides code snippets Code 11 and Code 12 for spinning up a remote server and attaching a robot client for asynchronous inference. These snippets demonstrate:

  • PolicyServerConfig and serve() function to start the server.
  • RobotClientConfig to configure the client with robot details, server address, policy type, and chunking parameters (chunk_size_threshold for gg, actions_per_chunk for HaH_a).
  • The RobotClient then initiates a control_loop that manages observation capture, asynchronous inference requests, and action execution.

4.6. Generalist Robot Policies (VLAs)

The final advancement discussed is the development of generalist robot policies, often termed Vision-Language-Action (VLA) models. These aim to transcend task-specific limitations by learning policies that can operate across various embodiments and tasks, guided by natural language instructions.

The image below (Figure 34 from the original paper) illustrates the ambition of generalist robot policies to generalize to new deployment scenarios.

该图像是一个示意图,展示了计算机视觉/自然语言处理领域与机器人领域中两种不同的数据集和模型架构策略:大规模公开数据集配合可扩展架构实现零样本任务泛化,小规模特定任务数据结合小型专用架构实现特定任务模型。 该图像是一个示意图,展示了计算机视觉/自然语言处理领域与机器人领域中两种不同的数据集和模型架构策略:大规模公开数据集配合可扩展架构实现零样本任务泛化,小规模特定任务数据结合小型专用架构实现特定任务模型。

Figure 34 | The figure is a schematic diagram illustrating two different strategies of datasets and model architectures in computer vision/natural language processing and robotics: large-scale open datasets with scalable architectures for zero-shot task generalization, and small-scale task-specific datasets with small, crafted architectures for task-specific models.

The image below (Figure 35 from the original paper) provides a timeline of major robot imitation learning models and datasets, showcasing the evolution towards generalist policies.

该图像是一个时间轴示意图,展示了从2022年2月到2025年6月一系列机器人模仿学习相关模型和数据集的发布时间,区分了大规模闭源模型与可控规模开源模型。 该图像是一个时间轴示意图,展示了从2022年2月到2025年6月一系列机器人模仿学习相关模型和数据集的发布时间,区分了大规模闭源模型与可控规模开源模型。

Figure 35 | The figure is a timeline diagram showing the release dates of various robot imitation learning models and datasets from February 2022 to June 2025, distinguishing between large-scale closed-source and manageable-size open-source models.

The image below (Figure 36 from the original paper) illustrates the growth of datasets and model parameters over time in robot learning.

该图像是由三部分组成的图表,展示了不同机器人学习数据集的轨迹数量(左)、LeRobot平台轨迹数量随时间的增长趋势(中)、以及不同模型参数数量随时间变化的趋势(右)。 该图像是由三部分组成的图表,展示了不同机器人学习数据集的轨迹数量(左)、LeRobot平台轨迹数量随时间的增长趋势(中)、以及不同模型参数数量随时间变化的趋势(右)。

Figure 36 | The figure is a chart composed of three parts showing the number of trajectories in different robotic learning datasets (left), the growth trend of trajectory numbers over time on the LeRobot platform (middle), and the trend of model parameter numbers over time for different models (right).

Modern VLAs leverage unified Transformer models for computational efficiency, featuring specialized sub-components for visual perception and action prediction. They use language conditioning to achieve cross-task performance. Crucially, many VLAs avoid discrete action tokens by modeling continuous action distributions p(at:t+Haot)p(a_{t:t+H_a}|o_t) and use generative models (flow matching) to learn from inherently non-i.i.d. data.

4.6.1. VLMs for VLAs

Vision-Language Models (VLMs) are central to VLAs. They are designed to process both visual and textual modalities, often by using pre-trained LLMs and adapting them to multimodal data. They learn a rich semantic understanding of the world without explicit supervision for each concept. Integrating VLMs as the perceptual backbone for VLAs allows robots to leverage this semantic knowledge, improving generalization to novel scenarios.

4.6.2. π0\pi_0

π0\pi_0 (Black et al., 2024) is a VLA that combines a large VLM backbone (initialized from Gemma 2.6B) with a dedicated action expert for generating continuous actions via flow matching.

4.6.2.1. π0\pi_0 Architecture

The image below (Figure 37 from the original paper) illustrates the π0\pi_0 architecture.

Figure 37 | The \(\\pi _ { 0 }\) architecture, as in Black et al. (2024). Vision and language tokens are routed to a VLM backbone which trajectories from a mixture of closed and openly available dataset… 该图像是一个示意图,展示了如 Black 等人(2024)提出的 pi_0 架构。视觉和语言嵌入经过 VLM Backbone 处理,再由动作专家 pθ(at:t+Haot)p_\theta(a_{t:t+H_a}|o_t) 生成动作序列。

Figure 37 | The π0\pi_0 architecture, as in Black et al. (2024). Vision and language tokens are routed to a VLM backbone which trajectories from a mixture of closed and openly available datasets.

π0\pi_0 is a single, unified Transformer with two disjoint sets of weights ϕ\phi and θ\theta:

  1. VLM Backbone (fϕf_\phi): A large model (initialized from Gemma 2.6B) that processes multiple image frames (from various cameras) and a language instruction describing the task.

  2. Action Expert (vθv_\theta): A separate Transformer (e.g., 300M parameters) that processes the robot proprioceptive state qtq_t and a noised action chunk at:t+Haa_{t:t+H_a}' at a specific time τ\tau in the flow-matching process.

    The different expert networks (VLM backbone and action expert) operate separately but communicate through self-attention layers. π0\pi_0 uses a blockwise causal attention mask:

TiTqTaTi(100110111),1:Bidirectional Attention, 0:Masked Attention { \begin{array} { r l } { { \mathcal { T } } _ { i } \quad { \mathcal { T } } _ { q } \quad { \mathcal { T } } _ { a } } & { } \\ { \quad { \mathcal { T } } _ { i } \left( { \begin{array} { l l l } { 1 } & { 0 } & { 0 } \\ { 1 } & { 1 } & { 0 } \\ { 1 } & { 1 } & { 1 } \end{array} } \right) , \quad 1 : { \mathrm { B i d i r e c t i o n a l ~ A t t e n t i o n } } , \ 0 : { \mathrm { M a s k e d ~ A t t e n t i o n } } } \end{array} }

Here:

  • Ti\mathcal{T}_i: Tokens for image and language.
  • Tq\mathcal{T}_q: Tokens for proprioceptive state.
  • Ta\mathcal{T}_a: Tokens for action chunk.
  • 1: Indicates bidirectional attention (full visibility between tokens within and across blocks).
  • 0: Indicates masked attention (no visibility). This mask indicates that:
    • Image/language tokens (Ti\mathcal{T}_i) can only attend to other image/language tokens.
    • Proprioceptive tokens (Tq\mathcal{T}_q) can attend to image/language tokens and other proprioceptive tokens.
    • Action tokens (Ta\mathcal{T}_a) can attend to all previous token types. This structure ensures efficient computation by preventing excessive communication between blocks, particularly between image/language tokens and proprioceptive/action tokens, which are handled by different expert networks.

4.6.2.2. π0\pi_0 Training Objective

π0\pi_0 is trained using a flow matching loss, updating both the VLM backbone (ϕ\phi) and action expert (θ\theta) parameters jointly. The objective function is:

L(ϕ,θ)=Eτ,ϵ,ot,at:t+Ha[vθ(τat:t+Ha+(1τ)ϵa~t:t+Ha,ot,τ)(ϵat:t+Ha)2], τBeta[0,s](1.5,1), ϵN(0,I), ot,at:t+HaD \begin{array} { r l } & { \mathcal { L } ( \phi , \theta ) = \mathbb { E } _ { \tau , \epsilon , o _ { t } , a _ { t : t + H _ { a } } } \left[ \left\| v _ { \theta } ( \underbrace { \tau a _ { t : t + H _ { a } } + ( 1 - \tau ) \epsilon } _ { \tilde { a } _ { t : t + H _ { a } } } , o _ { t } , \tau ) - ( \epsilon - a _ { t : t + H _ { a } } ) \right\| ^ { 2 } \right] , } \\ & { ~ \tau \sim \mathrm { B e t a } _ { [ 0 ,s ] } ( 1 . 5 , 1 ) , ~ \epsilon \sim \mathcal { N } ( \mathbf { 0 } , \mathbf { I } ) , ~ o _ { t } , a _ { t : t + H _ { a } } \sim \mathcal { D } } \end{array}

Here:

  • vθ(a~t:t+Ha,ot,τ)v_\theta(\tilde{a}_{t:t+H_a}, o_t, \tau) is the vector field predicted by the action expert (parameterized by θ\theta), conditioned on a noised action chunk a~t:t+Ha\tilde{a}_{t:t+H_a}, the observation oto_t, and the flow timestep τ\tau.

  • a~t:t+Ha=τat:t+Ha+(1τ)ϵ\tilde{a}_{t:t+H_a} = \tau a_{t:t+H_a} + (1-\tau)\epsilon represents a linear interpolation between the true action chunk at:t+Haa_{t:t+H_a} and Gaussian noise ϵ\epsilon, representing a state along the flow path.

  • (ϵat:t+Ha)(\epsilon - a_{t:t+H_a}) is the target vector field for this specific flow matching setup, aiming to transform the noisy ϵ\epsilon to the clean at:t+Haa_{t:t+H_a}. This formulation appears to be a variant of Flow Matching where the target is not z1z0z_1-z_0 but effectively a noise prediction that guides the flow towards the true action chunk.

  • τ\tau is the flow timestep, sampled from a Beta distribution Beta[0,s](1.5,1)\mathrm{Beta}_{[0,s]}(1.5, 1) defined on the interval [0, s] (where s<1s < 1). This non-uniform sampling emphasizes higher noise levels (earlier timesteps) during training, helping the model learn the mean of the data distribution.

  • ϵN(0,I)\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) is Gaussian noise.

  • ot,at:t+HaDo_t, a_{t:t+H_a} \sim \mathcal{D} are observations and action chunks from the expert dataset.

    The image below (Figure 38 from the original paper) shows the modified Beta distribution used for sampling τ\tau.

    Figure 31 | The Diffusion Policy archicture, as in Chi et al. (2024). A stack of `H _ { o }` previous observations is used as external conditioning to denoise a group of `H _ { a }` actions. Conditio… 该图像是一个示意图,展示了Diffusion Policy的架构,如Chi等(2024)所述。它使用栈状的HoH_o个前序观测作为外部条件,去噪一组HaH_a个动作。条件注入发生在U-Net的每一层,经过TT步去噪后,获得完整的动作组。

Figure 38 | Unlike more traditional flow-matching algorithms, π0\pi_0 uses a modified distribution to sample the timestep τ\tau from during training and inference, favouring earlier timestamps corresponding to noisier chunks.

At inference time, π0\pi_0 generates actions by iteratively integrating the predicted vector field to denoise an initial noisy action chunk. The flow matching process enables faster inference with a limited number of denoising steps (as few as 10).

4.6.2.3. Data and Cross-Embodiment

π0\pi_0 is trained on a massive dataset called "the π\pi dataset" (10M+ trajectories), which combines proprietary data with open datasets like Open-X and DROID. It achieves cross-embodiment capabilities by training on diverse data and handling different robot embodiments by zero-padding actions for robots with fewer degrees of freedom to match a maximal configuration size. It also relies on exactly three camera views and uses masked image slots for fewer cameras.

4.6.2.4. Code Example: Using π0\pi_0

Code 13 demonstrates using\pi_0

, similar to previous examples but for the `PI0Policy` class, configured for multimodal inputs (`task` and `robot_type`) and multiple camera views.

## 4.6.3. SmolVLA

`SmolVLA` `(Shukor et al., 2025)` is an `open-source VLA` designed for `efficiency` and `accessibility`. It also employs a `Mixture of Experts (MoE)` architecture with a `pre-trained VLM backbone` and a dedicated `action expert`, trained with `flow matching`.

### 4.6.3.1. SmolVLA Architecture

The image below (Figure 39 from the original paper) illustrates the `SmolVLA` architecture.

![该图像是一个示意图,展示了机器人模仿学习中一个多模态视觉语言模型(VLM)的结构,包含图像嵌入、语言嵌入、自注意力和交叉注意力模块,以及动作专家模块的详细流程。](/files/papers/6907680a971e575bdfc172d9/images/22.jpg)
*该图像是一个示意图,展示了机器人模仿学习中一个多模态视觉语言模型(VLM)的结构,包含图像嵌入、语言嵌入、自注意力和交叉注意力模块,以及动作专家模块的详细流程。*

Figure 39 | The SmolVLA architecture as in Shukor et al. (2025). SmolVLA is a compact MoE model trained with a mix of proprietary and open data. It explicitly truncates VLM layers and visual tokens, resulting in 7x less memory usage than π0\pi_0 (450M parameters vs. π0\pi_0's 3.3B).

Key design choices for efficiency and accessibility:
*   **Compact VLM Backbone:** Uses `SmolVLM-2` `(Marafoti et al., 2025)` as its `VLM backbone`, which itself uses `SigLIP` `(Zhai et al., 2023)` for visual encoding and `SmolLM2` `(Allal et al., 2025)` as a language decoder.
*   **Smaller Action Expert:** Consists of approximately 100M parameters.
*   **Reduced Visual Tokens and Layers:** It reduces `visual tokens` via `pixel shuffling` to a fixed budget (e.g., 64 tokens per frame) and skips upper `VLM layers` (using only features from the first N=L/2N=L/2 decoder layers) to halve compute needs.
*   **Interleaving Cross-Attention (CA) and Self-Attention (SA):** The `action expert` interleaves `CA` and `SA` layers. `SA` operates on `action tokens`, while `CA` uses `action tokens` as queries and projects `visual` and `proprioceptive tokens` from the `VLM backbone` as keys and values. This is shown to yield higher success and smoother `action chunks`.
*   **Causal Masking:** Employs `simple causal masking` rather than `blockwise causal attention masking`.

    These design choices result in a significantly smaller model (450M parameters) compared to

\pi_0

(3.3B parameters). `SmolVLA` consumes `multi-view RGB images`, `natural-language instructions`, `projected proprioceptive state tokens`, and `noised action chunks` as inputs.

### 4.6.3.2. Data

`SmolVLA` pretrains exclusively on `450+ community datasets`, totaling `20k+ trajectories`. It addresses noise and missing information in these datasets by re-annotating tasks and standardizing viewpoints.

### 4.6.3.3. Code Example: Using SmolVLA

`Code 14` demonstrates `using SmolVLA`, similar to

\pi_0

but specifically for the `SmolVLAPolicy` class, also configured for multimodal inputs and multiple camera views.

# 5. Experimental Setup

The provided text primarily focuses on introducing the theoretical foundations and architectural designs of various `robot imitation learning` models, rather than detailing specific experimental setups with quantitative results for each. Therefore, this section will outline the general characteristics of datasets, evaluation metrics (as implied), and baselines based on the models discussed.

## 5.1. Datasets
The paper references several types of datasets used in `robot imitation learning`, emphasizing the shift from small, proprietary datasets to large-scale, diverse, and often `open-source` collections.

*   **Expert Demonstrations (

\mathcal{D}

):** The fundamental data source for all `Behavioral Cloning` approaches. These are `offline`, `reward-free trajectories` consisting of `observation-action pairs` collected from human teleoperation or other expert systems. They often have `variable lengths` and may include `multimodal strategies` for the same goal.
*   **ALOHA (A Low-cost, Open-source Hardware for Bimanual Teleoperation):** Mentioned as a platform for collecting human demonstrations. `ALOHA`'s accessibility helps in generating data for `ACT`.
*   **Open-X Dataset:** A large-scale, `openly available` dataset of robot trajectories. `OpenVLA` (Kimetal.,2024)(Kim et al., 2024) is trained exclusively on `Open-X` (970k+ trajectories). It's also a component of larger mixed datasets like

\pidataset.

  • DROID (Distributed Robot Interaction Dataset) (Khazatsky et al., 2025): Another open-source dataset primarily focused on data collected in simulation. It serves as a building block for general-purpose robot policies.

  • The π\pi dataset (for π0\pi_0): This is described as the largest dataset used to develop a foundational robotics model to date, comprising 10M+ trajectories. It is a mix of proprietary and openly available data, including Open-X and DROID, but with only approximately 9.1% being openly available. It includes dexterous demonstrations across various robot configurations and 6 different tasks.

  • Community-Contributed Datasets (for SmolVLA): SmolVLA (Shukor et al., 2025) is specifically designed to train on 450+ community datasets, totaling 20k+ trajectories. This highlights an effort to democratize robot learning by leveraging data from accessible platforms like the SO-100 and SO-101 arms. The authors address potential noise or missing instructions in these datasets by re-annotating tasks and standardizing viewpoints.

  • LeRobotDataset (in Code Examples): The provided Python code snippets use lerobot/svla_so101_pickplace as an example LeRobotDataset. This indicates the use of standardized datasets for robot learning benchmarks and practical implementation.

    The image below (Figure 1 from the original paper) shows examples of time-varying command curves of multiple robotic arm joint movements alongside physical photos, illustrating the kind of interaction data collected.

    该图像是论文中的图表,包含两个部分(A和B),展示了机械臂多个关节动作命令随时间变化的曲线及机械臂执行动作的实物图片,用以验证动作控制的时序特性与实际表现。 该图像是论文中的图表,包含两个部分(A和B),展示了机械臂多个关节动作命令随时间变化的曲线及机械臂执行动作的实物图片,用以验证动作控制的时序特性与实际表现。

Figure 1 | The figure shows interaction with its environment.

The image below (Figure 19 from the original paper) shows example observation-action pairs from a dataset.

Figure 29 |The CVAE decoder used inACT, comprising of a ful encoder-decoder Transformer architecture.Camera observations from all `_ n` camera views are first embedded using pre-trained visual encode… 该图像是图29中的示意图,展示了ACT中使用的CVAE解码器结构,包含完整的编码器-解码器Transformer架构。多视角相机图像通过预训练视觉编码器嵌入,并结合位置嵌入后,与来自CVAE编码器的状态嵌入和风格变量zz^*一起输入Transformer推理。编码器与解码器共享矩阵K, V,用于解码固定位置嵌入为动作序列。

Figure 19 | The image is an illustration from Figure 29 showing the CVAE decoder used in ACT, featuring a full encoder-decoder Transformer architecture. Camera observations from multiple views are embedded by pretrained visual encoders and combined with positional embeddings, then fed along with state embeddings and the style variable zz^* from the CVAE encoder into the Transformer for inference. The encoder shares matrices K, V with the decoder to decode fixed positional embeddings into action sequences.

The image below (Figure 20 from the original paper) shows examples of robot arm motions (joint angles) and corresponding visual observations and actions.

该图像是机器人手臂运动的示意图,展示了不同时间点 \(q_0, q_{110}, q_{200}\) 的关节角度及对应的视觉观测 \(o_0, o_{110}, o_{200}\) 和动作 \(a_0, a_{110}, a_{200}\)。图中体现了机器人基于行为克隆从人体示范数据中学习动作的过程。 该图像是机器人手臂运动的示意图,展示了不同时间点 q0,q110,q200q_0, q_{110}, q_{200} 的关节角度及对应的视觉观测 o0,o110,o200o_0, o_{110}, o_{200} 和动作 a0,a110,a200a_0, a_{110}, a_{200}。图中体现了机器人基于行为克隆从人体示范数据中学习动作的过程。

Figure 20 | The image is an illustration of robotic arm motions, showing joint angles at different time points q0,q110,q200q_0, q_{110}, q_{200} along with the corresponding visual observations o0,o110,o200o_0, o_{110}, o_{200} and actions a0,a110,a200a_0, a_{110}, a_{200}. It depicts the process of the robot learning actions from human demonstration data through behavioral cloning.

5.2. Evaluation Metrics

The paper discusses the capabilities and performance of different models, but it does not explicitly define the mathematical formulas for evaluation metrics in this excerpt. However, based on the context of robot learning and the claims made, the primary implicit evaluation metrics would revolve around:

  1. Success Rate:

    • Conceptual Definition: Measures the proportion of trials or episodes where the robot successfully completes the specified task, according to predefined success criteria. This is the most common and intuitive metric for task-oriented robotics.
    • Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} $
    • Symbol Explanation:
      • Number of Successful Trials: The count of attempts where the robot achieved the task goal.
      • Total Number of Trials: The total count of attempts made by the robot.
  2. Action Smoothness / Consistency:

    • Conceptual Definition: Assesses how fluid and natural the robot's movements are, often reflecting the stability and human-likeness of the learned policy. This can be quantified by measuring variations in joint velocities or accelerations.
    • Mathematical Formula: Not explicitly provided or universally standardized in the paper. It could involve metrics like Joint Jerk (derivative of acceleration) or L2 norm of action differences over time. For example, for a sequence of actions ata_t: $ \text{Smoothness} = \frac{1}{T-1} \sum_{t=0}^{T-1} |a_{t+1} - a_t|_2^2 $
    • Symbol Explanation:
      • TT: Total number of timesteps.
      • ata_t: Action at time tt.
      • 22\|\cdot\|_2^2: Squared L2 norm, measuring the magnitude of change between consecutive actions. Lower values indicate smoother actions.
  3. Generalization Capabilities:

    • Conceptual Definition: Measures the model's ability to perform well on new, unseen tasks, environments, or robot embodiments that were not part of the training data. This is crucial for generalist policies.
    • Quantification: Often evaluated by success rate on held-out tasks or zero-shot performance (performing tasks without any prior specific training for them).
  4. Computational Efficiency / Latency:

    • Conceptual Definition: Measures the computational resources (e.g., memory usage, inference time, number of denoising steps) required by the policy, especially important for real-time control on resource-constrained hardware.
    • Quantification: Reported as metrics like inference time (ms), memory footprint (GB), or number of denoising steps in Diffusion Models.

5.3. Baselines

The paper discusses comparisons with various baselines, either explicitly or implicitly:

  • Point-Wise Policies (Standard BC): This is the fundamental baseline for imitation learning. The paper states that generative models prove more effective than point-wise policies at dealing with multimodal demonstration datasets. This implies point-wise BC models struggle with averaging across modes.
  • Simple Supervised Objective (L1 loss): Zhao et al. (2023) (for ACT) ablated their GM approach against a simpler supervised objective using L1 loss (L1(a,a)=aa1\mathcal{L}_1(a, a') = \|a - a'\|_1). They found GM to be superior for human demonstrations (unscripted, multimodal) but comparable for scripted demonstrations.
  • RL Algorithms: The entire premise of imitation learning in the paper is to provide an alternative to RL, implying that RL (especially with its exploration risks and reward design challenges) serves as a conceptual baseline against which the pragmatism of BC is measured.
  • Base Models for Generalist Policies:
    • π0scratch\pi_0^{\mathrm{scratch}} Baseline: For π0\pi_0, a version trained from scratch is compared against the pre-trained and fine-tuned version, demonstrating the benefits of large-scale pre-training and high-quality data.
    • π0\pi_0 (for SmolVLA): SmolVLA is explicitly compared against π0\pi_0 to highlight its advancements in computational efficiency and accessibility while rivalling its performance.
  • Different Denoising Paradigms: For Diffusion Policy, comparisons are made implicitly against DDPIM (Denoising Diffusion Probabilistic Models) and DDPM (Denoising Diffusion Probabilistic Models), with DP adopting a deterministic denoising paradigm to achieve fewer steps.

6. Results & Analysis

The paper, being a tutorial or survey-style chapter, primarily discusses the capabilities and general performance trends of the described methods rather than presenting detailed experimental results with quantitative tables. However, it provides key qualitative and some quantitative claims that highlight the effectiveness and advantages of each proposed approach.

6.1. Core Results Analysis

The analysis focuses on how different methodologies effectively address the challenges of robot imitation learning, particularly in handling multimodal demonstrations and scaling to generalist policies.

6.1.1. Behavioral Cloning with Generative Models

  • Effectiveness against Multimodality: The paper states that generative models (like VAEs, DMs, FMs) prove more effective than point-wise policies at dealing with multimodal demonstration datasets. This is because point-wise policies tend to average across modes, leading to indecisive or unstable actions, whereas generative models can represent and sample from the diverse strategies in the data.

6.1.2. Action Chunking with Transformers (ACT)

  • Handling Multimodal Human Demonstrations: Zhao et al. (2023) found that generative models (CVAEs in ACT) were significantly more competitive when learning from human demonstrations (which are inherently multimodal and often noisy) compared to a simple supervised learning objective (e.g., L1 loss). Specifically, learning from human demonstrations was hindered by -33.3% when using a standard supervised learning objective compared to a richer CVAE objective.
  • Scripted vs. Human Demonstrations: For scripted demonstrations (less multimodal), GM and supervised learning approaches showed comparable performance. This suggests the primary benefit of generative models in ACT lies in their ability to capture the complexity of real human behaviors.
  • Action Chunking Benefits: ACT's use of action chunking improved success rates dramatically, with 1% vs. 44% success rate compared to point-wise policies (implied by context or another work referenced).

6.1.3. Diffusion Policy (DP)

  • Data Efficiency: DP can be trained with as little as 50-150 demonstrations (approximately 15-60 minutes of teleoperation data) while exhibiting strong performance on various simulated and real-world tasks.
  • Outperformance across Dataset Sizes: Chietal.(2024)Chi et al. (2024) found that DP reliably outperforms considered baselines for all dataset sizes, demonstrating its robustness and scalability.
  • Inference Efficiency: DP can obtain fully-formed action chunks with as little as T=10T=10 denoising steps, making it efficient enough for real-time robotic control. This is a 10x improvement in denoising steps compared to some stochastic DPM variants.
  • Importance of Observation History: The combination of conditioning on a horizon of previous observations and action chunking at inference time is claimed to be essential for good performance and avoiding indecisiveness.

6.1.4. Optimized Inference (Asynchronous Inference)

  • Trade-off Governed by gg: The parameter gg (threshold for remaining actions in the queue) clearly illustrates a trade-off (Figure 33).
    • Small gg (e.g., g=0g=0 for sequential limit) results in idle periods as the robot waits for new chunks.
    • Large gg (e.g., g=1g=1 for sync-inference limit) assumes a highly accurate model and incurs a significant compute price by requesting new chunks almost constantly.
    • Choosing g(0,1)g \in (0,1) allows for a balance, amortizing computation while keeping the action queue filled.
  • Mitigating Stalling: By asynchronously predicting new action chunks and filtering out near-duplicate observations, the system effectively avoids the robot stalling due to an empty queue, providing smooth operation even with latency.

6.1.5. Generalist Robot Policies (π0\pi_0 and SmolVLA)

  • π0\pi_0 (Black et al., 2024):
    • Largest Dataset to Date: Trained on "the π\pi dataset" comprising 10M+ trajectories, which is claimed to be the largest dataset for a foundational robotics model.
    • Broadly Capable Base Model: Pre-training on the\pidataset yields a broadly capable base model that, when fine-tuned with extra high-quality data, consistently outperforms a π0scratch\pi_0^{\mathrm{scratch}} baseline across a variety of benchmarks.
    • Cross-Embodiment: Demonstrates the ability to control mobile and static manipulator robots with varying arm embodiments, attributing this to the large-scale cross-embodiment data in the π\pi dataset.
    • Faster Inference: Flow matching allows for faster inference with as few as 10 denoising steps.
  • SmolVLA (Shukor et al., 2025):
    • Efficiency and Accessibility: Designed as a compact model (450M parameters vs. π0\pi_0's 3.3B), resulting in 7x less memory usage than π0\pi_0. It is also 40% faster.
    • Comparable Performance: Despite its smaller size, SmolVLA is reported to rival π0\pi_0's performance in both real-world and simulated environments.
    • Open-Source and Community Data: Pretrains exclusively on 450+ community datasets, demonstrating the viability of open-source efforts.

6.2. Data Presentation (Figures)

The paper includes several figures that illustrate key concepts or qualitative results, rather than tables of quantitative experimental results. These figures were embedded and discussed in the methodology section, as they primarily depict architectures, processes, or high-level comparisons.

6.3. Ablation Studies / Parameter Analysis

The paper mentions the following ablation studies and parameter analyses:

  • ACT: Generative vs. Supervised Objective: Zhao et al. (2023) explicitly ablated their GM approach against a simpler supervised objective (using L1 loss). The finding that GM was more competitive for human demonstrations (unscripted, multimodal) validates the effectiveness of the CVAE's ability to model complex, multimodal action distributions.
  • ACT: Impact of Action Chunking: The paper references action chunking showing 1% vs. 44% success rate as an implicit ablation, highlighting the importance of predicting action sequences over single actions.
  • ACT: β\beta-CVAE Hyperparameter: ACT is trained as a
\beta`-CVAE`, where β\beta regulates the `information condensed in the latent space`. While specific results of β\beta tuning are not provided, this indicates a parameter that controls the expressivity of the `latent space`.
*   **Diffusion Policy: Transformer Network for Noise Regressor:** Chietal.(2024)Chi et al. (2024) found that `DPs` were particularly `performant` when modeling ϵθ\epsilon_\theta with a `Transformer-based network`. However, they note the `increased sensitivity of Transformer networks to hyperparameters`, indicating a trade-off between performance and tuning complexity.
*   **Optimized Inference: Threshold gg:** The `asynchronous inference control-loop` explicitly includes the parameter gg. Figure 33 (Action queue size evolution) serves as an `ablation study` showing the impact of gg on queue management and `idle time`, validating the concept of dynamic queue replenishment.
*   **π0\pi_0: Pre-training vs. From Scratch:** The comparison between

\pi_0pretrainedontheπdatasetand `pretrained on the`\pi`dataset` and \pi_0^{\mathrm{scratch}}

serves as a crucial `ablation`. The finding that the `pretrained version consistently outperforms` the `scratch baseline` validates the benefit of `large-scale pre-training` and `diverse data`.
*   **π0\pi_0: Modified Timestep Sampling (`Beta` distribution for τ\tau):** The use of a `Beta distribution` Beta[0,s](1.5,1)\mathrm{Beta}_{[0,s]}(1.5, 1) for sampling τ\tau (instead of `uniform`) is an explicit design choice. This `ablation` emphasizes `higher noise levels` during training to learn the mean of the data distribution, and reducing the support `[0,s]` helps optimize inference time.
*   **SmolVLA: Architectural Choices:** `SmolVLA`'s design choices (compact `VLM backbone`, smaller `action expert`, `reduced visual tokens`, `skipped VLM layers`, `interleaving CA/SA`) are implicitly ablations against larger or less optimized designs. The results validate that these choices lead to a `smaller`, `faster`, and `memory-efficient` model while retaining comparable performance to

\pi_0

.

    These analyses demonstrate the authors' rigor in understanding the contribution of different components and hyperparameters to the overall performance and efficiency of the proposed `robot learning` methodologies.

# 7. Conclusion & Reflections

## 7.1. Conclusion Summary
This tutorial chapter provides a comprehensive overview of the transformative shift in `robot learning`, moving from traditional `model-based control` to dynamic, `data-driven approaches`. It highlights `Behavioral Cloning (BC)` as a practical, reward-free alternative to `Reinforcement Learning (RL)`, particularly emphasizing the benefits of `generative models` (`VAEs`, `Diffusion Models`, `Flow Matching`) for handling `multimodal expert demonstrations`. The chapter details advanced `imitation learning` techniques like `Action Chunking with Transformers (ACT)` and `Diffusion Policy`, which leverage `action chunking` and `Transformer` architectures to learn complex behaviors from human data efficiently. Furthermore, it introduces `optimized inference` strategies, such as `asynchronous inference`, crucial for real-time deployment on hardware. The tutorial culminates by discussing the emergence of `generalist robot policies` or `Vision-Language-Action (VLA) models` (e.g.,

\pi_0

, `SmolVLA`), which integrate `Vision-Language Models (VLMs)` and `natural language conditioning` to achieve `cross-task` and `cross-embodiment capabilities`. A significant underlying theme is the growing importance of `open-source hardware`, `software`, and `decentralized data collection` to democratize and accelerate progress in the field of `foundational robotics`.

## 7.2. Limitations & Future Work
The paper implicitly or explicitly points out several limitations of current `robot learning` approaches and suggests directions for future work:

*   **Limitations of BC:**
    *   **Suboptimal Decisions:** `BC` can only reproduce behaviors that are "at best as good as those of the demonstrator" and has no mechanism to remedy `suboptimal decisions` made by the human expert.
    *   **Scarce Demonstrations:** `BC` can be problematic in `sequential decision-making tasks` where `expert demonstrations are scarce`, as data collection can be `expensive` or `time-consuming`.
    *   **Covariate Shift:** While `generative models` help with `multimodality`, the fundamental `covariate shift` issue in `sequential non-i.i.d. data` remains a challenge for purely `offline BC`.
*   **Limitations of Diffusion Models:**
    *   **Inference Time:** Standard `Diffusion Models` can be `computationally expensive` at `inference time`, requiring a large number of `denoising steps` to recover a sample. `Flow Matching` and `deterministic denoising` (e.g., in `Diffusion Policy`) aim to mitigate this.
*   **Challenges of Generalist Models:**
    *   **Computational Resources:** `Large-scale robotic foundational models` (like

\pi_0

) require immense `computational resources` for training and deployment, which are `unattainable for most research institutions`, creating an `accessibility gap`.
    *   **Data Accessibility:** Much of the data used for `large proprietary models` remains `closed-source`, hindering replication and further research.
    *   **Hyperparameter Sensitivity:** `Transformer networks` (used in `DP` and `VLAs`) can be `sensitive to hyperparameters`, making them potentially more challenging to train with `non-smooth action sequences`.
    *   **Data Gaps for Failure Recovery:** `Expert demonstrations` often lack data on `failure states` or `recovery behaviors`, making it difficult for `autonomous agents` to learn how to recover from near-failure situations.
*   **Future Work Directions:**
    *   **Open-Source Robotics:** Emphasizing the convergence of `open-source hardware` and `software` to democratize access and foster community contributions.
    *   **Decentralized Datasets:** Encouraging `decentralized data collection` by individual researchers and practitioners to build larger, more diverse datasets.
    *   **Efficient VLA Designs:** Developing `compact` and `compute-efficient VLA designs` (like `SmolVLA`) that can run on more modest hardware.
    *   **Robustness to Imperfect Data:** Developing methods to handle `noisy` or `missing instructions` and `unstandardized viewpoints` in `community-contributed datasets`.
    *   **Learning from Failures:** Researching ways to collect or synthesize data that allows models to learn `recovery behaviors` from suboptimal or failure states.

## 7.3. Personal Insights & Critique
This chapter provides an excellent, rigorous, and beginner-friendly overview of the rapid advancements in `robot imitation learning`. The structured progression from basic `Behavioral Cloning` to complex `generalist Vision-Language-Action models` is highly illuminating.

**Key Insights:**
*   **The Power of Generative Models:** The clear explanation of why `generative models` are superior to `point-wise policies` for `multimodal demonstrations` is a crucial takeaway. This addresses a fundamental limitation of early `BC` and underpins much of the subsequent progress.
*   **Pragmatism in Robotics:** The consistent emphasis on `offline learning`, `reward-free approaches`, and `optimized inference` demonstrates a pragmatic approach to `robot learning`—prioritizing safety, scalability, and real-world deployability over purely theoretical optimality. The `asynchronous inference control-loop` is a particularly clever engineering solution to bridge the gap between `computationally intensive policies` and `real-time robotic control`.
*   **The Foundational Model Paradigm in Robotics:** The discussion on `VLAs` clearly shows how `robotics` is adopting the `foundational model` paradigm from `NLP` and `CV`. The integration of `VLMs` for high-level understanding and `flow matching` for `continuous action generation` is a powerful combination for achieving broad `task generalization`.
*   **The Importance of Openness:** The contrast between proprietary models like

\pi_0$$ (with its mostly closed dataset) and open-source efforts like SmolVLA is a critical observation. The argument for democratizing access to foundational models through open-source releases and community-contributed datasets is vital for fostering widespread research and preventing a concentration of power in a few large institutions.

Potential Critique/Areas for Improvement:

  • Lack of Quantitative Results: While excellent for a tutorial, the absence of detailed quantitative experimental results (tables of specific performance metrics across tasks/robots) makes it difficult to directly compare the models' empirical superiority beyond qualitative statements. For a deeper academic understanding, specific benchmarks would be beneficial.
  • Assumed i.i.d. for GM Intro: The introduction to generative models assumes i.i.d. samples from D\mathcal{D}, which then contrasts with the later discussion of non-i.i.d. sequential data in BC. While a common pedagogical simplification, explicitly bridging this gap (e.g., how GMs are applied to non-i.i.d. sequences through action chunking or conditioning) could enhance clarity.
  • Formula Derivations: Some formulas, particularly in the VAE and DM sections (e.g., the full log-likelihood derivation for DMs), are truncated or presented with a "where we..." that implies steps not fully shown. While the core objectives are present, a complete, self-contained derivation would be ideal for beginners.
  • Specifics of Aggregation Function ff: In Algorithm 1 for asynchronous inference, the aggregation function f(At,A~t+1)f(\mathbf{A}_t, \tilde{\mathbf{A}}_{t+1}) is mentioned but not detailed (e.g., "aggregate overlaps (if any)"). Providing common strategies for this (e.g., exponential moving average on overlapping chunks, linear blending) would be helpful.

Transferability: The methodologies discussed, particularly the use of generative models for multimodal output distributions and Transformer-based architectures for sequential decision-making, are highly transferable.

  • Multimodal Output Generation: Beyond robot actions, these techniques could apply to any domain requiring generation of diverse, context-dependent outputs, such as human-robot interaction (generating diverse responses), creative content generation (e.g., music, art), or drug discovery (generating diverse molecular structures).

  • Autonomous Systems beyond Robotics: The optimized asynchronous inference paradigm is directly applicable to any autonomous system (e.g., autonomous vehicles, drones, industrial control systems) where complex AI policies need to operate in real-time with latency constraints on edge devices.

  • Vision-Language Integration: The VLA framework demonstrates a powerful way to integrate high-level semantic understanding with low-level control, a pattern that could be adapted for intelligent agents in virtual environments or complex simulation platforms.

    Overall, this chapter serves as an excellent reference point for anyone seeking to understand the current landscape and future trajectory of robot learning, providing both foundational knowledge and insights into cutting-edge research.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.