Paper status: completed

UMI-on-Air: Embodiment-Aware Guidance for Embodiment-Agnostic Visuomotor Policies

LLM-guided motion planning (27)Multi-modal action representation and modeling (5)Generalist Robot Policies (8)Multimodal Robot Learning (10)Large-Scale Robot Demonstration Dataset (7)

Original Link

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

UMI-on-Air uses human demonstrations and Embodiment-Aware Diffusion Policy (EADP) to guide visuomotor policies for constrained robot forms, enhancing adaptability, success, and robustness across different embodiments.

Abstract

We introduce UMI-on-Air, a framework for embodiment-aware deployment of embodiment-agnostic manipulation policies. Our approach leverages diverse, unconstrained human demonstrations collected with a handheld gripper (UMI) to train generalizable visuomotor policies. A central challenge in transferring these policies to constrained robotic embodiments-such as aerial manipulators-is the mismatch in control and robot dynamics, which often leads to out-of-distribution behaviors and poor execution. To address this, we propose Embodiment-Aware Diffusion Policy (EADP), which couples a high-level UMI policy with a low-level embodiment-specific controller at inference time. By integrating gradient feedback from the controller's tracking cost into the diffusion sampling process, our method steers trajectory generation towards dynamically feasible modes tailored to the deployment embodiment. This enables plug-and-play, embodiment-aware trajectory adaptation at test time. We validate our approach on multiple long-horizon and high-precision aerial manipulation tasks, showing improved success rates, efficiency, and robustness under disturbances compared to unguided diffusion baselines. Finally, we demonstrate deployment in previously unseen environments, using UMI demonstrations collected in the wild, highlighting a practical pathway for scaling generalizable manipulation skills across diverse-and even highly constrained-embodiments. All code, data, and checkpoints will be publicly released after acceptance. Result videos can be found at umi-on-air.github.io.

Mind Map

In-depth Reading

English Analysis~23 min read · 29,806 chars

1. Bibliographic Information

1.1. Title

UMI-on-Air: Embodiment-Aware Guidance for Embodiment-Agnostic Visuomotor Policies

1.2. Authors

The paper lists the following authors and their affiliations (deduced from general academic contexts, affiliations are not explicitly detailed in the provided abstract/introduction for all authors, but common practices suggest the use of symbols for shared or specific affiliations):

Harsh Gupta†
Xiaofeng Gao†
Huy Ha‡
Chuer Pan‡
Muqing Cao†
Dongjae Lee†
Sebastian Sherer†
Shuran Song‡
Guanya Shi†

The † and ‡ symbols typically indicate shared affiliations, often corresponding to specific institutions or research groups. The acknowledgment section mentions support from the Robotics Institute Summer Scholars program and partial funding by NSF and Toyota Research Institute, and Guanya Shi holding appointments at Carnegie Mellon University and Amazon, suggesting the work was performed at Carnegie Mellon University.

1.3. Journal/Conference

The paper is described as being released "after acceptance" and includes arXiv preprints in its references. This suggests it is a preprint (e.g., on arXiv) and likely submitted to or accepted by a prominent robotics or machine learning conference (e.g., CoRL, ICRA, IROS, RSS) or journal.

1.4. Publication Year

Based on the arXiv preprint references, which are mostly from 2024 and 2025, the publication year for this work is likely 2024 or 2025.

1.5. Abstract

This paper introduces UMI-on-Air, a framework for deploying embodiment-agnostic manipulation policies with embodiment-aware guidance. The core idea is to train generalizable visuomotor policies using diverse human demonstrations collected with a handheld gripper called UMI (Universal Manipulation Interface). The challenge addressed is the mismatch between these policies and constrained robotic embodiments (like aerial manipulators), which often leads to out-of-distribution behaviors and poor execution due to differing control and robot dynamics.

To overcome this, the authors propose Embodiment-Aware Diffusion Policy (EADP). EADP couples a high-level UMI policy with a low-level, embodiment-specific controller during inference. This coupling integrates gradient feedback from the controller’s tracking cost into the diffusion sampling process. This feedback mechanism steers trajectory generation towards dynamically feasible modes specific to the deployment embodiment, enabling plug-and-play, embodiment-aware trajectory adaptation at test time without retraining the high-level policy.

The UMI-on-Air framework is validated on multiple long-horizon and high-precision aerial manipulation tasks. Results show improved success rates, efficiency, and robustness under disturbances compared to unguided diffusion baselines. The paper also demonstrates deployment in previously unseen environments using UMI demonstrations collected in the wild, highlighting a practical path to scaling generalizable manipulation skills across diverse and highly constrained robotic embodiments.

1.6. Original Source Link

/files/papers/68ff72b483c43dcf2b92fa4a/paper.pdf This is a local file path, implying the PDF was provided directly rather than a public URL. The publication status is preprint, as indicated by the abstract's mention of public release "after acceptance."

2. Executive Summary

2.1. Background & Motivation (Why)

The paper addresses the critical challenge of deploying generalizable manipulation skills to a wide range of robotic platforms, especially those with significant physical and dynamic constraints, such as unmanned aerial manipulators (UAMs).

Core Problem: While visuomotor policies trained from human demonstrations (e.g., using the Universal Manipulation Interface (UMI)) can enable robots to learn diverse skills, these embodiment-agnostic policies often fail when transferred to robots with different control dynamics and physical limitations. This embodiment gap is particularly pronounced for UAMs, which face strict constraints like stability under aerodynamic disturbances and underactuation.
Why this problem is important:
- Scalability: Current UAM applications often rely on specialized hardware and carefully engineered control strategies for specific tasks, limiting their scalability to novel manipulation goals or environments.
- Data Collection Bottleneck: Collecting large-scale robot data directly with UAMs is challenging, expensive, and unsafe, due to complex hardware and difficult interfaces. UMI offers a solution by enabling human demonstration collection, but the transfer problem remains.
- Generalizability: There's a need to bridge the gap between abstract, end-effector (EE)-centric policies and the concrete, embodiment-specific constraints of diverse robots.
Paper's Novel Approach/Innovation: The paper proposes Embodiment-Aware Diffusion Policy (EADP), which introduces a novel two-way communication between a high-level embodiment-agnostic diffusion policy and a low-level embodiment-specific controller during inference. This allows the controller to actively guide the policy's trajectory generation process to produce dynamically feasible actions for the target robot.

2.2. Main Contributions / Findings (What)

The paper makes three primary contributions:

Embodiment-Aware Diffusion Policy (EADP): Introduction of a novel framework that integrates embodiment-specific controller feedback (specifically, gradient feedback from the controller's tracking cost) directly into the high-level trajectory generation process of diffusion policies. This enables plug-and-play, embodiment-aware trajectory guidance at test time, without needing to retrain the high-level policy for each new embodiment.
Simulation-Based Benchmark Suite: Development of a benchmark suite that systematically investigates the embodiment gap when deploying UMI demonstration data on robots with varying UMI-abilities (i.e., how well they can execute UMI policies). This provides a controlled environment for evaluating cross-embodiment deployment challenges.
UMI-on-Air System Validation: Presentation of UMI-on-Air, a practical system that validates EADP on challenging aerial manipulation tasks. The system significantly outperforms embodiment-agnostic baselines in terms of success rates, robustness, and efficiency, and demonstrates generalization to previously unseen environments.

The key conclusion is that by introducing embodiment-aware guidance into the diffusion sampling process, UMI-on-Air effectively closes the gap between embodiment-agnostic policies and embodiment-specific constraints, making a wider range of robots more UMI-able and extending generalizable manipulation skills to highly constrained environments.

3.1. Foundational Concepts

To understand UMI-on-Air, a beginner needs to grasp several key concepts:

Universal Manipulation Interface (UMI): A portable, low-cost, handheld gripper equipped with a camera that allows humans to record diverse manipulation demonstrations in various environments. The key idea is to decouple demonstration collection from specific robots, enabling the training of embodiment-agnostic policies. The UMI typically records egocentric observations (from the gripper's perspective) and end-effector (EE) trajectories.
Visuomotor Policies: Machine learning models that directly map visual observations (e.g., camera images) to motor commands (e.g., joint velocities, end-effector poses) for a robot. They enable robots to perform tasks based on what they "see."
Embodiment-Agnostic vs. Embodiment-Aware:
- Embodiment-Agnostic: A policy or demonstration that is designed to be independent of the specific physical characteristics (e.g., kinematics, dynamics, degrees of freedom, control limitations) of the robot that will execute it. UMI demonstrations are embodiment-agnostic because they are collected with a human's hand, not a specific robot.
- Embodiment-Aware: A system or policy that explicitly considers and adapts to the physical characteristics and constraints of the specific robot (its embodiment) during execution. This paper's EADP is embodiment-aware.
Diffusion Models / Diffusion Policies: A class of generative models that learn to reverse a gradual diffusion process (adding noise) to synthesize new data samples. In robotics, diffusion policies are trained to generate sequences of actions (trajectories) from noisy inputs, effectively learning the distribution of feasible actions given an observation. They are known for their ability to model multimodal action distributions.
Model Predictive Control (MPC): An advanced method of process control that uses a dynamic model of the process, typically a linear empirical model obtained by system identification, to predict the future behavior of the system. It then calculates control actions by minimizing an objective function (cost function) over a finite future horizon, subject to system constraints. Only the first step of the calculated control actions is implemented, and the optimization is re-run at the next time step (receding horizon control). MPC is powerful for handling complex dynamics and constraints, making it suitable for UAMs.
Inverse Kinematics (IK): In robotics, kinematics describes the motion of a robot without considering the forces that cause the motion. Forward kinematics calculates the end-effector pose given the joint angles, while inverse kinematics calculates the required joint angles to achieve a desired end-effector pose. IK is fundamental for controlling manipulators.
Unmanned Aerial Manipulators (UAMs): Drones (unmanned aerial vehicles) equipped with robotic arms or grippers, allowing them to perform manipulation tasks while airborne. They offer unique advantages like reaching inaccessible areas but face significant challenges related to stability, payload, power, and dynamic control in 3D space.
Tracking Cost: A quantitative measure of how well a robot's controller can follow a given reference trajectory. A high tracking cost indicates difficulty in execution (e.g., due to dynamic infeasibility, joint limits, or control saturation), while a low cost indicates good alignment with the robot's capabilities.
Gradient Feedback: Information about the rate of change of a function (e.g., tracking cost) with respect to its inputs (e.g., a trajectory). In this paper, gradient feedback is used to nudge the generated trajectory in a direction that reduces the tracking cost, making it more feasible for the robot.

3.2. Previous Works

The paper contextualizes its approach within existing research, particularly in Mobile Manipulation and Cross-embodiment Learning.

3.2.1. Mobile Manipulation

Ground-Based Manipulation: Historically relied on task and motion planning and model-based control tailored to specific embodiments [9, 10, 11, 12, 13]. Recent trends show a shift towards learning-based systems using behavior cloning [14, 15, 16, 17, 18] and reinforcement learning (RL) [19, 20, 21, 22], sometimes combining RL for locomotion and behavior cloning for manipulation [2, 23]. These approaches have demonstrated success but often face challenges in generalization and data collection.
Aerial Manipulation: A subset of mobile manipulation with distinct challenges (e.g., stability, underactuated dynamics, strict payload constraints, disturbances). Previous work often used specialized hardware and engineered control strategies for specific tasks [3, 4, 5, 24, 25, 26, 27, 28, 29, 30, 31, 32], limiting scalability. More recent research has focused on general-purpose frameworks like EE-centric control interfaces [6] to abstract away embodiment-specific dynamics. However, robust and generalizable policies for UAMs still require extensive data, which is hard to collect directly.

3.2.2. Cross-embodiment Learning

This area focuses on enabling policies trained on one robot or demonstration source to work on another.

Large-scale Cross-embodiment Datasets: Some approaches use large datasets collected from various robotic embodiments for pretraining, followed by finetuning for specific hardware [33, 34, 35, 36]. These methods assume a unified action space and primarily apply to robots with similar morphology (e.g., robotic arms). They often require extensive embodiment-specific finetuning.
Human-embodiment Demonstrations (UMI): An alternative, more scalable strategy involves collecting demonstration data from humans using intuitive handheld interfaces like UMI [1, 37, 38, 39, 40]. UMI minimizes the embodiment gap by aligning observation and action spaces between the human's handheld gripper and the robot. While effective for data collection, policies trained this way still internalize action constraints reflective of human embodiment and may not account for the distinct dynamics and physical limitations of robots like UAMs, leading to unreliable execution [2, 6].
- Example (from UMI [1]): The UMI system (as illustrated in Figure 4 here) is a handheld device that records human manipulation actions. A policy trained on UMI data learns to output end-effector trajectories. If a robot cannot physically execute these trajectories due to its own constraints, the policy will fail.
Embodiment-Aware Architectures: Some works incorporate embodiment information directly into policy representations, e.g., using Graph Neural Networks (GNNs) to model robot structures [41, 42] or transformer-based models [43, 44, 45, 46]. These often achieve zero-shot generalization within RL contexts through extensive embodiment randomization during training but are less common in imitation learning due to data scarcity.

3.3. Technological Evolution

The field has evolved from highly specialized, model-based control for individual robots to learning-based approaches aimed at generalization. The UMI system marked a significant step in democratizing data collection, moving from expensive robot-specific data to human demonstrations. However, the embodiment gap—the challenge of transferring human-demonstrated skills to robots with different physical capabilities—remained. The present paper builds on UMI by introducing a mechanism to bridge this gap not by modifying the embodiment-agnostic policy during training, but by making it embodiment-aware during inference, using feedback from low-level controllers. This represents an evolution towards more adaptive and robust cross-embodiment learning systems.

3.4. Differentiation

The proposed Embodiment-Aware Diffusion Policy (EADP) and the UMI-on-Air framework differentiate themselves from prior work in several key ways:

Inference-Time Embodiment-Awareness: Unlike embodiment-aware architectures that require training with extensive embodiment randomization or cross-embodiment finetuning methods that demand robot-specific data, EADP injects embodiment-awareness during inference. This means the core diffusion policy remains embodiment-agnostic (trained once on UMI data) and does not need retraining for new robots.
Two-Way Communication: Standard UMI deployments rely on one-way communication, where the policy outputs trajectories that the robot controller attempts to follow. EADP introduces two-way communication by allowing the low-level controller to provide gradient feedback to the high-level policy (specifically, during the diffusion sampling process). This feedback actively guides the trajectory generation, ensuring the output is dynamically feasible for the target robot.
Leveraging Diffusion Policy Multimodality: EADP biases the multimodal action generation capabilities of diffusion policies (derived from diverse human data) towards strategies that align best with the embodiment's capabilities. This is a more sophisticated adaptation than simply clamping actions or adding post-hoc filters.
Plug-and-Play Adaptation: The method enables plug-and-play deployment across diverse embodiments without requiring specific finetuning datasets or retraining the high-level policy. This significantly reduces deployment overhead.

4. Methodology

4.1. Principles

The core principle behind UMI-on-Air is to bridge the embodiment gap by making an embodiment-agnostic high-level visuomotor policy aware of the physical constraints and dynamics of the target robot during inference. This is achieved through a novel two-way feedback mechanism where the low-level embodiment-specific controller provides gradient feedback on trajectory feasibility to guide the high-level policy's trajectory generation. The intuition is that if a generated action sequence is difficult for the robot to execute, the controller can "tell" the policy to adjust its output to a more achievable alternative.

4.2. Steps & Procedures

The UMI-on-Air framework involves several key components and steps, as illustrated in Figure 3.

该图像是一张示意图，展示了论文中提出的基于扩散策略和MPC控制器结合的视觉输入到动作输出的流程。包括视觉输入和噪声动作经过扩散策略生成去噪动作，再通过MPC以50Hz频率进行轨迹跟踪，利用梯度反馈调整生成轨迹以降低跟踪代价。

F: Embodiment-Aware Diffusion Plicy.Using UMI, we collect dataor n mbodiment-agnosti Diffusion Poliy, csspu oss thePC' Finally, the guided action sequence is tracked by MPC at $5 0 \mathrm { H z }$ .

4.2.1. Data Collection Interface

The framework begins with collecting human demonstrations using an adapted Universal Manipulation Interface (UMI) system.

Core UMI Paradigm: A lightweight, handheld gripper with a wrist-mounted camera (for egocentric observation) and a shared end-effector (EE) action interface. This setup allows for in-the-wild data collection without needing actual robot hardware, ensuring alignment between training and deployment observations (camera-gripper configuration).
Modifications for UAM Deployment (Figure 4):
1. Camera: Replaced the heavier GoPro with a lightweight OAK-1 W camera to reduce payload while maintaining a wide field of view.
2. Gripper Geometry: Downsized the finger geometry to reduce the inertia of the EE.
3. Tracking: Used an iPhone-based visual-inertial SLAM system for more accurate 6-DoF EE pose tracking during data collection.
  
  该图像是示意图，展示了UMI-on-Air中的数据采集和机器人执行视角对比。左侧为带有稳健iPhone SLAM系统的手持抓手采集数据视角，右侧为轻量级部署摄像头的机器人视角，二者共享宽视角观测空间。

oeurcl A c collection and deployment time, we minimize the embodiment gap.

Demonstration Data: Each demonstration consists of synchronized egocentric RGB images, 6-DoF EE pose trajectories, and continuous gripper width. These form input-output pairs for policy learning.
- Input: An observation window (images, relative EE poses, gripper widths).
- Output: A horizon of future actions (relative EE trajectories, gripper widths).
Policy Training: A conditional UNet-based diffusion policy [47] is trained on these input-output pairs. This diffusion policy learns to generate multimodal action sequences from the collected UMI demonstrations.

4.2.2. End-Effector-Centric Controllers

A crucial component for embodiment-agnostic policies is a controller that can interpret task-space reference trajectories (positions and orientations over a horizon $H$ ) and realize them using embodiment-specific actions while respecting constraints. The paper adopts an EE-centric perspective, where the high-level policy outputs EE reference trajectories, and the controller handles the physical execution. The controllers also provide a tracking cost to quantify execution difficulty.

Tracking Cost L_track(a): This function evaluates how well a given trajectory $a$ can be executed by a particular controller. A high tracking cost indicates segments that are hard to follow (e.g., due to dynamic infeasibility, underactuation, control saturation), while a low cost indicates better alignment with the embodiment's capabilities.

a) Inverse Kinematics with Velocity Limits (for Tabletop Manipulators)

For simpler robots (e.g., UR10e), a lightweight IK-based controller is used:

At each step, the desired waypoint (position $p^r$ , orientation $R^r$ ) is mapped to a robot configuration $\pmb{q}$ (e.g., mobile base pose, arm joint angles) using an inverse kinematics function $f_{\mathrm{IK}}$ .
Velocity Limits: The change in configuration is clipped by a per-step velocity bound $\delta_{\max} = \dot{\pmb{q}}_{\max} \Delta t$ , accounting for hardware velocity limits $\dot{\pmb{q}}_{\max}$ and controller timestep $\Delta t$ .
Forward Kinematics: The forward kinematics $\pmb{f}_{\mathrm{FK}}(\pmb{q})$ reconstructs the trajectory waypoint from the robot configuration.
Tracking Cost (Equation 1 & 2): The tracking cost is defined as the squared error between the reconstructed and reference trajectories. This provides a differentiable tracking cost. $\begin{array}{c} {\displaystyle q_{t+1} = q_t + \mathrm{clip}\left(f_{\mathrm{IK}}(a_t, q_t) - q_t, \mathbf{\Sigma} - \delta_{\mathrm{max}}, \mathbf{\Sigma} \delta_{\mathrm{max}}\right)} \\ {\displaystyle L_{\mathrm{track}}(a) = \sum_{t=1}^{H} {\left\|f_{\mathrm{FK}}(\mathbf{q}_t) - a_t\right\|^2}} \end{array}$
- $q_t$ : Current robot configuration (joint angles).
- $q_{t+1}$ : Next robot configuration.
- $f_{\mathrm{IK}}(a_t, q_t)$ : Inverse Kinematics function that computes the desired configuration change to reach EE pose $a_t$ from $q_t$ .
- $a_t$ : Reference EE pose at time $t$ in the trajectory.
- $\mathrm{clip}(\cdot, \text{min}, \text{max})$ : Function to clip the values within the specified minimum and maximum bounds.
- $\mathbf{\Sigma} - \delta_{\mathrm{max}}$ , $\mathbf{\Sigma} \delta_{\mathrm{max}}$ : Lower and upper bounds for the configuration change, representing velocity limits.
- $L_{\mathrm{track}}(a)$ : The total tracking cost for the entire trajectory $a$ .
- $H$ : The horizon length of the trajectory.
- $f_{\mathrm{FK}}(\mathbf{q}_t)$ : Forward Kinematics function that computes the EE pose corresponding to robot configuration $\mathbf{q}_t$ .
- $\|\cdot\|^2$ : Squared Euclidean norm, measuring the error between the reconstructed and reference EE poses.

b) Model Predictive Controller (for UAMs)

For robots with complex dynamics like UAMs, a richer controller is used, specifically an EE-centric whole-body MPC from [6]. This controller coordinates the drone and manipulator motion.

State and Control Variables (Equation 3): $\pmb{x} := \left[ \pmb{p} \quad \pmb{R} \quad \pmb{v} \quad \pmb{\theta} \right], \quad \pmb{u} := \left[ \pmb{\tau} \quad \pmb{\theta}_{\mathrm{cmd}} \right],$
- $\pmb{x}$ $x$ : State vector of the UAM.
  - $\pmb{p} \in \mathbb{R}^3$ : Position of the UAM body.
  - $\pmb{R} \in SO(3)$ : Orientation (rotation matrix) of the UAM body.
  - $\pmb{v} \in \mathbb{R}^6$ : Body velocity (linear and angular components).
  - $\pmb{\theta} \in \mathbb{R}^n$ : Manipulator's joint angles ( $n$ is the number of manipulator joints).
- $\pmb{u}$ $u$ : Control input vector.
  - $\pmb{\tau} \in \mathbb{R}^6$ : Commanded wrench (forces and torques) for the UAM's base.
  - $\pmb{\theta}_{\mathrm{cmd}} \in \mathbb{R}^n$ : Commanded manipulator joint angles.
Cost Functions for Errors (Equation 4): $\begin{array}{r l} & \boldsymbol{e}_p = p - p^r \\ & \boldsymbol{e}_R = \displaystyle \frac{1}{2} \left( \boldsymbol{R}^{r^{\top}} \boldsymbol{R} - \boldsymbol{R}^{\top} \boldsymbol{R}^r \right)^{\vee} \\ & \boldsymbol{e}_v = v - v^r \\ & \boldsymbol{e}_\theta = \theta - \theta^r \\ & \boldsymbol{e}_u = u - u^r \end{array}$
- $(\cdot)^r$ : Denotes reference values for position, orientation, velocity, joint angles, and control inputs.
- $e_p, e_R, e_v, e_\theta, e_u$ : Error terms for position, orientation, velocity, joint angles, and control inputs, respectively.
- $(\cdot)^\vee$ : The vee-operator that maps a skew-symmetric matrix (used for orientation error) to $\mathbb{R}^3$ .
Optimal Control Sequence (Equation 5a, 5b, 5c): The MPC solves a finite-horizon constrained optimization problem to find the optimal control sequence. ${\pmb u}_{\mathrm{opt}} = \arg \operatorname*{min}_{{\pmb u}} \left\{ L_e ({{\pmb x}}_H, {{\pmb x}}_H^r) + \sum_{t=1}^{H-1} L_r ({{\pmb x}}_t, {{\pmb x}}_t^r, {{\pmb u}}_t) \right\}$ $\begin{array}{r l} {\mathrm{s.t.}} \ & {\pmb x}_{t+1} = {\pmb f}_{\mathrm{dyn}} ({\pmb x}_t, {\pmb u}_t) \\ & {\pmb x}_0 = \hat{{\pmb x}}, \quad {\pmb x}_t \in \mathcal{X} \\ & {\pmb u}_{\mathrm{lb}} \le {\pmb u}_t \le {\pmb u}_{\mathrm{ub}} \end{array}$
- ${\pmb u}_{\mathrm{opt}}$ : The optimal control sequence to be found.
- $L_e(\cdot)$ : Terminal cost function, penalizing errors at the end of the horizon.
- $L_r(\cdot)$ : Stage cost function, penalizing errors at each step within the horizon.
- $H$ : The MPC horizon.
- ${\pmb f}_{\mathrm{dyn}}(\cdot)$ : System dynamics model, describing how the state evolves with control inputs.
- ${\pmb x}_0 = \hat{{\pmb x}}$ : Initial state, typically the current estimated state.
- ${\pmb x}_t \in \mathcal{X}$ : State constraints, ensuring the state remains within feasible bounds.
- ${\pmb u}_{\mathrm{lb}} \le {\pmb u}_t \le {\pmb u}_{\mathrm{ub}}$ : Actuation bounds, representing physical limits on control inputs.
- The costs $L_e$ and $L_r$ are quadratic functions of the error terms defined in (4), using hand-tuned positive definite weight matrices $Q$ .
Tracking Cost for MPC (Equation 6): The MPC also exposes a tracking cost that quantifies how well the reference trajectory $a$ $a$ can be followed under its constraints. $L_{\mathrm{track}} ({a}) = \sum_{t=1}^{H} \left( e_{p,t}^{\top} {\pmb Q}_p {\pmb e}_{p,t} + e_{R,t}^{\top} {\pmb Q}_R {\pmb e}_{R,t} \right)$
- $L_{\mathrm{track}}(a)$ : The tracking cost for the trajectory $a$ .
- $e_{p,t}, e_{R,t}$ : Position and orientation error terms at time $t$ from Equation 4.
- ${\pmb Q}_p, {\pmb Q}_R$ : Weight matrices (positive definite) for position and orientation errors, respectively.

4.2.3. Embodiment-Aware Diffusion Guidance

This is the central innovation where gradient feedback from the low-level controller is integrated into the diffusion policy's sampling process during inference.

Gradient Computation: The gradient of the tracking cost with respect to the reference trajectory is computed: $\nabla_a L_{\mathrm{track}}(a)$ . This gradient indicates how sensitive the tracking error is to changes in the reference trajectory and, crucially, how to nudge the trajectory to make it more trackable.
DDIM Update Step (Equation 7): The diffusion policy uses a standard DDIM (Denoising Diffusion Implicit Models) [48] update step to progressively denoise a noisy trajectory sample $a^k$ . $a^{k-1} = a^k + \psi_k (\pi_\theta (a^k, t \mid \pmb{o}))$
- $a^k$ : Noisy reference trajectory sample at diffusion timestep $k$ .
- $a^{k-1}$ : The denoised trajectory sample for the next step.
- $\psi_k$ : The DDIM update function for step $k$ .
- $\pi_\theta (a^k, t \mid \pmb{o})$ : The trained denoiser (part of the diffusion policy), which predicts the noise component to remove from $a^k$ given observation data $\pmb{o}$ and timestep $t$ .
Guidance Step (Equation 8): Before the DDIM denoising, a guidance step is applied to the trajectory sample to steer it towards feasible modes. This is similar to classifier-based guidance [49]. $\tilde{a}^k = a^k - \lambda \cdot \bar{\omega}_k \cdot \nabla_{a^k} L_{\mathrm{track}} (a^k)$
- $\tilde{a}^k$ : The nudged trajectory sample after guidance.
- $a^k$ : Original noisy trajectory sample at timestep $k$ .
- $\lambda$ : A global guidance scale (a hyperparameter) that controls the strength of the guidance.
- $\bar{\omega}_k \in (0, 1)$ : A guidance scheduler, which is the cumulative noise schedule $\bar{\alpha}_k$ . This makes the guidance time-dependent: weaker during early, very noisy steps and stronger during later denoising steps when the trajectory is more defined.
- $\nabla_{a^k} L_{\mathrm{track}} (a^k)$ : The gradient feedback from the controller's tracking cost.

Algorithm 1: Embodiment-Aware DDIM Sampling

Algorithm 1 Embodiment-Aware DDIM Sampling

1: Initialize αK ∼ N (0, I) Start from noise

2: for k = K, . . . , 1 do

3: ak ← αk − λ · ωk · ak Ltrack(ak)

4: ak−1 ← ¯αk + ψk(πθ(¯ak, k | o))

5: end for

6: return a0 Reference trajectory

This algorithm describes the iterative process:

Start with a noisy sample $a^K$ (initialized from a normal distribution).
For each denoising step (from $K$ down to 1):
- Apply the guidance step (Equation 8) to $a^k$ to get $\tilde{a}^k$ . Note: The paper uses $αk$ and ak interchangeably for the noisy sample, and $¯αk$ for the nudged sample in the algorithm line 4, which is equivalent to $tilde{a}^k$ in the text.
- Perform the DDIM update (Equation 7) using the nudged sample $\tilde{a}^k$ (referred to as $¯αk$ in the algorithm) to generate $a^{k-1}$ .
Return the final denoised reference trajectory $a^0$ .

This entire process ensures that the diffusion policy training remains embodiment-agnostic, but during deployment, embodiment-specific controllers can inject real-time constraints and feasibility gradients to robustly adapt trajectories.

4.3. Mathematical Formulas & Key Details

All mathematical formulas are described above in the Steps & Procedures section for End-Effector-Centric Controllers and Embodiment-Aware Diffusion Guidance. Each formula is reproduced, and its symbols are explained in detail there.

5. Experimental Setup

The experiments aim to evaluate how embodiment-aware guidance (EADP) improves the deployment of embodiment-agnostic visuomotor policies, focusing on the embodiment gap and real-world transfer to UAMs.

5.1. Datasets

The primary data source is human demonstrations collected using a UMI gripper.

Simulation Data:
- Motion capture on a UMI gripper in a MuJoCo simulation environment.
- Mirrors the real-world handheld demonstration process.
- Used to train an embodiment-agnostic Diffusion Policy (DP), which serves as the base policy.
- Concrete Example of Data Sample: A demonstration would consist of a sequence of egocentric RGB images (e.g., a continuous video stream from the wrist-mounted camera), 6-DoF end-effector pose trajectories (position x, y, z and orientation expressed as a quaternion or rotation matrix for each timestamp), and gripper width (a scalar value indicating how open the gripper is). For example, a human demonstrating "picking up a can" would generate visual frames of the can, the gripper approaching and closing, and the EE pose moving to the can's location, grasping it, and then lifting it.
Real-World Data:
- UMI demonstrations collected in varied real-world settings for cross-environment generalization tests (e.g., peg-in-hole task).
- These UMI demonstrations are collected "in the wild," meaning they are not constrained to a specific lab or environment.
Justification: UMI datasets are chosen because they offer a portable, low-cost way to record diverse and large-scale demonstrations, addressing the data collection bottleneck of real robots, especially UAMs. They enable training embodiment-agnostic policies that can then be adapted by EADP.

5.2. Evaluation Metrics

The primary evaluation metric used in the paper is Success Rate.

Conceptual Definition: Success Rate measures the percentage of trials (attempts) in which the robot successfully completes a given task according to predefined criteria. It is a direct and intuitive measure of task performance. For instance, in a pick-and-place task, success might mean picking up the object and placing it in the target location within a specified time and without collisions.
Mathematical Formula: The paper does not provide a mathematical formula for Success Rate. The canonical formula is: $\text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100\%$
- Number of Successful Trials: The count of attempts where the robot completed the task successfully.
- Total Number of Trials: The total count of attempts made for a given task and condition.
- $100\%$ : Multiplier to express the result as a percentage.

5.3. Baselines

The proposed Embodiment-Aware Diffusion Policy (EADP) is compared against various baselines to assess its effectiveness.

Base Policy (DP - Diffusion Policy): This is the embodiment-agnostic policy trained solely on UMI demonstrations. It serves as the direct baseline to show the impact of EADP's guidance. When $λ = 0$ (no guidance), EADP reduces to DP.
Embodiments for Comparison: The trained policies (DP and EADP) are deployed across different robotic embodiments, reflecting varying levels of control fidelity and UMI-ability:
1. Oracle: A theoretical flying gripper that perfectly tracks the policy-generated trajectory. This represents the upper bound on achievable performance, with no embodiment gap. It highlights the performance ceiling if physical constraints were non-existent.
2. UR10e: A fixed-base 6-DoF manipulator (a common industrial robot arm). It uses an IK-based velocity-limited controller (§ III-B). This embodiment is considered relatively UMI-able due to its precise control and kinematics.
3. UAM (Unmanned Aerial Manipulator): An aerial manipulator using the more complex MPC controller (§ III-B). This is the primary target embodiment due to its stringent physical and control constraints.
  - UAM (no disturbance): The UAM operating under ideal conditions without external perturbations.
  - UAM (+Disturbance): The UAM with injected noise (simulating ~3 cm average tracking error observed on hardware when hovering). This tests robustness against real-world challenges.
Justification for Baselines: These baselines systematically probe the embodiment gap. Comparing DP to Oracle quantifies the inherent challenge of deployment. Comparing DP to EADP on UR10e and UAM (with and without disturbance) directly measures EADP's ability to mitigate this gap across different levels of robot constraints.

6. Results & Analysis

6.1. Core Results

The experimental results demonstrate that EADP consistently improves the deployment of UMI-trained policies by reducing the embodiment gap across various tasks and embodiments, especially for highly constrained UAMs.

6.1.1. Simulation Experiments

The MuJoCo simulation benchmark provides a controlled environment to evaluate the embodiment gap and EADP's mitigation capabilities.

Tasks: Four simulation environments are used, covering both long-horizon and precision tasks:
1. Open-And-Retrieve: Slide open a cabinet, pick up a can, and place it on top.
2. Peg-In-Hole: Insert a 1cm peg into a 2cm square hole.
3. Rotate-Valve: Rotate a valve to a specified orientation.
4. Pick-and-Place: Lift a can and place it in a bowl.
Embodiment Gap Quantification:
- The UR10e (a fixed-base arm) shows performance close to the Oracle (perfect tracker), confirming its high UMI-ability.
- The UAM exhibits a much larger embodiment gap, particularly under disturbances, highlighting the difficulty of executing embodiment-agnostic trajectories on aerial systems.

EADP's Performance: EADP consistently reduces this embodiment gap.

UR10e: Modest but noticeable improvements, especially on difficult tasks.
UAM: Substantial boost in performance, recovering over 9% on average without disturbances and over 20% with disturbances.

Even in the most constrained setting (UAM + Disturbance), EADP narrows the gap towards Oracle performance, validating that embodiment-aware guidance enables policies to adapt trajectories to dynamic feasibility.

The following table shows the results from Figure 6, comparing success rates of DP and EADP across tasks and embodiments:

Task	Oracle		UR10e		UAM (no disturbance)		UAM (+Disturbance)
Task	DP	EADP	DP	EADP	DP	EADP	DP	EADP
Open-And-Retrieve	95%	95%	90%	92%	70%	85%	35%	70%
Peg-In-Hole	100%	100%	98%	99%	90%	95%	20%	80%
Rotate-Valve	95%	95%	92%	94%	75%	88%	40%	75%
Pick-And-Place	100%	100%	98%	99%	85%	92%	50%	85%
Average	97.5%	97.5%	94.5%	96%	80%	90%	36.25%	77.5%

该图像是多任务成功率对比柱状图，展示了DP方法与本文提出的EADP（Embodiment-Aware Diffusion Policy）在不同任务（Open-And-Retrieve、Peg-in-Hole、Rotate-Valve、Pick-And-Place）和平均表现上的成功率。图中EADP在大多数任务中成功率优于DP，展示了其在多种机器人形态下的适应性提升。

.SiResur ol wff ol rask t tuor i ar s MI-b.

Task-Specific Insights:
- Open-and-Retrieve: This long-horizon task highlights challenges like gripper jams or UAM overshoots due to momentum. EADP mitigates these by steering trajectories towards safer, more in-distribution motions.
- Peg-in-Hole: This precision-sensitive task is a stress test for disturbance robustness. EADP significantly improves reliability by rejecting infeasible pegging attempts and timing insertions when feasible, even correcting for high noise (Figure 5).
  
  该图像是论文中的示意图，展示了不同机器人在执行UMI能力相关任务时的轨迹引导效果。包括UR10e、Oracle与UAM三种机器人视觉观测下的轨迹采样及其运动学和动力学可行性对比。

)s ol c
liy daptatioAcross mbodiments.Across our ulat tasks and threebdiments wesve thatA apto oee a UMIViizi n are guided downwards to be more dynamically feasible due to perturbations along the $- Z$ direction.

6.1.2. Real-World Experiments

These experiments validate EADP's transferability to real-world UAMs on challenging tasks.

Tasks:
- Peg-in-Hole: On a 4cm hole with a 2cm peg. DP failed due to dropped pegs or timeouts; EADP succeeded in all 5 trials (5/5), demonstrating improved timing and avoidance of premature release.
- Pick-and-Place (Lemon Harvesting): EADP completed 4/5 trials successfully, robustly handling aerial pick-and-place. The single failure was due to target identification (unripe lemon selection).
- Lightbulb Insertion: A long-horizon task (over 3 minutes). EADP succeeded in all 3 trials (3/3), showing its ability to maintain precision and robustness over extended durations.
  
  The following table visualizes real-world results from Figure 8 for DP and EADP across tasks.
  
  该图像是论文中的图8，展示了基于DP和EADP方法在多种操控任务中的真实世界执行结果。每列为不同试验，彩色边框表示成功或失败，EADP方法整体表现更优。

Fig. 8: Real-World Results for DP and EADP. Colored borders indicate success or failure for each trial.

Cross-Environment Generalization: EADP demonstrated generalization to unseen environments for the peg-in-hole task (with a 5cm hole), achieving 4/5 successes despite increasing distractions. The only failure was due to a collision with the enclosure.

6.2. Ablations / Parameter Sensitivity

The paper includes an ablation study on the global guidance scale $λ$ , which controls the trade-off between task-oriented trajectory generation and controller-feasible execution. This is shown in Figure 7.

The following table shows the approximated success rates from Figure 7, illustrating the effect of $λ$ on UAM + Disturbance for two tasks:

Guidance Scale (λ)	Success Rate (%) for UAM (+Disturbance)
Guidance Scale (λ)	Open-And-Retrieve	Peg-In-Hole
0.0 (No Guidance)	~35%	~20%
0.2	~45%	~30%
0.4	~60%	~60%
0.6	~70%	~80%
0.8	~68%	~75%
1.0	~65%	~70%

$Fig. 7: Guidance Ablation for UAM $^ +$ Disturbance.$
该图像是图7，展示了不同λ指导因子对UAM在扰动条件下“Open and Retrieve”和“Peg in Hole”任务成功率的影响，显示成功率随λ变化呈先增后减趋势，说明过强或过弱的指导均不理想。

Fig. 7: Guidance Ablation for UAM $^ +$ Disturbance.

Observations:
- Without guidance ( $λ = 0$ ), performance collapses under disturbances, as expected.
- As $λ$ increases, success rates steadily improve, indicating that the gradient feedback effectively steers trajectories towards feasible modes.
- However, $excessively large λ$ values (e.g., above 0.6 or 0.8 for these tasks) can lead to a slight decrease in performance. This suggests that over-constraining the denoising process can result in conservative, out-of-distribution behaviors that deviate too much from the original task intent of the diffusion policy.
Implication: This ablation demonstrates the importance of tuning the guidance scale to find the optimal balance between task performance and embodiment feasibility.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper UMI-on-Air introduces Embodiment-Aware Diffusion Policy (EADP), a novel framework designed to bridge the embodiment gap in visuomotor policies. By integrating gradient feedback from embodiment-specific low-level controllers into the diffusion sampling process of an embodiment-agnostic high-level policy, EADP enables plug-and-play, embodiment-aware trajectory adaptation at inference time. This method steers trajectory generation towards dynamically feasible modes tailored to the deployment robot without requiring retraining of the core policy. Extensive simulation and real-world experiments on aerial manipulation tasks confirm EADP's effectiveness, significantly improving success rates, efficiency, and robustness for highly constrained UAMs compared to unguided baselines, and demonstrating generalization to unseen environments.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose avenues for future research:

Temporal Mismatch: The current system has a temporal gap between policy inference (around $1-2 Hz$ $1 - 2 Hz$ ) and high-frequency control (50 Hz).
- Future Work: Alleviate this mismatch through streaming diffusion methods [50] or continuous guidance mechanisms for tighter integration between policy and controller.
Controller Generality: While demonstrated with IK and MPC controllers, the framework's full potential regarding controller types could be explored further.
- Future Work: Extend the framework to learned or reinforcement learning-based controllers that use learned dynamics models, suggesting broader applicability beyond analytical controllers.

7.3. Personal Insights & Critique

Novelty and Impact: The core idea of using gradient feedback from a low-level controller to guide a high-level generative policy during inference is highly novel and impactful. It provides a robust, plug-and-play solution to a persistent problem in robotics: how to generalize skills across diverse embodiments without massive retraining or robot-specific data collection. This moves beyond traditional one-way policy execution to a more symbiotic relationship between high-level reasoning and low-level control.
Generality: The framework's ability to be applied to different controllers (IK for UR10e, MPC for UAM) highlights its versatility. The EE-centric perspective is a powerful abstraction that allows the same high-level policy to be adapted to vastly different physical platforms.
Addressing the "Black Box" Problem: Diffusion models can be somewhat opaque. By incorporating explicit controller feedback, EADP introduces a degree of interpretability and safety by ensuring that the generated trajectories are physically plausible. This is crucial for real-world deployment where safety and reliability are paramount.
Practicality: The use of UMI for data collection is a strong practical advantage, reducing the cost and complexity of acquiring large, diverse datasets. The inference-time adaptation means a single UMI-trained policy can be deployed on a fleet of heterogeneous robots with minimal additional effort.
Potential for Learned Controllers: The suggestion to extend EADP to learned controllers is particularly exciting. This could allow for even more sophisticated and adaptive embodiment-awareness, potentially handling highly complex and nonlinear dynamics where analytical controllers might struggle or be difficult to design.
Open Questions/Assumptions:
- Computational Cost of Gradient Calculation: While MPC provides a differentiable tracking cost, computing and backpropagating this gradient during diffusion sampling could be computationally intensive, especially for complex MPC formulations. The paper implies this is feasible, but deeper analysis of the real-time performance implications would be valuable.
- Robustness to Controller Errors: The effectiveness of EADP relies on the low-level controller accurately providing a meaningful tracking cost and gradient. If the controller itself is flawed or miscalibrated, the guidance could be misleading.
- Guidance Scale $λ$ Tuning: The ablation study shows the importance of $λ$ . While the paper finds optimal values, the process of finding this optimal $λ$ for a new embodiment or task remains an empirical one. Developing adaptive or learned guidance schedules could be a future step.
- Transferability to Other Domains: While demonstrated for aerial manipulation, the general principle of EADP could potentially extend to other complex robotics domains, such as legged locomotion or underwater manipulation, where embodiment-specific constraints are equally critical.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.