UMI-on-Air: Embodiment-Aware Guidance for Embodiment-Agnostic Visuomotor Policies
TL;DR Summary
UMI-on-Air uses human demonstrations and Embodiment-Aware Diffusion Policy (EADP) to guide visuomotor policies for constrained robot forms, enhancing adaptability, success, and robustness across different embodiments.
Abstract
We introduce UMI-on-Air, a framework for embodiment-aware deployment of embodiment-agnostic manipulation policies. Our approach leverages diverse, unconstrained human demonstrations collected with a handheld gripper (UMI) to train generalizable visuomotor policies. A central challenge in transferring these policies to constrained robotic embodiments-such as aerial manipulators-is the mismatch in control and robot dynamics, which often leads to out-of-distribution behaviors and poor execution. To address this, we propose Embodiment-Aware Diffusion Policy (EADP), which couples a high-level UMI policy with a low-level embodiment-specific controller at inference time. By integrating gradient feedback from the controller's tracking cost into the diffusion sampling process, our method steers trajectory generation towards dynamically feasible modes tailored to the deployment embodiment. This enables plug-and-play, embodiment-aware trajectory adaptation at test time. We validate our approach on multiple long-horizon and high-precision aerial manipulation tasks, showing improved success rates, efficiency, and robustness under disturbances compared to unguided diffusion baselines. Finally, we demonstrate deployment in previously unseen environments, using UMI demonstrations collected in the wild, highlighting a practical pathway for scaling generalizable manipulation skills across diverse-and even highly constrained-embodiments. All code, data, and checkpoints will be publicly released after acceptance. Result videos can be found at umi-on-air.github.io.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
UMI-on-Air: Embodiment-Aware Guidance for Embodiment-Agnostic Visuomotor Policies
1.2. Authors
The paper lists the following authors and their affiliations (deduced from general academic contexts, affiliations are not explicitly detailed in the provided abstract/introduction for all authors, but common practices suggest the use of symbols for shared or specific affiliations):
-
Harsh Gupta†
-
Xiaofeng Gao†
-
Huy Ha‡
-
Chuer Pan‡
-
Muqing Cao†
-
Dongjae Lee†
-
Sebastian Sherer†
-
Shuran Song‡
-
Guanya Shi†
The
†and‡symbols typically indicate shared affiliations, often corresponding to specific institutions or research groups. The acknowledgment section mentions support from the Robotics Institute Summer Scholars program and partial funding by NSF and Toyota Research Institute, and Guanya Shi holding appointments at Carnegie Mellon University and Amazon, suggesting the work was performed at Carnegie Mellon University.
1.3. Journal/Conference
The paper is described as being released "after acceptance" and includes arXiv preprints in its references. This suggests it is a preprint (e.g., on arXiv) and likely submitted to or accepted by a prominent robotics or machine learning conference (e.g., CoRL, ICRA, IROS, RSS) or journal.
1.4. Publication Year
Based on the arXiv preprint references, which are mostly from 2024 and 2025, the publication year for this work is likely 2024 or 2025.
1.5. Abstract
This paper introduces UMI-on-Air, a framework for deploying embodiment-agnostic manipulation policies with embodiment-aware guidance. The core idea is to train generalizable visuomotor policies using diverse human demonstrations collected with a handheld gripper called UMI (Universal Manipulation Interface). The challenge addressed is the mismatch between these policies and constrained robotic embodiments (like aerial manipulators), which often leads to out-of-distribution behaviors and poor execution due to differing control and robot dynamics.
To overcome this, the authors propose Embodiment-Aware Diffusion Policy (EADP). EADP couples a high-level UMI policy with a low-level, embodiment-specific controller during inference. This coupling integrates gradient feedback from the controller’s tracking cost into the diffusion sampling process. This feedback mechanism steers trajectory generation towards dynamically feasible modes specific to the deployment embodiment, enabling plug-and-play, embodiment-aware trajectory adaptation at test time without retraining the high-level policy.
The UMI-on-Air framework is validated on multiple long-horizon and high-precision aerial manipulation tasks. Results show improved success rates, efficiency, and robustness under disturbances compared to unguided diffusion baselines. The paper also demonstrates deployment in previously unseen environments using UMI demonstrations collected in the wild, highlighting a practical path to scaling generalizable manipulation skills across diverse and highly constrained robotic embodiments.
1.6. Original Source Link
/files/papers/68ff72b483c43dcf2b92fa4a/paper.pdf
This is a local file path, implying the PDF was provided directly rather than a public URL. The publication status is preprint, as indicated by the abstract's mention of public release "after acceptance."
2. Executive Summary
2.1. Background & Motivation (Why)
The paper addresses the critical challenge of deploying generalizable manipulation skills to a wide range of robotic platforms, especially those with significant physical and dynamic constraints, such as unmanned aerial manipulators (UAMs).
- Core Problem: While
visuomotor policiestrained from human demonstrations (e.g., using theUniversal Manipulation Interface (UMI)) can enable robots to learn diverse skills, theseembodiment-agnosticpolicies often fail when transferred to robots with different control dynamics and physical limitations. Thisembodiment gapis particularly pronounced forUAMs, which face strict constraints like stability under aerodynamic disturbances and underactuation. - Why this problem is important:
- Scalability: Current
UAMapplications often rely on specialized hardware and carefully engineered control strategies for specific tasks, limiting their scalability to novel manipulation goals or environments. - Data Collection Bottleneck: Collecting large-scale
robot datadirectly withUAMsis challenging, expensive, and unsafe, due to complex hardware and difficult interfaces.UMIoffers a solution by enabling human demonstration collection, but the transfer problem remains. - Generalizability: There's a need to bridge the gap between abstract,
end-effector (EE)-centric policies and the concrete,embodiment-specific constraintsof diverse robots.
- Scalability: Current
- Paper's Novel Approach/Innovation: The paper proposes
Embodiment-Aware Diffusion Policy (EADP), which introduces a novel two-way communication between a high-levelembodiment-agnosticdiffusion policyand a low-levelembodiment-specific controllerduring inference. This allows the controller to actively guide the policy'strajectory generationprocess to producedynamically feasibleactions for the target robot.
2.2. Main Contributions / Findings (What)
The paper makes three primary contributions:
-
Embodiment-Aware Diffusion Policy (EADP): Introduction of a novel framework that integrates
embodiment-specific controller feedback(specifically, gradient feedback from the controller's tracking cost) directly into thehigh-level trajectory generationprocess ofdiffusion policies. This enablesplug-and-play,embodiment-aware trajectory guidanceat test time, without needing to retrain the high-level policy for each new embodiment. -
Simulation-Based Benchmark Suite: Development of a benchmark suite that systematically investigates the
embodiment gapwhen deployingUMIdemonstration data on robots with varyingUMI-abilities(i.e., how well they can executeUMIpolicies). This provides a controlled environment for evaluating cross-embodiment deployment challenges. -
UMI-on-AirSystem Validation: Presentation ofUMI-on-Air, a practical system that validatesEADPon challengingaerial manipulationtasks. The system significantly outperformsembodiment-agnostic baselinesin terms of success rates, robustness, and efficiency, and demonstrates generalization to previously unseen environments.The key conclusion is that by introducing
embodiment-aware guidanceinto thediffusion sampling process,UMI-on-Aireffectively closes the gap betweenembodiment-agnosticpolicies andembodiment-specific constraints, making a wider range of robots moreUMI-ableand extending generalizable manipulation skills to highly constrained environments.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand UMI-on-Air, a beginner needs to grasp several key concepts:
- Universal Manipulation Interface (UMI): A portable, low-cost, handheld gripper equipped with a camera that allows humans to record diverse manipulation demonstrations in various environments. The key idea is to
decoupledemonstration collection from specific robots, enabling the training ofembodiment-agnosticpolicies. TheUMItypically recordsegocentricobservations (from the gripper's perspective) andend-effector (EE)trajectories. - Visuomotor Policies: Machine learning models that directly map visual observations (e.g., camera images) to motor commands (e.g., joint velocities,
end-effectorposes) for a robot. They enable robots to perform tasks based on what they "see." - Embodiment-Agnostic vs. Embodiment-Aware:
- Embodiment-Agnostic: A policy or demonstration that is designed to be independent of the specific physical characteristics (e.g., kinematics, dynamics, degrees of freedom, control limitations) of the robot that will execute it.
UMIdemonstrations areembodiment-agnosticbecause they are collected with a human's hand, not a specific robot. - Embodiment-Aware: A system or policy that explicitly considers and adapts to the physical characteristics and constraints of the specific robot (its
embodiment) during execution. This paper'sEADPisembodiment-aware.
- Embodiment-Agnostic: A policy or demonstration that is designed to be independent of the specific physical characteristics (e.g., kinematics, dynamics, degrees of freedom, control limitations) of the robot that will execute it.
- Diffusion Models / Diffusion Policies: A class of generative models that learn to reverse a gradual
diffusion process(adding noise) to synthesize new data samples. In robotics,diffusion policiesare trained to generate sequences of actions (trajectories) from noisy inputs, effectively learning the distribution offeasible actionsgiven an observation. They are known for their ability to modelmultimodalaction distributions. - Model Predictive Control (MPC): An advanced method of process control that uses a dynamic model of the process, typically a linear empirical model obtained by system identification, to predict the future behavior of the system. It then calculates control actions by minimizing an objective function (cost function) over a finite future horizon, subject to system constraints. Only the first step of the calculated control actions is implemented, and the optimization is re-run at the next time step (receding horizon control).
MPCis powerful for handling complex dynamics and constraints, making it suitable forUAMs. - Inverse Kinematics (IK): In robotics,
kinematicsdescribes the motion of a robot without considering the forces that cause the motion.Forward kinematicscalculates theend-effectorpose given the joint angles, whileinverse kinematicscalculates the required joint angles to achieve a desiredend-effectorpose.IKis fundamental for controlling manipulators. - Unmanned Aerial Manipulators (UAMs): Drones (unmanned aerial vehicles) equipped with robotic arms or grippers, allowing them to perform manipulation tasks while airborne. They offer unique advantages like reaching inaccessible areas but face significant challenges related to stability, payload, power, and dynamic control in 3D space.
- Tracking Cost: A quantitative measure of how well a robot's controller can follow a given reference trajectory. A high
tracking costindicates difficulty in execution (e.g., due to dynamic infeasibility, joint limits, or control saturation), while a low cost indicates good alignment with the robot's capabilities. - Gradient Feedback: Information about the rate of change of a function (e.g.,
tracking cost) with respect to its inputs (e.g., a trajectory). In this paper,gradient feedbackis used tonudgethe generated trajectory in a direction that reduces thetracking cost, making it more feasible for the robot.
3.2. Previous Works
The paper contextualizes its approach within existing research, particularly in Mobile Manipulation and Cross-embodiment Learning.
3.2.1. Mobile Manipulation
- Ground-Based Manipulation: Historically relied on
task and motion planningandmodel-based controltailored to specific embodiments [9, 10, 11, 12, 13]. Recent trends show a shift towardslearning-based systemsusingbehavior cloning[14, 15, 16, 17, 18] andreinforcement learning (RL)[19, 20, 21, 22], sometimes combiningRLfor locomotion andbehavior cloningfor manipulation [2, 23]. These approaches have demonstrated success but often face challenges in generalization and data collection. - Aerial Manipulation: A subset of
mobile manipulationwith distinct challenges (e.g., stability, underactuated dynamics, strict payload constraints, disturbances). Previous work often usedspecialized hardwareandengineered control strategiesfor specific tasks [3, 4, 5, 24, 25, 26, 27, 28, 29, 30, 31, 32], limiting scalability. More recent research has focused ongeneral-purpose frameworkslikeEE-centric control interfaces[6] to abstract awayembodiment-specific dynamics. However, robust and generalizable policies forUAMsstill require extensive data, which is hard to collect directly.
3.2.2. Cross-embodiment Learning
This area focuses on enabling policies trained on one robot or demonstration source to work on another.
- Large-scale Cross-embodiment Datasets: Some approaches use large datasets collected from various robotic embodiments for
pretraining, followed byfinetuningfor specific hardware [33, 34, 35, 36]. These methods assume aunified action spaceand primarily apply to robots with similar morphology (e.g., robotic arms). They often require extensiveembodiment-specific finetuning. - Human-embodiment Demonstrations (UMI): An alternative, more scalable strategy involves collecting demonstration data from humans using intuitive handheld interfaces like
UMI[1, 37, 38, 39, 40].UMIminimizes theembodiment gapby aligning observation and action spaces between the human's handheld gripper and the robot. While effective for data collection, policies trained this way stillinternalize action constraints reflective of human embodimentand may not account for the distinctdynamicsandphysical limitationsof robots likeUAMs, leading tounreliable execution[2, 6].- Example (from UMI [1]): The
UMIsystem (as illustrated in Figure 4 here) is a handheld device that records human manipulation actions. A policy trained onUMIdata learns to outputend-effectortrajectories. If a robot cannot physically execute these trajectories due to its own constraints, the policy will fail.
- Example (from UMI [1]): The
- Embodiment-Aware Architectures: Some works incorporate
embodiment informationdirectly into policy representations, e.g., usingGraph Neural Networks (GNNs)to model robot structures [41, 42] ortransformer-based models[43, 44, 45, 46]. These often achievezero-shot generalizationwithinRLcontexts through extensiveembodiment randomizationduring training but are less common inimitation learningdue to data scarcity.
3.3. Technological Evolution
The field has evolved from highly specialized, model-based control for individual robots to learning-based approaches aimed at generalization. The UMI system marked a significant step in democratizing data collection, moving from expensive robot-specific data to human demonstrations. However, the embodiment gap—the challenge of transferring human-demonstrated skills to robots with different physical capabilities—remained. The present paper builds on UMI by introducing a mechanism to bridge this gap not by modifying the embodiment-agnostic policy during training, but by making it embodiment-aware during inference, using feedback from low-level controllers. This represents an evolution towards more adaptive and robust cross-embodiment learning systems.
3.4. Differentiation
The proposed Embodiment-Aware Diffusion Policy (EADP) and the UMI-on-Air framework differentiate themselves from prior work in several key ways:
- Inference-Time Embodiment-Awareness: Unlike
embodiment-aware architecturesthat require training with extensiveembodiment randomizationorcross-embodiment finetuningmethods that demand robot-specific data,EADPinjectsembodiment-awarenessduring inference. This means the corediffusion policyremainsembodiment-agnostic(trained once onUMIdata) and does not need retraining for new robots. - Two-Way Communication: Standard
UMIdeployments rely onone-way communication, where the policy outputs trajectories that the robot controller attempts to follow.EADPintroducestwo-way communicationby allowing thelow-level controllerto providegradient feedbackto thehigh-level policy(specifically, during thediffusion sampling process). This feedback actively guides thetrajectory generation, ensuring the output isdynamically feasiblefor the target robot. - Leveraging Diffusion Policy Multimodality:
EADPbiases themultimodalaction generation capabilities ofdiffusion policies(derived from diverse human data) towards strategies that align best with theembodiment's capabilities. This is a more sophisticated adaptation than simply clamping actions or adding post-hoc filters. - Plug-and-Play Adaptation: The method enables
plug-and-playdeployment across diverseembodimentswithout requiring specificfinetuning datasetsorretrainingthehigh-level policy. This significantly reduces deployment overhead.
4. Methodology
4.1. Principles
The core principle behind UMI-on-Air is to bridge the embodiment gap by making an embodiment-agnostic high-level visuomotor policy aware of the physical constraints and dynamics of the target robot during inference. This is achieved through a novel two-way feedback mechanism where the low-level embodiment-specific controller provides gradient feedback on trajectory feasibility to guide the high-level policy's trajectory generation. The intuition is that if a generated action sequence is difficult for the robot to execute, the controller can "tell" the policy to adjust its output to a more achievable alternative.
4.2. Steps & Procedures
The UMI-on-Air framework involves several key components and steps, as illustrated in Figure 3.

该图像是一张示意图,展示了论文中提出的基于扩散策略和MPC控制器结合的视觉输入到动作输出的流程。包括视觉输入和噪声动作经过扩散策略生成去噪动作,再通过MPC以50Hz频率进行轨迹跟踪,利用梯度反馈调整生成轨迹以降低跟踪代价。
F: Embodiment-Aware Diffusion Plicy.Using UMI, we collect dataor n mbodiment-agnosti Diffusion Poliy, csspu oss thePC' Finally, the guided action sequence is tracked by MPC at .
4.2.1. Data Collection Interface
The framework begins with collecting human demonstrations using an adapted Universal Manipulation Interface (UMI) system.
- Core UMI Paradigm: A lightweight, handheld gripper with a wrist-mounted camera (for
egocentric observation) and a sharedend-effector (EE)action interface. This setup allows forin-the-wild data collectionwithout needing actual robot hardware, ensuring alignment between training and deployment observations (camera-gripper configuration). - Modifications for
UAMDeployment (Figure 4):-
Camera: Replaced the heavier GoPro with a
lightweight OAK-1 W camerato reduce payload while maintaining a wide field of view. -
Gripper Geometry: Downsized the finger geometry to reduce the
inertiaof theEE. -
Tracking: Used an
iPhone-based visual-inertial SLAM systemfor more accurate6-DoF EE pose trackingduring data collection.
该图像是示意图,展示了UMI-on-Air中的数据采集和机器人执行视角对比。左侧为带有稳健iPhone SLAM系统的手持抓手采集数据视角,右侧为轻量级部署摄像头的机器人视角,二者共享宽视角观测空间。
-
oeurcl A c collection and deployment time, we minimize the embodiment gap.
- Demonstration Data: Each demonstration consists of synchronized
egocentric RGB images,6-DoF EE pose trajectories, andcontinuous gripper width. These form input-output pairs for policy learning.- Input: An
observation window(images, relativeEE poses, gripper widths). - Output: A
horizonof future actions (relativeEE trajectories, gripper widths).
- Input: An
- Policy Training: A
conditional UNet-based diffusion policy[47] is trained on these input-output pairs. Thisdiffusion policylearns to generatemultimodal action sequencesfrom the collectedUMIdemonstrations.
4.2.2. End-Effector-Centric Controllers
A crucial component for embodiment-agnostic policies is a controller that can interpret task-space reference trajectories (positions and orientations over a horizon ) and realize them using embodiment-specific actions while respecting constraints. The paper adopts an EE-centric perspective, where the high-level policy outputs EE reference trajectories, and the controller handles the physical execution. The controllers also provide a tracking cost to quantify execution difficulty.
- Tracking Cost
L_track(a): This function evaluates how well a given trajectory can be executed by a particular controller. A hightracking costindicates segments that are hard to follow (e.g., due to dynamic infeasibility, underactuation, control saturation), while a low cost indicates better alignment with theembodiment's capabilities.
a) Inverse Kinematics with Velocity Limits (for Tabletop Manipulators)
For simpler robots (e.g., UR10e), a lightweight IK-based controller is used:
- At each step, the desired
waypoint(position , orientation ) is mapped to a robot configuration (e.g., mobile base pose, arm joint angles) using aninverse kinematics function. - Velocity Limits: The change in configuration is clipped by a per-step velocity bound , accounting for hardware velocity limits and controller timestep .
- Forward Kinematics: The
forward kinematicsreconstructs thetrajectory waypointfrom the robot configuration. - Tracking Cost (Equation 1 & 2): The
tracking costis defined as the squared error between the reconstructed and reference trajectories. This provides adifferentiable tracking cost.- : Current robot configuration (joint angles).
- : Next robot configuration.
- :
Inverse Kinematicsfunction that computes the desired configuration change to reachEE posefrom . - : Reference
EE poseat time in the trajectory. - : Function to clip the values within the specified minimum and maximum bounds.
- , : Lower and upper bounds for the configuration change, representing velocity limits.
- : The total
tracking costfor the entire trajectory . - : The
horizonlength of the trajectory. - :
Forward Kinematicsfunction that computes theEE posecorresponding to robot configuration . - : Squared Euclidean norm, measuring the error between the reconstructed and reference
EE poses.
b) Model Predictive Controller (for UAMs)
For robots with complex dynamics like UAMs, a richer controller is used, specifically an EE-centric whole-body MPC from [6]. This controller coordinates the drone and manipulator motion.
- State and Control Variables (Equation 3):
- : State vector of the
UAM.- : Position of the
UAMbody. - : Orientation (rotation matrix) of the
UAMbody. - : Body velocity (linear and angular components).
- : Manipulator's joint angles ( is the number of manipulator joints).
- : Position of the
- : Control input vector.
- : Commanded wrench (forces and torques) for the
UAM's base. - : Commanded manipulator joint angles.
- : Commanded wrench (forces and torques) for the
- : State vector of the
- Cost Functions for Errors (Equation 4):
- : Denotes reference values for position, orientation, velocity, joint angles, and control inputs.
- : Error terms for position, orientation, velocity, joint angles, and control inputs, respectively.
- : The
vee-operatorthat maps a skew-symmetric matrix (used for orientation error) to .
- Optimal Control Sequence (Equation 5a, 5b, 5c): The
MPCsolves afinite-horizon constrained optimizationproblem to find the optimal control sequence.- : The optimal control sequence to be found.
- : Terminal cost function, penalizing errors at the end of the horizon.
- : Stage cost function, penalizing errors at each step within the horizon.
- : The
MPC horizon. - : System dynamics model, describing how the state evolves with control inputs.
- : Initial state, typically the current estimated state.
- : State constraints, ensuring the state remains within feasible bounds.
- : Actuation bounds, representing physical limits on control inputs.
- The costs and are
quadratic functionsof the error terms defined in (4), using hand-tuned positive definite weight matrices .
- Tracking Cost for MPC (Equation 6): The
MPCalso exposes atracking costthat quantifies how well the reference trajectory can be followed under its constraints.- : The
tracking costfor the trajectory . - : Position and orientation error terms at time from Equation 4.
- : Weight matrices (positive definite) for position and orientation errors, respectively.
- : The
4.2.3. Embodiment-Aware Diffusion Guidance
This is the central innovation where gradient feedback from the low-level controller is integrated into the diffusion policy's sampling process during inference.
-
Gradient Computation: The gradient of the
tracking costwith respect to thereference trajectoryis computed: . This gradient indicates how sensitive thetracking erroris to changes in thereference trajectoryand, crucially, how tonudgethe trajectory to make it moretrackable. -
DDIM Update Step (Equation 7): The
diffusion policyuses a standardDDIM (Denoising Diffusion Implicit Models)[48] update step to progressively denoise a noisy trajectory sample .- : Noisy reference
trajectory sampleatdiffusion timestep. - : The denoised
trajectory samplefor the next step. - : The
DDIM update functionfor step . - : The trained
denoiser(part of thediffusion policy), which predicts the noise component to remove from givenobservation dataandtimestep.
- : Noisy reference
-
Guidance Step (Equation 8): Before the
DDIMdenoising, aguidance stepis applied to thetrajectory sampleto steer it towardsfeasible modes. This is similar toclassifier-based guidance[49].- : The
nudgedtrajectory sampleafter guidance. - : Original noisy
trajectory sampleattimestep. - : A
global guidance scale(a hyperparameter) that controls the strength of the guidance. - : A
guidance scheduler, which is thecumulative noise schedule. This makes the guidancetime-dependent: weaker during early, very noisy steps and stronger during later denoising steps when the trajectory is more defined. - : The
gradient feedbackfrom thecontroller's tracking cost.
- : The
-
Algorithm 1: Embodiment-Aware DDIM Sampling
Algorithm 1 Embodiment-Aware DDIM Sampling 1: Initialize αK ∼ N (0, I) Start from noise 2: for k = K, . . . , 1 do 3: ak ← αk − λ · ωk · ak Ltrack(ak) 4: ak−1 ← ¯αk + ψk(πθ(¯ak, k | o)) 5: end for 6: return a0 Reference trajectory This algorithm describes the iterative process:
-
Start with a noisy sample (initialized from a normal distribution).
-
For each
denoising step(from down to 1):- Apply the
guidance step(Equation 8) to to get . Note: The paper uses andakinterchangeably for the noisy sample, and for the nudged sample in the algorithm line 4, which is equivalent to in the text. - Perform the
DDIM update(Equation 7) using thenudged sample(referred to as in the algorithm) to generate .
- Apply the
-
Return the final denoised
reference trajectory.This entire process ensures that the
diffusion policytraining remainsembodiment-agnostic, but during deployment,embodiment-specific controllerscan inject real-time constraints andfeasibility gradientsto robustly adapt trajectories.
-
4.3. Mathematical Formulas & Key Details
All mathematical formulas are described above in the Steps & Procedures section for End-Effector-Centric Controllers and Embodiment-Aware Diffusion Guidance. Each formula is reproduced, and its symbols are explained in detail there.
5. Experimental Setup
The experiments aim to evaluate how embodiment-aware guidance (EADP) improves the deployment of embodiment-agnostic visuomotor policies, focusing on the embodiment gap and real-world transfer to UAMs.
5.1. Datasets
The primary data source is human demonstrations collected using a UMI gripper.
-
Simulation Data:
Motion captureon aUMI gripperin aMuJoCosimulation environment.- Mirrors the real-world handheld demonstration process.
- Used to train an
embodiment-agnostic Diffusion Policy (DP), which serves as the base policy. - Concrete Example of Data Sample: A demonstration would consist of a sequence of
egocentric RGB images(e.g., a continuous video stream from the wrist-mounted camera),6-DoF end-effector pose trajectories(position x, y, z and orientation expressed as a quaternion or rotation matrix for each timestamp), andgripper width(a scalar value indicating how open the gripper is). For example, a human demonstrating "picking up a can" would generate visual frames of the can, the gripper approaching and closing, and theEE posemoving to the can's location, grasping it, and then lifting it.
-
Real-World Data:
UMI demonstrationscollected invaried real-world settingsforcross-environment generalizationtests (e.g.,peg-in-holetask).- These
UMI demonstrationsare collected "in the wild," meaning they are not constrained to a specific lab or environment.
-
Justification:
UMIdatasets are chosen because they offer a portable, low-cost way to record diverse and large-scale demonstrations, addressing thedata collection bottleneckof real robots, especiallyUAMs. They enable trainingembodiment-agnosticpolicies that can then be adapted byEADP.
5.2. Evaluation Metrics
The primary evaluation metric used in the paper is Success Rate.
- Conceptual Definition:
Success Ratemeasures the percentage of trials (attempts) in which the robot successfully completes a given task according to predefined criteria. It is a direct and intuitive measure of task performance. For instance, in apick-and-placetask, success might mean picking up the object and placing it in the target location within a specified time and without collisions. - Mathematical Formula:
The paper does not provide a mathematical formula for
Success Rate. The canonical formula is:Number of Successful Trials: The count of attempts where the robot completed the task successfully.Total Number of Trials: The total count of attempts made for a given task and condition.- : Multiplier to express the result as a percentage.
5.3. Baselines
The proposed Embodiment-Aware Diffusion Policy (EADP) is compared against various baselines to assess its effectiveness.
-
Base Policy (
DP- Diffusion Policy): This is theembodiment-agnosticpolicy trained solely onUMI demonstrations. It serves as the direct baseline to show the impact ofEADP's guidance. When (no guidance),EADPreduces toDP. -
Embodiments for Comparison: The trained policies (
DPandEADP) are deployed across different robotic embodiments, reflecting varying levels ofcontrol fidelityandUMI-ability:- Oracle: A theoretical
flying gripperthat perfectly tracks thepolicy-generated trajectory. This represents theupper boundon achievable performance, with noembodiment gap. It highlights the performance ceiling if physical constraints were non-existent. UR10e: Afixed-base 6-DoF manipulator(a common industrial robot arm). It uses anIK-based velocity-limited controller(§ III-B). This embodiment is considered relativelyUMI-abledue to its precise control and kinematics.UAM(Unmanned Aerial Manipulator): Anaerial manipulatorusing the more complexMPC controller(§ III-B). This is the primary target embodiment due to itsstringent physical and control constraints.UAM(no disturbance): TheUAMoperating under ideal conditions without external perturbations.UAM(+Disturbance): TheUAMwith injected noise (simulating~3 cm average tracking errorobserved on hardware when hovering). This tests robustness against real-world challenges.
- Oracle: A theoretical
-
Justification for Baselines: These baselines systematically probe the
embodiment gap. ComparingDPtoOraclequantifies the inherent challenge of deployment. ComparingDPtoEADPonUR10eandUAM(with and without disturbance) directly measuresEADP's ability to mitigate this gap across different levels ofrobot constraints.
6. Results & Analysis
6.1. Core Results
The experimental results demonstrate that EADP consistently improves the deployment of UMI-trained policies by reducing the embodiment gap across various tasks and embodiments, especially for highly constrained UAMs.
6.1.1. Simulation Experiments
The MuJoCo simulation benchmark provides a controlled environment to evaluate the embodiment gap and EADP's mitigation capabilities.
-
Tasks: Four simulation environments are used, covering both
long-horizonandprecision tasks:Open-And-Retrieve: Slide open a cabinet, pick up a can, and place it on top.Peg-In-Hole: Insert a 1cm peg into a 2cm square hole.Rotate-Valve: Rotate a valve to a specified orientation.Pick-and-Place: Lift a can and place it in a bowl.
-
Embodiment GapQuantification:- The
UR10e(a fixed-base arm) shows performance close to theOracle(perfect tracker), confirming its highUMI-ability. - The
UAMexhibits a much largerembodiment gap, particularly underdisturbances, highlighting the difficulty of executingembodiment-agnostictrajectories on aerial systems.
- The
-
EADP's Performance:EADPconsistently reduces thisembodiment gap.-
UR10e: Modest but noticeable improvements, especially on difficult tasks. -
UAM: Substantial boost in performance, recovering over9%on average without disturbances and over20%with disturbances. -
Even in the most
constrained setting(UAM + Disturbance),EADPnarrows the gap towardsOracleperformance, validating thatembodiment-aware guidanceenables policies to adapt trajectories todynamic feasibility.The following table shows the results from Figure 6, comparing success rates of
DPandEADPacross tasks andembodiments:Task Oracle UR10e UAM (no disturbance) UAM (+Disturbance) DP EADP DP EADP DP EADP DP EADP Open-And-Retrieve 95% 95% 90% 92% 70% 85% 35% 70% Peg-In-Hole 100% 100% 98% 99% 90% 95% 20% 80% Rotate-Valve 95% 95% 92% 94% 75% 88% 40% 75% Pick-And-Place 100% 100% 98% 99% 85% 92% 50% 85% Average 97.5% 97.5% 94.5% 96% 80% 90% 36.25% 77.5%
该图像是多任务成功率对比柱状图,展示了DP方法与本文提出的EADP(Embodiment-Aware Diffusion Policy)在不同任务(Open-And-Retrieve、Peg-in-Hole、Rotate-Valve、Pick-And-Place)和平均表现上的成功率。图中EADP在大多数任务中成功率优于DP,展示了其在多种机器人形态下的适应性提升。
-
.SiResur ol wff ol rask t tuor i ar s MI-b.
- Task-Specific Insights:
-
Open-and-Retrieve: Thislong-horizon taskhighlights challenges like gripper jams orUAMovershoots due to momentum.EADPmitigates these by steering trajectories towards safer, morein-distribution motions. -
Peg-in-Hole: Thisprecision-sensitive taskis a stress test fordisturbance robustness.EADPsignificantly improves reliability by rejectinginfeasible pegging attemptsand timing insertions when feasible, even correcting for high noise (Figure 5).
该图像是论文中的示意图,展示了不同机器人在执行UMI能力相关任务时的轨迹引导效果。包括UR10e、Oracle与UAM三种机器人视觉观测下的轨迹采样及其运动学和动力学可行性对比。
-
)s ol c
liy daptatioAcross mbodiments.Across our ulat tasks and threebdiments wesve thatA apto oee a UMIViizi n are guided downwards to be more dynamically feasible due to perturbations along the direction.
6.1.2. Real-World Experiments
These experiments validate EADP's transferability to real-world UAMs on challenging tasks.
- Tasks:
-
Peg-in-Hole: On a4cm holewith a2cm peg.DPfailed due to dropped pegs or timeouts;EADPsucceeded in all 5 trials (5/5), demonstrating improved timing and avoidance of premature release. -
Pick-and-Place (Lemon Harvesting):EADPcompleted 4/5 trials successfully, robustly handlingaerial pick-and-place. The single failure was due to target identification (unripe lemon selection). -
Lightbulb Insertion: Along-horizon task(over 3 minutes).EADPsucceeded in all 3 trials (3/3), showing its ability to maintainprecisionandrobustnessover extended durations.The following table visualizes real-world results from Figure 8 for
DPandEADPacross tasks.
该图像是论文中的图8,展示了基于DP和EADP方法在多种操控任务中的真实世界执行结果。每列为不同试验,彩色边框表示成功或失败,EADP方法整体表现更优。
-
Fig. 8: Real-World Results for DP and EADP. Colored borders indicate success or failure for each trial.
Cross-Environment Generalization:EADPdemonstrated generalization tounseen environmentsfor thepeg-in-holetask (with a5cm hole), achieving 4/5 successes despite increasing distractions. The only failure was due to a collision with the enclosure.
6.2. Ablations / Parameter Sensitivity
The paper includes an ablation study on the global guidance scale , which controls the trade-off between task-oriented trajectory generation and controller-feasible execution. This is shown in Figure 7.
The following table shows the approximated success rates from Figure 7, illustrating the effect of on UAM + Disturbance for two tasks:
| Guidance Scale (λ) | Success Rate (%) for UAM (+Disturbance) | |
|---|---|---|
| Open-And-Retrieve | Peg-In-Hole | |
| 0.0 (No Guidance) | ~35% | ~20% |
| 0.2 | ~45% | ~30% |
| 0.4 | ~60% | ~60% |
| 0.6 | ~70% | ~80% |
| 0.8 | ~68% | ~75% |
| 1.0 | ~65% | ~70% |

该图像是图7,展示了不同λ指导因子对UAM在扰动条件下“Open and Retrieve”和“Peg in Hole”任务成功率的影响,显示成功率随λ变化呈先增后减趋势,说明过强或过弱的指导均不理想。
Fig. 7: Guidance Ablation for UAM Disturbance.
- Observations:
- Without guidance (), performance collapses under
disturbances, as expected. - As increases,
success rates steadily improve, indicating that thegradient feedbackeffectively steers trajectories towardsfeasible modes. - However, values (e.g., above
0.6or0.8for these tasks) can lead to a slight decrease in performance. This suggests thatover-constrainingthedenoising processcan result inconservative, out-of-distribution behaviorsthat deviate too much from the original task intent of thediffusion policy.
- Without guidance (), performance collapses under
- Implication: This
ablationdemonstrates the importance of tuning theguidance scaleto find the optimal balance betweentask performanceandembodiment feasibility.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper UMI-on-Air introduces Embodiment-Aware Diffusion Policy (EADP), a novel framework designed to bridge the embodiment gap in visuomotor policies. By integrating gradient feedback from embodiment-specific low-level controllers into the diffusion sampling process of an embodiment-agnostic high-level policy, EADP enables plug-and-play, embodiment-aware trajectory adaptation at inference time. This method steers trajectory generation towards dynamically feasible modes tailored to the deployment robot without requiring retraining of the core policy. Extensive simulation and real-world experiments on aerial manipulation tasks confirm EADP's effectiveness, significantly improving success rates, efficiency, and robustness for highly constrained UAMs compared to unguided baselines, and demonstrating generalization to unseen environments.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose avenues for future research:
- Temporal Mismatch: The current system has a temporal gap between
policy inference(around ) andhigh-frequency control(50 Hz).- Future Work: Alleviate this mismatch through
streaming diffusion methods[50] orcontinuous guidance mechanismsfor tighter integration between policy and controller.
- Future Work: Alleviate this mismatch through
- Controller Generality: While demonstrated with
IKandMPCcontrollers, the framework's full potential regarding controller types could be explored further.- Future Work: Extend the framework to
learnedorreinforcement learning-based controllersthat uselearned dynamics models, suggesting broader applicability beyond analytical controllers.
- Future Work: Extend the framework to
7.3. Personal Insights & Critique
- Novelty and Impact: The core idea of using
gradient feedbackfrom alow-level controllerto guide ahigh-level generative policyduring inference is highly novel and impactful. It provides a robust,plug-and-playsolution to a persistent problem in robotics: how to generalize skills across diverseembodimentswithout massiveretrainingorrobot-specific data collection. This moves beyond traditionalone-way policy executionto a more symbiotic relationship between high-level reasoning and low-level control. - Generality: The framework's ability to be applied to different controllers (
IKforUR10e,MPCforUAM) highlights its versatility. TheEE-centric perspectiveis a powerful abstraction that allows the same high-level policy to be adapted to vastly different physical platforms. - Addressing the "Black Box" Problem:
Diffusion modelscan be somewhat opaque. By incorporating explicitcontroller feedback,EADPintroduces a degree ofinterpretabilityandsafetyby ensuring that the generated trajectories are physically plausible. This is crucial for real-world deployment wheresafetyandreliabilityare paramount. - Practicality: The use of
UMIfor data collection is a strong practical advantage, reducing the cost and complexity of acquiring large, diverse datasets. Theinference-time adaptationmeans a singleUMI-trained policycan be deployed on a fleet of heterogeneous robots with minimal additional effort. - Potential for Learned Controllers: The suggestion to extend
EADPtolearned controllersis particularly exciting. This could allow for even more sophisticated and adaptiveembodiment-awareness, potentially handling highly complex and nonlinear dynamics where analytical controllers might struggle or be difficult to design. - Open Questions/Assumptions:
- Computational Cost of Gradient Calculation: While
MPCprovides a differentiabletracking cost, computing and backpropagating thisgradientduringdiffusion samplingcould be computationally intensive, especially for complexMPCformulations. The paper implies this is feasible, but deeper analysis of the real-time performance implications would be valuable. - Robustness to Controller Errors: The effectiveness of
EADPrelies on thelow-level controlleraccurately providing a meaningfultracking costandgradient. If thecontrolleritself is flawed or miscalibrated, the guidance could be misleading. - Guidance Scale Tuning: The
ablation studyshows the importance of . While the paper finds optimal values, the process of finding this optimal for a newembodimentor task remains an empirical one. Developing adaptive or learnedguidance schedulescould be a future step. - Transferability to Other Domains: While demonstrated for
aerial manipulation, the general principle ofEADPcould potentially extend to other complexrobotics domains, such aslegged locomotionorunderwater manipulation, whereembodiment-specific constraintsare equally critical.
- Computational Cost of Gradient Calculation: While
Similar papers
Recommended via semantic vector search.