Paper status: completed

UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers

Published:07/15/2024

Robotic Action Learning (18)Cross-Embodiment Latent Action Representation (3)Whole-Body Controller (1)Dynamic Robot Manipulation (1)Zero-Shot Cross-Embodiment Deployment (1)

Original Link PDF

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

UMI-on-Legs integrates handheld data collection with simulation-based whole-body control, enabling zero-shot cross-embodiment deployment and achieving over 70% success in diverse dynamic manipulation tasks on quadrupeds.

Abstract

We introduce UMI-on-Legs, a new framework that combines real-world and simulation data for quadruped manipulation systems. We scale task-centric data collection in the real world using a hand-held gripper (UMI), providing a cheap way to demonstrate task-relevant manipulation skills without a robot. Simultaneously, we scale robot-centric data in simulation by training whole-body controller for task-tracking without task simulation setups. The interface between these two policies is end-effector trajectories in the task frame, inferred by the manipulation policy and passed to the whole-body controller for tracking. We evaluate UMI-on-Legs on prehensile, non-prehensile, and dynamic manipulation tasks, and report over 70% success rate on all tasks. Lastly, we demonstrate the zero-shot cross-embodiment deployment of a pre-trained manipulation policy checkpoint from prior work, originally intended for a fixed-base robot arm, on our quadruped system. We believe this framework provides a scalable path towards learning expressive manipulation skills on dynamic robot embodiments. Please checkout our website for robot videos, code, and data: https://umi-on-legs.github.io

Mind Map

In-depth Reading

English Analysis~28 min read · 36,251 chars

1. Bibliographic Information

1.1. Title

UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers

1.2. Authors

Huy Ha (* Equal Contribution, Stanford University, Columbia University)
Yihuai Gao (* Equal Contribution, Columbia University)
Zipeng Fu (Columbia University)
Jie Tan (Google DeepMind)
Shuran Song (Stanford University, Columbia University)

1.3. Journal/Conference

This paper was published on arXiv, indicating it is a preprint and has not yet undergone formal peer review for a specific journal or conference. However, the authors and affiliations (Stanford, Columbia, Google DeepMind) suggest a high-caliber research background. The work builds upon UMI and Diffusion Policy, which were presented at RSS, a top-tier robotics conference.

1.4. Publication Year

2024

1.5. Abstract

The paper introduces UMI-on-Legs, a novel framework designed for quadruped manipulation systems. It integrates real-world data, collected economically using a hand-held gripper (UMI) for task demonstrations, with simulation-trained whole-body controllers (WBCs). The core innovation lies in the interface: end-effector trajectories in the task frame, inferred by a manipulation policy (trained from UMI data) and tracked by the WBC. This design allows for scaling task-centric data collection without a robot and robot-centric data in simulation without complex task setups. The framework is evaluated on prehensile, non-prehensile, and dynamic manipulation tasks, achieving over 70% success rates. Notably, it demonstrates zero-shot cross-embodiment deployment of a pre-trained manipulation policy originally for a fixed-base robot arm onto their quadruped system. The authors propose UMI-on-Legs as a scalable path for learning expressive manipulation skills on dynamic robot embodiments.

1.6. Original Source Link

Official Source/PDF Link: https://arxiv.org/abs/2407.10353v1
Publication Status: This is a preprint ( $v1$ ) uploaded to arXiv on 2024-07-14T23:03:23.000Z.

2. Executive Summary

2.1. Background & Motivation (Why)

The field of robot learning faces a dilemma: real-world data collection provides direct task relevance but is bottlenecked by hardware costs, safety, and robot-specific data, while simulation offers safe exploration and infinite resets but struggles with task diversity and realistic object dynamics. Current legged manipulation systems often rely on body velocity commands or single-step body-frame end-effector targets, which are embodiment-specific (requiring the robot for data collection) and too simplistic for complex, dynamic manipulation. There is a critical need for a framework that combines the strengths of both real-world and simulation data collection in a scalable, generalizable, and user-friendly manner, especially for mobile manipulation on dynamic robot embodiments like quadrupeds.

The paper aims to address these challenges by providing:

A method to scale task-centric data collection in the real world without needing a robot, reducing cost and safety concerns.
A method to scale robot-centric data in simulation without requiring complex task-specific simulation setups or reward engineering.
A robust and expressive interface between high-level manipulation policies and low-level whole-body controllers that allows for cross-embodiment generalization and handles the dynamic nature of quadruped robots.
An accessible real-time odometry solution for in-the-wild mobile manipulation.

2.2. Main Contributions / Findings (What)

The main contributions of this work are:

UMI-on-Legs Framework: Introduction of a novel framework that combines real-world human demonstrations (using the UMI hand-held gripper) and simulation-based reinforcement learning to train cross-embodiment mobile manipulation systems. This framework allows real-world data collected without robots to be distilled into manipulation skills for various mobile robot platforms, leveraging robot-specific controllers.
Manipulation-Centric Whole-Body Controller: Proposal of a whole-body controller (WBC) that uses end-effector trajectories in the task-frame as its interface. This simple yet expressive interface enables zero-shot cross-embodiment deployment of existing manipulation policies (originally designed for fixed-base arms) onto mobile quadrupeds, while also facilitating complex dynamic skills. The WBC is trained in simulation to track these trajectories, allowing it to compensate for base perturbations and anticipate future movements.
Accessible Real-time Odometry System: Development and deployment of a real-time, robust, and accessible odometry approach based on Apple's ARKit using an iPhone. This addresses a common bottleneck in mobile manipulation systems by providing fast and precise task-space tracking for in-the-wild deployment.
Empirical Validation on Diverse Tasks: Successful evaluation of UMI-on-Legs on prehensile (cup rearrangement), non-prehensile (kettlebell pushing), and dynamic (ball tossing) manipulation tasks, reporting over 70% success rates on all tasks in real-world scenarios. The WBC demonstrates robustness to unexpected external dynamics.
Zero-Shot Cross-Embodiment Transfer: Demonstrated the direct deployment of a pre-trained manipulation policy (from UMI prior work) originally for a fixed-base robot arm onto the quadruped system without any fine-tuning, showcasing significant generalization capabilities.

3.1. Foundational Concepts

Quadruped Robot: A robot with four legs, designed for locomotion over varied terrain. They are inherently dynamic systems, meaning their body can move significantly during tasks, posing challenges for stable manipulation.
Robot Manipulation: The ability of a robot to interact with objects in its environment, typically using an end-effector (like a gripper or hand) attached to a robotic arm. This can involve prehensile actions (grasping), non-prehensile actions (pushing), or dynamic actions (tossing).
Whole-Body Control (WBC): A control strategy for robots with multiple degrees of freedom (like a quadruped with an arm) that coordinates all joints (legs and arm) to achieve a desired task while maintaining balance and respecting physical limits. It goes beyond controlling just the arm or just the legs, integrating their movements.
End-effector Trajectory: A sequence of desired poses (position and orientation) over time for the robot's end-effector. Instead of controlling individual joint angles, this specifies the path the gripper should follow in space.
Task-Frame: A coordinate system defined relative to the task or object being manipulated. For example, if a robot is interacting with a cup on a table, the task-frame might be fixed to the table. This is in contrast to a body-frame, which is fixed to the robot's base and moves with the robot. Tracking in the task-frame means the robot tries to maintain the end-effector's position relative to the task, even if its own body moves.
Diffusion Policy: A type of visuo-motor policy that uses a diffusion model to predict robot actions. Diffusion models are generative models that learn to reverse a diffusion process (gradually adding noise to data) to generate new data samples. In robotics, they can be trained to generate action sequences (like end-effector trajectories) from visual observations by learning to denoise a noisy action sequence.
Behavior Cloning (BC): A machine learning technique where a robot learns a policy by imitating expert demonstrations. The robot observes a human (or another robot) performing a task and tries to replicate the observed actions given similar observations.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal. The agent learns through trial and error, getting positive rewards for desired behaviors and penalties for undesired ones.
Sim2Real Transfer: The process of training a robot policy or controller in a simulated environment and then deploying it successfully in the real world. This is challenging due to the reality gap (differences between simulation and reality), which often requires techniques like domain randomization or adaptive policies.
Odometry: The process of estimating a robot's position and orientation (pose) over time by using data from its onboard sensors, such as wheel encoders, IMUs (Inertial Measurement Units), or cameras. Visual-inertial odometry (VIO) combines camera images with IMU data for robust pose estimation, especially useful in dynamic environments.
Cross-embodiment Generalization: The ability of a robot policy or skill to transfer and work effectively on different robot platforms or configurations (e.g., from a fixed-base arm to a mobile quadruped) without significant re-training.
Prehensile/Non-prehensile Manipulation: Prehensile manipulation involves grasping and holding objects, while non-prehensile manipulation involves interacting with objects without grasping them, such as pushing, nudging, or tossing.

3.2. Previous Works

The paper extensively references several foundational and contemporary works that form the basis of UMI-on-Legs:

UMI [1]: The Universal Manipulation Interface (Chi et al., RSS 2024) is a key predecessor. It introduced the concept of collecting task-centric data using a hand-held gripper, enabling human demonstrations without requiring a robot. UMI-on-Legs directly leverages UMI for its real-world data collection strategy.
Diffusion Policy [2]: (Chi et al., RSS 2023) This work forms the core of the manipulation policy in UMI-on-Legs. It proposed using diffusion models for visuo-motor policy learning, enabling robust action generation from visual observations. UMI-on-Legs adopts this architecture.
Massively Parallel Simulators [3, 7]: Works like Isaac Gym [3] and methods from Rudin et al. [7] are crucial for scaling RL training for whole-body controllers. These simulators allow training thousands of robot instances in parallel, accelerating the learning process for complex control tasks and enabling Sim2Real transfer techniques like domain randomization.
DeepWBC [8]: (Fu et al., CoRL 2023) This is cited as a baseline for whole-body control in legged manipulation. DeepWBC typically trains policies to control both locomotion and manipulation but often uses body-frame end-effector targets, which the authors argue is less effective for compensating body movements compared to their task-frame approach.
MobileALOHA [27]: (Fu et al., arXiv 2024) This work proposed a wheeled-bimanual platform for demonstration collection to learn challenging mobile manipulation skills. While MobileALOHA focuses on demonstration collection for wheeled robots, UMI-on-Legs offers an alternative for quadrupeds and uses a different data collection method (hand-held gripper).
Prior Legged Manipulation Systems [8, 10-12]: Many prior works in legged manipulation have explored whole-body controllers. However, they often rely on body velocity commands or single-step body-frame end-effector targets, which are embodiment-specific and limited in representing complex trajectories. UMI-on-Legs differentiates itself by using task-frame end-effector trajectories and demonstrating zero-shot cross-embodiment transfer.

3.3. Technological Evolution

The field has evolved from traditional robot control (model-based, inverse kinematics) to learning-based approaches (behavior cloning, reinforcement learning). Early RL for locomotion often focused solely on leg movements [6, 28-30], while later works started integrating manipulation skills [31-33]. The development of massively parallel simulators [3, 7] significantly accelerated RL training and improved Sim2Real transfer [9]. Concurrently, advances in visuo-motor policies have led to more generalizable manipulation skills, often leveraging pre-trained visual encoders [1, 25, 43] and diffusion models [1, 2, 27]. The Universal Manipulation Interface (UMI) [1] represented a shift by enabling robot-agnostic data collection. UMI-on-Legs combines these threads: it uses UMI for task data, diffusion policies for manipulation skill inference, and massively parallel RL for whole-body control, specifically adapting these to the dynamic quadruped embodiment using a novel task-frame trajectory interface. The integration of readily available consumer electronics like iPhones for odometry further showcases an evolution towards more accessible and robust mobile manipulation systems.

3.4. Differentiation

UMI-on-Legs differentiates itself from prior work in several key aspects:

Data Collection Strategy: Unlike systems requiring the robot for data collection (teleoperation or demonstration replay [8, 10, 11, 27]), UMI-on-Legs uses a hand-held gripper (UMI) to collect task-centric data in the real world without involving the robot hardware. This significantly reduces cost, safety concerns, and the robot-specific nature of the collected data.
Whole-Body Controller Training: Instead of requiring full task simulation setups, UMI-on-Legs trains its WBC in simulation to track end-effector trajectories. This means the WBC doesn't need to interact with diverse objects or learn from complex task rewards in simulation, simplifying the sim-to-real transfer for the control aspect.
Interface Design: Most prior legged manipulation systems use body velocity commands or single-step body-frame end-effector targets [8, 10-12]. UMI-on-Legs uses end-effector trajectories in the task-frame as the interface. This provides preview information to the WBC for anticipation, is embodiment-agnostic beyond the end-effector, and enables precise and stable manipulation by actively compensating for base movements.
Cross-Embodiment Generalization: The task-frame trajectory interface allows for zero-shot cross-embodiment deployment of manipulation policies trained for fixed-base robot arms onto a mobile quadruped. This is a significant leap compared to policies that assume perfect lower-level controllers or are tied to specific robot kinematics.
Accessible Odometry: UMI-on-Legs employs an iPhone with ARKit for real-time odometry, offering a compact, self-contained, and accessible solution for in-the-wild deployment, contrasting with systems relying on motion capture [31] or AprilTags [8, 11, 36] which constrain deployment environments.

4. Methodology

The UMI-on-Legs framework is a modular system composed of two main components: a high-level manipulation policy and a low-level whole-body controller (WBC). These two policies communicate through a well-defined interface: end-effector trajectories expressed in the task-frame.

The overall method overview is depicted in Figure 3.

Fig 3: Method Overview. Our system takes as input RGB images from a GoPro and infers a camera-frame end-effector trajectory using a diffusion policy (a), trained using real-world UMI demonstrations.…
该图像是示意图，展示了论文中图3的方法概览。系统输入GoPro拍摄的RGB图像，通过频率为2Hz的扩散策略推断相机帧末端执行器轨迹（a）。轨迹被转换到任务空间，作为整体运动控制器的接口（b）。控制器通过多层感知机（MLP）以50Hz输出关节动作指令（c）。

The system takes RGB images from a GoPro mounted on the robot's end-effector. A diffusion policy (trained using real-world UMI demonstrations) infers a camera-frame end-effector trajectory. This trajectory is then transformed into the task-space and passed to the WBC. The WBC outputs joint position targets at $50 \mathrm { Hz }$ , which are then tracked by PD controllers for both the legs and the arm.

4.1. Principles

The core idea behind UMI-on-Legs is to decouple the learning of task-specific manipulation skills from robot-specific whole-body control.

Task-Centric Data Collection: Human operators can demonstrate diverse and complex manipulation tasks using a simple hand-held gripper (UMI) in the real world. This data is robot-agnostic beyond the end-effector type.
Robot-Centric Control Learning: A whole-body controller is trained in simulation to robustly track any given end-effector trajectory. This training does not require complex task simulations or object dynamics, simplifying the Sim2Real transfer for the controller.
Expressive Interface: The end-effector trajectory in the task-frame serves as the communication bridge. It is simple enough for humans to demonstrate, yet expressive enough to capture complex motions, and provides preview information for the WBC to anticipate. This task-frame approach allows the WBC to actively compensate for base perturbations (body movements), making the manipulation precise and stable.

The benefits of this interface design are highlighted:

Intuitive demonstration: Non-expert users can demonstrate tasks using hand-held devices (like UMI), as they only need to specify the desired end-effector path, not complex robot-specific actions.
High-level intention from preview horizon: Providing a trajectory (a sequence of future targets) allows the WBC to anticipate upcoming movements and coordinate the entire robot body. For example, bracing for a high-velocity toss or leaning the body instead of stepping to reach a target.
Precise and stable manipulation in the task-frame: By tracking actions in a task-space that is persistent regardless of base movement (Figure 4), the controller inherently learns to compensate for body movements and perturbations, leading to more stable manipulation.

该图像是图4的示意图，展示了任务坐标系与机器人身体坐标系下的轨迹追踪对比。图中说明了全身控制器在任务坐标系中运行，能有效补偿机器人底盘扰动，提升操作策略的任务执行能力，而传统方法在身体坐标系中难以快速响应扰动。
Asynchronous multi-frequency execution: The interface naturally supports a hierarchical control architecture, where a low-frequency manipulation policy (1-5Hz) can command a high-frequency low-level controller (50Hz), accommodating different sensor and inference latencies.
Compatible with any trajectory-based manipulation policy: This interface allows for plug-and-play functionality with existing or future trajectory-based manipulation policies, facilitating the transfer of "table-top" skills to "mobile" platforms.

4.2. Steps & Procedures

The UMI-on-Legs framework operates in two main stages: training and deployment.

4.2.1. Training Stage

Manipulation Policy Training (Real-World Data, Offline):
- Data Collection: Human operators use a hand-held gripper (UMI) to perform desired manipulation tasks in the real world. The UMI device records the end-effector poses and corresponding visual observations. This data is task-centric and embodiment-agnostic (meaning it doesn't depend on the robot's specific body).
- Policy Training: A diffusion policy is trained using this collected data. The policy learns to predict sequences of future end-effector poses in the camera frame given visual inputs from a wrist-mounted camera.
Whole-Body Controller (WBC) Training (Simulation, Offline):
- Simulation Environment: A massively parallelized simulation environment (Isaac Gym or similar) is used. This environment contains the quadruped robot with its arm but does not require a complex task setup with objects or detailed object dynamics.
- Trajectory Generation for Training: End-effector trajectories (similar to those collected with UMI) are used as targets. The paper states that UMI trajectories are used, but also mentions the WBC can be trained without task simulations, implying a simplified trajectory generation might be used for WBC pre-training.
- Reinforcement Learning: An RL agent (the WBC) is trained to track these end-effector trajectories in the task-frame while coordinating all 18 degrees of freedom (12 for legs, 6 for arm).
- Observation Space: The WBC observes robot's 18 joint positions and velocities, base orientation and angular velocity, previous action, and the end-effector trajectory (densely sampled target poses from -60ms to 60ms at 20ms interval, plus a target 1000ms into the future) inferred by the manipulation policy. End-effector pose is represented by a 3D position vector and a 6D rotation representation [52].
- Reward Function: The RL agent is rewarded for minimizing position error and orientation error to the target pose. Regularization and shaping terms (e.g., joint limits, action rate, root height, collision, body-EE alignment, even mass distribution, feet under hips) are also included. A sigma curriculum is used for the position and orientation error rewards to encourage precision.
- Sim2Real Transfer Techniques: During training, domain randomization (random pushes, joint friction, damping, contact frictions, body/arm masses, center of masses) is applied to bridge the reality gap. Crucially, control latency is modeled, and the robot is randomly transported to account for odometry noise.

4.2.2. Deployment Stage

Real-time Sensing:
- RGB images are captured by a GoPro mounted on the end-effector.
- Robot proprioception (joint positions, velocities, base orientation, angular velocity) is gathered from the robot's internal sensors.
- Real-time odometry (robot's pose in the global task-frame) is provided by an iPhone running ARKit, mounted on the robot's base.
High-Level Manipulation Policy Inference (Low Frequency):
- The RGB images are fed to the diffusion policy (running on an external RTX 4090 desktop) to infer a camera-frame end-effector trajectory.
- This camera-frame trajectory is then transformed into the task-frame using the iPhone's odometry and the robot's kinematic chain.
- This manipulation policy typically runs at a low frequency (e.g., 1-5Hz).
Low-Level Whole-Body Control Execution (High Frequency):
- The inferred task-frame end-effector trajectory is passed to the WBC (running on the robot's Jetson computer).
- The WBC (an MLP network) takes proprioceptive feedback, odometry, and the target trajectory as input.
- It outputs joint position targets for all 18 DOF at a high frequency (e.g., 50Hz).
- Separate PD controllers for the legs and arm track these joint position targets.

4.3. Mathematical Formulas & Key Details

4.3.1. Manipulation Policy with Behavior Cloning

Architecture: The manipulation policy uses a U-Net [50] architecture for its diffusion policy [2] with a DDIM scheduler [51]. It leverages a pre-trained CLIP vision encoder [25] for visual feature extraction.
Action Horizon: An action horizon of 4 is used, meaning the policy predicts 4 future end-effector poses or trajectory segments into the future, providing more information to the low-level controller.
Training Data: For tasks like cup rearrangement, pre-trained checkpoints from UMI [1] are directly used. For new tasks like pushing and tossing, diffusion policies are trained from scratch using newly collected UMI data.

4.3.2. Whole-body Controller with Reinforcement Learning

Reward Function for Trajectory Tracking: The core of the WBC's training objective is to minimize the error between the end-effector's current pose and the target pose within the trajectory. The reward function for this objective is given by: $R_{pose} = \exp \left( - \left( \frac{\epsilon_{\mathrm{pos}}}{\sigma_{\mathrm{pos}}} + \frac{\epsilon_{\mathrm{orn}}}{\sigma_{\mathrm{orn}}} \right) \right)$ Where:
- $R_{pose}$ : The reward obtained for tracking the target pose.
- $\epsilon_{\mathrm{pos}}$ : The position error, which is the Euclidean distance between the actual end-effector position and the target end-effector position.
- $\epsilon_{\mathrm{orn}}$ : The orientation error, which is the angular difference (e.g., measured in radians) between the actual end-effector orientation and the target end-effector orientation. The paper uses a 6D rotation representation [52] for orientation.
- $\sigma_{\mathrm{pos}}$ : A scaling term (standard deviation) for the position error.
- $\sigma_{\mathrm{orn}}$ : A scaling term (standard deviation) for the orientation error.
  
  Explanation: This exponential reward function encourages errors to be as small as possible. The scaling terms ( $\sigma_{\mathrm{pos}}$ , $\sigma_{\mathrm{orn}}$ ) determine the sensitivity of the reward to errors. Smaller $\sigma$ values mean the reward drops off more steeply for small errors, pushing the policy towards higher precision. The combined term for position and orientation, rather than separate terms, leads to more balanced precision in both.
Sigma Curriculum: A sigma curriculum is applied, meaning the values of $\sigma_{\mathrm{pos}}$ and $\sigma_{\mathrm{orn}}$ are gradually decreased during training as the position and orientation errors become smaller. This allows for broader exploration in the early stages of training (with larger $\sigma$ values, making the reward less sensitive to small errors) and forces the policy to achieve high precision in later stages (with smaller $\sigma$ values, making the reward highly sensitive to small errors).
- $\sigma_{\mathrm{pos}}$ is set to values like [2, 0.1, 0.5, 0.1, 0.05, 0.01, 0.005] when the position error is smaller than [100, 1.0, 0.8, 0.5, 0.4, 0.2, 0.1] respectively.
- $\sigma_{\mathrm{orn}}$ is set to values like [8.0, 4.0, 2.0, 1.0, 0.5] when the orientation error is smaller than [100.0, 1.0, 0.8, 0.6, 0.2] respectively.
Additional Reward Terms (Regularization and Shaping): The WBC incorporates several other reward terms, standard in RL for locomotion and manipulation, to ensure stable, safe, and energy-efficient behavior. These include:
- Joint Limit: Penalizes exceeding joint physical limits.
- Joint Acceleration: Penalizes rapid changes in joint velocities, promoting smooth movements.
- Joint Torque: Penalizes high torques, encouraging energy efficiency and preventing motor overheating.
- Root Height: Rewards maintaining a desired body height.
- Collision: Penalizes unwanted collisions (e.g., robot body hitting the ground).
- Action Rate: Penalizes large changes in successive actions, promoting smoother control signals.
- Body-EE Alignment: Regularizes specific arm joints (0 and 3) to keep the gripper aligned with the body, preventing awkward poses.
- Even Mass Distribution: Regularizes the standard deviation of forces exerted by the four feet, encouraging balanced weight distribution and reducing motor overheating.
- Feet Under Hips: Regularizes the feet's planar position to be near their respective hips, improving standing stability. The specific weights for these reward terms are provided in the Appendix (Table 5).
Policy Network Architecture: A multi-layer perceptron (MLP) is used as the controller network. It maps observations (robot state, end-effector trajectory) to target joint positions for both the 12 DOF legs and the 6 DOF arm. This MLP has a fast inference time (approximately $0.06 \mathrm { ms }$ ), enabling $50 \mathrm { Hz }$ deployment.

4.3.3. System Integration

Robot Hardware:
- Quadruped: Unitree Go2 (12 DOF).
- Manipulator: ARX5 robot arm (6 DOF).
- End-effector: Customized with Finray grippers and a GoPro (to match UMI gripper).
- Power: Both robot and arm are powered by the Go2's battery.
Compute:
- Whole-body controller runs on the Go2's Jetson onboard computer.
- Diffusion policy inference runs on a separate desktop RTX 4090 over an internet connection.
Odometry: An iPhone mounted on the robot's base provides pose estimation via ARKit. It connects to the Jetson via Ethernet.
- Placement: The iPhone is mounted on the rear of the robot, facing backward at a $60^\circ$ angle. This avoids arm-phone collisions, minimizes motion blur and visual occlusion, and ensures the ARKit can track the environment better, even when the Go2 bends.
Sim2Real Transfer Details:
- Robustness Training: Random push forces are applied to the robot during RL training to enhance robustness.
- Domain Randomization: Joint friction, damping, contact frictions, body and arm masses, and center of masses are randomized.
- Latency Modeling: Crucially, latency in control is modeled during training. The paper reports simulating a 20ms end-to-end control latency for best performance in the real world. Pose latency of $10 \mathrm { ms }$ is also simulated for odometry.
- Odometry Noise Handling: To account for odometry system noise, the robot's pose is randomly transported every 20 seconds mid-episode.
- URDF Accuracy: For highly dynamic tasks, precise URDF (Unified Robot Description Format) models are critical. The authors disassembled the ARX5 arm, re-weighed components, and recomputed mass, center of mass, and inertia matrices for accurate URDF creation, which was essential for both simulation stability and real-world deployment.
Latency Challenges and Mitigation:
- Control Latency: The ROS2 communication, motor encoder readouts, WBC inference, and motor execution introduce latency. A $20 \mathrm { ms }$ end-to-end control latency was found to work best in simulation to match the real system.
- Pose Estimator Latency: Motion capture has $\sim 8 \mathrm { ms }$ latency, while iPhone ARKit has $\sim 140 \mathrm { ms }$ . A $10 \mathrm { ms }$ pose latency is simulated. Attempts to compensate iPhone latency by integrating base velocity estimates (from foot contacts, joint positions/velocities, IMU readings) forward in time were not significantly effective in reducing shaking.
- Python ROS2 Latency: Due to Python's Global Interpreter Lock, ROS2 callback functions can block each other. Optimization and detaching long-running callbacks to separate ROS2 nodes improved joint observation updates to $200 \mathrm { Hz }$ and policy inference to $50 \mathrm { Hz }$ .
Safety Measures:
- Shaking/Oscillation: Action rate regularization (penalizing large changes in actions) was increased during training. The onboard Lidar was disabled as it introduced shaking.
- Overheat Shutdowns: The calf joint's linkage design requires higher torque. Increased torque regularization during training allowed continuous operation for up to 30 minutes.
- Unsafe Configurations: Reward terms were added to regularize the body to a more balanced pose with the center of mass centrally located, preventing the robot from twisting its legs into unsafe, high-precision configurations often favored in simulation.

5. Experimental Setup

The authors designed a series of experiments to validate the UMI-on-Legs framework, focusing on capability, robustness, and scalability.

5.1. Datasets

Real-world UMI Demonstrations:
- Cup Rearrangement Task: The paper directly uses a pre-trained UMI checkpoint [1]. This policy was trained on 1400 episodes of human demonstrations collected using a hand-held gripper for placing an espresso cup on its saucer with a specific handle orientation.
- Pushing and Tossing Tasks: For these tasks, diffusion policies were trained from scratch using UMI data collected by the authors.
  - Dynamic Tossing: 500 episodes of human demonstration for grasping and tossing.
  - Kettlebell Pushing: 25 episodes of human demonstration for pushing.
- Characteristics: The data includes end-effector poses and wrist-mounted camera views. It is task-centric and robot-agnostic beyond the end-effector.
Synthetic Trajectories for WBC Training: For whole-body controller training, end-effector trajectories (e.g., from UMI data) are provided as targets in simulation. The training environment includes the robot but no complex task objects, reducing the need for diverse simulated assets.
Justification: Using UMI data allows for diverse, task-relevant manipulation skills to be collected cheaply in the real world without a robot. Training the WBC in simulation with these trajectories avoids the difficulty of simulating diverse objects and engineering task-specific rewards for controller learning.

5.2. Evaluation Metrics

The experiments use a combination of simulation and real-world metrics to assess performance.

Simulation Metrics (Averaged over 500 episodes):
1. Position Error (cm):
  - Conceptual Definition: Measures the average Euclidean distance between the end-effector's actual position and its target position over time. It quantifies the accuracy of the controller in reaching the desired spatial location. A lower value indicates better precision.
  - Formula: While the paper does not explicitly provide the formula for position error as a metric, the standard definition is the Euclidean distance: $\epsilon_{\mathrm{pos}} = \| \mathbf{p}_{\mathrm{actual}} - \mathbf{p}_{\mathrm{target}} \|_2$ Where:
    - $\mathbf{p}_{\mathrm{actual}}$ : The 3D position vector of the end-effector in the task-frame.
    - $\mathbf{p}_{\mathrm{target}}$ : The 3D target position vector from the trajectory in the task-frame.
    - $\| \cdot \|_2$ : The Euclidean norm (distance).
  - Unit: Centimeters (cm).
2. Orientation Error (deg or rad):
  - Conceptual Definition: Measures the average angular difference between the end-effector's actual orientation and its target orientation over time. It quantifies the accuracy of the controller in achieving the desired angular pose. A lower value indicates better orientation control.
  - Formula: The paper mentions using a 6D rotation representation [52]. While a single canonical formula for orientation error for this representation isn't explicitly given in the paper, a common approach for angular error between two rotation matrices ( $R_1, R_2$ $R_{1}, R_{2}$ ) or quaternions is: $\epsilon_{\mathrm{orn}} = \arccos \left( \frac{\mathrm{trace}(R_1^T R_2) - 1}{2} \right)$ Where:
    - $R_1$ : Actual end-effector rotation matrix.
    - $R_2$ : Target end-effector rotation matrix.
    - $\mathrm{trace}(\cdot)$ : The sum of the diagonal elements of a matrix. The result is typically in radians and can be converted to degrees.
  - Unit: Degrees (deg) or Radians (rad).
3. Survival (%):
  - Conceptual Definition: Represents the percentage of episodes where the robot avoids a terminal collision and successfully completes the desired duration of the task. A terminal collision is defined as any robot part contact that is not the robot's feet or gripper. It indicates the overall stability and safety of the policy. Higher is better.
  - Formula: (Number of successful episodes / Total number of episodes) $\times 100 \%$ .
4. Electrical Power Usage (kW):
  - Conceptual Definition: Estimates the average electrical power consumed by the robot's motors during an episode. It's calculated based on real hardware's voltages, manufacturer-reported torque constants, and simulated motor torques. It measures the energy efficiency of the policy. A lower value indicates better efficiency.
  - Formula: This is typically derived from the sum of (torque $\times$ angular velocity) for each joint, potentially incorporating motor efficiency and voltage. The paper states it's based on "real hardware's voltages, manufacturers' reported torque constants, and the simulation's motor torques," implying a complex model-based estimation.
  - Unit: Kilowatts (kW).
Real-World Metrics (Averaged over 20 episodes):
1. Success Rate (%):
  - Conceptual Definition: The percentage of real-world trials where the robot successfully completes the defined task according to specific task criteria (e.g., ball landing within $40 \mathrm { cm }$ of a target, cup placed upright with handle oriented correctly). Higher is better.
  - Formula: (Number of successful trials / Total number of trials) $\times 100 \%$ .

5.3. Baselines

The paper primarily uses ablations of its own proposed WBC design and DeepWBC as baselines for comparison in simulation.

Ablation Studies:
- (-) Preview: Removes the trajectory observations (future target poses) from the WBC's input, meaning the controller only sees the current target.
- (-) Task-space: The WBC is trained to track end-effector targets in the body-frame instead of the task-frame.
- (-) UMI Traj: The WBC is trained on simpler, heuristically generated trajectories instead of real-world manipulation trajectories from UMI.
- DeepWBC [8]: A baseline representing prior whole-body control approaches, conceptually similar to removing preview, task-space tracking, and UMI trajectories simultaneously. It often relies on body-frame targets.
Odometry Comparison: For real-world evaluations, the proposed iPhone-based odometry system is compared against an OptiTrack motion capture system, which serves as an oracle (ground truth) for pose estimation.
No-Preview Baseline (Real-World): For the dynamic tossing task, the behavior of a no-preview baseline is visually demonstrated in the real world to highlight the importance of trajectory information.

5.4. Task Descriptions

5.4.1. Capability: Whole-Body Dynamic Tossing

Goal: The robot must grasp a ball and toss it towards a target bin $90 \mathrm { cm }$ away.
Success Criteria: The ball must land within $40 \mathrm { cm }$ of the bin's center.
Difficulty: Requires dynamic whole-body coordination, precise timing, power from all DOFs, and momentary balancing on one or two feet. This is challenging for legged systems due to base stabilization during dynamic arm movements.

5.4.2. Robustness: End-effector Reaching Leads to Robust Whole-body Pushing

Goal: Push a 10lbs (or 20lbs) kettlebell such that it slides upright into a goal region.
Setup: The kettlebell's initial distance from the goal varies between $80 \mathrm { cm }$ and $120 \mathrm { cm }$ between episodes.
Difficulty: Tests the controller's ability to handle unexpected external dynamics (large disturbances, varying friction, feet slippage) in a zero-shot fashion. Small imprecision in the contact point can cause the kettlebell to topple.

5.4.3. Scalability: Plug-and-play Cross-Embodiment Manipulation Policies (In-the-wild Cup Rearrangement)

Goal: Place an espresso cup on its saucer with its handle pointed to the left of the robot.
Setup: The cup and saucer are randomly positioned within a $20 \mathrm { cm }$ radius, with random cup orientation. Evaluated in unseen environments (in-the-wild).
Success Criteria: The cup must be upright on the saucer with its handle within $\pm 15^\circ$ of pointing directly left.
Difficulty: This task primarily validates zero-shot cross-embodiment generalization by deploying a pre-trained manipulation policy (from UMI for a fixed-base UR5e arm) onto the quadruped system (18 DOF ARX5 arm) without fine-tuning. It also requires precise 6 DOF end-effector movements for prehensile and non-prehensile actions, which are traditionally difficult for low-cost legged manipulation systems.

6. Results & Analysis

6.1. Core Results

6.1.1. Capability: Whole-Body Dynamic Tossing

The dynamic tossing experiment aimed to validate the system's ability to handle complex, dynamic whole-body manipulation skills. The WBC successfully discovered a whole-body tossing strategy that involves coordinated movements across all joints, momentary balancing, and effective use of body mass for stability.

The following table shows the results from Table 1, which evaluates the tossing task in simulation for various WBC configurations: The following table shows the results from Table 1:

Approach	Units	Pos Err	Orn Err	Survival	Power
		cm↓	deg↓	% ↑	kW↓
Ours		2.12	3.35	98.4	3.82
(-) Preview		3.02	4.23	93.0	4.74
(-) Task-space		15.49	15.55	0.0	3.95
(-) UMI Traj		2.48	15.67	97.4	3.69
DeepWBC [8]		22.2	66.22	0.0	5.92

Analysis of Tossing Evaluation in Sim (Table 1):

Ours (full proposed system): Achieves the best balance of low position error ( $2.12 \mathrm { cm }$ ), low orientation error ( $3.35 \mathrm { deg }$ ), high survival ( $98.4\%$ ), and reasonable power usage ( $3.82 \mathrm { kW }$ ). This demonstrates the effectiveness of the combined task-frame trajectory tracking with preview information from UMI data.
(-) Preview: Removing trajectory observations (future targets) from the WBC leads to a noticeable increase in both position ( $3.02 \mathrm { cm }$ ) and orientation ( $4.23 \mathrm { deg }$ ) errors, and a drop in survival ( $93.0\%$ ). This highlights the importance of preview horizon for the WBC to anticipate and plan whole-body coordination, especially for dynamic tasks. It also uses more power ( $4.74 \mathrm { kW }$ ) due to more reactive rather than proactive movements.
(-) Task-space: When the WBC tracks targets in the body-frame instead of the task-frame, the performance significantly degrades. Position error ( $15.49 \mathrm { cm }$ ) and orientation error ( $15.55 \mathrm { deg }$ ) become very high, and survival drops to $0.0\%$ . This is a critical finding, confirming that task-frame tracking is essential for stable manipulation on dynamic quadruped embodiments, as it enables the controller to compensate for base movements.
(-) UMI Traj: Training the WBC on simpler, non-UMI trajectories results in increased orientation error ( $15.67 \mathrm { deg }$ ) while position error ( $2.48 \mathrm { cm }$ ) remains relatively low, and survival ( $97.4\%$ ) is still high. This suggests that the quality and expressiveness of UMI-collected trajectories are important for learning precise orientation control, but the WBC can still maintain stability with simpler trajectories for position.
DeepWBC [8]: This baseline, representing a more traditional body-frame approach without trajectory preview or UMI data, performs the worst, with extremely high position ( $22.2 \mathrm { cm }$ ) and orientation ( $66.22 \mathrm { deg }$ ) errors and $0.0\%$ survival. This strongly validates the design choices of UMI-on-Legs.

In real-world experiments, UMI-on-Legs achieved $75\%$ success rate with motion capture (oracle) and $70\%$ with iPhone odometry for the dynamic tossing task. The authors also showed that using calibrated tossing trajectory replay with motion capture (acting as an "oracle manipulation policy") improved success by $10\%$ , suggesting that the remaining misses might be due to imprecise manipulation policies rather than the WBC.

该图像是论文中图5的插图，展示了动态投掷任务中四足机器人在操纵和身体协调方面的表现。上排为我们的方法，展示机器人跳跃投掷、收起卷曲及落地动作；下排为无预览方法，表现出动作慌乱。右侧柱状图比较了三种策略的成功率，最高达85%。

6.1.2. Robustness: End-effector Reaching Leads to Robust Whole-body Pushing

The kettlebell pushing task demonstrated the WBC's zero-shot robustness to unexpected external dynamics.

Success Rates: The system achieved $95\%$ success rate with motion capture and $90\%$ with iPhone odometry when pushing a 10lbs kettlebell. Even with a 20lbs kettlebell, 4 out of 5 episodes were successful.
Adaptive Strategy: The WBC learned to adapt its strategy in response to disturbances. When encountering the kettlebell's resistance, it would lean forward to exert more force, a strategy that would cause the robot to fall without the kettlebell. This demonstrates the WBC's ability to coordinate whole-body movements to maintain balance and achieve the task goal even under significant external forces. The controller's ability to move its legs forward to reach faraway target poses, without being explicitly commanded base velocity, also highlights its intelligent coordination.

该图像是论文中的复合示意图，展示了四足机器人利用UMI-on-Legs框架完成推（Push）和抓取（Catch）动作的过程，以及机器人膝关节速度随时间变化的曲线和不同物体对比的成功率柱状图。

6.1.3. Scalability: Plug-and-play Cross-Embodiment Manipulation Policies (In-the-wild Cup Rearrangement)

This experiment showcased the ability of UMI-on-Legs to deploy a pre-trained manipulation policy from prior work (originally for a fixed-base UR5e arm) onto the quadruped system with zero-shot generalization.

Success Rates: The system achieved $85\%$ success rate using motion capture and $80\%$ using iPhone odometry. This is a remarkable result given the cross-embodiment gap (fixed-base 6 DoF industrial arm vs. mobile 18 DoF legged manipulator) and the precision requirements of the task.
Dynamic Adaptation: The quadruped dynamically tilted and leaned its base to increase its reach range or counterbalance the arm, effectively compensating for the kinematic reach limitations of its smaller ARX5 arm compared to the UR5e the policy was originally trained for.
In-the-Wild Performance: The task was evaluated in unseen environments including unstable terrain (grass, dirt), collapsible tables, and direct sunlight, demonstrating the generalizability of the diffusion policy and the robustness of the WBC in challenging conditions.

$Fig 7: Making Existing Manipulation Policies Mobile. Although intended for fixed-based arms with larger reach to achieve $80 \\%$ success rates zero-shot on our quadruped system with our learned whole…$ 该图像是由三部分构成的插图，展示了将现有操控策略应用于四足机器人移动操作的效果。左图为固定基座的机械臂操作场景，中间图为搭载学习全身控制器的四足机器人操作场景，右图为成功率对比柱状图，表明四足机器人以80%的零样本迁移成功率完成任务。

6.2. Data Presentation (Tables)

Table 2: Hyper parameter set of the dynamic tossing task. The following table shows the results from Table 2:

Hyperparameters	Values
Training Set Trajectory Number	500
Diffusion Policy Visual Observation Horizon	2
Diffusion Policy Proprioception Horizon	4
Diffusion Policy Output Steps	64
Diffusion Policy Execution Steps	40
Diffusion Policy Execution Frequency	12Hz
Trajectory Update Smoothing Time	0.1s

Table 3: Hyper parameter set of the whole-body pushing task The following table shows the results from Table 3:

Hyperparameters	Values
Training Set Trajectory Number	25
Diffusion Policy Visual Observation Horizon	2
Diffusion Policy Proprioception Horizon	2
Diffusion Policy Output Steps	32
Diffusion Policy Execution Steps	10
Diffusion Policy Execution Frequency	10Hz
Trajectory Update Smoothing Time	0.3s

Table 4: Hyper parameter set of the UMI cup rearrangement task. The following table shows the results from Table 4:

Hyperparameters	Values
Training Set Trajectory Number	1400
Diffusion Policy Visual Observation Horizon	1
Diffusion Policy Proprioception Horizon	2
Diffusion Policy Output Steps	16
Diffusion Policy Execution Steps	8
Diffusion Policy Execution Frequency	5Hz
Trajectory Update Smoothing Time	0.1s

Table 5: Reward terms. The following table shows the results from Table 5:

Name	Weight
Joint Limit	-10
Joint Acceleration	-2.5e-7
Joint Torque	-1e-4
Root Height Collision	-1
Action Rate	-1 -0.01
Body-EE Alignment	-1
Even Mass Distribution	-1
Feet Under Hips	-1
Pose Reaching	4

Table 6: Domain Randomization Hyperparameters. The following table shows the results from Table 6:

Hyperparameters	Values
Init XY Position	[-0.1m,0.1m]
Init Z Orientation	[-0.05rad,0.05rad]
Joint Damping	[0.01,0.5]
Joint Friction	[0.0,0.05]
Geometry Friction	[0.1,8.0]
Mass Randomization	[-0.25,0.25]
Center of Mass Randomization	[-0.1m,0.1m]

Table 7: System Cost. The following table shows the results from Table 7:

Item	Cost(\$)
Unitree Go2 Edu Plus	12,500.00
ARX5 Robot Arm	10,000.00
GoPro Hero9	210.99
GoPro Media Mod	79.99
GoPro Max Lens Mod	68.69
iPhone 15 Pro	999.00
Elgato Capture Card	147.34
Total	24,006.01

6.3. Ablations / Parameter Sensitivity

The ablation studies (Table 1) clearly demonstrate the critical role of key design choices in UMI-on-Legs:

Preview Horizon: The performance degradation of (-) Preview confirms that providing future trajectory information is crucial for the WBC to anticipate dynamic movements, leading to smoother and safer whole-body coordination. Without it, the WBC becomes more reactive, leading to higher errors and lower survival.
Task-frame Tracking: The catastrophic failure of (-) Task-space is the most significant finding from the ablations. It unequivocally shows that training the WBC to track end-effector trajectories relative to a task-frame (independent of the robot's base motion) is absolutely essential for achieving stable and precise manipulation on a dynamic quadruped. Traditional body-frame tracking is insufficient to compensate for the robot's inherent base movements during manipulation.
UMI Trajectories for WBC Training: While not as severe as task-frame ablation, training the WBC with (-) UMI Traj (simpler trajectories) still leads to a significant increase in orientation error. This implies that real-world-like UMI trajectories provide richer, more diverse motion patterns that help the WBC learn more precise orientation control, which is vital for tasks like cup rearrangement.

6.4. Failure Analysis

Tossing Task: Misses were often attributed to imprecise manipulation policies rather than the WBC. When provided with oracle trajectories (calibrated motion capture replay), the success rate improved by $10\%$ , suggesting that scaling up tossing data collection for the manipulation policy could further close this gap.
Pushing Task: The most common failure was the kettlebell toppling sideways. While more human demonstrations could teach the robot to reorient, the arm itself was too weak to pick up the kettlebell once it fell. Hardware limitations (overheating) were also noted, suggesting that training with simulated forces [10] could lead to more elegant and energy-efficient pushing behaviors.
Grasping for Tossing (Appendix): Initial attempts at diffusion policies for grasping and tossing suffered from distributional drift due to small shakes from the controller during the grasping phase. More data is hypothesized to resolve this.
System Reliability (Appendix): The system's general reliability for fully-untethered deployment was not high enough.
- Overheating: The calf joints of the Unitree Go2 frequently overheated (within 10-30 minutes) due to a linkage design requiring higher torque, leading to emergency shutdowns. Torque regularization during training helped extend operation time.
- Power Fluctuation: The battery voltage varied with charge level, affecting arm performance (too high when full, dampening behavior when low), requiring voltage adaptors.
Velocity Integration for Odometry (Appendix): iPhone ARKit pose estimates sometimes drifted heavily, especially during dynamic actions like tossing. The latency of $\sim 140 \mathrm { ms }$ from movement to pose update introduced a significant Sim2Real gap, leading to low-frequency oscillations. Integrating base velocity estimates from proprioceptive sensors to project the iPhone pose forward did not significantly improve performance, indicating latency and shaking require more robust mitigation.

6.5. Things That Did Not Work (Appendix)

Privileged Policy Distillation and Observation History: Attempts to include privileged information (e.g., ground truth poses, base velocities, kp/kd) to train a privileged policy and distill it did not boost performance and introduced instability, especially with observation histories longer than 1 step. This was attributed to imprecise Python ROS2 timers causing out-of-distribution observation histories.
Precise Grasping for Tossing: As mentioned in failure analysis, training diffusion policies for precise grasping was challenging due to controller shakes causing distributional drift.
Velocity Integration for Odometry: As mentioned in failure analysis, integrating inertial-legged velocity estimates to compensate for iPhone ARKit latency did not significantly improve performance due to remaining shaking behavior and latency issues.

7. Conclusion & Reflections

7.1. Conclusion Summary

UMI-on-Legs presents a robust and scalable framework for mobile manipulation on quadruped robots. By strategically combining real-world human demonstrations (via UMI) for task-centric data and simulation-based reinforcement learning for robot-centric whole-body control, the framework overcomes common challenges in robot learning. The key innovation lies in its interface: end-effector trajectories in the task-frame, which enables zero-shot cross-embodiment deployment of manipulation policies and allows the whole-body controller to effectively coordinate the entire robot to perform complex, dynamic, and precise manipulation tasks while compensating for base movements. Demonstrated on tossing, pushing, and cup rearrangement tasks, UMI-on-Legs achieves high success rates across diverse challenges and offers an accessible real-time odometry solution via an iPhone.

7.2. Limitations & Future Work

The authors acknowledge several limitations:

Gripper-based Manipulation Only: The current system is limited to gripper-based manipulation, inheriting this from the UMI framework. Future work could extend the interface to support whole-body manipulation (e.g., using legs for manipulation), inspired by emerging whole-body demonstration devices and quadruped manipulation research [31, 54, 55].
Embodiment-Specific Constraints: While the interface is embodiment-agnostic, incorporating embodiment-specific constraints (e.g., kinematic limits, torque limits) back into the high-level manipulation policy is an important next step. This would allow the policy to generate more feasible trajectories for a given robot.
Complete Mobile Manipulation Platform: The current system lacks capabilities essential for a complete mobile manipulation platform, such as collision avoidance [56] and force feedback/control [10]. Integrating these features would enhance safety and interaction capabilities.
Hardware Robustness: As noted in the appendix, the system faces hardware reliability issues like motor overheating and power fluctuations during prolonged dynamic tasks. Addressing these through hardware improvements or more energy-efficient control policies is crucial.
Odometry Latency: Despite using an iPhone for odometry, a significant latency ( $\sim 140 \mathrm { ms }$ ) was observed, leading to oscillations and a Sim2Real gap. Improving low-latency odometry solutions or better latency compensation in the WBC is important.
Manipulation Policy Precision: For highly dynamic tasks like tossing, manipulation policy precision was identified as a limiting factor, suggesting that more demonstration data could improve performance.

7.3. Personal Insights & Critique

UMI-on-Legs represents a significant step forward in making mobile manipulation on dynamic quadrupeds more accessible and scalable. The core insight of decoupling task-centric demonstration from robot-centric control via an embodiment-agnostic trajectory interface is powerful. This modularity allows researchers to improve manipulation policies and whole-body controllers independently, knowing they can be effectively integrated.

Strengths:

Novel Interface: The task-frame end-effector trajectory interface is a crucial innovation. It simplifies human demonstration, enhances WBC performance through preview, and enables true cross-embodiment transfer, which is a persistent challenge in robotics.
Scalability: The framework's ability to leverage UMI for cheap real-world data and massively parallel simulation for WBC training offers a highly scalable path for developing complex skills.
Accessibility: The use of an iPhone for odometry drastically lowers the barrier to entry for mobile manipulation research, moving away from expensive motion capture systems.
Rigorous Ablations: The comprehensive ablation studies clearly validate the necessity of each component (e.g., task-frame tracking, preview horizon), strengthening the paper's conclusions.
Diversity of Tasks: Demonstrating success on prehensile, non-prehensile, and highly dynamic tasks (tossing) showcases the versatility and robustness of the framework.

Critique and Open Questions:

Dependency on UMI: While a strength for data collection, the reliance on UMI (a specific gripper-based demonstration device) might limit the immediate applicability to whole-body manipulation or diverse end-effector types. Future iterations might explore more generalized demonstration interfaces.
Real-time Odometry Trade-offs: While accessible, the iPhone ARKit odometry with its notable latency (140ms) might pose fundamental limits for extremely fast or precise dynamic tasks. Further research into fusing VIO with proprioceptive sensing or deploying onboard visual odometry solutions (like Lidar-Inertial Odometry) on the Jetson could offer lower-latency, more robust alternatives.
Sim2Real Gap for Policy: While the WBC uses domain randomization, the manipulation policy is trained on real-world UMI data. The zero-shot transfer of a pre-trained policy to a different robot embodiment still involves an implicit Sim2Real gap (or Real2Real embodiment gap) for the policy itself regarding subtle kinematic or dynamic differences. How does the WBC compensate for trajectories that might be kinematically challenging or energetically expensive for the target robot, even if valid for the source robot? The WBC dynamically tilting and leaning is a good start, but more explicit feedback or awareness in the manipulation policy could be beneficial.
Generality of Reward Function Weights: The reward terms for the WBC involve many manually tuned weights (Table 5) and sigma curriculum schedules. While effective, this tuning process can be tedious. Investigating meta-learning approaches or automated reward design could reduce this manual effort for new robots or tasks.
Cost vs. Performance: The system's cost (~$24k) is presented as relatively low, which is a significant advantage. However, the hardware limitations (overheating, battery issues) suggest that lower-cost hardware still presents practical challenges for continuous, high-performance operation. This highlights the ongoing trade-off between accessibility and industrial-grade reliability.

Overall, UMI-on-Legs provides a compelling blueprint for future mobile manipulation systems. Its modular design, task-frame interface, and accessible odometry pave the way for more widespread adoption and development of complex robot skills, particularly on dynamic legged platforms. The framework's ability to bridge human intuition with robot capabilities is a testament to its forward-thinking design.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~28 min read · 36,251 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation (Why)

2.2. Main Contributions / Findings (What)

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation

4. Methodology

4.1. Principles

4.2. Steps & Procedures

4.2.1. Training Stage

4.2.2. Deployment Stage

4.3. Mathematical Formulas & Key Details

4.3.1. Manipulation Policy with Behavior Cloning

4.3.2. Whole-body Controller with Reinforcement Learning

4.3.3. System Integration

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Task Descriptions

5.4.1. Capability: Whole-Body Dynamic Tossing

5.4.2. Robustness: End-effector Reaching Leads to Robust Whole-body Pushing

5.4.3. Scalability: Plug-and-play Cross-Embodiment Manipulation Policies (In-the-wild Cup Rearrangement)

6. Results & Analysis

6.1. Core Results

6.1.1. Capability: Whole-Body Dynamic Tossing

6.1.2. Robustness: End-effector Reaching Leads to Robust Whole-body Pushing

6.1.3. Scalability: Plug-and-play Cross-Embodiment Manipulation Policies (In-the-wild Cup Rearrangement)

6.2. Data Presentation (Tables)

6.3. Ablations / Parameter Sensitivity

6.4. Failure Analysis

6.5. Things That Did Not Work (Appendix)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers