UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers
TL;DR Summary
UMI-on-Legs integrates handheld data collection with simulation-based whole-body control, enabling zero-shot cross-embodiment deployment and achieving over 70% success in diverse dynamic manipulation tasks on quadrupeds.
Abstract
We introduce UMI-on-Legs, a new framework that combines real-world and simulation data for quadruped manipulation systems. We scale task-centric data collection in the real world using a hand-held gripper (UMI), providing a cheap way to demonstrate task-relevant manipulation skills without a robot. Simultaneously, we scale robot-centric data in simulation by training whole-body controller for task-tracking without task simulation setups. The interface between these two policies is end-effector trajectories in the task frame, inferred by the manipulation policy and passed to the whole-body controller for tracking. We evaluate UMI-on-Legs on prehensile, non-prehensile, and dynamic manipulation tasks, and report over 70% success rate on all tasks. Lastly, we demonstrate the zero-shot cross-embodiment deployment of a pre-trained manipulation policy checkpoint from prior work, originally intended for a fixed-base robot arm, on our quadruped system. We believe this framework provides a scalable path towards learning expressive manipulation skills on dynamic robot embodiments. Please checkout our website for robot videos, code, and data: https://umi-on-legs.github.io
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers
1.2. Authors
- Huy Ha (* Equal Contribution, Stanford University, Columbia University)
- Yihuai Gao (* Equal Contribution, Columbia University)
- Zipeng Fu (Columbia University)
- Jie Tan (Google DeepMind)
- Shuran Song (Stanford University, Columbia University)
1.3. Journal/Conference
This paper was published on arXiv, indicating it is a preprint and has not yet undergone formal peer review for a specific journal or conference. However, the authors and affiliations (Stanford, Columbia, Google DeepMind) suggest a high-caliber research background. The work builds upon UMI and Diffusion Policy, which were presented at RSS, a top-tier robotics conference.
1.4. Publication Year
2024
1.5. Abstract
The paper introduces UMI-on-Legs, a novel framework designed for quadruped manipulation systems. It integrates real-world data, collected economically using a hand-held gripper (UMI) for task demonstrations, with simulation-trained whole-body controllers (WBCs). The core innovation lies in the interface: end-effector trajectories in the task frame, inferred by a manipulation policy (trained from UMI data) and tracked by the WBC. This design allows for scaling task-centric data collection without a robot and robot-centric data in simulation without complex task setups. The framework is evaluated on prehensile, non-prehensile, and dynamic manipulation tasks, achieving over 70% success rates. Notably, it demonstrates zero-shot cross-embodiment deployment of a pre-trained manipulation policy originally for a fixed-base robot arm onto their quadruped system. The authors propose UMI-on-Legs as a scalable path for learning expressive manipulation skills on dynamic robot embodiments.
1.6. Original Source Link
- Official Source/PDF Link: https://arxiv.org/abs/2407.10353v1
- Publication Status: This is a preprint () uploaded to arXiv on 2024-07-14T23:03:23.000Z.
2. Executive Summary
2.1. Background & Motivation (Why)
The field of robot learning faces a dilemma: real-world data collection provides direct task relevance but is bottlenecked by hardware costs, safety, and robot-specific data, while simulation offers safe exploration and infinite resets but struggles with task diversity and realistic object dynamics. Current legged manipulation systems often rely on body velocity commands or single-step body-frame end-effector targets, which are embodiment-specific (requiring the robot for data collection) and too simplistic for complex, dynamic manipulation. There is a critical need for a framework that combines the strengths of both real-world and simulation data collection in a scalable, generalizable, and user-friendly manner, especially for mobile manipulation on dynamic robot embodiments like quadrupeds.
The paper aims to address these challenges by providing:
- A method to scale
task-centric data collectionin the real world without needing a robot, reducing cost and safety concerns. - A method to scale
robot-centric datain simulation without requiring complex task-specific simulation setups orreward engineering. - A robust and expressive interface between
high-level manipulation policiesandlow-level whole-body controllersthat allows forcross-embodiment generalizationand handles the dynamic nature of quadruped robots. - An accessible real-time
odometry solutionforin-the-wild mobile manipulation.
2.2. Main Contributions / Findings (What)
The main contributions of this work are:
- UMI-on-Legs Framework: Introduction of a novel framework that combines
real-world human demonstrations(using theUMIhand-held gripper) andsimulation-based reinforcement learningto traincross-embodiment mobile manipulation systems. This framework allows real-world data collected without robots to be distilled into manipulation skills for various mobile robot platforms, leveraging robot-specific controllers. - Manipulation-Centric Whole-Body Controller: Proposal of a
whole-body controller (WBC)that usesend-effector trajectories in the task-frameas its interface. This simple yet expressive interface enableszero-shot cross-embodiment deploymentof existing manipulation policies (originally designed for fixed-base arms) onto mobile quadrupeds, while also facilitating complex dynamic skills. TheWBCis trained in simulation to track these trajectories, allowing it to compensate for base perturbations and anticipate future movements. - Accessible Real-time Odometry System: Development and deployment of a
real-time, robust, and accessible odometry approachbased on Apple'sARKitusing aniPhone. This addresses a common bottleneck inmobile manipulation systemsby providing fast and precisetask-space trackingforin-the-wild deployment. - Empirical Validation on Diverse Tasks: Successful evaluation of
UMI-on-Legsonprehensile (cup rearrangement),non-prehensile (kettlebell pushing), anddynamic (ball tossing)manipulation tasks, reporting over 70% success rates on all tasks in real-world scenarios. TheWBCdemonstrates robustness to unexpected external dynamics. - Zero-Shot Cross-Embodiment Transfer: Demonstrated the direct deployment of a
pre-trained manipulation policy(fromUMIprior work) originally for a fixed-base robot arm onto the quadruped system without any fine-tuning, showcasing significantgeneralization capabilities.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Quadruped Robot: A robot with four legs, designed for locomotion over varied terrain. They are inherently dynamic systems, meaning their body can move significantly during tasks, posing challenges for stable manipulation.
- Robot Manipulation: The ability of a robot to interact with objects in its environment, typically using an
end-effector(like a gripper or hand) attached to a robotic arm. This can involveprehensileactions (grasping),non-prehensileactions (pushing), ordynamicactions (tossing). - Whole-Body Control (WBC): A control strategy for robots with multiple degrees of freedom (like a quadruped with an arm) that coordinates all joints (legs and arm) to achieve a desired task while maintaining balance and respecting physical limits. It goes beyond controlling just the arm or just the legs, integrating their movements.
- End-effector Trajectory: A sequence of desired poses (position and orientation) over time for the robot's
end-effector. Instead of controlling individual joint angles, this specifies the path the gripper should follow in space. - Task-Frame: A coordinate system defined relative to the task or object being manipulated. For example, if a robot is interacting with a cup on a table, the
task-framemight be fixed to the table. This is in contrast to abody-frame, which is fixed to the robot's base and moves with the robot. Tracking in thetask-framemeans the robot tries to maintain theend-effector'sposition relative to the task, even if its own body moves. - Diffusion Policy: A type of
visuo-motor policythat uses adiffusion modelto predict robot actions.Diffusion modelsaregenerative modelsthat learn to reverse adiffusion process(gradually adding noise to data) to generate new data samples. In robotics, they can be trained to generateaction sequences(likeend-effector trajectories) from visual observations by learning to denoise a noisy action sequence. - Behavior Cloning (BC): A machine learning technique where a robot learns a policy by imitating expert demonstrations. The robot observes a human (or another robot) performing a task and tries to replicate the observed actions given similar observations.
- Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to make decisions by performingactionsin anenvironmentto maximize a cumulativerewardsignal. The agent learns through trial and error, getting positive rewards for desired behaviors and penalties for undesired ones. - Sim2Real Transfer: The process of training a robot policy or controller in a
simulated environmentand then deploying it successfully in thereal world. This is challenging due to thereality gap(differences between simulation and reality), which often requires techniques likedomain randomizationoradaptive policies. - Odometry: The process of estimating a robot's position and orientation (pose) over time by using data from its onboard sensors, such as wheel encoders,
IMUs (Inertial Measurement Units), or cameras.Visual-inertial odometry (VIO)combines camera images withIMUdata for robust pose estimation, especially useful in dynamic environments. - Cross-embodiment Generalization: The ability of a robot policy or skill to transfer and work effectively on different robot platforms or configurations (e.g., from a fixed-base arm to a mobile quadruped) without significant re-training.
- Prehensile/Non-prehensile Manipulation:
Prehensile manipulationinvolves grasping and holding objects, whilenon-prehensile manipulationinvolves interacting with objects without grasping them, such as pushing, nudging, or tossing.
3.2. Previous Works
The paper extensively references several foundational and contemporary works that form the basis of UMI-on-Legs:
- UMI [1]: The
Universal Manipulation Interface(Chi et al., RSS 2024) is a key predecessor. It introduced the concept of collectingtask-centric datausing a hand-held gripper, enabling human demonstrations without requiring a robot.UMI-on-Legsdirectly leveragesUMIfor its real-world data collection strategy. - Diffusion Policy [2]: (Chi et al., RSS 2023) This work forms the core of the
manipulation policyinUMI-on-Legs. It proposed usingdiffusion modelsforvisuo-motor policy learning, enabling robust action generation from visual observations.UMI-on-Legsadopts this architecture. - Massively Parallel Simulators [3, 7]: Works like
Isaac Gym[3] and methods from Rudin et al. [7] are crucial for scalingRLtraining forwhole-body controllers. These simulators allow training thousands of robot instances in parallel, accelerating the learning process for complex control tasks and enablingSim2Real transfertechniques likedomain randomization. - DeepWBC [8]: (Fu et al., CoRL 2023) This is cited as a baseline for
whole-body controlinlegged manipulation.DeepWBCtypically trains policies to control both locomotion and manipulation but often usesbody-frame end-effector targets, which the authors argue is less effective for compensating body movements compared to theirtask-frameapproach. - MobileALOHA [27]: (Fu et al., arXiv 2024) This work proposed a wheeled-bimanual platform for
demonstration collectionto learn challenging mobile manipulation skills. WhileMobileALOHAfocuses ondemonstration collectionfor wheeled robots,UMI-on-Legsoffers an alternative for quadrupeds and uses a different data collection method (hand-held gripper). - Prior Legged Manipulation Systems [8, 10-12]: Many prior works in
legged manipulationhave exploredwhole-body controllers. However, they often rely onbody velocity commandsorsingle-step body-frame end-effector targets, which areembodiment-specificand limited in representing complex trajectories.UMI-on-Legsdifferentiates itself by usingtask-frame end-effector trajectoriesand demonstratingzero-shot cross-embodiment transfer.
3.3. Technological Evolution
The field has evolved from traditional robot control (model-based, inverse kinematics) to learning-based approaches (behavior cloning, reinforcement learning). Early RL for locomotion often focused solely on leg movements [6, 28-30], while later works started integrating manipulation skills [31-33]. The development of massively parallel simulators [3, 7] significantly accelerated RL training and improved Sim2Real transfer [9]. Concurrently, advances in visuo-motor policies have led to more generalizable manipulation skills, often leveraging pre-trained visual encoders [1, 25, 43] and diffusion models [1, 2, 27]. The Universal Manipulation Interface (UMI) [1] represented a shift by enabling robot-agnostic data collection. UMI-on-Legs combines these threads: it uses UMI for task data, diffusion policies for manipulation skill inference, and massively parallel RL for whole-body control, specifically adapting these to the dynamic quadruped embodiment using a novel task-frame trajectory interface. The integration of readily available consumer electronics like iPhones for odometry further showcases an evolution towards more accessible and robust mobile manipulation systems.
3.4. Differentiation
UMI-on-Legs differentiates itself from prior work in several key aspects:
- Data Collection Strategy: Unlike systems requiring the robot for data collection (
teleoperationordemonstration replay[8, 10, 11, 27]),UMI-on-Legsuses ahand-held gripper (UMI)to collecttask-centric datain the real world without involving the robot hardware. This significantly reduces cost, safety concerns, and therobot-specificnature of the collected data. - Whole-Body Controller Training: Instead of requiring full task simulation setups,
UMI-on-Legstrains itsWBCin simulation to trackend-effector trajectories. This means theWBCdoesn't need to interact with diverse objects or learn from complextask rewardsin simulation, simplifying thesim-to-real transferfor the control aspect. - Interface Design: Most prior
legged manipulation systemsusebody velocity commandsorsingle-step body-frame end-effector targets[8, 10-12].UMI-on-Legsusesend-effector trajectories in the task-frameas the interface. This providespreview informationto theWBCfor anticipation, isembodiment-agnosticbeyond theend-effector, and enablesprecise and stable manipulationby actively compensating forbase movements. - Cross-Embodiment Generalization: The
task-frame trajectory interfaceallows forzero-shot cross-embodiment deploymentofmanipulation policiestrained for fixed-base robot arms onto amobile quadruped. This is a significant leap compared to policies that assume perfect lower-level controllers or are tied to specific robot kinematics. - Accessible Odometry:
UMI-on-Legsemploys aniPhonewithARKitforreal-time odometry, offering a compact, self-contained, and accessible solution forin-the-wild deployment, contrasting with systems relying onmotion capture[31] orAprilTags[8, 11, 36] which constrain deployment environments.
4. Methodology
The UMI-on-Legs framework is a modular system composed of two main components: a high-level manipulation policy and a low-level whole-body controller (WBC). These two policies communicate through a well-defined interface: end-effector trajectories expressed in the task-frame.
The overall method overview is depicted in Figure 3.

该图像是示意图,展示了论文中图3的方法概览。系统输入GoPro拍摄的RGB图像,通过频率为2Hz的扩散策略推断相机帧末端执行器轨迹(a)。轨迹被转换到任务空间,作为整体运动控制器的接口(b)。控制器通过多层感知机(MLP)以50Hz输出关节动作指令(c)。
The system takes RGB images from a GoPro mounted on the robot's end-effector. A diffusion policy (trained using real-world UMI demonstrations) infers a camera-frame end-effector trajectory. This trajectory is then transformed into the task-space and passed to the WBC. The WBC outputs joint position targets at , which are then tracked by PD controllers for both the legs and the arm.
4.1. Principles
The core idea behind UMI-on-Legs is to decouple the learning of task-specific manipulation skills from robot-specific whole-body control.
-
Task-Centric Data Collection: Human operators can demonstrate diverse and complex manipulation tasks using a simple
hand-held gripper (UMI)in the real world. This data isrobot-agnosticbeyond theend-effectortype. -
Robot-Centric Control Learning: A
whole-body controlleris trained in simulation to robustly track any givenend-effector trajectory. This training does not require complextask simulationsorobject dynamics, simplifying theSim2Real transferfor the controller. -
Expressive Interface: The
end-effector trajectory in the task-frameserves as the communication bridge. It is simple enough for humans to demonstrate, yet expressive enough to capture complex motions, and providespreview informationfor theWBCto anticipate. Thistask-frameapproach allows theWBCto actively compensate forbase perturbations(body movements), making the manipulation precise and stable.The benefits of this interface design are highlighted:
-
Intuitive demonstration: Non-expert users can demonstrate tasks using
hand-held devices(likeUMI), as they only need to specify the desiredend-effector path, not complexrobot-specific actions. -
High-level intention from preview horizon: Providing a
trajectory(a sequence of future targets) allows theWBCto anticipate upcoming movements and coordinate the entire robot body. For example, bracing for ahigh-velocity tossor leaning the body instead of stepping to reach a target. -
Precise and stable manipulation in the task-frame: By tracking actions in a
task-spacethat ispersistent regardless of base movement(Figure 4), the controller inherently learns to compensate forbody movementsandperturbations, leading to more stable manipulation.
该图像是图4的示意图,展示了任务坐标系与机器人身体坐标系下的轨迹追踪对比。图中说明了全身控制器在任务坐标系中运行,能有效补偿机器人底盘扰动,提升操作策略的任务执行能力,而传统方法在身体坐标系中难以快速响应扰动。 -
Asynchronous multi-frequency execution: The interface naturally supports a
hierarchical controlarchitecture, where alow-frequency manipulation policy(1-5Hz) can command ahigh-frequency low-level controller(50Hz), accommodating different sensor and inference latencies. -
Compatible with any trajectory-based manipulation policy: This interface allows for
plug-and-playfunctionality with existing or futuretrajectory-based manipulation policies, facilitating the transfer of "table-top" skills to "mobile" platforms.
4.2. Steps & Procedures
The UMI-on-Legs framework operates in two main stages: training and deployment.
4.2.1. Training Stage
- Manipulation Policy Training (Real-World Data, Offline):
- Data Collection: Human operators use a
hand-held gripper (UMI)to perform desired manipulation tasks in the real world. TheUMIdevice records theend-effector posesand correspondingvisual observations. This data istask-centricandembodiment-agnostic(meaning it doesn't depend on the robot's specific body). - Policy Training: A
diffusion policyis trained using this collected data. The policy learns to predictsequences of future end-effector posesin thecamera framegiven visual inputs from awrist-mounted camera.
- Data Collection: Human operators use a
- Whole-Body Controller (WBC) Training (Simulation, Offline):
- Simulation Environment: A
massively parallelized simulationenvironment (Isaac Gymor similar) is used. This environment contains the quadruped robot with its arm but does not require a complex task setup with objects or detailedobject dynamics. - Trajectory Generation for Training:
End-effector trajectories(similar to those collected withUMI) are used as targets. The paper states thatUMI trajectoriesare used, but also mentions the WBC can be trained without task simulations, implying a simplified trajectory generation might be used for WBC pre-training. - Reinforcement Learning: An
RL agent(theWBC) is trained to track theseend-effector trajectoriesin thetask-framewhile coordinating all18 degrees of freedom(12 for legs, 6 for arm). - Observation Space: The
WBCobservesrobot's 18 joint positions and velocities,base orientation and angular velocity,previous action, and theend-effector trajectory(densely sampledtarget posesfrom -60ms to 60ms at 20ms interval, plus a target 1000ms into the future) inferred by themanipulation policy.End-effector poseis represented by a3D position vectorand a6D rotation representation[52]. - Reward Function: The
RL agentis rewarded for minimizingposition errorandorientation errorto the target pose. Regularization and shaping terms (e.g.,joint limits,action rate,root height,collision,body-EE alignment,even mass distribution,feet under hips) are also included. Asigma curriculumis used for thepositionandorientation errorrewards to encourage precision. - Sim2Real Transfer Techniques: During training,
domain randomization(random pushes, joint friction, damping, contact frictions, body/arm masses, center of masses) is applied to bridge thereality gap. Crucially,control latencyis modeled, and the robot is randomly transported to account forodometry noise.
- Simulation Environment: A
4.2.2. Deployment Stage
- Real-time Sensing:
RGB imagesare captured by aGoPromounted on theend-effector.Robot proprioception(joint positions,velocities,base orientation,angular velocity) is gathered from the robot's internal sensors.Real-time odometry(robot's pose in the globaltask-frame) is provided by aniPhonerunningARKit, mounted on the robot's base.
- High-Level Manipulation Policy Inference (Low Frequency):
- The
RGB imagesare fed to thediffusion policy(running on an externalRTX 4090desktop) to infer acamera-frame end-effector trajectory. - This
camera-frame trajectoryis then transformed into thetask-frameusing theiPhone's odometryand the robot's kinematic chain. - This
manipulation policytypically runs at alow frequency(e.g., 1-5Hz).
- The
- Low-Level Whole-Body Control Execution (High Frequency):
- The inferred
task-frame end-effector trajectoryis passed to theWBC(running on the robot'sJetsoncomputer). - The
WBC(anMLPnetwork) takesproprioceptive feedback,odometry, and thetarget trajectoryas input. - It outputs
joint position targetsfor all 18 DOF at ahigh frequency(e.g., 50Hz). - Separate
PD controllersfor the legs and arm track thesejoint position targets.
- The inferred
4.3. Mathematical Formulas & Key Details
4.3.1. Manipulation Policy with Behavior Cloning
- Architecture: The manipulation policy uses a
U-Net[50] architecture for itsdiffusion policy[2] with aDDIM scheduler[51]. It leverages apre-trained CLIP vision encoder[25] for visual feature extraction. - Action Horizon: An
action horizonof 4 is used, meaning the policy predicts 4 futureend-effector posesortrajectory segmentsinto the future, providing more information to the low-level controller. - Training Data: For tasks like cup rearrangement,
pre-trained checkpointsfromUMI[1] are directly used. For new tasks like pushing and tossing,diffusion policiesare trained from scratch using newly collectedUMIdata.
4.3.2. Whole-body Controller with Reinforcement Learning
-
Reward Function for Trajectory Tracking: The core of the
WBC'straining objective is to minimize the error between theend-effector'scurrent pose and thetarget posewithin thetrajectory. The reward function for this objective is given by: Where:-
: The reward obtained for tracking the target pose.
-
: The
position error, which is the Euclidean distance between the actualend-effector positionand the targetend-effector position. -
: The
orientation error, which is the angular difference (e.g., measured in radians) between the actualend-effector orientationand the targetend-effector orientation. The paper uses a6D rotation representation[52] for orientation. -
: A
scaling term(standard deviation) for theposition error. -
: A
scaling term(standard deviation) for theorientation error.Explanation: This exponential reward function encourages errors to be as small as possible. The
scaling terms(, ) determine the sensitivity of the reward to errors. Smaller values mean the reward drops off more steeply for small errors, pushing the policy towards higher precision. The combined term for position and orientation, rather than separate terms, leads to more balanced precision in both.
-
-
Sigma Curriculum: A
sigma curriculumis applied, meaning the values of and are gradually decreased during training as thepositionandorientation errorsbecome smaller. This allows for broader exploration in the early stages of training (with larger values, making the reward less sensitive to small errors) and forces the policy to achieve high precision in later stages (with smaller values, making the reward highly sensitive to small errors).- is set to values like
[2, 0.1, 0.5, 0.1, 0.05, 0.01, 0.005]when theposition erroris smaller than[100, 1.0, 0.8, 0.5, 0.4, 0.2, 0.1]respectively. - is set to values like
[8.0, 4.0, 2.0, 1.0, 0.5]when theorientation erroris smaller than[100.0, 1.0, 0.8, 0.6, 0.2]respectively.
- is set to values like
-
Additional Reward Terms (Regularization and Shaping): The
WBCincorporates several other reward terms, standard inRLforlocomotionandmanipulation, to ensure stable, safe, and energy-efficient behavior. These include:Joint Limit: Penalizes exceeding joint physical limits.Joint Acceleration: Penalizes rapid changes in joint velocities, promoting smooth movements.Joint Torque: Penalizes high torques, encouraging energy efficiency and preventing motor overheating.Root Height: Rewards maintaining a desired body height.Collision: Penalizes unwanted collisions (e.g., robot body hitting the ground).Action Rate: Penalizes large changes in successive actions, promoting smoother control signals.Body-EE Alignment: Regularizes specific arm joints (0 and 3) to keep the gripper aligned with the body, preventing awkward poses.Even Mass Distribution: Regularizes the standard deviation of forces exerted by the four feet, encouraging balanced weight distribution and reducing motor overheating.Feet Under Hips: Regularizes thefeet's planar positionto be near their respectivehips, improving standing stability. The specificweightsfor thesereward termsare provided in theAppendix(Table 5).
-
Policy Network Architecture: A
multi-layer perceptron (MLP)is used as the controller network. It maps observations (robot state,end-effector trajectory) totarget joint positionsfor both the12 DOFlegs and the6 DOFarm. ThisMLPhas a fast inference time (approximately ), enabling deployment.
4.3.3. System Integration
- Robot Hardware:
- Quadruped:
Unitree Go2(12 DOF). - Manipulator:
ARX5robot arm (6 DOF). - End-effector: Customized with
Finray grippersand aGoPro(to matchUMIgripper). - Power: Both robot and arm are powered by the
Go2'sbattery.
- Quadruped:
- Compute:
Whole-body controllerruns on theGo2's Jetsononboard computer.Diffusion policy inferenceruns on a separatedesktop RTX 4090over an internet connection.
- Odometry: An
iPhonemounted on the robot's base providespose estimationviaARKit. It connects to theJetsonviaEthernet.- Placement: The
iPhoneis mounted on the rear of the robot, facing backward at a angle. This avoids arm-phone collisions, minimizesmotion blurandvisual occlusion, and ensures theARKitcan track the environment better, even when theGo2bends.
- Placement: The
- Sim2Real Transfer Details:
- Robustness Training:
Random push forcesare applied to the robot duringRL trainingto enhancerobustness. - Domain Randomization:
Joint friction,damping,contact frictions,body and arm masses, andcenter of massesare randomized. - Latency Modeling: Crucially,
latencyin control is modeled during training. The paper reports simulating a20ms end-to-end control latencyfor best performance in the real world.Pose latencyof is also simulated forodometry. - Odometry Noise Handling: To account for
odometry system noise, the robot's pose is randomly transported every 20 seconds mid-episode. - URDF Accuracy: For highly dynamic tasks, precise
URDF(Unified Robot Description Format) models are critical. The authors disassembled theARX5 arm, re-weighed components, and recomputedmass,center of mass, andinertia matricesfor accurateURDFcreation, which was essential for bothsimulation stabilityandreal-world deployment.
- Robustness Training:
- Latency Challenges and Mitigation:
- Control Latency: The
ROS2communication, motor encoder readouts,WBC inference, and motor execution introducelatency. Aend-to-end control latencywas found to work best in simulation to match the real system. - Pose Estimator Latency:
Motion capturehas latency, whileiPhone ARKithas . A pose latency is simulated. Attempts to compensateiPhone latencyby integratingbase velocity estimates(fromfoot contacts,joint positions/velocities,IMU readings) forward in time were not significantly effective in reducing shaking. - Python ROS2 Latency: Due to
Python's Global Interpreter Lock,ROS2 callback functionscan block each other. Optimization and detaching long-running callbacks to separateROS2 nodesimprovedjoint observation updatesto andpolicy inferenceto .
- Control Latency: The
- Safety Measures:
- Shaking/Oscillation:
Action rate regularization(penalizing large changes in actions) was increased during training. The onboardLidarwas disabled as it introduced shaking. - Overheat Shutdowns: The
calf joint'slinkage design requires higher torque. Increasedtorque regularizationduring training allowed continuous operation for up to 30 minutes. - Unsafe Configurations: Reward terms were added to regularize the body to a more
balanced posewith thecenter of masscentrally located, preventing the robot from twisting its legs into unsafe, high-precision configurations often favored in simulation.
- Shaking/Oscillation:
5. Experimental Setup
The authors designed a series of experiments to validate the UMI-on-Legs framework, focusing on capability, robustness, and scalability.
5.1. Datasets
- Real-world UMI Demonstrations:
- Cup Rearrangement Task: The paper directly uses a
pre-trained UMI checkpoint[1]. This policy was trained on1400 episodesof human demonstrations collected using ahand-held gripperfor placing anespresso cupon itssaucerwith a specifichandle orientation. - Pushing and Tossing Tasks: For these tasks,
diffusion policieswere trained from scratch usingUMIdata collected by the authors.Dynamic Tossing:500 episodesof human demonstration for grasping and tossing.Kettlebell Pushing:25 episodesof human demonstration for pushing.
- Characteristics: The data includes
end-effector posesandwrist-mounted camera views. It istask-centricandrobot-agnosticbeyond theend-effector.
- Cup Rearrangement Task: The paper directly uses a
- Synthetic Trajectories for WBC Training: For
whole-body controllertraining,end-effector trajectories(e.g., fromUMIdata) are provided as targets in simulation. The training environment includes the robot but no complex task objects, reducing the need for diverse simulated assets. - Justification: Using
UMIdata allows for diverse,task-relevant manipulation skillsto be collected cheaply in the real world without a robot. Training theWBCin simulation with these trajectories avoids the difficulty of simulating diverse objects and engineeringtask-specific rewardsfor controller learning.
5.2. Evaluation Metrics
The experiments use a combination of simulation and real-world metrics to assess performance.
-
Simulation Metrics (Averaged over 500 episodes):
-
Position Error (cm):
- Conceptual Definition: Measures the average Euclidean distance between the
end-effector'sactual position and its target position over time. It quantifies the accuracy of the controller in reaching the desired spatial location. A lower value indicates better precision. - Formula: While the paper does not explicitly provide the formula for
position erroras a metric, the standard definition is the Euclidean distance: Where:- : The 3D position vector of the
end-effectorin thetask-frame. - : The 3D target position vector from the
trajectoryin thetask-frame. - : The Euclidean norm (distance).
- : The 3D position vector of the
- Unit: Centimeters (cm).
- Conceptual Definition: Measures the average Euclidean distance between the
-
Orientation Error (deg or rad):
- Conceptual Definition: Measures the average angular difference between the
end-effector'sactual orientation and its target orientation over time. It quantifies the accuracy of the controller in achieving the desired angular pose. A lower value indicates better orientation control. - Formula: The paper mentions using a
6D rotation representation[52]. While a single canonical formula fororientation errorfor this representation isn't explicitly given in the paper, a common approach for angular error between two rotation matrices () or quaternions is: Where:- : Actual
end-effector rotation matrix. - : Target
end-effector rotation matrix. - : The sum of the diagonal elements of a matrix. The result is typically in radians and can be converted to degrees.
- : Actual
- Unit: Degrees (deg) or Radians (rad).
- Conceptual Definition: Measures the average angular difference between the
-
Survival (%):
- Conceptual Definition: Represents the percentage of episodes where the robot avoids a
terminal collisionand successfully completes the desired duration of the task. Aterminal collisionis defined as any robot part contact that is not the robot's feet or gripper. It indicates the overall stability and safety of the policy. Higher is better. - Formula: (Number of successful episodes / Total number of episodes) .
- Conceptual Definition: Represents the percentage of episodes where the robot avoids a
-
Electrical Power Usage (kW):
- Conceptual Definition: Estimates the average electrical power consumed by the robot's motors during an episode. It's calculated based on real hardware's voltages, manufacturer-reported torque constants, and simulated motor torques. It measures the energy efficiency of the policy. A lower value indicates better efficiency.
- Formula: This is typically derived from the sum of (torque angular velocity) for each joint, potentially incorporating motor efficiency and voltage. The paper states it's based on "real hardware's voltages, manufacturers' reported torque constants, and the simulation's motor torques," implying a complex model-based estimation.
- Unit: Kilowatts (kW).
-
-
Real-World Metrics (Averaged over 20 episodes):
- Success Rate (%):
- Conceptual Definition: The percentage of real-world trials where the robot successfully completes the defined task according to specific task criteria (e.g., ball landing within of a target, cup placed upright with handle oriented correctly). Higher is better.
- Formula: (Number of successful trials / Total number of trials) .
- Success Rate (%):
5.3. Baselines
The paper primarily uses ablations of its own proposed WBC design and DeepWBC as baselines for comparison in simulation.
- Ablation Studies:
(-) Preview: Removes thetrajectory observations(future target poses) from theWBC'sinput, meaning the controller only sees the current target.(-) Task-space: TheWBCis trained to trackend-effector targetsin thebody-frameinstead of thetask-frame.(-) UMI Traj: TheWBCis trained on simpler, heuristically generatedtrajectoriesinstead ofreal-world manipulation trajectoriesfromUMI.DeepWBC [8]: A baseline representing priorwhole-body controlapproaches, conceptually similar to removingpreview,task-space tracking, andUMI trajectoriessimultaneously. It often relies onbody-frame targets.
- Odometry Comparison: For real-world evaluations, the proposed
iPhone-based odometrysystem is compared against anOptiTrack motion capture system, which serves as anoracle(ground truth) forpose estimation. - No-Preview Baseline (Real-World): For the
dynamic tossing task, the behavior of ano-previewbaseline is visually demonstrated in the real world to highlight the importance oftrajectory information.
5.4. Task Descriptions
5.4.1. Capability: Whole-Body Dynamic Tossing
- Goal: The robot must grasp a ball and toss it towards a target bin away.
- Success Criteria: The ball must land within of the bin's center.
- Difficulty: Requires dynamic
whole-body coordination, precise timing, power from allDOFs, and momentary balancing on one or two feet. This is challenging forlegged systemsdue tobase stabilizationduring dynamic arm movements.
5.4.2. Robustness: End-effector Reaching Leads to Robust Whole-body Pushing
- Goal: Push a 10lbs (or 20lbs)
kettlebellsuch that it slides upright into a goal region. - Setup: The
kettlebell'sinitial distance from the goal varies between and between episodes. - Difficulty: Tests the controller's ability to handle unexpected external dynamics (
large disturbances,varying friction,feet slippage) in azero-shot fashion. Smallimprecisionin the contact point can cause thekettlebellto topple.
5.4.3. Scalability: Plug-and-play Cross-Embodiment Manipulation Policies (In-the-wild Cup Rearrangement)
- Goal: Place an
espresso cupon itssaucerwith itshandlepointed to the left of the robot. - Setup: The
cup and saucerare randomly positioned within a radius, with randomcup orientation. Evaluated inunseen environments(in-the-wild). - Success Criteria: The
cupmust be upright on thesaucerwith itshandlewithin of pointing directly left. - Difficulty: This task primarily validates
zero-shot cross-embodiment generalizationby deploying apre-trained manipulation policy(fromUMIfor a fixed-baseUR5earm) onto thequadruped system(18 DOFARX5arm) without fine-tuning. It also requiresprecise 6 DOF end-effector movementsforprehensileandnon-prehensileactions, which are traditionally difficult forlow-cost legged manipulation systems.
6. Results & Analysis
6.1. Core Results
6.1.1. Capability: Whole-Body Dynamic Tossing
The dynamic tossing experiment aimed to validate the system's ability to handle complex, dynamic whole-body manipulation skills. The WBC successfully discovered a whole-body tossing strategy that involves coordinated movements across all joints, momentary balancing, and effective use of body mass for stability.
The following table shows the results from Table 1, which evaluates the tossing task in simulation for various WBC configurations:
The following table shows the results from Table 1:
| Approach | Units | Pos Err | Orn Err | Survival | Power |
| cm↓ | deg↓ | % ↑ | kW↓ | ||
| Ours | 2.12 | 3.35 | 98.4 | 3.82 | |
| (-) Preview | 3.02 | 4.23 | 93.0 | 4.74 | |
| (-) Task-space | 15.49 | 15.55 | 0.0 | 3.95 | |
| (-) UMI Traj | 2.48 | 15.67 | 97.4 | 3.69 | |
| DeepWBC [8] | 22.2 | 66.22 | 0.0 | 5.92 |
Analysis of Tossing Evaluation in Sim (Table 1):
-
Ours(full proposed system): Achieves the best balance of lowposition error(), loworientation error(), highsurvival(), and reasonablepower usage(). This demonstrates the effectiveness of the combinedtask-frame trajectory trackingwithpreview informationfromUMIdata. -
(-) Preview: Removingtrajectory observations(future targets) from theWBCleads to a noticeable increase in bothposition() andorientation() errors, and a drop insurvival(). This highlights the importance ofpreview horizonfor theWBCto anticipate and planwhole-body coordination, especially for dynamic tasks. It also uses more power () due to more reactive rather than proactive movements. -
(-) Task-space: When theWBCtracks targets in thebody-frameinstead of thetask-frame, the performance significantly degrades.Position error() andorientation error() become very high, andsurvivaldrops to . This is a critical finding, confirming thattask-frame trackingis essential for stable manipulation on dynamicquadruped embodiments, as it enables the controller to compensate forbase movements. -
(-) UMI Traj: Training theWBCon simpler, non-UMItrajectories results in increasedorientation error() whileposition error() remains relatively low, andsurvival() is still high. This suggests that the quality and expressiveness ofUMI-collected trajectoriesare important for learning precise orientation control, but theWBCcan still maintain stability with simpler trajectories for position. -
DeepWBC [8]: This baseline, representing a more traditionalbody-frameapproach withouttrajectory previeworUMIdata, performs the worst, with extremely highposition() andorientation() errors andsurvival. This strongly validates the design choices ofUMI-on-Legs.In real-world experiments,
UMI-on-Legsachieved success rate withmotion capture(oracle) and withiPhone odometryfor thedynamic tossing task. The authors also showed that usingcalibrated tossing trajectory replaywithmotion capture(acting as an "oracle manipulation policy") improved success by , suggesting that the remaining misses might be due toimprecise manipulation policiesrather than theWBC.
该图像是论文中图5的插图,展示了动态投掷任务中四足机器人在操纵和身体协调方面的表现。上排为我们的方法,展示机器人跳跃投掷、收起卷曲及落地动作;下排为无预览方法,表现出动作慌乱。右侧柱状图比较了三种策略的成功率,最高达85%。
6.1.2. Robustness: End-effector Reaching Leads to Robust Whole-body Pushing
The kettlebell pushing task demonstrated the WBC's zero-shot robustness to unexpected external dynamics.
-
Success Rates: The system achieved success rate with
motion captureand withiPhone odometrywhen pushing a 10lbskettlebell. Even with a 20lbskettlebell, 4 out of 5 episodes were successful. -
Adaptive Strategy: The
WBClearned to adapt its strategy in response to disturbances. When encountering thekettlebell's resistance, it wouldlean forwardto exert more force, a strategy that would cause the robot to fall without thekettlebell. This demonstrates theWBC'sability to coordinatewhole-body movementsto maintain balance and achieve the task goal even under significant external forces. The controller's ability to move its legs forward to reach faraway target poses, without being explicitly commandedbase velocity, also highlights its intelligent coordination.
该图像是论文中的复合示意图,展示了四足机器人利用UMI-on-Legs框架完成推(Push)和抓取(Catch)动作的过程,以及机器人膝关节速度随时间变化的曲线和不同物体对比的成功率柱状图。
6.1.3. Scalability: Plug-and-play Cross-Embodiment Manipulation Policies (In-the-wild Cup Rearrangement)
This experiment showcased the ability of UMI-on-Legs to deploy a pre-trained manipulation policy from prior work (originally for a fixed-base UR5e arm) onto the quadruped system with zero-shot generalization.
-
Success Rates: The system achieved success rate using
motion captureand usingiPhone odometry. This is a remarkable result given thecross-embodiment gap(fixed-base 6 DoF industrial arm vs. mobile 18 DoF legged manipulator) and the precision requirements of the task. -
Dynamic Adaptation: The
quadrupeddynamically tilted and leaned its base toincrease its reach rangeorcounterbalance the arm, effectively compensating for thekinematic reach limitationsof its smallerARX5 armcompared to theUR5ethe policy was originally trained for. -
In-the-Wild Performance: The task was evaluated in
unseen environmentsincludingunstable terrain(grass, dirt),collapsible tables, anddirect sunlight, demonstrating thegeneralizabilityof thediffusion policyand therobustnessof theWBCin challenging conditions.
该图像是由三部分构成的插图,展示了将现有操控策略应用于四足机器人移动操作的效果。左图为固定基座的机械臂操作场景,中间图为搭载学习全身控制器的四足机器人操作场景,右图为成功率对比柱状图,表明四足机器人以80%的零样本迁移成功率完成任务。
6.2. Data Presentation (Tables)
Table 2: Hyper parameter set of the dynamic tossing task. The following table shows the results from Table 2:
| Hyperparameters | Values |
|---|---|
| Training Set Trajectory Number | 500 |
| Diffusion Policy Visual Observation Horizon | 2 |
| Diffusion Policy Proprioception Horizon | 4 |
| Diffusion Policy Output Steps | 64 |
| Diffusion Policy Execution Steps | 40 |
| Diffusion Policy Execution Frequency | 12Hz |
| Trajectory Update Smoothing Time | 0.1s |
Table 3: Hyper parameter set of the whole-body pushing task The following table shows the results from Table 3:
| Hyperparameters | Values |
|---|---|
| Training Set Trajectory Number | 25 |
| Diffusion Policy Visual Observation Horizon | 2 |
| Diffusion Policy Proprioception Horizon | 2 |
| Diffusion Policy Output Steps | 32 |
| Diffusion Policy Execution Steps | 10 |
| Diffusion Policy Execution Frequency | 10Hz |
| Trajectory Update Smoothing Time | 0.3s |
Table 4: Hyper parameter set of the UMI cup rearrangement task. The following table shows the results from Table 4:
| Hyperparameters | Values |
|---|---|
| Training Set Trajectory Number | 1400 |
| Diffusion Policy Visual Observation Horizon | 1 |
| Diffusion Policy Proprioception Horizon | 2 |
| Diffusion Policy Output Steps | 16 |
| Diffusion Policy Execution Steps | 8 |
| Diffusion Policy Execution Frequency | 5Hz |
| Trajectory Update Smoothing Time | 0.1s |
Table 5: Reward terms. The following table shows the results from Table 5:
| Name | Weight |
|---|---|
| Joint Limit | -10 |
| Joint Acceleration | -2.5e-7 |
| Joint Torque | -1e-4 |
| Root Height Collision | -1 |
| Action Rate | -1 -0.01 |
| Body-EE Alignment | -1 |
| Even Mass Distribution | -1 |
| Feet Under Hips | -1 |
| Pose Reaching | 4 |
Table 6: Domain Randomization Hyperparameters. The following table shows the results from Table 6:
| Hyperparameters | Values |
|---|---|
| Init XY Position | [-0.1m,0.1m] |
| Init Z Orientation | [-0.05rad,0.05rad] |
| Joint Damping | [0.01,0.5] |
| Joint Friction | [0.0,0.05] |
| Geometry Friction | [0.1,8.0] |
| Mass Randomization | [-0.25,0.25] |
| Center of Mass Randomization | [-0.1m,0.1m] |
Table 7: System Cost. The following table shows the results from Table 7:
| Item | Cost(\$) |
|---|---|
| Unitree Go2 Edu Plus | 12,500.00 |
| ARX5 Robot Arm | 10,000.00 |
| GoPro Hero9 | 210.99 |
| GoPro Media Mod | 79.99 |
| GoPro Max Lens Mod | 68.69 |
| iPhone 15 Pro | 999.00 |
| Elgato Capture Card | 147.34 |
| Total | 24,006.01 |
6.3. Ablations / Parameter Sensitivity
The ablation studies (Table 1) clearly demonstrate the critical role of key design choices in UMI-on-Legs:
Preview Horizon: The performance degradation of(-) Previewconfirms that providing futuretrajectory informationis crucial for theWBCto anticipate dynamic movements, leading to smoother and saferwhole-body coordination. Without it, theWBCbecomes more reactive, leading to highererrorsand lowersurvival.Task-frame Tracking: The catastrophic failure of(-) Task-spaceis the most significant finding from the ablations. It unequivocally shows that training theWBCto trackend-effector trajectoriesrelative to atask-frame(independent of the robot's base motion) is absolutely essential for achieving stable and precise manipulation on adynamic quadruped. Traditionalbody-frame trackingis insufficient to compensate for the robot's inherentbase movementsduring manipulation.UMI Trajectoriesfor WBC Training: While not as severe astask-frameablation, training theWBCwith(-) UMI Traj(simpler trajectories) still leads to a significant increase inorientation error. This implies thatreal-world-like UMI trajectoriesprovide richer, more diversemotion patternsthat help theWBClearn more preciseorientation control, which is vital for tasks likecup rearrangement.
6.4. Failure Analysis
- Tossing Task: Misses were often attributed to
imprecise manipulation policiesrather than theWBC. When provided withoracle trajectories(calibratedmotion capture replay), the success rate improved by , suggesting that scaling uptossing data collectionfor themanipulation policycould further close this gap. - Pushing Task: The most common failure was the
kettlebell toppling sideways. While morehuman demonstrationscould teach the robot to reorient, the arm itself was too weak to pick up thekettlebellonce it fell. Hardware limitations (overheating) were also noted, suggesting thattraining with simulated forces[10] could lead to moreelegantandenergy-efficient pushing behaviors. - Grasping for Tossing (Appendix): Initial attempts at
diffusion policiesforgrasping and tossingsuffered fromdistributional driftdue tosmall shakesfrom the controller during thegrasping phase. More data is hypothesized to resolve this. - System Reliability (Appendix): The system's general reliability for
fully-untethered deploymentwas not high enough.- Overheating: The
calf jointsof theUnitree Go2frequently overheated (within 10-30 minutes) due to a linkage design requiring higher torque, leading to emergency shutdowns.Torque regularizationduring training helped extend operation time. - Power Fluctuation: The battery voltage varied with charge level, affecting arm performance (too high when full, dampening behavior when low), requiring voltage adaptors.
- Overheating: The
- Velocity Integration for Odometry (Appendix):
iPhone ARKitpose estimatessometimes drifted heavily, especially during dynamic actions like tossing. Thelatencyof from movement topose updateintroduced a significantSim2Real gap, leading to low-frequency oscillations. Integratingbase velocity estimatesfromproprioceptive sensorsto project theiPhone poseforward did not significantly improve performance, indicatinglatencyandshakingrequire more robust mitigation.
6.5. Things That Did Not Work (Appendix)
- Privileged Policy Distillation and Observation History: Attempts to include
privileged information(e.g., ground truth poses, base velocities, kp/kd) to train aprivileged policyand distill it did not boost performance and introduced instability, especially withobservation historieslonger than 1 step. This was attributed to imprecisePython ROS2 timerscausingout-of-distribution observation histories. - Precise Grasping for Tossing: As mentioned in failure analysis, training
diffusion policiesfor precise grasping was challenging due to controllershakescausingdistributional drift. - Velocity Integration for Odometry: As mentioned in failure analysis, integrating
inertial-legged velocity estimatesto compensate foriPhone ARKit latencydid not significantly improve performance due to remainingshaking behaviorandlatency issues.
7. Conclusion & Reflections
7.1. Conclusion Summary
UMI-on-Legs presents a robust and scalable framework for mobile manipulation on quadruped robots. By strategically combining real-world human demonstrations (via UMI) for task-centric data and simulation-based reinforcement learning for robot-centric whole-body control, the framework overcomes common challenges in robot learning. The key innovation lies in its interface: end-effector trajectories in the task-frame, which enables zero-shot cross-embodiment deployment of manipulation policies and allows the whole-body controller to effectively coordinate the entire robot to perform complex, dynamic, and precise manipulation tasks while compensating for base movements. Demonstrated on tossing, pushing, and cup rearrangement tasks, UMI-on-Legs achieves high success rates across diverse challenges and offers an accessible real-time odometry solution via an iPhone.
7.2. Limitations & Future Work
The authors acknowledge several limitations:
- Gripper-based Manipulation Only: The current system is limited to
gripper-based manipulation, inheriting this from theUMIframework. Future work could extend the interface to supportwhole-body manipulation(e.g., using legs for manipulation), inspired by emergingwhole-body demonstration devicesandquadruped manipulationresearch [31, 54, 55]. - Embodiment-Specific Constraints: While the interface is
embodiment-agnostic, incorporatingembodiment-specific constraints(e.g., kinematic limits, torque limits) back into the high-levelmanipulation policyis an important next step. This would allow the policy to generate more feasible trajectories for a given robot. - Complete Mobile Manipulation Platform: The current system lacks capabilities essential for a complete
mobile manipulation platform, such ascollision avoidance[56] andforce feedback/control[10]. Integrating these features would enhance safety and interaction capabilities. - Hardware Robustness: As noted in the appendix, the system faces
hardware reliability issueslikemotor overheatingandpower fluctuationsduring prolonged dynamic tasks. Addressing these through hardware improvements or more energy-efficient control policies is crucial. - Odometry Latency: Despite using an
iPhoneforodometry, a significantlatency() was observed, leading tooscillationsand aSim2Real gap. Improvinglow-latency odometrysolutions or betterlatency compensationin theWBCis important. - Manipulation Policy Precision: For highly dynamic tasks like tossing,
manipulation policy precisionwas identified as a limiting factor, suggesting that moredemonstration datacould improve performance.
7.3. Personal Insights & Critique
UMI-on-Legs represents a significant step forward in making mobile manipulation on dynamic quadrupeds more accessible and scalable. The core insight of decoupling task-centric demonstration from robot-centric control via an embodiment-agnostic trajectory interface is powerful. This modularity allows researchers to improve manipulation policies and whole-body controllers independently, knowing they can be effectively integrated.
Strengths:
- Novel Interface: The
task-frame end-effector trajectoryinterface is a crucial innovation. It simplifies human demonstration, enhancesWBCperformance throughpreview, and enables truecross-embodiment transfer, which is a persistent challenge in robotics. - Scalability: The framework's ability to leverage
UMIfor cheap real-world data andmassively parallel simulationforWBCtraining offers a highly scalable path for developing complex skills. - Accessibility: The use of an
iPhoneforodometrydrastically lowers the barrier to entry formobile manipulation research, moving away from expensivemotion capturesystems. - Rigorous Ablations: The comprehensive
ablation studiesclearly validate the necessity of each component (e.g.,task-frame tracking,preview horizon), strengthening the paper's conclusions. - Diversity of Tasks: Demonstrating success on
prehensile,non-prehensile, and highlydynamic tasks(tossing) showcases the versatility and robustness of the framework.
Critique and Open Questions:
-
Dependency on UMI: While a strength for data collection, the reliance on
UMI(a specificgripper-based demonstration device) might limit the immediate applicability towhole-body manipulationor diverseend-effector types. Future iterations might explore more generalizeddemonstration interfaces. -
Real-time Odometry Trade-offs: While accessible, the
iPhone ARKitodometrywith its notablelatency(140ms) might pose fundamental limits for extremely fast or precisedynamic tasks. Further research into fusingVIOwithproprioceptive sensingor deployingonboard visual odometrysolutions (likeLidar-Inertial Odometry) on theJetsoncould offer lower-latency, more robust alternatives. -
Sim2Real Gap for Policy: While the
WBCusesdomain randomization, themanipulation policyis trained onreal-world UMIdata. Thezero-shot transferof apre-trained policyto a different robot embodiment still involves an implicitSim2Real gap(orReal2Real embodiment gap) for the policy itself regarding subtle kinematic or dynamic differences. How does theWBCcompensate fortrajectoriesthat might be kinematically challenging or energetically expensive for the target robot, even if valid for the source robot? TheWBCdynamically tilting and leaning is a good start, but more explicit feedback or awareness in themanipulation policycould be beneficial. -
Generality of Reward Function Weights: The
reward termsfor theWBCinvolve many manually tunedweights(Table 5) andsigma curriculumschedules. While effective, this tuning process can be tedious. Investigatingmeta-learningapproaches orautomated reward designcould reduce this manual effort for new robots or tasks. -
Cost vs. Performance: The system's cost (
~$24k) is presented as relatively low, which is a significant advantage. However, thehardware limitations(overheating, battery issues) suggest that lower-cost hardware still presents practical challenges for continuous, high-performance operation. This highlights the ongoing trade-off between accessibility and industrial-grade reliability.Overall,
UMI-on-Legsprovides a compelling blueprint for futuremobile manipulation systems. Its modular design,task-frame interface, and accessibleodometrypave the way for more widespread adoption and development of complex robot skills, particularly on dynamiclegged platforms. The framework's ability to bridgehuman intuitionwithrobot capabilitiesis a testament to its forward-thinking design.
Similar papers
Recommended via semantic vector search.