Paper status: completed

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Published:02/15/2024

Robot Manipulation Skill Transfer (1)Bimanual Dynamic Manipulation Demonstrations (6)Zero-Shot Generalizable Policy Learning (1)Relative-Trajectory Action Representation (1)Hardware-Agnostic Robot Policies (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

UMI enables direct robot skill transfer from in-the-wild human demonstrations via a portable, low-cost hand-held gripper and a hardware-agnostic policy interface. This system facilitates learning generalizable robot policies for complex, dynamic tasks, showing zero-shot generaliz

Abstract

We present Universal Manipulation Interface (UMI) -- a data collection and policy learning framework that allows direct skill transfer from in-the-wild human demonstrations to deployable robot policies. UMI employs hand-held grippers coupled with careful interface design to enable portable, low-cost, and information-rich data collection for challenging bimanual and dynamic manipulation demonstrations. To facilitate deployable policy learning, UMI incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation. The resulting learned policies are hardware-agnostic and deployable across multiple robot platforms. Equipped with these features, UMI framework unlocks new robot manipulation capabilities, allowing zero-shot generalizable dynamic, bimanual, precise, and long-horizon behaviors, by only changing the training data for each task. We demonstrate UMI's versatility and efficacy with comprehensive real-world experiments, where policies learned via UMI zero-shot generalize to novel environments and objects when trained on diverse human demonstrations. UMI's hardware and software system is open-sourced at https://umi-gripper.github.io.

Mind Map

In-depth Reading

English Analysis~10 min read · 12,438 chars

1. Bibliographic Information

Title: Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
Authors: Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, Shuran Song
Affiliations: Stanford University, Columbia University, Toyota Research Institute (TRI)
Journal/Conference: This paper is a preprint available on arXiv. It has not yet undergone formal peer review for a specific conference or journal at the time of this analysis.
Publication Year: 2024 (First submitted February 2024)
Abstract: The paper introduces the Universal Manipulation Interface (UMI), a complete framework for collecting data from human demonstrations and learning robot policies that can be deployed in the real world. The core innovation is a low-cost, portable, hand-held gripper that allows people to demonstrate complex, bimanual, and dynamic tasks anywhere ("in-the-wild") without needing a robot present during data collection. The framework includes a carefully designed policy learning interface that handles issues like system latency and makes the learned skills transferable across different robot platforms. The authors demonstrate that policies trained with UMI can perform a variety of challenging tasks (dynamic, bimanual, precise, long-horizon) and can generalize to new objects and environments in a zero-shot manner (i.e., without any additional training). The entire system is open-sourced.
Original Source Link:
- arXiv page: https://arxiv.org/abs/2402.10329
- PDF link: http://arxiv.org/pdf/2402.10329

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Teaching robots complex manipulation skills is a major challenge in robotics. Existing methods are flawed. Teleoperation, where a human controls a robot directly, is expensive, requires expert operators, and is tied to a specific lab setup, limiting data diversity. Learning from in-the-wild human videos (e.g., from YouTube) suffers from a large "embodiment gap"—human hands and perspectives are very different from a robot's, making it hard to transfer skills.
- Gaps in Prior Work: Previous attempts using hand-held grippers for data collection offered a promising middle ground but were limited. They struggled to capture precise and dynamic actions, often due to insufficient visual information (narrow camera view, lack of depth), inaccurate action tracking (e.g., using methods prone to scale ambiguity), and a failure to account for real-world system latencies during deployment. This restricted them to simple, slow "pick-and-place" tasks.
- Fresh Angle/Innovation: UMI tackles these issues head-on with a holistic framework that combines a cleverly designed demonstration interface (the physical gripper) and a robust policy interface (the software and learning representations). The key idea is to collect rich, high-quality data in a robot-agnostic way and then carefully bridge the gap to real-world robot execution, enabling direct skill transfer for a much wider and more complex range of tasks than previously possible.
Main Contributions / Findings (What):
1. A Novel Hand-Held Demonstration Interface: A portable, low-cost (BoM of ~ORB-SLAM3`). This allows for robust tracking even with motion blur and recovers the true metric scale of movements, which is critical for precise actions and bimanual coordination.
HD5. Continuous gripper control: Instead of a simple open/close action, the gripper's width is continuously tracked using fiducial markers (ArUco markers) on the fingers. This allows for more nuanced interactions, such as precise timing for releasing an object during a toss. The soft fingers also allow for implicit force control.
HD6. Kinematic-based data filtering: Although data is collected in a robot-agnostic way, a post-processing step filters out trajectories that would be impossible for a specific target robot (e.g., out of reach, violates joint limits).

B. Policy Interface Design

The goal is to create a policy learning pipeline that is agnostic to the specific robot hardware, allowing skills to be transferred easily. The authors use Diffusion Policy as their learning algorithm, but the interface design is general.

PD1. Inference-time latency matching: This is a crucial step for real-world deployment, especially for dynamic tasks.
- Observation Latency: Different sensors (camera, robot joint encoders) have different delays. At inference time, all observation streams are synchronized to the timestamp of the camera image (which is usually the slowest).
- Action Latency: There is a delay between when a command is sent and when the robot/gripper actually executes it. The system measures this execution latency for each component and sends commands ahead of time to compensate, ensuring the robot reaches the desired pose at the desired time.
PD2. Relative end-effector pose: To avoid being tied to a specific robot base location or a global coordinate frame, all poses are represented relatively.
- Relative EE trajectory as action: The policy outputs a sequence of future end-effector (EE) poses, all defined relative to the current EE pose at the start of the prediction. This is more robust to tracking drift and calibration errors than predicting absolute poses or accumulating errors with delta poses.
- Relative EE trajectory as proprioception: Historical EE poses are also fed to the policy as a relative trajectory, providing implicit velocity information.
- Relative inter-gripper proprioception: For bimanual tasks, the relative pose between the two grippers is given to the policy as an additional input. This was found to be critical for tasks requiring tight coordination between the two arms.

5. Experimental Setup

The effectiveness of UMI is evaluated through a series of real-world experiments.

Tasks:
1. Cup Arrangement (Precise): Place a cup on a saucer with the handle facing a specific direction.
2. Dynamic Tossing (Dynamic): Sort objects by tossing them into the correct bins, which are placed beyond the robot's reach.
3. Bimanual Cloth Folding (Bimanual): Use two robot arms to fold a sweater.
4. Dish Washing (Long-Horizon): A 7-step task involving turning on a faucet, washing a plate with a sponge, and placing it back.
5. In-the-Wild Cup Arrangement (Generalization): The cup arrangement task is scaled up with data collected in 30 diverse locations using 15 different cups. The policy is then tested in 2 completely new environments with both seen and unseen cups.
Datasets: Data is collected by 1-3 non-expert demonstrators. The number of demonstration episodes ranges from 250 to 305 for the narrow-domain tasks and 1400 for the in-the-wild task.
Evaluation Metrics: The primary metric is task success rate, which is manually judged by a human operator based on pre-defined criteria for each task. Experiments are repeated multiple times (typically 20 trials) with randomized initial object and robot states.
Baselines: The authors perform extensive ablation studies to validate their design choices. Baselines include:
- No Fisheye lens (using a standard rectilinear view).
- Alternative action spaces (absolute pose, delta pose).
- Without side mirrors / without digital reflection of mirrors.
- No latency matching.
- No inter-gripper proprioception for the bimanual task.
- Using a smaller vision encoder (ResNet-34 vs. CLIP-pretrained ViT).
- Training only on narrow-domain data for the generalization task.

6. Results & Analysis

Core Results (Capability Experiments)

Cup Arrangement: The full UMI system achieves a 100% (20/20) success rate.
- Ablations show significant performance drops: using a standard FoV lens drops the success rate to 55%, and using an absolute action space drops it to a mere 25% due to calibration sensitivity. Using digitally reflected side mirrors improves performance from 90% (no mirrors) to 100%.
- The same policy also achieved a 90% success rate on a different robot (Franka Emika FR2), demonstrating cross-embodiment transferability.
Dynamic Tossing: UMI achieves an 87.5% success rate.
- The most critical factor here is latency matching. Disabling latency matching cuts the success rate to 57.5%, as the robot's movements become jittery and the gripper release timing is off.
Bimanual Cloth Folding: UMI achieves a 70% success rate.
- The key to bimanual coordination is relative inter-gripper proprioception. Without this information, the success rate plummets to 30% because the arms fail to synchronize their actions, such as grasping the hem of the sweater simultaneously.
Dish Washing: UMI achieves a 70% success rate on this complex, long-horizon task.
- This task highlights the need for a powerful vision model. Using a large, pre-trained vision encoder (CLIP-ViT-B/16) was essential. A smaller`ResNet-34$ trained from scratch completely failed (0% success), as it could not learn a reactive behavior. The final policy was also robust to perturbations, like adding more sauce mid-task.

The image above (Image 3) shows the policy's robustness to various changes, including base movement (a), novel objects (b), different lighting (c), and perturbations during the task (d).

In-the-Wild Generalization Experiments

This is one of the most significant results of the paper.

The UMI policy, trained on 1400 demonstrations from 30 diverse environments, achieved a combined 71.7% success rate in two completely unseen environments (a busy outdoor cafe and a water fountain) with unseen cups.
In contrast, a baseline policy trained only on the 305 demonstrations from the lab environment (but with the same powerful ViT-L vision backbone) achieved a 0% success rate. It failed to even attempt to move towards the cup.
Takeaway: This strongly demonstrates that the generalization capability comes from the diversity of the in-the-wild data, not just from using a large pre-trained model. UMI's portability is the key enabler for collecting such data.

Data Collection Throughput and Accuracy

Throughput: UMI is significantly faster than traditional teleoperation. For the cup arrangement task, data collection with UMI was over 3x faster than with a spacemouse. While not as fast as a bare human hand, it strikes a practical balance between speed and collecting robot-compatible data. Teleoperation failed completely on the dynamic tossing task.
Accuracy: A benchmark against a Motion Capture (MoCap) system shows the SLAM-based tracking is highly accurate.
- Absolute Trajectory Error (ATE): 6.1 mm position, 3.5° rotation.
- Relative Pose Error (RPE) between grippers: 10.1 mm position, 0.8° rotation. This level of accuracy is sufficient for learning precise manipulation skills.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully presents UMI, a comprehensive and practical framework for teaching robots complex skills from in-the-wild human demonstrations. By combining a thoughtfully designed hand-held data collection device with a robust, hardware-agnostic policy interface, UMI overcomes key limitations of prior methods. It enables the learning of dynamic, bimanual, and long-horizon tasks and, most importantly, facilitates zero-shot generalization to novel scenes and objects by making large-scale, diverse data collection feasible for anyone. The open-sourcing of the project aims to foster a community-driven effort to create large, decentralized robotics datasets.
Limitations & Future Work: The authors acknowledge several limitations:
1. Kinematic Mismatches: The framework currently relies on filtering out demonstrations that are kinematically impossible for the target robot. Future work could explore methods to learn from these "infeasible" demonstrations, adapting the skill to the robot's capabilities.
2. Reliance on Visual Texture: The SLAM system requires sufficient texture in the environment to function reliably. It may struggle in visually sparse environments (e.g., rooms with plain white walls).
3. Data Collection Efficiency: While faster than teleoperation, collecting data with the UMI gripper is still slower and more cumbersome than demonstrating with bare hands. Future improvements could focus on lighter, more ergonomic hardware designs.
Personal Insights & Critique:
- Holistic System Design: The strength of this paper lies not in a single algorithmic novelty but in the meticulous, end-to-end design of a complete system. Every component—from the physical mirrors to the relative action representation—is thoughtfully chosen to solve a specific, practical problem in transferring human skills to robots.
- Practicality and Accessibility: By using a low-cost, off-the-shelf camera (GoPro) and 3D-printed parts, the authors have created a system that is highly accessible. This, combined with the open-source release, genuinely has the potential to "democratize" robot data collection as the authors hope.
- Generalization is Key: The in-the-wild generalization result is the paper's crowning achievement. It provides strong evidence that diverse, large-scale datasets are a critical component for building truly generalist robots, and UMI provides a scalable way to create such datasets.
- Critique: The reliance on manual, subjective evaluation of task success is a common issue in manipulation research but remains a weakness. While necessary for such complex tasks, it makes reproducibility and exact comparison challenging. Furthermore, while the system is robot-agnostic, the policy is still tied to a parallel-jaw gripper end-effector, limiting the scope of tasks to non-dexterous manipulation. Extending this framework to dexterous hands would be a significant and valuable next step.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.