Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
TL;DR Summary
UMI enables direct robot skill transfer from in-the-wild human demonstrations via a portable, low-cost hand-held gripper and a hardware-agnostic policy interface. This system facilitates learning generalizable robot policies for complex, dynamic tasks, showing zero-shot generaliz
Abstract
We present Universal Manipulation Interface (UMI) -- a data collection and policy learning framework that allows direct skill transfer from in-the-wild human demonstrations to deployable robot policies. UMI employs hand-held grippers coupled with careful interface design to enable portable, low-cost, and information-rich data collection for challenging bimanual and dynamic manipulation demonstrations. To facilitate deployable policy learning, UMI incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation. The resulting learned policies are hardware-agnostic and deployable across multiple robot platforms. Equipped with these features, UMI framework unlocks new robot manipulation capabilities, allowing zero-shot generalizable dynamic, bimanual, precise, and long-horizon behaviors, by only changing the training data for each task. We demonstrate UMI's versatility and efficacy with comprehensive real-world experiments, where policies learned via UMI zero-shot generalize to novel environments and objects when trained on diverse human demonstrations. UMI's hardware and software system is open-sourced at https://umi-gripper.github.io.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
- Authors: Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, Shuran Song
- Affiliations: Stanford University, Columbia University, Toyota Research Institute (TRI)
- Journal/Conference: This paper is a preprint available on arXiv. It has not yet undergone formal peer review for a specific conference or journal at the time of this analysis.
- Publication Year: 2024 (First submitted February 2024)
- Abstract: The paper introduces the Universal Manipulation Interface (UMI), a complete framework for collecting data from human demonstrations and learning robot policies that can be deployed in the real world. The core innovation is a low-cost, portable, hand-held gripper that allows people to demonstrate complex, bimanual, and dynamic tasks anywhere ("in-the-wild") without needing a robot present during data collection. The framework includes a carefully designed policy learning interface that handles issues like system latency and makes the learned skills transferable across different robot platforms. The authors demonstrate that policies trained with UMI can perform a variety of challenging tasks (dynamic, bimanual, precise, long-horizon) and can generalize to new objects and environments in a zero-shot manner (i.e., without any additional training). The entire system is open-sourced.
- Original Source Link:
- arXiv page: https://arxiv.org/abs/2402.10329
- PDF link: http://arxiv.org/pdf/2402.10329
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Teaching robots complex manipulation skills is a major challenge in robotics. Existing methods are flawed. Teleoperation, where a human controls a robot directly, is expensive, requires expert operators, and is tied to a specific lab setup, limiting data diversity. Learning from in-the-wild human videos (e.g., from YouTube) suffers from a large "embodiment gap"—human hands and perspectives are very different from a robot's, making it hard to transfer skills.
- Gaps in Prior Work: Previous attempts using hand-held grippers for data collection offered a promising middle ground but were limited. They struggled to capture precise and dynamic actions, often due to insufficient visual information (narrow camera view, lack of depth), inaccurate action tracking (e.g., using methods prone to scale ambiguity), and a failure to account for real-world system latencies during deployment. This restricted them to simple, slow "pick-and-place" tasks.
- Fresh Angle/Innovation: UMI tackles these issues head-on with a holistic framework that combines a cleverly designed demonstration interface (the physical gripper) and a robust policy interface (the software and learning representations). The key idea is to collect rich, high-quality data in a robot-agnostic way and then carefully bridge the gap to real-world robot execution, enabling direct skill transfer for a much wider and more complex range of tasks than previously possible.
-
Main Contributions / Findings (What):
- A Novel Hand-Held Demonstration Interface: A portable, low-cost (BoM of ~371) 3D-printed gripper equipped with a GoPro camera, a wide-angle fisheye lens, side mirrors (for implicit stereo vision), and an IMU. This design captures rich visual context and enables precise, robust tracking of dynamic actions. 2. <strong>A Hardware-Agnostic Policy Learning Interface:</strong> A set of techniques to ensure skills learned from human demonstrations can be deployed on different robots. This includes: * <strong>Inference-time latency matching</strong> to compensate for delays in sensors and robot actuation. * A <strong>relative trajectory action representation</strong> that makes the policy robust to tracking errors and independent of a global coordinate system. 3. <strong>Demonstration of New Robot Capabilities:</strong> The paper shows that policies trained with UMI can successfully perform tasks previously difficult to teach via imitation learning, including: * <strong>Dynamic:</strong> Tossing objects into bins. * <strong>Bimanual:</strong> Folding a sweater with two arms. * <strong>Precise:</strong> Arranging a cup and saucer. * <strong>Long-Horizon:</strong> A multi-step dishwashing task. 4. <strong>Zero-Shot Generalization:</strong> By collecting data across many diverse environments ("in-the-wild"), the trained policy for a cup arrangement task achieves a <strong>71.7% success rate</strong> in completely new environments with unseen cups, a level of generalization rarely seen in behavior cloning. 5. <strong>Open-Sourced System:</strong> The complete hardware design and software pipeline are made publicly available, aiming to democratize robot data collection. # 3. Prerequisite Knowledge & Related Work * <strong>Foundational Concepts:</strong> * <strong>Imitation Learning (IL):</strong> A machine learning paradigm where an agent (e.g., a robot) learns to perform a task by observing demonstrations from an expert (e.g., a human). * <strong>Behavior Cloning (BC):</strong> The simplest form of IL, where the agent learns a direct mapping (a policy) from observations to actions, effectively "cloning" the expert's behavior. The UMI framework primarily uses BC. * <strong>Teleoperation:</strong> The process of remotely controlling a robot. It's a common way to collect demonstration data for IL, but as noted, it can be cumbersome and restrictive. * <strong>Embodiment Gap:</strong> The difference in physical form (kinematics, dynamics, sensors) between the demonstrator (e.g., a human) and the learner (e.g., a robot). A large embodiment gap makes it difficult to transfer skills. UMI is designed to minimize this gap. * <strong>SLAM (Simultaneous Localization and Mapping):</strong> A computational problem where a robot or device builds a map of an unknown environment while simultaneously keeping track of its own location within that map. UMI uses a Visual-Inertial SLAM system to track the 6DoF pose of the hand-held gripper. * <strong>6DoF (Six Degrees of Freedom):</strong> Refers to the ability of a rigid body to move in 3D space. It consists of three translational movements (forward/back, up/down, left/right) and three rotational movements (pitch, yaw, roll). Accurately tracking the 6DoF pose of the gripper is essential for capturing manipulation actions. * <strong>Previous Works & Differentiation:</strong> The paper positions UMI against three main lines of work: 1. <strong>Teleoperated Robot Data:</strong> Systems like ALOHA use a leader-follower setup for intuitive teleoperation. * *Limitation:* They require a physical robot to be present during data collection, which restricts data collection to the lab and makes it difficult to gather diverse "in-the-wild" data. * *UMI's Advantage:* <strong>UMI decouples data collection from the robot.</strong> A person can take the UMI gripper anywhere to record demonstrations, which are then used to train a policy for any compatible robot later. 2. <strong>Visual Demonstrations from Human Video:</strong> This approach uses existing videos (e.g., from YouTube) to learn robot skills. * *Limitation:* Suffers from a huge observation and action embodiment gap. It's hard to infer precise actions from videos of human hands, and the camera perspective is different from what a robot would see. * *UMI's Advantage:* UMI uses a gripper that mimics a robot's end-effector and a wrist-mounted camera that provides a robot-like perspective. This <strong>minimizes both the action and observation embodiment gaps</strong>, enabling much more direct and effective skill transfer. 3. <strong>Hand-Held Grippers for Quasi-static Actions:</strong> Previous work like that of Song et al. (2020) and Pinto et al. (2021) used hand-held devices for data collection. * *Limitation:* These systems often struggled with robustly and accurately tracking the gripper's 6DoF pose, especially during fast movements. This limited them to slow, simple, quasi-static tasks like grasping or pick-and-place. * *UMI's Advantage:* UMI integrates a state-of-the-art <strong>Visual-Inertial SLAM system with IMU data</strong> from the GoPro. This provides robust, scale-accurate 6DoF tracking even during fast, dynamic motions. This, combined with latency matching, unlocks the ability to teach dynamic tasks like tossing. # 4. Methodology (Core Technology & Implementation) The UMI framework is composed of two tightly integrated components: the Demonstration Interface (hardware) and the Policy Interface (software). ## A. Demonstration Interface Design The goal is to create a portable, low-cost device that captures all necessary information for learning complex manipulation skills. The hardware is a hand-held, 3D-printed parallel-jaw gripper.  The image above (Image 8) illustrates the core components of the UMI system, from the human demonstration setup on the left, to the observation space in the middle, and the final robot setup on the right. The key design decisions (HD1-HD6) are: * <strong>HD1. Wrist-mounted cameras as input observation:</strong> The only sensor is a GoPro camera mounted on the gripper, mimicking a robot's wrist camera. This design minimizes the observation gap between demonstration and deployment, makes the system portable, and naturally diversifies the data through camera motion. * <strong>HD2. Fisheye Lens for visual context:</strong> A 155° fisheye lens is used to capture a wide field of view. This provides more visual context for the policy and improves the robustness of the SLAM tracking system. The raw fisheye image is used directly as input, as rectifying it would heavily distort the image, as shown below.  * <strong>HD3. Side mirrors for implicit stereo:</strong> To provide depth information without a heavy RGB-D camera, two physical mirrors are placed in the camera's peripheral view. These mirrors create virtual camera views, providing implicit stereo information from a single image. The authors found that digitally reflecting the content of the mirrors before feeding it to the policy yielded the best results.  * <strong>HD4. IMU-aware tracking:</strong> The GoPro's built-in IMU (Inertial Measurement Unit) data is synchronized with the video. This data is fed into an Inertial-monocular SLAM system (ORB-SLAM3`). This allows for robust tracking even with motion blur and recovers the true metric scale of movements, which is critical for precise actions and bimanual coordination.
-
HD5. Continuous gripper control: Instead of a simple open/close action, the gripper's width is continuously tracked using fiducial markers (ArUco markers) on the fingers. This allows for more nuanced interactions, such as precise timing for releasing an object during a toss. The soft fingers also allow for implicit force control.
-
HD6. Kinematic-based data filtering: Although data is collected in a robot-agnostic way, a post-processing step filters out trajectories that would be impossible for a specific target robot (e.g., out of reach, violates joint limits).
B. Policy Interface Design
The goal is to create a policy learning pipeline that is agnostic to the specific robot hardware, allowing skills to be transferred easily. The authors use Diffusion Policy as their learning algorithm, but the interface design is general.
-
PD1. Inference-time latency matching: This is a crucial step for real-world deployment, especially for dynamic tasks.
- Observation Latency: Different sensors (camera, robot joint encoders) have different delays. At inference time, all observation streams are synchronized to the timestamp of the camera image (which is usually the slowest).
- Action Latency: There is a delay between when a command is sent and when the robot/gripper actually executes it. The system measures this execution latency for each component and sends commands ahead of time to compensate, ensuring the robot reaches the desired pose at the desired time.
-
PD2. Relative end-effector pose: To avoid being tied to a specific robot base location or a global coordinate frame, all poses are represented relatively.
- Relative EE trajectory as action: The policy outputs a sequence of future end-effector (EE) poses, all defined relative to the current EE pose at the start of the prediction. This is more robust to tracking drift and calibration errors than predicting absolute poses or accumulating errors with delta poses.
- Relative EE trajectory as proprioception: Historical EE poses are also fed to the policy as a relative trajectory, providing implicit velocity information.
- Relative inter-gripper proprioception: For bimanual tasks, the relative pose between the two grippers is given to the policy as an additional input. This was found to be critical for tasks requiring tight coordination between the two arms.
5. Experimental Setup
The effectiveness of UMI is evaluated through a series of real-world experiments.
-
Tasks:
- Cup Arrangement (Precise): Place a cup on a saucer with the handle facing a specific direction.
- Dynamic Tossing (Dynamic): Sort objects by tossing them into the correct bins, which are placed beyond the robot's reach.
- Bimanual Cloth Folding (Bimanual): Use two robot arms to fold a sweater.
- Dish Washing (Long-Horizon): A 7-step task involving turning on a faucet, washing a plate with a sponge, and placing it back.
- In-the-Wild Cup Arrangement (Generalization): The cup arrangement task is scaled up with data collected in 30 diverse locations using 15 different cups. The policy is then tested in 2 completely new environments with both seen and unseen cups.
-
Datasets: Data is collected by 1-3 non-expert demonstrators. The number of demonstration episodes ranges from 250 to 305 for the narrow-domain tasks and 1400 for the in-the-wild task.
-
Evaluation Metrics: The primary metric is task success rate, which is manually judged by a human operator based on pre-defined criteria for each task. Experiments are repeated multiple times (typically 20 trials) with randomized initial object and robot states.
-
Baselines: The authors perform extensive ablation studies to validate their design choices. Baselines include:
- No Fisheye lens (using a standard rectilinear view).
- Alternative action spaces (absolute pose, delta pose).
- Without side mirrors / without digital reflection of mirrors.
- No latency matching.
- No inter-gripper proprioception for the bimanual task.
- Using a smaller vision encoder (ResNet-34 vs. CLIP-pretrained ViT).
- Training only on narrow-domain data for the generalization task.
6. Results & Analysis
Core Results (Capability Experiments)
- Cup Arrangement: The full UMI system achieves a 100% (20/20) success rate.
- Ablations show significant performance drops: using a standard FoV lens drops the success rate to 55%, and using an absolute action space drops it to a mere 25% due to calibration sensitivity. Using digitally reflected side mirrors improves performance from 90% (no mirrors) to 100%.
- The same policy also achieved a 90% success rate on a different robot (Franka Emika FR2), demonstrating cross-embodiment transferability.
- Dynamic Tossing: UMI achieves an 87.5% success rate.
- The most critical factor here is latency matching. Disabling latency matching cuts the success rate to 57.5%, as the robot's movements become jittery and the gripper release timing is off.
- Bimanual Cloth Folding: UMI achieves a 70% success rate.
- The key to bimanual coordination is relative inter-gripper proprioception. Without this information, the success rate plummets to 30% because the arms fail to synchronize their actions, such as grasping the hem of the sweater simultaneously.
- Dish Washing: UMI achieves a 70% success rate on this complex, long-horizon task.
-
This task highlights the need for a powerful vision model. Using a large, pre-trained vision encoder (
CLIP-ViT-B/16) was essential. A smaller`ResNet-34$ trained from scratch completely failed (0% success), as it could not learn a reactive behavior. The final policy was also robust to perturbations, like adding more sauce mid-task.
-
The image above (Image 3) shows the policy's robustness to various changes, including base movement (a), novel objects (b), different lighting (c), and perturbations during the task (d).
In-the-Wild Generalization Experiments
This is one of the most significant results of the paper.
- The UMI policy, trained on 1400 demonstrations from 30 diverse environments, achieved a combined 71.7% success rate in two completely unseen environments (a busy outdoor cafe and a water fountain) with unseen cups.
- In contrast, a baseline policy trained only on the 305 demonstrations from the lab environment (but with the same powerful ViT-L vision backbone) achieved a 0% success rate. It failed to even attempt to move towards the cup.
- Takeaway: This strongly demonstrates that the generalization capability comes from the diversity of the in-the-wild data, not just from using a large pre-trained model. UMI's portability is the key enabler for collecting such data.
Data Collection Throughput and Accuracy
-
Throughput: UMI is significantly faster than traditional teleoperation. For the cup arrangement task, data collection with UMI was over 3x faster than with a spacemouse. While not as fast as a bare human hand, it strikes a practical balance between speed and collecting robot-compatible data. Teleoperation failed completely on the dynamic tossing task.

-
Accuracy: A benchmark against a Motion Capture (MoCap) system shows the SLAM-based tracking is highly accurate.
-
Absolute Trajectory Error (ATE): 6.1 mm position, 3.5° rotation.
-
Relative Pose Error (RPE) between grippers: 10.1 mm position, 0.8° rotation. This level of accuracy is sufficient for learning precise manipulation skills.

-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully presents UMI, a comprehensive and practical framework for teaching robots complex skills from in-the-wild human demonstrations. By combining a thoughtfully designed hand-held data collection device with a robust, hardware-agnostic policy interface, UMI overcomes key limitations of prior methods. It enables the learning of dynamic, bimanual, and long-horizon tasks and, most importantly, facilitates zero-shot generalization to novel scenes and objects by making large-scale, diverse data collection feasible for anyone. The open-sourcing of the project aims to foster a community-driven effort to create large, decentralized robotics datasets.
-
Limitations & Future Work: The authors acknowledge several limitations:
- Kinematic Mismatches: The framework currently relies on filtering out demonstrations that are kinematically impossible for the target robot. Future work could explore methods to learn from these "infeasible" demonstrations, adapting the skill to the robot's capabilities.
- Reliance on Visual Texture: The SLAM system requires sufficient texture in the environment to function reliably. It may struggle in visually sparse environments (e.g., rooms with plain white walls).
- Data Collection Efficiency: While faster than teleoperation, collecting data with the UMI gripper is still slower and more cumbersome than demonstrating with bare hands. Future improvements could focus on lighter, more ergonomic hardware designs.
-
Personal Insights & Critique:
- Holistic System Design: The strength of this paper lies not in a single algorithmic novelty but in the meticulous, end-to-end design of a complete system. Every component—from the physical mirrors to the relative action representation—is thoughtfully chosen to solve a specific, practical problem in transferring human skills to robots.
- Practicality and Accessibility: By using a low-cost, off-the-shelf camera (GoPro) and 3D-printed parts, the authors have created a system that is highly accessible. This, combined with the open-source release, genuinely has the potential to "democratize" robot data collection as the authors hope.
- Generalization is Key: The in-the-wild generalization result is the paper's crowning achievement. It provides strong evidence that diverse, large-scale datasets are a critical component for building truly generalist robots, and UMI provides a scalable way to create such datasets.
- Critique: The reliance on manual, subjective evaluation of task success is a common issue in manipulation research but remains a weakness. While necessary for such complex tasks, it makes reproducibility and exact comparison challenging. Furthermore, while the system is robot-agnostic, the policy is still tied to a parallel-jaw gripper end-effector, limiting the scope of tasks to non-dexterous manipulation. Extending this framework to dexterous hands would be a significant and valuable next step.
Similar papers
Recommended via semantic vector search.