ActiveUMI: Robotic Manipulation with Active Perception from Robot-Free Human Demonstrations
TL;DR Summary
ActiveUMI integrates portable VR teleoperation and sensorized controllers to capture active egocentric perception, enabling precise human-robot alignment for complex bimanual tasks with 70% success and strong generalization in novel scenarios.
Abstract
We present ActiveUMI, a framework for a data collection system that transfers in-the-wild human demonstrations to robots capable of complex bimanual manipulation. ActiveUMI couples a portable VR teleoperation kit with sensorized controllers that mirror the robot's end-effectors, bridging human-robot kinematics via precise pose alignment. To ensure mobility and data quality, we introduce several key techniques, including immersive 3D model rendering, a self-contained wearable computer, and efficient calibration methods. ActiveUMI's defining feature is its capture of active, egocentric perception. By recording an operator's deliberate head movements via a head-mounted display, our system learns the crucial link between visual attention and manipulation. We evaluate ActiveUMI on six challenging bimanual tasks. Policies trained exclusively on ActiveUMI data achieve an average success rate of 70% on in-distribution tasks and demonstrate strong generalization, retaining a 56% success rate when tested on novel objects and in new environments. Our results demonstrate that portable data collection systems, when coupled with learned active perception, provide an effective and scalable pathway toward creating generalizable and highly capable real-world robot policies.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
ActiveUMI: Robotic Manipulation with Active Perception from Robot-Free Human Demonstrations
1.2. Authors
Qiyuan Zeng, Chengmeng Li, Jude St. John, Zhongyi Zhou, Junjie Wen, Guorui Feng, Yichen Zhu, Yi Xu. Affiliations: Shanghai University, Stanford University, Midea Group.
1.3. Journal/Conference
This paper is published on arXiv, a preprint server for research papers. It is currently a preprint (indicated by ), meaning it has not yet undergone formal peer review or been published in a specific conference or journal. arXiv is a widely respected platform for quickly disseminating research in physics, mathematics, computer science, and other fields.
1.4. Publication Year
2025
1.5. Abstract
The paper introduces ActiveUMI, a framework designed for collecting data to train robots for complex bimanual manipulation using "in-the-wild" human demonstrations. ActiveUMI utilizes a portable VR teleoperation kit with sensorized controllers that mimic the robot's end-effectors, establishing precise kinematic alignment between human and robot. Key features for mobility and data quality include immersive 3D model rendering, a self-contained wearable computer, and efficient calibration methods. A distinguishing aspect of ActiveUMI is its ability to capture active, egocentric perception by recording the operator's deliberate head movements via a head-mounted display (HMD). This allows the system to learn the critical relationship between visual attention and manipulation actions. Evaluated on six challenging bimanual tasks, policies trained solely on ActiveUMI data achieve an average success rate of 70% on in-distribution tasks. Furthermore, these policies exhibit strong generalization, maintaining a 56% success rate on novel objects and in new environments. The authors conclude that portable data collection systems, when combined with learned active perception, offer an effective and scalable pathway for developing generalizable and highly capable real-world robot policies.
1.6. Original Source Link
https://arxiv.org/abs/2510.01607v1
1.7. PDF Link
https://arxiv.org/pdf/2510.01607v1.pdf
2. Executive Summary
2.1. Background & Motivation (Why)
The paper addresses a fundamental challenge in robotics: scaling up data collection to train advanced robot foundation models, which are AI models designed to perform a wide range of robotic tasks. While these models hold great promise for generalist robot policies, they are currently limited by the scarcity and quality of available robot data compared to the vast datasets used for large language models.
Current data collection methods face significant limitations:
-
In-lab teleoperation: This involves a human directly controlling a robot, which is effective but expensive and difficult to scale to massive datasets.
-
Human videos: Using videos of humans performing tasks can provide rich data, but there's a significant "cross-embodiment gap" – the difference in physical form and capabilities between a human and a robot – making direct transfer difficult.
-
Simulation: Training robots in virtual environments offers scalability, but often suffers from a "sim-to-real gap," meaning policies learned in simulation don't transfer perfectly to physical hardware due to differences in physics, sensor noise, and other real-world complexities.
A critical oversight in many existing "in-the-wild" data collection systems, especially those using sensorized hand-held interfaces, is the neglect of active, egocentric perception. Humans naturally move their heads to manage visual occlusions (when an object is hidden from view), gather context, and focus attention. However, most robot systems primarily rely on static or wrist-mounted cameras. These cameras' viewpoints are often constrained by the robot's end-effector (gripper) movements, rather than being guided by the perceptual needs of the task. This makes it difficult for robots to handle complex, long-horizon tasks (tasks requiring many steps over a long period) or fine manipulation, as they lack the ability to actively choose the most informative viewpoint.
The paper's novel approach, ActiveUMI, aims to bridge these gaps by providing a scalable data collection method that not only captures action-aligned trajectories but also explicitly records and leverages active, egocentric perception (the operator's head movements) to enable policies to learn optimal viewpoint selection.
2.2. Main Contributions / Findings (What)
The ActiveUMI framework introduces several key contributions:
- Portable VR Teleoperation System: A self-contained, portable VR teleoperation kit with sensorized controllers that precisely mirror the robot's end-effectors. This system allows for "in-the-wild" (outside laboratory settings) data collection, promoting scalability and accessibility.
- Hardware Architecture for Embodiment Alignment: A specially designed hardware architecture that allows the target robot's custom grippers to be mounted directly onto VR controllers, ensuring tight alignment between natural human movement and robot embodiment. This includes immersive 3D model rendering within VR, a wearable computer, and efficient calibration methods.
- Capture of Active, Egocentric Perception: The defining feature of ActiveUMI is its ability to record the operator's deliberate head movements via a Head-Mounted Display (HMD). This explicit tracking of visual attention enables policies to learn how to actively control their own viewpoint, which is crucial for overcoming occlusions and performing complex tasks.
- Demonstrated Effectiveness on Challenging Bimanual Tasks: Policies trained exclusively on ActiveUMI data achieve a high average success rate of 70% on six challenging bimanual tasks (e.g., block disassembly, shirt folding, rope boxing).
- Significant Improvement over Baselines: ActiveUMI policies demonstrate a 44% and 38% improvement in average success rate compared to non-active perception counterparts (policies trained from wrist-centric views or static third-person cameras, respectively).
- Strong Generalization Capabilities: The learned policies show robust generalization, retaining a 56% success rate when tested on novel objects and in new environments, indicating that the learned active perception is transferable.
- Improved Data Collection Efficiency and Accuracy: ActiveUMI significantly speeds up data collection compared to conventional teleoperation (e.g., 1.49x to 2.06x faster for complex tasks) and offers 2.5x smaller Relative Pose Error (RPE) compared to the Universal Manipulation Interface (UMI) system, leading to higher quality data.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand ActiveUMI, a few foundational concepts are important:
- Robot Foundation Models (RFMs) / Vision-Language-Action (VLA) Models: These are large-scale AI models, similar to large language models (LLMs) but designed for robotics. They aim to understand visual information, linguistic instructions, and then generate actions for robots to perform various tasks. The goal is to create "generalist" robots that can adapt to many different situations without needing to be re-programmed for each new task. The paper mentions (PiZero) as a state-of-the-art VLA model used in their experiments.
- Teleoperation: This is the remote control of a robot by a human operator. In the context of data collection, teleoperation allows humans to demonstrate tasks directly, generating action-aligned trajectories for robots to learn from.
- In-the-wild Data Collection: This refers to collecting data in natural, unstructured environments outside of a controlled laboratory setting. It's crucial for training robust robot policies that can operate effectively in diverse real-world scenarios.
- Egocentric Perception: This is a first-person perspective, typically from the robot's "eyes" or a human operator's Head-Mounted Display (HMD). It provides a view directly from the agent's point of view, which is essential for tasks requiring precise hand-eye coordination.
- Active Perception: Beyond just having an egocentric view, active perception means the agent (robot or human operator) can deliberately control its viewpoint (e.g., by moving its head or camera) to acquire the most relevant visual information for the task. This contrasts with static cameras or cameras fixed to an end-effector, which have passive viewpoints.
- Bimanual Manipulation: This involves using two robotic arms or end-effectors (like human hands) simultaneously to perform a task. Many complex real-world tasks (e.g., folding laundry, assembling objects) require bimanual coordination.
- Six-Degrees-of-Freedom (6-DoF) Pose Tracking: This refers to tracking an object's position and orientation in 3D space. "Position" means its location along the X, Y, and Z axes (3 degrees of freedom), and "orientation" means its rotation around these axes (roll, pitch, and yaw, another 3 degrees of freedom). VR systems like Meta Quest use sophisticated sensors to track this for controllers and the headset.
- SLAM (Simultaneous Localization and Mapping): This is a computational problem of constructing or updating a map of an unknown environment while simultaneously keeping track of an agent's location within it. VR headsets use SLAM-like techniques (e.g., inside-out tracking with cameras and IMUs) to localize themselves and their controllers in the physical world.
- Fisheye Camera: A type of ultra-wide-angle lens that produces strong visual distortion intended to create a wide panoramic or hemispherical image. In robotics, these are often used on robot wrists to capture a broad view of the immediate operational environment.
- Proprioception: This refers to the robot's internal sense of its own body's position, movement, and force. For example, joint angles, motor torques, or gripper force readings are proprioceptive data. It's distinct from external sensing like vision.
3.2. Previous Works
The paper contextualizes ActiveUMI within the broader landscape of robotic data collection and foundation models:
- Robot Foundation Models (RFMs): The paper references a surge of interest in RFMs and Vision-Language-Action (VLA) models (e.g., [6, 8, 10, 19, 21, 22, 35, 47, 50] for RFMs; [4, 5, 11, 12, 24, 25, 27, 31, 32, 38, 40, 41, 52, 53] for VLA models). These works highlight the ambition to create generalist robots but also underscore the massive data requirements, often citing the disparity with web-scale data available for large language models.
- Addressing Data Scarcity: Several approaches are mentioned to tackle the data scarcity problem:
- User-friendly teleoperation systems: Works like [3, 7, 16, 26, 37, 48] aim to make human control of robots easier, but ActiveUMI argues these are still expensive and hard to scale.
- Large-scale simulation data: Papers such as [1, 17, 30] leverage simulations, but ActiveUMI points out the inherent "sim-to-real gap."
- Repurposing human videos: Approaches from [18, 20, 23, 29, 33, 36, 45, 46, 54] use human videos, but ActiveUMI notes the "embodiment gap" between human and robot kinematics.
- In-the-Wild Data Collection Systems with Sensorized Hand-held Interfaces: ActiveUMI builds upon a line of research focused on portable systems for collecting data outside the lab:
- DexCap [39]: Uses a wearable glove to capture precise wrist and fingertip poses for dexterous tasks.
- AirExo [13, 14]: Leverages low-cost hardware with direct kinematic mapping for arm manipulation.
- DoGlove [49]: A low-cost haptic force feedback glove system for teleoperation and manipulation.
- Dexop [15]: A passive hand exoskeleton for collecting rich sensory data for dexterous manipulation.
- NuEXO [51]: Portable exoskeleton hardware for teleoperation and collecting humanoid data.
- Universal Manipulation Interface (UMI) [9]: This is highlighted as the most related work. UMI introduced a simple handheld controller for scalable bimanual data collection. ActiveUMI directly addresses a key limitation of UMI.
- DexUMI [44] and FastUMI [28]: Extensions of UMI, with DexUMI adapting it for dexterous hands and FastUMI redesigning it for rapid deployment with an extra camera.
3.3. Technological Evolution
The field of robot learning has evolved from relying heavily on in-lab, costly teleoperation or simulation, to exploring more scalable approaches like leveraging human videos or specialized "in-the-wild" data collection systems. Early systems focused on capturing hand/arm movements but often used static or wrist-mounted cameras. This limited their ability to handle tasks requiring dynamic viewpoint changes.
ActiveUMI represents an evolution by explicitly integrating active, egocentric perception. While systems like UMI provided scalable human-robot kinematic alignment, ActiveUMI's novelty lies in recognizing and addressing the critical role of human head movements (visual attention) in manipulation. It extends the UMI concept by treating the operator's head pose as a learnable input, thereby enabling the robot to choose its viewpoint, much like a human. This addresses a crucial gap where previous systems, despite advancements in hardware portability and kinematic mapping, still struggled with occlusions and context-dependent visual information due to their passive camera setups. Vision-in-Action [43] is cited as a related work also focusing on active perception, indicating a growing trend in this direction.
3.4. Differentiation
ActiveUMI differentiates itself from previous works primarily through its integration of active, egocentric perception with a portable, high-fidelity human-robot kinematic mapping system.
- Vs. Traditional Teleoperation/Simulation: ActiveUMI offers a more scalable and cost-effective data collection method than traditional in-lab teleoperation, while mitigating the sim-to-real and embodiment gaps associated with simulation and human videos, respectively.
- Vs. UMI-style Systems (e.g., UMI, DexUMI, FastUMI): This is the most direct comparison.
- Commonality: ActiveUMI shares with UMI the core idea of using hand-held controllers to capture bimanual human demonstrations and map them to robot actions, enabling "in-the-wild" data collection. It also shares the goal of bridging human-robot kinematics.
- Key Differentiation (Active Perception): The primary distinction is ActiveUMI's explicit capture and utilization of the operator's head movements via a VR Head-Mounted Display (HMD). UMI and its variants primarily rely on wrist-mounted cameras, whose viewpoints are passively determined by the robot's arm movements. ActiveUMI, by contrast, learns to actively control a mobile head camera, enabling the robot to dynamically adjust its viewpoint to overcome occlusions and acquire task-critical information. This fundamentally changes how the robot "sees" and interacts with its environment.
- Hardware Flexibility: ActiveUMI's design allows for greater hardware flexibility by adapting a modified Meta Quest controller to the target robot's existing end-effector, rather than being built around a specific, non-interchangeable gripper as some UMI systems might be.
- Accuracy: The paper also shows ActiveUMI achieving 2.5x smaller Relative Pose Error (RPE) compared to UMI, indicating higher fidelity in kinematic mapping.
- Vs. Other Wearable Systems (e.g., DexCap, AirExo, DoGlove, Dexop, NuEXO): While these systems also focus on "in-the-wild" data collection and various forms of human-robot mapping, ActiveUMI stands out by specifically addressing the visual attention aspect through active head camera control, which is often overlooked in systems primarily focused on hand/arm pose capture.
- Vs. Vision-in-Action [43]: While Vision-in-Action also focuses on designing a teleoperation system for active perception, ActiveUMI's core contribution integrates this with a highly portable, scalable, and high-fidelity VR-based teleoperation system for bimanual manipulation, and explicitly tracks operator head movements for viewpoint learning.
4. Methodology
The ActiveUMI framework is built on two core principles: (i) tight alignment of robot embodiment with natural human movement, and (ii) enabling active perception.
4.1. Principles
The core idea behind ActiveUMI is to address the limitations of current robot data collection by directly translating natural human interaction patterns, including visual attention, into robot control.
- Embodiment Fidelity and Kinematic Alignment: The system aims to minimize the "embodiment gap" by mirroring the robot's end-effectors with sensorized human controllers. This ensures that the human's natural movements and interactions with objects can be directly translated into robot actions. The intuition is that if the human's input directly corresponds to the robot's physical capabilities, the robot can learn more effectively from demonstrations.
- Active, Egocentric Perception: The central principle is that "how a human looks" is as important as "how a human acts." Humans instinctively move their heads to manage occlusions, gain context, and focus on relevant details during manipulation. ActiveUMI hypothesizes that by capturing these deliberate head movements and incorporating them into the learned policy, robots can also learn to actively control their camera viewpoints, thereby improving performance on complex tasks that require dynamic visual information.
4.2. Steps & Procedures
ActiveUMI encompasses a hardware system for data collection and a learning paradigm for active perception.
4.2.1. Data Collection System for ActiveUMI
The data collection system is designed to be portable and efficient, using consumer-grade VR equipment.
-
Hardware Setup (Figure 2):
-
VR Gripper Controller: Modified Meta Quest 3s controllers are used. These controllers offer high-precision 6-DoF pose tracking via the headset's inside-out tracking system, which triangulates their pose using integrated infrared (IR) LEDs.
- Procedure: The controllers are rigidly mounted onto a replica of the target robot's end-effector (gripper). This ensures that the controller's tracked 6-DoF pose (position:
x,y,z; orientation: roll, pitch, yaw) directly represents the robot's end-effector pose.
- Procedure: The controllers are rigidly mounted onto a replica of the target robot's end-effector (gripper). This ensures that the controller's tracked 6-DoF pose (position:
-
Gripper Actuation: A micro-motor is integrated into each controller to drive the open/close motion of the gripper.
- Procedure: An identical copy of the robot's gripper is attached to the operator's controller. When the operator triggers the gripper action on the controller, the micro-motor actuates the attached gripper, and this action is recorded.
-
Wrist-Mounted Fisheye Camera: Each controller is augmented with a fisheye camera.
- Procedure: These cameras are positioned to maximize their field of view, capturing comprehensive visual information from the robot's "wrist perspective." This "wrist view" complements the head-mounted camera's view.
-
Head-Mounted Display (HMD): The Meta Quest 3s HMD serves two critical roles.
- Localization: Its robust SLAM system provides a stable world coordinate system and tracks the 6-DoF poses of both the operator's head and the controllers.
- Dynamic Top Camera: The HMD's front-facing color cameras act as a dynamic, egocentric "top camera," providing a global perspective coupled with the operator's line of sight.
-
Wearable Device: A compact computational unit (small computer) is worn on the operator's back.
- Procedure: This allows the entire system to be self-contained and portable, enabling data collection in diverse "in-the-wild" environments without being tethered to a stationary workstation.
-
Immersive Data Collection (Figure 4):
-
Procedure: A 3D model of the robotic arms is rendered within the VR environment. These virtual arms are precisely aligned with the operator's hand-held controllers, providing real-time visual feedback on the robot's virtual movements. This helps the operator visualize and understand the robot's actions as they demonstrate.
该图像是一个示意图,展示了论文中虚拟现实环境下通过头显渲染机器人双臂的操作视角,图中机器人双臂配有坐标轴指示其姿态方向。
Figure 4. Immerse Data Collection. Our system provides the operator with critical visual feedback by rendering the robot's arms in the VR environment.
-
-
-
Data Recording (Figure 3):
-
The system records the 6-DoF pose data from the left controller, right controller, and the HMD. These correspond directly to the robot's two gripper tips and its head-mounted camera.
-
All data is recorded in absolute coordinates relative to a unified world coordinate system established during an initial calibration phase.
-
The wrist-mounted fisheye camera feeds are also recorded.
该图像是示意图,展示了ActiveUMI系统的数据采集与模型评估流程。左侧展示了通过VR设备和多视角相机收集的多通道数据,右侧展示了机器人执行任务时的模型输出及其正确与错误的示范对比。
The data collection and model evaluation process.
-
4.2.2. Active Perception for Policy Learning
The core innovation is learning from the operator's head movements.
- Recording Head Movements: The real-time 6-DoF pose of the operator's HMD is recorded as an additional input stream alongside the hand actions and wrist camera feeds. This captures the operator's visual attention and viewpoint selection.
- Policy Learning: During policy training, the model learns the correlation between the operator's head movements (visual attention patterns) and their corresponding hand actions needed to perform tasks. This means the policy learns not just what to do, but also where to look.
- Deployment: When the trained policy is executed on the real robot, it predicts a 6-DoF pose for the robot's head camera. This predicted motion is then executed by a dedicated low-level controller on a mobile robotics arm that holds the head camera. This allows the robot to dynamically adjust its viewpoint, actively mimicking the learned human attention patterns, overcoming occlusions, and enhancing performance. The robot platform for experimentation is described as having two 6-DoF arms for bimanual manipulation and a third arm for the active, mobile viewpoint, making a total of 20-DoF (6+6+6+2 for grippers).
4.2.3. Calibrating End-Effector for Precise Data Collection
To ensure high-quality and consistent data, several calibration methods are employed:
- In-Situ Environment Setup:
- Procedure: Operators can press a designated 'B' button on the controller to reset the 6-DoF zero-point (origin) of the base coordinate system. The coordinate system's axes are rendered in real-time within the VR headset, allowing the operator to intuitively align the virtual reference frame with the physical workspace. This flexible reset ensures a consistent starting point for each data collection session.
- Gripper Placeholder:
- Procedure: A physical jig or "docking station" is used for the VR controllers. This placeholder can be placed anywhere in the workspace. When the controllers are seated in it, their relative distance and pose are fixed to a known, predefined state. Pressing a button while docked instantly calibrates the virtual coordinate system, aligning its origin and orientation with this known physical configuration.
- Haptic Feedback for Zero-Point Position:
- Procedure: To aid in precise zero-point calibration, the controller's motor generates a high-frequency vibration when a gripper moves within 3cm of the base coordinate system's origin. This tactile cue helps operators confirm alignment without needing to rely solely on visual readouts, improving speed and efficiency.
4.3. Mathematical Formulas & Key Details
The paper presents the formulas for Relative Pose Error (RPE) in the context of data collection accuracy.
-
Absolute Error (): The absolute error is the difference in magnitude between the actual distance measured after robot replay and the nominal (ground truth) distance.
- : The actual distance measured between the inside of the two grippers during playback on the real robot. This is what the robot actually did.
- : The nominal distance, which is the ground truth distance between the two grippers as manually recorded by the operator using a tape measure during the demonstration. This is what the robot should have done.
- : Denotes the absolute value, ensuring the error is always a positive quantity.
-
Relative Pose Error (RPE): The RPE expresses the absolute error as a percentage of the nominal distance, providing a normalized measure of accuracy.
-
: The absolute error, as calculated above.
-
: The nominal (ground truth) distance.
-
: Converts the ratio into a percentage.
These formulas are critical for evaluating the fidelity of the ActiveUMI system in accurately capturing and replaying human demonstration trajectories on a robot. A lower RPE indicates higher precision in data collection and playback.
-
5. Experimental Setup
5.1. Datasets
The paper describes the data collected using the ActiveUMI system as "in-the-wild human demonstrations." The dataset comprises recordings of human operators performing six challenging bimanual manipulation tasks using the ActiveUMI VR teleoperation kit.
-
Origin: The data is collected by human operators using the ActiveUMI portable system in various real-world environments ("in-the-wild").
-
Characteristics:
- Multimodal: Includes 6-DoF pose data for two gripper controllers and the operator's head (from the HMD), as well as visual feeds from wrist-mounted fisheye cameras and the HMD's front-facing cameras (acting as an egocentric head camera).
- Action-aligned: Human actions and corresponding visual attention cues (head movements) are synchronously recorded.
- Bimanual: Designed for tasks requiring two robot arms.
- Egocentric: The head camera data provides a first-person perspective, crucial for active perception.
- Real-world: Collected from diverse environments, enhancing generalization.
-
Data Sample: A single data point at any given time step (collected at 30 Hz) would consist of:
- 6-DoF pose (position and orientation) of Left Controller.
- 6-DoF pose (position and orientation) of Right Controller.
- 6-DoF pose (position and orientation) of Operator's Head (from HMD).
- Image frame from Left Wrist Fisheye Camera.
- Image frame from Right Wrist Fisheye Camera.
- Image frame from HMD Front-facing Cameras (Egocentric Head Camera).
- Gripper actuation state (open/close) for both left and right grippers.
-
Justification: The ActiveUMI dataset is specifically designed to overcome the limitations of existing datasets by integrating active, egocentric perception from "in-the-wild" human demonstrations. This approach aims to provide highly relevant and diverse data for training robust and generalizable robot policies, particularly for tasks involving occlusions, long horizons, and fine manipulation, where viewpoint control is critical.
5.2. Evaluation Metrics
The primary evaluation metric for policy performance is success rate. For data collection accuracy, Relative Pose Error (RPE) is used.
-
Success Rate:
- Conceptual Definition: Measures the percentage of trials in which the robot successfully completes the assigned task according to predefined criteria. This is a direct and intuitive measure of a policy's effectiveness.
- Mathematical Formula:
- Symbol Explanation:
Number of Successful Trials: The count of times the robot completed the task to specification.Total Number of Trials: The total number of attempts made by the robot for a given task.
-
Relative Pose Error (RPE):
- Conceptual Definition: RPE quantifies the accuracy of the ActiveUMI system in capturing human demonstrations and subsequently replaying them on a real robot. It measures the percentage difference between a manually recorded, nominal distance (ground truth) and the actual distance achieved by the robot during replay. A lower RPE indicates higher fidelity and precision in the data collection and robot's execution.
- Mathematical Formula:
- Symbol Explanation:
- : The distance measured between the inside of the two grippers when the robot replicates the demonstrated pose.
- : The nominal distance, which is the ground truth distance manually recorded by the operator during the demonstration.
- : The absolute difference between the playback distance and the nominal distance.
RPE: The Relative Pose Error, expressed as a percentage.
5.3. Baselines
The paper compares ActiveUMI's performance against two main baselines, designed to isolate the impact of active and egocentric perception:
-
UMI (Wrist-Camera-Only Baseline):
- Description: This configuration mimics the standard setup of Universal Manipulation Interface (UMI) style methods. The robot relies solely on two fisheye wrist-mounted cameras for visual perception. The head camera component is entirely removed.
- Why representative: It represents a common approach in "in-the-wild" data collection where perception is tightly coupled with the end-effectors, often overlooking a dedicated, dynamic viewpoint. This setup has 14 Degrees of Freedom (DoF) – 6 DoF per arm for the two bimanual arms, excluding any head camera control.
- Model Adaptation: For the VLA model (), the visual tokens corresponding to the missing third-camera view are padded (filled with placeholder data) to maintain architectural consistency.
-
UMI w/ Fixed Head Camera (Static Top-Down Camera Baseline):
-
Description: In this baseline, a head camera is present, but it is mounted in a static, top-down position. This removes the "active" perception component, meaning the camera's viewpoint cannot be adjusted dynamically based on task needs or learned attention.
-
Why representative: This setup provides a third-person, global view, which is often considered complementary information for complex bimanual tasks. It helps evaluate if just having a third camera (even if static) is sufficient, or if the active control of that camera is essential. This setup also has 14 DoF (6 DoF per arm, with the head camera fixed).
Shared Model: For all three configurations (ActiveUMI, Fixed Head Camera, Wrist-Camera-Only), the same base VLA model, , is used to train policies, ensuring a fair comparison of the perception setups. All experiments were conducted over 10 trials unless specified otherwise.
-
5.4. Task Descriptions
The experiments were conducted on six challenging bimanual tasks, designed to test a diverse range of robotic skills:
-
Block disassembly: A precision task requiring the robot to separate two small, interlocked blocks and sort them into a designated box. This tests fine motor skills and sequential manipulation.
-
Shirt folding: A deformable object manipulation task that demands accurate state recognition (e.g., recognizing unfolded, partially folded states) to correctly fold a cloth. This is challenging due to the non-rigid nature of the object.
-
Rope boxing: A long-horizon task where the robot must neatly guide a long rope into a box. This tests continuous manipulation and planning over an extended sequence of actions.
-
Toolbox cleaning: An articulated object manipulation task requiring the robot to operate a hinge to close the lid of a toolbox. This tests interaction with hinged mechanisms and understanding object articulation.
-
Bottle placing: A task designed to test the policy's generalization and robustness to significant randomization in object positions. This assesses how well the policy can handle variability in the environment.
-
Take Drink from Bag: (This task is present in the tables but not explicitly described in the task description section 4.1, implying it's a similar manipulation task, likely involving grasping and moving a drink from a bag).
该图像是多组机器人双臂操作的示意图,展示了不同任务如积木分类、折叠衬衫、绳索装箱、工具箱清理及瓶子放置的步骤和动作过程,体现机器人在复杂操作中的灵活性和多样性。
Figure 5. Examples of bimanual manipulation tasks.
6. Results & Analysis
6.1. Core Results
6.1.1. How Important is the Egocentric Active Perception?
This section evaluates the impact of ActiveUMI's core feature: active, egocentric perception. Policies trained with ActiveUMI are compared against those trained with a fixed head camera and those relying only on wrist cameras (UMI baseline).
The following table shows the results from Table 1 (In-Domain):
| Camera View | Tasks (In-Domain) | |||||
| Bottle placing | Rope boxing | Shirt folding | Block disassembly | Take Drink from Bag | Average | |
| UMI | 60% | 20% | 10% | 0% | 40% | 26% |
| UMI w/ Fixed Head Camera | 60% | 40% | 40% | 20% | 50% | 42% |
| ActiveUMI | 90% | 70% | 80% | 30% | 80% | 70% |
Analysis: The results clearly demonstrate that equipping the robot with active perception (ActiveUMI) significantly outperforms both baselines across all evaluated tasks.
- ActiveUMI's Superiority: ActiveUMI achieves an average success rate of 70%, which is 44% higher than the UMI baseline (26%) and 28% higher than the UMI with Fixed Head Camera baseline (42%).
- Task-Specific Improvements: For tasks like "Bottle placing" and "Take Drink from Bag," ActiveUMI achieves 90% and 80% success rates respectively, significantly higher than the baselines. Even on the challenging "Block disassembly" task, where all methods struggled, ActiveUMI still achieved 30%, while UMI had 0%.
- Fixed Head Camera vs. Wrist-Only: The "UMI w/ Fixed Head Camera" consistently outperforms the "UMI" (wrist-camera-only) baseline (42% vs. 26% average). This indicates that even a static third-person view provides valuable complementary information for complex bimanual tasks, suggesting that having an additional, broader perspective is beneficial.
- Hypothesized Reasons: The authors suggest two main drivers for ActiveUMI's improved performance:
- Compensation for Demonstrator Motion: During in-the-wild data collection, human demonstrators naturally move their heads and bodies. An active camera allows the learned policy to compensate for these movements, treating them as deliberate viewpoint selections rather than just observation noise.
- Task-Critical Information Acquisition: Active viewpoint selection enables the policy to acquire crucial visual information on demand, such as verifying a grasp or inspecting a specific part of an object that might otherwise be occluded.
6.1.2. Mixed Training with Teleoperated Data
This section investigates the optimal strategy for combining ActiveUMI data with a small amount of high-quality teleoperated data (typically more precise but less scalable). Experiments were conducted on the "shirt folding" task over 20 trials.
The following table shows the results from Table 3:
| Teleoperated Data Ratio | 10% | 1% | 0% |
| Avg. Success Rate | 90% | 95% | 80% |
Analysis:
- Improvement with Teleoperated Data: Adding even a small amount of teleoperated data significantly improves performance on the "shirt folding" task. Training exclusively on ActiveUMI data yielded an 80% success rate.
- Optimal Mixture: Interestingly, a very small mixture of teleoperated data (1%) achieved the highest success rate of 95%, outperforming the 10% mixture (90%). This suggests a "sweet spot" where a minimal amount of high-fidelity data can effectively fine-tune a policy trained on large-scale, potentially noisier, ActiveUMI data.
- Implication: This finding aligns with previous research suggesting that policies can be effectively trained by combining large-scale, lower-cost data with a small fraction of high-quality real-world demonstrations. It highlights that ActiveUMI data is highly sample-efficient and can significantly lower the cost of developing robust robot foundation models.
6.1.3. Generalization Capability of ActiveUMI for In-the-Wild Data Collection
A crucial test for any robust policy is its ability to generalize to novel objects and unseen environments. Policies trained on ActiveUMI data were tested in a new environment with the same tasks.
The following table shows the results from Table 2 (New Environment):
| Camera View | Tasks (New Environment) | |||||
| Bottle placing | Rope boxing | Shirt folding | Block disassembly | Take Drink from Bag | Average | |
| UMI | 30% | 0% | 0% | 0% | 0% | 6% |
| UMI w/ Fixed Head Camera | 30% | 10% | 20% | 0% | 20% | 16% |
| ActiveUMI | 70% | 50% | 80% | 30% | 50% | 56% |
Analysis:
- Strong Generalization of ActiveUMI: ActiveUMI policies demonstrate strong generalization capabilities, achieving an average success rate of 56% in novel environments. While this is a drop from the 70% in-domain performance, it represents a significant retention of capabilities.
- Baselines Collapse: In stark contrast, the baselines perform very poorly in new environments:
- "UMI w/ Fixed Head Camera" drops to a mere 16% average success rate.
- "UMI" (wrist-camera-only) baseline's performance plummets to just 6% average success, failing completely on several tasks (0% success).
- Importance of Active Viewpoint Control: This dramatic difference highlights that policies relying on static or constrained viewpoints (like wrist cameras or fixed head cameras) fail to adapt when the visual context changes significantly. The ability of ActiveUMI policies to actively control their viewpoint makes them more resilient to visual shifts and novel scene configurations, validating the quality and transferability of the "in-the-wild" data with active perception.
6.1.4. Data Collection Throughput and Accuracy
This section examines the practical aspects of ActiveUMI's efficiency and precision.
Throughput (Efficiency):
- Measurement: The time required to complete two long-horizon tasks, "rope boxing" and "shirt folding," was measured for three methods: ActiveUMI, teleoperation of a real robot via a VR kit, and direct human demonstration.
- Results (Figure 6(d)):
- For "rope boxing," ActiveUMI was 2.06x slower than a direct human demonstration, while conventional teleoperation was 3.27x slower.
- For "shirt folding," ActiveUMI was 1.49x slower than a direct human demonstration, compared to 2.63x for teleoperation.
- Analysis: ActiveUMI significantly speeds up data collection compared to conventional teleoperation. While it is still slower than a human performing the task directly (due to the overhead of operating the system), it offers a practical and efficient middle ground, combining much of the efficiency of natural human motion with the ability to generate robot-compatible data.
Data Collection Accuracy (Figure 6(e)):
-
Measurement: The Relative Pose Error (RPE) was measured by comparing the distance between two grippers as demonstrated by an operator with ActiveUMI controllers () and the actual distance achieved when the robot replayed that trajectory (). This was done for various nominal distances from 10cm to 100cm.
-
Results (Figure 6(e)): The RPE of ActiveUMI is 2.5x smaller than that of the UMI system.
-
Analysis: This low RPE indicates that ActiveUMI provides much better data quality in terms of kinematic fidelity. The authors attribute this precision to the advantages of the underlying VR system used for pose tracking. Higher accuracy in collected data directly translates to better-trained policies, as the robot learns from more precise demonstrations.
该图像是论文中关于ActiveUMI与远程操控及人工徒手示范在两项任务(绳索绑包和衣物折叠)中的对比示意图。图中展示了三种操控方式的实际操作过程,以及效率和相对位姿误差(RPE)对比,ActiveUMI在时间消耗和RPE指标上均优于其他方式。
Figure 6. Throughput and accuracy comparisons.
6.2. Data Presentation (Tables)
All relevant tables have been transcribed and presented in the respective subsections above.
6.3. Ablations / Parameter Sensitivity
The paper includes a specific ablation study on the impact of different camera views (Table 1 and Table 2) and a parameter sensitivity study on the ratio of mixed teleoperated data (Table 3).
-
Ablation of Active Perception (Tables 1 & 2): By comparing ActiveUMI (full system with active head camera) against "UMI w/ Fixed Head Camera" and "UMI" (wrist-camera-only), the study directly isolates the contribution of active, egocentric perception. The results overwhelmingly show that the active head camera component is critical, significantly improving both in-domain performance (Table 1) and generalization to novel environments (Table 2). This reveals that not only having a head camera but also actively controlling its viewpoint is essential for robustness and capability.
-
Sensitivity to Mixed Teleoperated Data Ratio (Table 3): This study explores how the proportion of high-fidelity teleoperated data affects policy performance when combined with ActiveUMI data. It shows that even a very small amount (1%) of teleoperated data can lead to the best performance (95% success), outperforming both 0% and 10% teleoperated data. This indicates that ActiveUMI data forms a strong base, and policies are highly sensitive to even tiny infusions of highly precise, expert demonstrations for fine-tuning. It highlights the sample efficiency of the ActiveUMI data when paired with strategic data mixing.
These studies confirm the importance of ActiveUMI's key components and provide insights into optimal training strategies.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces ActiveUMI, a novel framework for collecting "in-the-wild" human demonstrations to train complex bimanual robotic manipulation policies. The core innovation lies in integrating a portable VR teleoperation kit with the capture of active, egocentric perception – explicitly recording the operator's head movements alongside their hand actions. This allows robots to learn not just what to do, but also where to look. ActiveUMI policies achieved an impressive 70% average success rate on diverse bimanual tasks and demonstrated strong generalization (56% success in novel environments). Critically, it outperformed baselines lacking active perception by 38-44%, underscoring that viewpoint control is as crucial as action execution. Furthermore, ActiveUMI offers superior data collection efficiency and accuracy compared to prior methods like UMI. The ability to effectively train policies with predominantly ActiveUMI data, fine-tuned with minimal teleoperated data, presents a scalable and cost-effective pathway for developing generalizable robot foundation models.
7.2. Limitations & Future Work
The paper does not explicitly state a dedicated "Limitations" or "Future Work" section. However, based on the context and typical challenges in the field, potential limitations and implied areas for future work could include:
- Human Operator Dependence: While ActiveUMI makes data collection more accessible, it still relies on human operators for demonstrations. The quality and diversity of the collected data are inherently tied to the operator's skill, consistency, and ability to generate varied demonstrations. Future work could explore methods to reduce this dependency or to automatically augment collected data.
- Complexity of 20-DoF Control: While the system works, precisely coordinating 20-DoF (two 6-DoF arms, one 6-DoF head camera arm, and two gripper actuations) for learning might become very complex for humans to demonstrate for highly intricate tasks, potentially limiting the complexity of tasks that can be reliably demonstrated.
- Hardware and Software Maintenance: Being a VR-based system with custom modifications (micro-motors, fisheye cameras), long-term robustness, maintenance, and software updates for the consumer-grade VR equipment might pose challenges for widespread, sustained "in-the-wild" deployment.
- Sim-to-Real for Active Camera: While the paper mentions the robot's low-level controller executes the predicted head motion, there could still be subtle sim-to-real gaps in precisely matching the human-demonstrated viewpoint control on a physical robot arm, especially under dynamic conditions or with varying robot arm kinematics.
- Task Scope: The evaluated tasks, while challenging, represent a specific set of bimanual manipulation scenarios. Future work could involve expanding to tasks with even greater complexity, unstructured environments, or interaction with highly dynamic objects or human co-operators.
- Causal Understanding of Attention: While ActiveUMI learns the correlation between head movements and actions, it doesn't necessarily learn the causal reasons for human visual attention. Future research could delve into more interpretable models that understand why a particular viewpoint is chosen at a specific moment.
- Integration with Other Modalities: The current system primarily focuses on vision and proprioception. Future work could explore integrating other sensory modalities (e.g., touch, force feedback) into the active perception framework.
7.3. Personal Insights & Critique
-
Novelty of Active Perception Integration: The explicit and rigorous integration of active, egocentric perception from human head movements into robot policy learning is a significant and genuinely novel contribution. It addresses a fundamental aspect of human intelligence (how we use our eyes to guide our actions) that has often been overlooked or passively handled in robot learning. This is a crucial step towards more human-like and capable robots.
-
Scalability and Portability: The "in-the-wild" and portable nature of ActiveUMI (wearable computer, consumer VR hardware) is highly commendable. This design choice directly tackles the critical bottleneck of data scarcity for robot foundation models, making high-quality data collection much more accessible and widespread.
-
Rigorous Evaluation: The experimental design, comparing against strong baselines (UMI, fixed head camera) and evaluating both in-domain and generalization performance, provides convincing evidence for ActiveUMI's effectiveness. The ablation study clearly highlights the value of the active perception component.
-
Practical Implications of Mixed Training: The finding that a very small amount of teleoperated data (1%) can significantly boost performance when mixed with ActiveUMI data is a powerful insight. It suggests a practical, cost-effective strategy for training robust robot policies: leverage large volumes of scalable "in-the-wild" data from ActiveUMI, then fine-tune with minimal expert demonstrations. This could dramatically lower the barrier to entry for developing capable robot systems.
-
Potential for Embodied AI: By allowing the robot to learn to control its own viewpoint, ActiveUMI contributes to the broader field of embodied AI, where agents interact with and perceive their environment in an active, intelligent manner. This could lead to more robust and adaptable robots in complex, unstructured environments.
-
Critique - "PourWater" discrepancy: The abstract mentions "PourWater" task while the results table uses "Take Drink from Bag". This minor inconsistency might confuse a reader who expects to find "PourWater" in the results. While "Take Drink from Bag" is likely a similar manipulation task, clarifying this in the paper would improve precision.
-
Open Question - Human Workload: While ActiveUMI improves data collection throughput, the mental and physical workload for the human operator to precisely demonstrate complex bimanual tasks while also deliberately controlling their head movements might still be substantial, particularly for very long or repetitive tasks. Further research could investigate ways to reduce this cognitive load while maintaining data quality.
-
Untested Assumptions - Optimality of Human Head Movements: The system assumes that human head movements during a demonstration represent optimal or near-optimal visual attention strategies for the robot. While this is a reasonable heuristic, there might be situations where a robot's optimal viewing strategy differs due to its specific camera characteristics or kinematics.
Overall, ActiveUMI presents a compelling and well-supported solution to a critical problem in robot learning, demonstrating that explicit learning of active perception from human demonstrations is a powerful approach for developing generalizable and highly capable robotic manipulation skills.
Similar papers
Recommended via semantic vector search.