OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction
TL;DR Summary
The paper presents OmniRetarget, an engine addressing the embodiment gap in humanoid robots by preserving key interactive relationships. It generates high-quality trajectories for reinforcement learning through an interaction mesh, enabling complex task execution for durations up
Abstract
A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long-horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction
1.2. Authors
Lujie Yang (Amazon FAR, MIT), Xiaoyu Huang (Amazon FAR, UC Berkeley), Zhen Wu (Amazon FAR), Angjoo Kanazawa (Amazon FAR, UC Berkeley), Pieter Abbeel (Amazon FAR, UC Berkeley), Carmelo Sferrazza (Amazon FAR), C. Karen Liu (Amazon FAR, Stanford), Rocky Duan (Amazon FAR), Guanya Shi (Amazon FAR, CMU).
1.3. Journal/Conference
This paper was published as a preprint on ArXiv (v2, September 2025). The authors are affiliated with top-tier institutions (MIT, UC Berkeley, Stanford, CMU) and Amazon’s Frontier AI & Robotics (FAR) division, which is known for high-impact research in humanoid robotics and deep reinforcement learning.
1.4. Publication Year
2025
1.5. Abstract
The paper addresses the "embodiment gap" in teaching humanoid robots complex skills. While human motion capture (MoCap) data is abundant, retargeting it to robots often results in physical errors like foot-skating or ignoring object interactions. The authors propose OmniRetarget, an engine that uses an "interaction mesh" to model and preserve spatial relationships between the robot, the environment, and objects. By optimizing a Laplacian deformation energy subject to hard physical constraints (like joint limits and collision avoidance), it generates high-quality kinematic references. These references allow reinforcement learning (RL) policies to learn complex tasks (like parkour and carrying objects) with very simple reward structures and successful zero-shot transfer to real hardware (Unitree G1).
1.6. Original Source Link
-
PDF Link: https://arxiv.org/pdf/2509.26633v2.pdf
-
Project Website: https://omniretarget.github.io
2. Executive Summary
2.1. Background & Motivation
Teaching humanoid robots to move like humans usually involves motion retargeting—taking a human's recorded movements and mapping them onto a robot's skeleton. However, robots have different limb lengths, joint limits, and shapes compared to humans (the embodiment gap).
Current methods suffer from two main flaws:
-
Physical Artifacts: They produce "foot-skating" (feet sliding on the floor when they should be still) and "penetration" (limbs clipping through objects or the floor).
-
Interaction Loss: Most methods focus only on the body's joints and ignore the context. For example, if a human picks up a box, a standard retargeter might move the robot's hands to the right coordinates but fail to ensure the hands actually "pinch" the box or maintain the correct distance from the chest.
OmniRetarget aims to solve this by explicitly modeling the spatial relationship between the agent and its surroundings, ensuring that if a human was touching a wall or holding an object, the robot does so as well.
2.2. Main Contributions / Findings
-
Interaction-Preserving Framework: Introduced a system that handles robot-object-terrain interactions simultaneously while enforcing hard physical constraints.
-
Data Augmentation Engine: A method to turn one human demonstration into hundreds of robot variations by parametrically changing object sizes, positions, and terrain heights.
-
Large-Scale Dataset: Generated over 8 hours of high-quality, retargeted loco-manipulation trajectories from existing datasets like
OMOMOandLAFAN1. -
Minimalist RL Training: Demonstrated that high-quality data removes the need for "reward engineering." They trained a Unitree G1 humanoid to perform 30-second complex sequences (carrying a chair, climbing, rolling) using only 5 basic reward terms.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, several core concepts must be defined:
- Humanoid Whole-Body Loco-Manipulation: This refers to a robot using its entire body (arms, legs, torso) to move (locomotion) and interact with objects (manipulation) simultaneously.
- Motion Retargeting: The process of translating a motion sequence from a source (human) to a target (robot) while preserving the "essence" of the movement.
- Interaction Mesh: A concept from computer graphics where a "web" of connections is created between joints and external points (like the surface of a table). If the body moves, the "web" stretches or shrinks, helping maintain relative distances.
- Laplacian Deformation: A mathematical technique used in geometry processing. It focuses on the relative position of a point to its neighbors rather than its absolute global coordinates. This is key for preserving shapes and contact points.
- Hard Constraints vs. Soft Penalties: In optimization, a hard constraint is a rule that must be followed (e.g., "the foot must not go below the floor"). A soft penalty is a suggestion (e.g., "try not to go below the floor, but it's okay if you do, it just costs you points"). OmniRetarget uses hard constraints for physical realism.
3.2. Previous Works
The authors compare OmniRetarget against three major paradigms:
- PHC (Perpetual Humanoid Control): A real-time method that uses keypoint matching but often suffers from limb penetration and lacks scene awareness.
- GMR (General Motion Retargeting): An optimization-based method that matches positions and orientations but doesn't explicitly handle environment interactions.
- VideoMimic: A recent framework that retargets from videos using soft penalties, which can be difficult to tune and often fails to prevent artifacts like foot-skating.
3.3. Technological Evolution
Historically, retargeting was a graphics problem (making characters in movies move). Robotics added the requirement of physics. Early robot retargeting used Inverse Kinematics (IK) to place hands and feet. OmniRetarget represents the next step: moving from "keypoint placement" to "interaction preservation," treating the robot and its environment as a single interconnected system.
4. Methodology
4.1. Principles
The core idea of OmniRetarget is that motion is not just about where the joints are, but how the joints relate to the environment. The theoretical basis is the Interaction Mesh. By defining a volumetric mesh that connects the robot to the terrain and objects, the system can use Laplacian coordinates to ensure that the "local neighborhood" of every contact point remains consistent between the human and the robot.
4.2. Core Methodology In-depth (Layer by Layer)
The following figure (Figure 2 from the original paper) shows the system architecture:
该图像是一个示意图,展示了 OmniRetarget 的工作流程,包括人类动作的重定向、交互网格配对、以及强化学习训练。图中左侧显示了人类动作数据,如来自 LAFAN1 和 OMOMO 的数据,右侧展示了训练后机器人的执行效果,以及与现实世界的互动。公式 描述了网格匹配的优化过程。
4.2.1. Interaction Mesh Construction
First, the system identifies key points: robot joints, points on the surface of an object, and points on the terrain. It then uses Delaunay tetrahedralization to create a mesh. This creates a network of connections (edges) between these points.
4.2.2. The Laplacian Coordinate Objective
The "essence" of the interaction is captured by the Laplacian coordinate. For any point (the -th keypoint at time ), the Laplacian coordinate is defined as:
Where:
-
is the coordinate of the point.
-
is the set of neighbors connected to point in the mesh.
-
is the weight of the connection (the authors use uniform weights, ).
The primary goal is to minimize the Laplacian deformation energy , which measures the difference between the human's relative spatial layout and the robot's:
4.2.3. Optimization with Hard Constraints
To find the robot's configuration (which includes joint angles and the base pose), the system solves a constrained optimization problem for every frame :
Subject to:
-
Collision Avoidance: (where is the Signed Distance Function).
-
Joint Limits: .
-
Velocity Limits: .
-
Foot Stance Stability: (if the foot is supposed to be stationary, its position must not change).
This is solved using a Sequential Quadratic Programming (SQP) solver. This means the complex, non-linear problem is broken down into simpler quadratic problems that are solved iteratively.
4.2.4. Systematic Data Augmentation
OmniRetarget can generate new data by perturbing the target mesh .
-
Object Augmentation: They apply translations and rotations to the object.
-
Anchoring: To ensure the robot doesn't just "teleport" with the object, they add a penalty to keep the lower body close to the original path while the upper body adapts to the new object position.
5. Experimental Setup
5.1. Datasets
The authors used three primary sources:
-
OMOMO: A dataset of humans interacting with objects (e.g., picking up boxes).
-
LAFAN1: A high-quality locomotion dataset (walking, running, turning).
-
In-House MoCap: Custom recordings for terrain-specific tasks like climbing high platforms or crawling.
The following figure (Figure 3 from the original paper) shows cross-embodiment interactions:
该图像是示意图,展示了不同机器人的物体搬运和平台攀爬动作,包含四个部分:H1和T1的物体搬运(a, b)以及H1和T1的攀爬(c, d),体现了跨体现机器人与物体、地形间的交互。
5.2. Evaluation Metrics
5.2.1. Penetration
- Conceptual Definition: Measures how much the robot's body parts "clip" into objects or the environment.
- Mathematical Formula:
Penetration = \frac{1}{T} \int_{0}^{T} \mathbb{1}(\phi(q_{t}) < 0) dt - Symbol Explanation: is total time, is an indicator function (1 if true, 0 if false), and is the distance to the nearest obstacle.
5.2.2. Foot Skating
- Conceptual Definition: Quantifies the unintended sliding of a foot that is supposed to be in contact with the ground.
- Mathematical Formula: for all
- Symbol Explanation: is the velocity of the foot during the stance phase.
5.2.3. Contact Preservation
- Conceptual Definition: Measures the percentage of time a desired contact (e.g., hand on box) is maintained.
5.3. Baselines
-
PHC: Keypoint-based trajectory optimization.
-
GMR: Inverse Kinematics with position and orientation matching.
-
VideoMimic: Video-based retargeting using soft physical penalties.
6. Results & Analysis
6.1. Core Results Analysis
OmniRetarget significantly reduced physical artifacts. While baselines like PHC and GMR frequently showed feet sliding or hands penetrating deep into boxes, OmniRetarget achieved zero foot-skating and negligible penetration ( cm). This physical cleanliness directly led to higher RL success rates ( for object tasks vs. for PHC).
6.2. Data Presentation (Tables)
The following are the results from Table II of the original paper:
| Method | Penetration | Foot Skating | Contact Preservation | Downstream RL Policy | ||
|---|---|---|---|---|---|---|
| Duration ↓ | Max Depth (cm) ↓ | Duration ↓ | Max Vel. (cm/s) ↓ | Duration ↑ | Success Rate ↑ | |
| Robot-Object Interaction (Retargeting from the OMOMO Dataset) | ||||||
| PHC [10] | 0.68 ± 0.21 | 5.11 ± 3.09 | 0.05 ± 0.05 | 1.40 ± 0.80 | 0.96 ± 0.09 | 71.28% ± 22.55% |
| GMR [9] | 0.83 ± 0.14 | 8.50 ± 3.94 | 0.02 ± 0.01 | 1.46 ± 0.45 | 0.99 ± 0.04 | 50.83% ± 23.89% |
| VideoMimic [11] | 0.60 ± 0.27 | 7.48 ± 4.95 | 0.12 ± 0.07 | 1.50 ± 0.70 | 0.77 ± 0.25 | 3.85% ± 8.41% |
| OMNIRETARGET | 0.00 ± 0.01 | 1.34 ± 0.34 | 0 | 0 | 0.96 ± 0.09 | 82.20% ± 9.74% |
| Robot-Terrain Interaction (Retargeting from the In-House MoCap Dataset) | ||||||
| PHC | 0.66 ± 0.36 | 7.74 ± 4.53 | 0.15 ± 0.04 | 2.03 ± 1.83 | 0.45 ± 0.28 | 52.63% ± 49.93% |
| GMR | 0.91 ± 0.16 | 5.72 ± 3.84 | 0.04 ± 0.05 | 1.75 ± 3.01 | 0.67 ± 0.26 | 78.94% ± 40.77% |
| VideoMimic | 0.83 ± 0.11 | 5.97 ± 3.58 | 0.14 ± 0.05 | 1.85 ± 1.38 | 0.47 ± 0.25 | 51.75% ± 49.23% |
| OMNIRETARGET | 0.01 ± 0.02 | 1.37 ± 0.18 | 0 | 0 | 0.72 ± 0.19 | 94.73% ± 22.33% |
6.3. Ablation Studies / Parameter Analysis
The authors showed that the Interaction Mesh is the critical component. Without it, the robot might follow the human's joints but lose the "grasp" on the object. They also demonstrated that using hard constraints for foot contact is what eliminates foot-skating, a common issue in RL-trained humanoids that leads to poor real-world stability.
7. Conclusion & Reflections
7.1. Conclusion Summary
OmniRetarget proves that the quality of reference data is more important than the complexity of the RL reward function. By using an interaction-aware mesh and hard kinematic constraints, they generated a dataset that allows robots to perform high-dynamic tasks (wall-flips, rolling, climbing) that were previously very difficult to train.
7.2. Limitations & Future Work
- Static Interaction Mesh: The mesh structure is currently fixed for a specific task. Dynamic changes (e.g., catching a falling object) might require a mesh that updates its topology in real-time.
- Geometry Dependency: The system requires accurate 3D models of the objects and terrain. Future work could integrate this with real-time vision (computer vision) to handle unknown environments.
- Sequential Optimization: Frame-by-frame optimization can occasionally lead to jerky movements if the human data is very noisy.
7.3. Personal Insights & Critique
The paper makes a compelling case for "principled data generation." In the AI era, there is a tendency to throw more compute and complex rewards at a problem. OmniRetarget takes a step back and applies classical geometry processing (Laplacian meshes) to clean the data before the AI sees it. This "clean data" approach is likely the key to making humanoid robots more reliable in home and factory settings where precise interaction is non-negotiable. However, the reliance on a Sequential Quadratic Programming (SQP) solver might be computationally expensive for massive datasets, suggesting a need for even faster optimization techniques in the future.
Similar papers
Recommended via semantic vector search.