Paper status: completed

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

Published:10/01/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents OmniRetarget, an engine addressing the embodiment gap in humanoid robots by preserving key interactive relationships. It generates high-quality trajectories for reinforcement learning through an interaction mesh, enabling complex task execution for durations up

Abstract

A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long-horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

1.2. Authors

Lujie Yang (Amazon FAR, MIT), Xiaoyu Huang (Amazon FAR, UC Berkeley), Zhen Wu (Amazon FAR), Angjoo Kanazawa (Amazon FAR, UC Berkeley), Pieter Abbeel (Amazon FAR, UC Berkeley), Carmelo Sferrazza (Amazon FAR), C. Karen Liu (Amazon FAR, Stanford), Rocky Duan (Amazon FAR), Guanya Shi (Amazon FAR, CMU).

1.3. Journal/Conference

This paper was published as a preprint on ArXiv (v2, September 2025). The authors are affiliated with top-tier institutions (MIT, UC Berkeley, Stanford, CMU) and Amazon’s Frontier AI & Robotics (FAR) division, which is known for high-impact research in humanoid robotics and deep reinforcement learning.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses the "embodiment gap" in teaching humanoid robots complex skills. While human motion capture (MoCap) data is abundant, retargeting it to robots often results in physical errors like foot-skating or ignoring object interactions. The authors propose OmniRetarget, an engine that uses an "interaction mesh" to model and preserve spatial relationships between the robot, the environment, and objects. By optimizing a Laplacian deformation energy subject to hard physical constraints (like joint limits and collision avoidance), it generates high-quality kinematic references. These references allow reinforcement learning (RL) policies to learn complex tasks (like parkour and carrying objects) with very simple reward structures and successful zero-shot transfer to real hardware (Unitree G1).

2. Executive Summary

2.1. Background & Motivation

Teaching humanoid robots to move like humans usually involves motion retargeting—taking a human's recorded movements and mapping them onto a robot's skeleton. However, robots have different limb lengths, joint limits, and shapes compared to humans (the embodiment gap).

Current methods suffer from two main flaws:

  1. Physical Artifacts: They produce "foot-skating" (feet sliding on the floor when they should be still) and "penetration" (limbs clipping through objects or the floor).

  2. Interaction Loss: Most methods focus only on the body's joints and ignore the context. For example, if a human picks up a box, a standard retargeter might move the robot's hands to the right coordinates but fail to ensure the hands actually "pinch" the box or maintain the correct distance from the chest.

    OmniRetarget aims to solve this by explicitly modeling the spatial relationship between the agent and its surroundings, ensuring that if a human was touching a wall or holding an object, the robot does so as well.

2.2. Main Contributions / Findings

  • Interaction-Preserving Framework: Introduced a system that handles robot-object-terrain interactions simultaneously while enforcing hard physical constraints.

  • Data Augmentation Engine: A method to turn one human demonstration into hundreds of robot variations by parametrically changing object sizes, positions, and terrain heights.

  • Large-Scale Dataset: Generated over 8 hours of high-quality, retargeted loco-manipulation trajectories from existing datasets like OMOMO and LAFAN1.

  • Minimalist RL Training: Demonstrated that high-quality data removes the need for "reward engineering." They trained a Unitree G1 humanoid to perform 30-second complex sequences (carrying a chair, climbing, rolling) using only 5 basic reward terms.


3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, several core concepts must be defined:

  • Humanoid Whole-Body Loco-Manipulation: This refers to a robot using its entire body (arms, legs, torso) to move (locomotion) and interact with objects (manipulation) simultaneously.
  • Motion Retargeting: The process of translating a motion sequence from a source (human) to a target (robot) while preserving the "essence" of the movement.
  • Interaction Mesh: A concept from computer graphics where a "web" of connections is created between joints and external points (like the surface of a table). If the body moves, the "web" stretches or shrinks, helping maintain relative distances.
  • Laplacian Deformation: A mathematical technique used in geometry processing. It focuses on the relative position of a point to its neighbors rather than its absolute global coordinates. This is key for preserving shapes and contact points.
  • Hard Constraints vs. Soft Penalties: In optimization, a hard constraint is a rule that must be followed (e.g., "the foot must not go below the floor"). A soft penalty is a suggestion (e.g., "try not to go below the floor, but it's okay if you do, it just costs you points"). OmniRetarget uses hard constraints for physical realism.

3.2. Previous Works

The authors compare OmniRetarget against three major paradigms:

  1. PHC (Perpetual Humanoid Control): A real-time method that uses keypoint matching but often suffers from limb penetration and lacks scene awareness.
  2. GMR (General Motion Retargeting): An optimization-based method that matches positions and orientations but doesn't explicitly handle environment interactions.
  3. VideoMimic: A recent framework that retargets from videos using soft penalties, which can be difficult to tune and often fails to prevent artifacts like foot-skating.

3.3. Technological Evolution

Historically, retargeting was a graphics problem (making characters in movies move). Robotics added the requirement of physics. Early robot retargeting used Inverse Kinematics (IK) to place hands and feet. OmniRetarget represents the next step: moving from "keypoint placement" to "interaction preservation," treating the robot and its environment as a single interconnected system.


4. Methodology

4.1. Principles

The core idea of OmniRetarget is that motion is not just about where the joints are, but how the joints relate to the environment. The theoretical basis is the Interaction Mesh. By defining a volumetric mesh that connects the robot to the terrain and objects, the system can use Laplacian coordinates to ensure that the "local neighborhood" of every contact point remains consistent between the human and the robot.

4.2. Core Methodology In-depth (Layer by Layer)

The following figure (Figure 2 from the original paper) shows the system architecture:

Fig. 2: OmNIReTARGET overview. Human demonstrations are retargeted to the robot via interaction-mesh-based tho to real-world humanoids. 该图像是一个示意图,展示了 OmniRetarget 的工作流程,包括人类动作的重定向、交互网格配对、以及强化学习训练。图中左侧显示了人类动作数据,如来自 LAFAN1 和 OMOMO 的数据,右侧展示了训练后机器人的执行效果,以及与现实世界的互动。公式 minextLtargetextLsourceLtarget2min ext{ }L_{target} ext{ }||L_{source} - L_{target}||^2 描述了网格匹配的优化过程。

4.2.1. Interaction Mesh Construction

First, the system identifies key points: robot joints, points on the surface of an object, and points on the terrain. It then uses Delaunay tetrahedralization to create a mesh. This creates a network of connections (edges) between these points.

4.2.2. The Laplacian Coordinate Objective

The "essence" of the interaction is captured by the Laplacian coordinate. For any point pt,ip_{t,i} (the ii-th keypoint at time tt), the Laplacian coordinate L(pt,i)L(p_{t,i}) is defined as:

L(pt,i)=pt,ijN(i)wijpt,jL(p_{t,i}) = p_{t,i} - \sum_{j \in \mathcal{N}(i)} w_{ij} \cdot p_{t,j}

Where:

  • pt,ip_{t,i} is the coordinate of the point.

  • N(i)\mathcal{N}(i) is the set of neighbors connected to point ii in the mesh.

  • wijw_{ij} is the weight of the connection (the authors use uniform weights, wij=1/N(i)w_{ij} = 1 / |\mathcal{N}(i)|).

    The primary goal is to minimize the Laplacian deformation energy ELE_{L}, which measures the difference between the human's relative spatial layout and the robot's:

EL=pt,isourcePtsource,pt,itargetPttargetL(pt,isource)L(pt,itarget)2E_{L} = \sum_{p_{t,i}^{source} \in \mathcal{P}_{t}^{source}, p_{t,i}^{target} \in \mathcal{P}_{t}^{target}} \| L(p_{t,i}^{source}) - L(p_{t,i}^{target}) \|^{2}

4.2.3. Optimization with Hard Constraints

To find the robot's configuration qtq_{t} (which includes joint angles and the base pose), the system solves a constrained optimization problem for every frame tt:

qt=argminqtiL(pt,isource)L(pt,itarget(qt))2+qtqt1Q2q_{t}^{\star} = \arg \min_{q_{t}} \sum_{i} \| L(p_{t,i}^{source}) - L(p_{t,i}^{target}(q_{t})) \|^{2} + \| q_{t} - q_{t-1} \|_{Q}^{2}

Subject to:

  1. Collision Avoidance: ϕj(qt)0,j\phi_{j}(q_{t}) \geq 0, \forall j (where ϕj\phi_{j} is the Signed Distance Function).

  2. Joint Limits: qminqtqmaxq_{min} \leq q_{t} \leq q_{max}.

  3. Velocity Limits: νmindtqtqt1νmaxdt\nu_{min} \cdot dt \leq q_{t} - q_{t-1} \leq \nu_{max} \cdot dt.

  4. Foot Stance Stability: ptF=pt1Fp_{t}^{F} = p_{t-1}^{F} (if the foot is supposed to be stationary, its position must not change).

    This is solved using a Sequential Quadratic Programming (SQP) solver. This means the complex, non-linear problem is broken down into simpler quadratic problems that are solved iteratively.

4.2.4. Systematic Data Augmentation

OmniRetarget can generate new data by perturbing the target mesh Pttarget\mathcal{P}_{t}^{target}.

  • Object Augmentation: They apply translations Δpobj\Delta p_{obj} and rotations Δθobj\Delta \theta_{obj} to the object.

  • Anchoring: To ensure the robot doesn't just "teleport" with the object, they add a penalty qtqˉtW\| q_{t} - \bar{q}_{t}^{\star} \|_{W} to keep the lower body close to the original path while the upper body adapts to the new object position.


5. Experimental Setup

5.1. Datasets

The authors used three primary sources:

  1. OMOMO: A dataset of humans interacting with objects (e.g., picking up boxes).

  2. LAFAN1: A high-quality locomotion dataset (walking, running, turning).

  3. In-House MoCap: Custom recordings for terrain-specific tasks like climbing high platforms or crawling.

    The following figure (Figure 3 from the original paper) shows cross-embodiment interactions:

    Fig. 3: Cross-embodiment robot-object-terrain interaction. 该图像是示意图,展示了不同机器人的物体搬运和平台攀爬动作,包含四个部分:H1和T1的物体搬运(a, b)以及H1和T1的攀爬(c, d),体现了跨体现机器人与物体、地形间的交互。

5.2. Evaluation Metrics

5.2.1. Penetration

  1. Conceptual Definition: Measures how much the robot's body parts "clip" into objects or the environment.
  2. Mathematical Formula: Penetration = \frac{1}{T} \int_{0}^{T} \mathbb{1}(\phi(q_{t}) < 0) dt
  3. Symbol Explanation: TT is total time, 1\mathbb{1} is an indicator function (1 if true, 0 if false), and ϕ(qt)\phi(q_{t}) is the distance to the nearest obstacle.

5.2.2. Foot Skating

  1. Conceptual Definition: Quantifies the unintended sliding of a foot that is supposed to be in contact with the ground.
  2. Mathematical Formula: Vskate=p˙tFV_{skate} = \| \dot{p}_{t}^{F} \| for all tStancet \in Stance
  3. Symbol Explanation: p˙tF\dot{p}_{t}^{F} is the velocity of the foot during the stance phase.

5.2.3. Contact Preservation

  1. Conceptual Definition: Measures the percentage of time a desired contact (e.g., hand on box) is maintained.

5.3. Baselines

  • PHC: Keypoint-based trajectory optimization.

  • GMR: Inverse Kinematics with position and orientation matching.

  • VideoMimic: Video-based retargeting using soft physical penalties.


6. Results & Analysis

6.1. Core Results Analysis

OmniRetarget significantly reduced physical artifacts. While baselines like PHC and GMR frequently showed feet sliding or hands penetrating deep into boxes, OmniRetarget achieved zero foot-skating and negligible penetration (<1.4< 1.4 cm). This physical cleanliness directly led to higher RL success rates (82.2%82.2\% for object tasks vs. 71.3%71.3\% for PHC).

6.2. Data Presentation (Tables)

The following are the results from Table II of the original paper:

Method Penetration Foot Skating Contact Preservation Downstream RL Policy
Duration ↓ Max Depth (cm) ↓ Duration ↓ Max Vel. (cm/s) ↓ Duration ↑ Success Rate ↑
Robot-Object Interaction (Retargeting from the OMOMO Dataset)
PHC [10] 0.68 ± 0.21 5.11 ± 3.09 0.05 ± 0.05 1.40 ± 0.80 0.96 ± 0.09 71.28% ± 22.55%
GMR [9] 0.83 ± 0.14 8.50 ± 3.94 0.02 ± 0.01 1.46 ± 0.45 0.99 ± 0.04 50.83% ± 23.89%
VideoMimic [11] 0.60 ± 0.27 7.48 ± 4.95 0.12 ± 0.07 1.50 ± 0.70 0.77 ± 0.25 3.85% ± 8.41%
OMNIRETARGET 0.00 ± 0.01 1.34 ± 0.34 0 0 0.96 ± 0.09 82.20% ± 9.74%
Robot-Terrain Interaction (Retargeting from the In-House MoCap Dataset)
PHC 0.66 ± 0.36 7.74 ± 4.53 0.15 ± 0.04 2.03 ± 1.83 0.45 ± 0.28 52.63% ± 49.93%
GMR 0.91 ± 0.16 5.72 ± 3.84 0.04 ± 0.05 1.75 ± 3.01 0.67 ± 0.26 78.94% ± 40.77%
VideoMimic 0.83 ± 0.11 5.97 ± 3.58 0.14 ± 0.05 1.85 ± 1.38 0.47 ± 0.25 51.75% ± 49.23%
OMNIRETARGET 0.01 ± 0.02 1.37 ± 0.18 0 0 0.72 ± 0.19 94.73% ± 22.33%

6.3. Ablation Studies / Parameter Analysis

The authors showed that the Interaction Mesh is the critical component. Without it, the robot might follow the human's joints but lose the "grasp" on the object. They also demonstrated that using hard constraints for foot contact is what eliminates foot-skating, a common issue in RL-trained humanoids that leads to poor real-world stability.


7. Conclusion & Reflections

7.1. Conclusion Summary

OmniRetarget proves that the quality of reference data is more important than the complexity of the RL reward function. By using an interaction-aware mesh and hard kinematic constraints, they generated a dataset that allows robots to perform high-dynamic tasks (wall-flips, rolling, climbing) that were previously very difficult to train.

7.2. Limitations & Future Work

  • Static Interaction Mesh: The mesh structure is currently fixed for a specific task. Dynamic changes (e.g., catching a falling object) might require a mesh that updates its topology in real-time.
  • Geometry Dependency: The system requires accurate 3D models of the objects and terrain. Future work could integrate this with real-time vision (computer vision) to handle unknown environments.
  • Sequential Optimization: Frame-by-frame optimization can occasionally lead to jerky movements if the human data is very noisy.

7.3. Personal Insights & Critique

The paper makes a compelling case for "principled data generation." In the AI era, there is a tendency to throw more compute and complex rewards at a problem. OmniRetarget takes a step back and applies classical geometry processing (Laplacian meshes) to clean the data before the AI sees it. This "clean data" approach is likely the key to making humanoid robots more reliable in home and factory settings where precise interaction is non-negotiable. However, the reliance on a Sequential Quadratic Programming (SQP) solver might be computationally expensive for massive datasets, suggesting a need for even faster optimization techniques in the future.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.