Paper status: completed

Retargeting Matters: General Motion Retargeting for Humanoid Motion Tracking

Published:10/03/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
18 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work proposes General Motion Retargeting (GMR) to reduce artifacts in humanoid motion tracking. GMR improves retargeting quality and policy robustness without extensive reward tuning, outperforming existing open-source methods in fidelity and success rate.

Abstract

Humanoid motion tracking policies are central to building teleoperation pipelines and hierarchical controllers, yet they face a fundamental challenge: the embodiment gap between humans and humanoid robots. Current approaches address this gap by retargeting human motion data to humanoid embodiments and then training reinforcement learning (RL) policies to imitate these reference trajectories. However, artifacts introduced during retargeting, such as foot sliding, self-penetration, and physically infeasible motion are often left in the reference trajectories for the RL policy to correct. While prior work has demonstrated motion tracking abilities, they often require extensive reward engineering and domain randomization to succeed. In this paper, we systematically evaluate how retargeting quality affects policy performance when excessive reward tuning is suppressed. To address issues that we identify with existing retargeting methods, we propose a new retargeting method, General Motion Retargeting (GMR). We evaluate GMR alongside two open-source retargeters, PHC and ProtoMotions, as well as with a high-quality closed-source dataset from Unitree. Using BeyondMimic for policy training, we isolate retargeting effects without reward tuning. Our experiments on a diverse subset of the LAFAN1 dataset reveal that while most motions can be tracked, artifacts in retargeted data significantly reduce policy robustness, particularly for dynamic or long sequences. GMR consistently outperforms existing open-source methods in both tracking performance and faithfulness to the source motion, achieving perceptual fidelity and policy success rates close to the closed-source baseline. Website: https://jaraujo98.github.io/retargeting_matters. Code: https://github.com/YanjieZe/GMR.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Retargeting Matters: General Motion Retargeting for Humanoid Motion Tracking

1.2. Authors

Joo Pedro Araújo†, Yanjie Ze†, Pei Xu†, Jiajun Wu*, C. Karen Liu*. Affiliation: Stanford University.

1.3. Journal/Conference

This paper is published as a preprint on arXiv (arXiv:2510.02252v1). While not yet peer-reviewed for a specific journal or conference, its publication venue, Stanford University, is a highly reputable institution in computer science, robotics, and artificial intelligence, suggesting a strong academic backing and potential for significant impact in the field.

1.4. Publication Year

2025

1.5. Abstract

This paper addresses the embodiment gap between humans and humanoid robots in motion tracking for teleoperation and hierarchical control. Current approaches retarget human motion data to robots, then train Reinforcement Learning (RL) policies. However, artifacts like foot sliding, self-penetration, and physically infeasible motion often remain in these retargeted reference trajectories, forcing RL policies to correct them, often requiring extensive reward engineering and domain randomization. The authors systematically evaluate how retargeting quality impacts policy performance when excessive reward tuning is suppressed. To mitigate issues in existing methods, they propose General Motion Retargeting (GMR). They compare GMR with two open-source retargeters (PHC and ProtoMotions) and a high-quality closed-source dataset from Unitree. Using BeyondMimic for policy training, they isolate retargeting effects without reward tuning. Experiments on a diverse subset of the LAFAN1 dataset show that while most motions can be tracked, artifacts significantly reduce policy robustness, especially for dynamic or long sequences. GMR consistently outperforms open-source methods in tracking performance and faithfulness to source motion, achieving perceptual fidelity and policy success rates comparable to the closed-source baseline.

https://arxiv.org/abs/2510.02252v1 (Preprint)

https://arxiv.org/pdf/2510.02252v1.pdf

2. Executive Summary

2.1. Background & Motivation

The paper tackles a fundamental challenge in humanoid robotics: building effective teleoperation pipelines and hierarchical controllers that rely on humanoid motion tracking. The core problem lies in the embodiment gap—significant morphological, kinematic, and dynamic differences—between humans and humanoid robots. Existing methods address this by kinematically retargeting human motion data to robotic embodiments, then training Reinforcement Learning (RL) policies to imitate these reference trajectories.

The primary motivation for this research stems from a critical oversight in current practices: the presence of artifacts in retargeted data. These artifacts, such as foot sliding, ground penetration, self-penetration, and physically infeasible motion, are often passed directly to RL policies. While prior work has shown that policies can be trained on such data, it typically demands extensive reward engineering and domain randomization to compensate for these imperfections. The authors hypothesize that without these extensive engineering efforts, the quality of the retargeted motion plays a much more significant role in policy performance and robustness. There is a clear gap in understanding the direct impact of retargeting quality on RL policy learning when reward tuning is suppressed.

2.2. Main Contributions / Findings

The paper makes several key contributions:

  • Systematic Evaluation of Retargeting Quality: It conducts a rigorous and systematic evaluation of how retargeting quality affects RL policy performance in motion tracking tasks, specifically when excessive reward tuning and domain randomization are suppressed. This isolates the impact of the retargeting process itself.
  • Proposal of General Motion Retargeting (GMR): The paper introduces a new retargeting method, GMR, designed to address the specific shortcomings (deviation from source motion, foot sliding, ground penetrations, self-intersections) identified in existing open-source retargeters like PHC and ProtoMotions. GMR employs a novel non-uniform local scaling procedure followed by a two-stage optimization.
  • Identification of Critical Retargeting Artifacts: The research identifies specific types of artifacts (physically inconsistent height, self-intersections, and sudden jumps in joint values) that significantly reduce policy robustness and can make certain motions unlearnable without substantial reward engineering.
  • Demonstration of GMR's Superior Performance: Through extensive experiments on a diverse subset of the LAFAN1 dataset, GMR consistently outperforms existing open-source methods in both tracking performance (lower errors) and faithfulness to the source motion (confirmed by a user study). GMR achieves perceptual fidelity and policy success rates that are close to a high-quality, closed-source baseline dataset from Unitree.
  • Emphasis on Initial Frame Stability: The paper highlights the importance of the initial frame of the reference motion for policy success, recommending that both start and end poses be stable for safe policy deployment.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully grasp the contributions of this paper, a foundational understanding of several key concepts is necessary, particularly in robotics, computer graphics, and machine learning.

  • Humanoid Motion Tracking: This refers to the task of enabling a humanoid robot to replicate or imitate a given human motion. The goal is for the robot's movements (e.g., walking, running, dancing) to closely match those of a human, considering the robot's physical constraints. This is crucial for applications like teleoperation (controlling a robot remotely) and hierarchical control (breaking down complex tasks into simpler, manageable sub-tasks).

  • Embodiment Gap: This term describes the inherent differences between human bodies and humanoid robot bodies. These differences include variations in bone length, joint range of motion, kinematic structure (how joints are connected), body shape, mass distribution, and actuation mechanisms (how joints are moved). Overcoming this gap is central to successfully transferring human motions to robots.

  • Kinematic Retargeting: This is a data processing technique used to adapt a motion from a source character (e.g., a human) to a target character (e.g., a humanoid robot) that may have a different skeleton or morphology. The process involves mapping the joint positions, orientations, or end-effector trajectories of the source to the target, while respecting the target's physical constraints and maintaining the perceptual characteristics of the original motion.

  • Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal.

    • Policy: The policy in RL is the agent's strategy, mapping observed states of the environment to actions to be taken. In motion tracking, the policy learns to control the robot's joints to follow a reference motion.
    • Reward Engineering (or Reward Shaping): The process of designing the reward function that guides an RL agent's learning. Well-designed rewards are crucial for successful learning, but complex tasks often require extensive, hand-tuned reward functions.
    • Domain Randomization: A technique used in RL to train policies in simulation that generalize better to the real world. It involves randomizing various physical parameters (e.g., friction, mass, sensor noise) within the simulation during training, making the policy robust to variations and uncertainties in the target environment.
  • SMPL (Skinned Multi-Person Linear Model): A widely used statistical 3D human body model in computer graphics. It represents human body shapes and poses using a low-dimensional parameter space.

    • It takes shape parameters (βh\beta^h) and pose parameters (θh\theta^h) as input.
    • It outputs the 3D locations of vertices of a posed human body mesh (νh=fSMPL(βh,θh)\nu^h = f_{\mathrm{SMPL}}(\beta^h, \theta^h)).
    • A joint regressor (JJ) is then used to derive 3D joint positions (jh=Jνhj^h = J \nu^h) from the mesh vertices.
  • Inverse Kinematics (IK): In robotics and computer graphics, IK is the mathematical process of calculating the joint parameters (e.g., angles) of an articulated body (like a robot arm or human skeleton) that will achieve a desired position and orientation for a specified end-effector (e.g., a hand or foot). It's the inverse problem of forward kinematics.

  • Forward Kinematics (FK): The calculation of the position and orientation of the end-effector from the given joint parameters (angles) and link lengths of an articulated body.

  • LAFAN1 Dataset: A publicly available motion capture dataset consisting of a variety of human locomotion and expressive motions. It's often used as a source for reference motions in character animation and robotics research.

  • Unitree G1 Robot: A specific model of a humanoid robot, often used as a target embodiment in research for its dynamic capabilities.

3.2. Previous Works

The paper contextualizes its work within a broad landscape of motion retargeting and humanoid control research.

  • Motion Retargeting in Computer Graphics:

    • Classic Methods [16, 17, 18, 19]: These often employ optimization-based approaches and rely on heuristically defined kinematic constraints to map motions. They are typically concerned with visual fidelity for character animation.
    • Data-driven Approaches [20, 21, 22, 23, 24, 25]: With the rise of deep learning, methods requiring paired data for supervised learning, semantic labels for unsupervised training, or language models for visual evaluation have emerged. However, the paper notes that the difficulty of acquiring paired data for real robots limits their direct application in humanoid control.
  • Motion Retargeting in Robotics:

    • Naïve Approaches [3, 5]: These directly copy joint rotations from human to robot. This often leads to severe artifacts like floating, foot penetrations, sliding, and end-effector drift due to topological and morphological differences. Additional processing is needed to convert SO(3)SO(3) joint spaces (common in humans) to revolute joints (common in robots).
    • Whole-Body Geometric Retargeting (WBGR) [35, 1]: These approaches use Inverse Kinematics (IK) to address joint space misalignment.
      • Vanilla WBGR: Ignores size differences and matches orientations of key links.
      • HumanMimic [36]: Solves IK for Cartesian position matching of key points, using manually defined scaling coefficients.
    • SMPL-based Retargeting (e.g., PHC [2, 14, 11, 38]): These methods leverage the SMPL body model. PHC fits the robot shape to an SMPL model, then calculates target 3D joint coordinates using the human pose parameters. An optimization then solves for robot root translation, orientation, and joint angles to minimize the position error between the posed robot and the SMPL-derived targets.
      • Critique: PHC uses gradient descent and forward kinematics, which can be time-consuming. Crucially, it does not take into account contact states during retargeting, leading to floating, foot sliding, and penetrations. SMPL is also designed for human bodies and may not represent robots with large morphological discrepancies well.
    • Differential IK Solvers (e.g., ProtoMotions [40], KungfuBot [8]): These methods scale Cartesian joint positions of the source motion and then calculate generalized velocities that, when integrated, reduce position and orientation errors.
      • ProtoMotions [40]: Uses global axis-aligned scaling factors and Mink [41] (a differential IK solver) to minimize a weighted sum of position and orientation errors between key bodies.
      • KungfuBot [8]: Uses the ProtoMotions approach but with scaling disabled.
    • Learning-based Retargeting for Humanoids [32, 33, 34]: Some works explore learning-based methods but often focus on simpler arm/upper-body motions or lack data for whole-body tasks.
  • Humanoid Control with RL:

    • Many recent works [2, 3, 4, 5, 7, 8, 6, 10, 15] use RL-based approaches to learn policies for humanoid control by imitating reference motions.
    • BeyondMimic [15]: This is the policy training pipeline used in the current paper for evaluation. It's highlighted as a method that works for various reference motions without extensive reward tuning or domain randomization, making it suitable for isolating retargeting effects.
    • Other influential works: VideoMimic [10] (visual imitation), Hub [7] (extreme balance), ASAP [11] (sim-to-real for agile skills).

3.3. Technological Evolution

The field of humanoid motion control has evolved from simple direct mapping and heuristic-based retargeting to sophisticated optimization-based and data-driven approaches. Early methods struggled with the embodiment gap, leading to unrealistic motion artifacts. The introduction of parametric body models like SMPL significantly improved the ability to represent human motion and apply it to robots, albeit with remaining challenges in contact consistency and physical feasibility.

The advent of Reinforcement Learning has enabled robots to learn complex, dynamic skills from these retargeted motions. However, this often came at the cost of extensive reward engineering and domain randomization to make policies robust enough to handle the artifacts inherent in retargeted data.

This paper represents a crucial step in this evolution by shifting focus back to the quality of the retargeting process itself. Instead of relying solely on RL to correct for poor retargeting, it seeks to improve the source data, thereby simplifying the RL training task and leading to more robust policies. GMR contributes to this by providing a more robust retargeting solution that explicitly addresses scaling and kinematic constraints to minimize artifacts.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of the GMR approach are:

  • Addressing Scaling Artifacts: GMR's primary differentiation lies in its non-uniform local scaling procedure (Step 3). Unlike PHC which relies on SMPL for a global scaling that can introduce distortions when fitting non-humanoid robots, or ProtoMotions which uses global axis-aligned scaling factors that might not account for local body part proportions, GMR allows for custom local scale factors for different key bodies. This is crucial for accurately translating human proportions to a robot without introducing foot sliding or self-penetration. The emphasis on uniform scaling factor for root translation is a key insight for avoiding foot sliding.

  • Two-Stage Optimization for Robustness: GMR employs a two-stage optimization process (Steps 4 and 5) to solve the IK problem. The first stage focuses on end-effectors and body orientations to get a good initial guess, while the second stage fine- tunes by considering the positions of all key bodies. This sequential approach helps avoid local optimization minima and provides a more stable solution compared to single-stage optimizations. PHC uses gradient descent over all frames, which can be time-consuming and potentially less robust to local minima for complex poses. ProtoMotions uses differential IK but doesn't explicitly mention a two-stage approach for initial setup and fine-tuning that GMR does.

  • Explicit Artifact Mitigation: GMR is explicitly designed to fix glaring artifacts like deviation from source motion, foot sliding, ground penetrations, and self-intersections, which are often a consequence of the scaling and IK methods used in PHC and ProtoMotions. The paper directly evaluates and highlights these artifacts in the competitor methods.

  • Independent Policy Training Environment: The use of BeyondMimic for policy training is a methodological strength. Since BeyondMimic is developed independently and does not require reward tuning or extensive domain randomization, it provides a fair and unbiased platform to evaluate the direct impact of retargeting quality, which is often masked by RL engineering efforts in other works.

    In summary, GMR differentiates itself by providing a more nuanced and robust kinematic retargeting solution that prioritizes artifact prevention through intelligent scaling and multi-stage optimization, leading to higher quality reference motions that are easier for RL policies to learn from, even without heavy reward engineering.

4. Methodology

The core of this paper lies in its proposed General Motion Retargeting (GMR) pipeline, designed to produce high-quality, artifact-free retargeted motions for humanoid robots. The methodology section also details how existing retargeting methods are applied and how policies are trained and evaluated.

4.1. Principles

The fundamental principle behind GMR is to generate reference motions that are as physically feasible and perceptually faithful as possible, thereby simplifying the task for Reinforcement Learning (RL) policies. The authors identify that many artifacts in retargeted motions (like foot sliding, ground penetration, self-intersection, physically infeasible motion) stem from inadequate handling of scaling and kinematic constraints. GMR addresses this through two main strategies:

  1. Non-Uniform Local Scaling: Instead of global scaling, GMR employs a flexible, non-uniform scaling procedure that accounts for specific differences in body part proportions between human and robot, especially for the root translation, to maintain contact and avoid large-scale distortions.
  2. Two-Stage Optimization for IK: It uses a robust, two-stage Inverse Kinematics (IK) optimization approach that first focuses on achieving crucial end-effector and orientation targets to provide a stable initial guess, then fine-tunes the entire robot pose with more comprehensive position and rotation constraints. This helps in avoiding local minima and producing more stable and accurate poses.

4.2. Core Methodology In-depth (Layer by Layer)

The General Motion Retargeting (GMR) pipeline consists of five sequential steps to transform source human motion into a target robot motion.

4.2.1. Step 1: Human-Robot Key Body Matching

The initial step involves establishing a correspondence between the relevant body parts (key bodies) of the source human skeleton and the target humanoid skeleton. This mapping, denoted as M\mathcal{M}, is user-defined and typically includes torso, head, legs, feet, arms, and hands. This mapping is crucial as it informs the subsequent Inverse Kinematics (IK) optimization problems, specifying which human body parts should be tracked by which robot body parts. The user can also assign weights for the position and orientation tracking errors of these matched key bodies, allowing for prioritization (e.g., feet and hands might have higher weights).

4.2.2. Step 2: Human-Robot Cartesian Space Rest Pose Alignment

This step aims to align the rest poses of the human and robot in Cartesian space before motion application. The orientations of the human bodies are offset to match the orientations of the corresponding robot bodies when both are in their respective rest poses. In some cases, a local offset to the position of a body might also be added. This pre-alignment helps to mitigate artifacts that arise from initial pose discrepancies, such as the toed-in artifact mentioned in prior work [2], ensuring a more natural starting configuration for the retargeted motion.

4.2.3. Step 3: Human Data Non-Uniform Local Scaling

This step is identified as critical for avoiding many artifacts found in other retargeting methods. The GMR approach to scaling is unique in its flexibility and local specificity. First, a general scaling factor is calculated based on the height (hh) of the source human skeleton. This general factor is then used to adjust a set of custom local scale factors (sbs_b) defined for each key body bb. This allows for non-uniform scaling, accommodating the fact that the human upper body might scale differently from the lower body when mapped to a robot.

The target body positions in Cartesian space (pbtarget \mathbf{p}_b^{\mathrm{target}} ) are computed using the following formula: $ \mathbf{p}b^{\mathrm{target}} = \frac{h}{h{\mathrm{ref}}} s_b (\mathbf{p}j^{\mathrm{source}} - \mathbf{p}{\mathrm{root}}^{\mathrm{source}}) + \frac{h}{h_{\mathrm{ref}}} s_{\mathrm{root}} \mathbf{p}_{\mathrm{root}}^{\mathrm{source}} $ Where:

  • pbtarget \mathbf{p}_b^{\mathrm{target}} is the target Cartesian position for body bb on the robot.

  • hh is the current height of the human source skeleton.

  • hrefh_{\mathrm{ref}} is a reference height used when defining the scaling factors (a baseline height relative to which sbs_b are set).

  • sbs_b is the custom local scaling factor for body bb.

  • pjsource \mathbf{p}_j^{\mathrm{source}} is the Cartesian position of the human joint corresponding to body bb.

  • prootsource \mathbf{p}_{\mathrm{root}}^{\mathrm{source}} is the Cartesian position of the human root joint.

  • sroots_{\mathrm{root}} is the scaling factor applied specifically to the root position.

    When the body bb in question is the root, the scaling equation simplifies to: $ \mathbf{p}{\mathrm{root}}^{\mathrm{target}} = \frac{h}{h{\mathrm{ref}}} s_{\mathrm{root}} \mathbf{p}_{\mathrm{root}}^{\mathrm{source}} $ The authors emphasize that scaling the root translation by a uniform scaling factor is crucial to avoid introducing foot sliding artifacts. This means that the root's horizontal movement should be scaled uniformly, preserving its relative motion pattern.

4.2.4. Step 4: Solving Robot IK with Rotation Constraints

This is the first stage of a two-stage optimization process to find the robot's generalized coordinates (q\mathbf{q}, which include root translation, root rotation, and joint values). This stage is designed to avoid local optimization minima by focusing on critical aspects first. Given a target pose, the following optimization problem is solved: $ \begin{array}{r l} \operatorname*{min}{\mathbf{q}} & \sum{(i,j) \in \mathcal{M}} (w_1){i,j}^R | R_i^h \ominus R_j(\mathbf{q}) |2^2 \ & + \sum{(i,j) \in \mathcal{M}{\mathrm{ee}}} (w_1)_{i,j}^p | \mathbf{p}_i^{\mathrm{target}} - \mathbf{p}_j(\mathbf{q}) |_2^2 \ \mathrm{subject~to} & \mathbf{q}^- \leq \mathbf{q} \leq \mathbf{q}^+ \end{array} $ Where:

  • q \mathbf{q} represents the robot's generalized coordinates (root translation, root rotation, and joint angles).

  • RihSO(3) R_i^h \in SO(3) is the orientation of human body ii. SO(3)SO(3) represents the Special Orthogonal Group of 3x3 rotation matrices, which describes 3D rotations.

  • pj(q) \mathbf{p}_j(\mathbf{q}) and Rj(q)SO(3) R_j(\mathbf{q}) \in SO(3) are the Cartesian position and orientation of robot body jj, respectively, computed through forward kinematics given q \mathbf{q} .

  • RiRj R_i \ominus R_j is the exponential map representation of the orientation difference between Ri R_i and Rj R_j . Specifically, RiRj:=exp(Ri1Rj)R_i \ominus R_j := \mathrm{exp}(R_i^{-1}R_j) in so(3) \mathfrak{so}(3) . The exponential map converts a rotation matrix difference into a 3-vector in the Lie algebra so(3) \mathfrak{so}(3) , effectively representing the magnitude and axis of the rotational error. The 22 \| \cdot \|_2^2 then measures the squared Euclidean norm of this error.

  • M \mathcal{M} is the full set of human-robot key body matches defined in Step 1.

  • Mee \mathcal{M}_{\mathrm{ee}} is a subset of M \mathcal{M} containing only the end-effectors (hands and feet), which are often prioritized for accurate tracking.

  • (w1)i,jp (w_1)_{i,j}^p and (w1)i,jR (w_1)_{i,j}^R are the weights for the position and rotation errors, respectively, in this first optimization stage. These weights allow for prioritizing certain body parts or types of errors.

  • q \mathbf{q}^- and q+ \mathbf{q}^+ are the minimum and maximum joint limits of the robot, enforcing physical constraints.

    The root position and orientation components of q \mathbf{q} are initialized using the scaled position proottarget \mathbf{p}_{\mathrm{root}}^{\mathrm{target}} (from Step 3) and the yaw component of the human root key body orientation.

This problem is solved using Mink [41], a differential IK solver. Instead of directly finding q \mathbf{q} , Mink computes generalized velocities q˙ \dot{\mathbf{q}} that, when integrated, reduce the cost function. The optimization solved by Mink is: $ \begin{array}{r l} \operatorname*{min}_{\dot{\mathbf{q}}} & {} | e(\mathbf{q}) + J(\mathbf{q}) \dot{\mathbf{q}} |_W^2 \ \mathrm{subject~to~} & {} \mathbf{q}^- \le \mathbf{q} + \dot{\mathbf{q}} \Delta t \le \mathbf{q}^+ \end{array} $ Where:

  • e(q) e(\mathbf{q}) is the loss function from the previous equation (Eq. 4 in the paper), representing the vector of errors to be minimized.

  • J(q)=eq J(\mathbf{q}) = \frac{\partial e}{\partial \mathbf{q}} is the Jacobian matrix of the loss function with respect to q \mathbf{q} . The Jacobian describes how the errors change with respect to infinitesimal changes in the generalized coordinates.

  • W W is a weight matrix induced by the individual weights (w)i,jp (w_*)_{i,j}^p and (w)i,jR (w_*)_{i,j}^R , controlling the relative importance of different error components.

  • Δt \Delta t is a parameter specific to the differential IK solver and does not necessarily correspond to the actual time difference between motion frames. It controls the integration step size.

    The solver runs until convergence (change in value function below a threshold, set to 0.001) or a maximum number of iterations (10) is reached.

4.2.5. Step 5: Fine Tuning using Rotation & Translation Constraints

The solution obtained from Step 4 is used as the initial guess for this final optimization stage. This stage aims to fine-tune the robot's pose by considering position and orientation constraints for all key bodies, not just end-effectors, with a potentially different set of weights. The optimization problem is: $ \begin{array}{r l} \operatorname*{min}{\mathbf{q}} & \sum{(i,j) \in \mathcal{M}} (w_2)_{i,j}^R | R_i^h \ominus R_j(\mathbf{q}) |2^2 \ & + (w_2){i,j}^p | \mathbf{p}_i^{\mathrm{target}} - \mathbf{p}_j(\mathbf{q}^r) |_2^2 \ \mathrm{subject~to} & \mathbf{q}^- \leq \mathbf{q} \leq \mathbf{q}^+ \end{array} $ Where:

  • The notation is similar to Step 4, but (w2)i,jp (w_2)_{i,j}^p and (w2)i,jR (w_2)_{i,j}^R are a different set of weights from the first stage. This allows for a more detailed and holistic constraint satisfaction in the fine-tuning phase.

  • The term pj(qr) \mathbf{p}_j(\mathbf{q}^r) implicitly refers to the robot body positions obtained through forward kinematics from the generalized coordinates q \mathbf{q} . The notation qr \mathbf{q}^r might be a typo and should likely be q \mathbf{q} to be consistent with the argument of pj() \mathbf{p}_j(\cdot) .

    The termination conditions (convergence or max iterations) are the same as in Step 4.

4.2.6. Application to Motion Sequences

The described GMR method is applied sequentially to each frame of a motion sequence. The retarget result from the previous frame is used as the initial guess for the optimization in Step 4 of the current frame. This temporal coherence helps ensure smooth and continuous motion. After a full motion has been retargeted, a post-processing step is performed: forward kinematics is used to compute the height of all robot bodies over time. The minimum height across all bodies and frames is then subtracted from the global translation to correct for any residual height artifacts (e.g., floating or ground penetration).

4.3. Retargeting Methods for Comparison

The paper evaluates GMR against several other retargeting methods:

  • PHC [38, 2, 14, 11]: This method is designed for motions in SMPL format. It first fits an SMPL shape to the robot skeleton. Then, for each frame, it uses the human SMPL pose parameters to calculate target 3D joint coordinates for the robot. An optimization (using gradient descent with Adam for root pose and Adadelta for joint angles) minimizes the position error between the forward kinematics of the robot and these SMPL-derived targets. Joint limits are enforced by clamping. A post-processing step adjusts the root translation based on the lowest body height. The authors note PHC's issues with contact states and computational time.

  • ProtoMotions [40]: This package includes an optimization-based retargeting algorithm. It scales the source motion using custom scaling factors for each world frame axis. It then employs Mink [41] (a differential IK solver) to minimize joint position and orientation errors between the scaled human and robot key bodies. It also includes a post-processing step to set the lowest height to match the lowest height in the source motion.

  • Unitree (Closed-Source): This refers to a high-quality, pre-retargeted dataset from Unitree, likely generated by a proprietary method not publicly available. It serves as a high-quality baseline to gauge the performance of open-source methods.

4.4. Data Processing

  • Source Data: A diverse sample from the LAFAN1 dataset [42] is chosen, including simple, dynamic, and complex motions (e.g., walking, dancing, martial arts). Motions with complex environmental interactions (e.g., crawling) are excluded, except for a cartwheel where feet/hands are always in contact.
  • Target Robot: Unitree G1 robot.
  • Format Conversion: LAFAN1 data is in BVH format.
    • GMR is directly compatible with BVH.
    • PHC and ProtoMotions require SMPL or SMPL-X format. The authors convert BVH to SMPL(-X) by:
      1. Fitting SMPL(-X) shape parameters (β\beta) to the BVH skeleton by minimizing joint position error (with β2 \| \beta \|_2 regularization to prevent distortions).
      2. Copying matching joint 3D rotations from the LAFAN1 skeleton (which has a similar kinematic structure to SMPL(-X)).
      3. Calculating root translation as the offset that minimizes position error between the posed LAFAN1 skeleton and the posed SMPL(-X) skeleton.
    • SMPL-X [43] is preferred over SMPL for ProtoMotions due to better fit to LAFAN1.
  • Post-processing for PHC: The authors found PHC's built-in foot penetration fix sometimes led to severe floating. They applied a custom fix: forward kinematics to get minimum body height per frame, then offsetting the entire motion by the mean minimum body height. Other methods didn't require this.

4.5. Motion Tracking Evaluation

  • Policy Training: BeyondMimic [15] is used to train individual motion imitation policies for each retargeted motion. This choice is deliberate because BeyondMimic is designed to work without reward tuning or extensive domain randomization, making it ideal for isolating the impact of retargeting quality. Training is performed in IsaacSim.
  • Robustness Evaluation: Policies are evaluated under various conditions to test their robustness:
    • sim: 100 rollouts in IsaacSim without domain randomization.
    • sim-dr: 4096 rollouts in IsaacSim with domain randomization enabled (simulating noisy sensors, state estimation errors, model parameter errors).
    • sim2sim: 100 rollouts using a ROS node running MuJoCo, mimicking a real-world deployment scenario. This setup introduces timing and synchronization conditions from ROS and noise from a state estimation module, but without simulator parameter tuning or privileged information.
  • Rollout Termination: Each rollout continues until the robot falls (anchor body height or orientation deviates beyond a threshold) or the reference motion ends.

4.6. User Study Evaluation

To assess perceptual faithfulness (Q3), a user study is conducted.

  • Methodology: Participants are shown a 5-second clip of a reference motion (rendered from SMPL-X fit of LAFAN1 data) and two retargeted clips. One retargeted clip is always GMR, and the other is either Unitree, PHC, or ProtoMotions.
  • Blinding & Randomization: Users are not told which method produced which video, and the order of presentation is randomized.
  • Task: Users select which of the two retargeted videos they believe is closer to the reference motion, with an option for "can't find a difference."
  • Coverage: 15 motions are randomly sampled. Each of the 3 competitor methods is compared against GMR for every motion, resulting in 45 questions per user.
  • Participants: 20 users.

5. Experimental Setup

5.1. Datasets

The primary dataset used for source human motion is the LAFAN1 dataset [42].

  • Source: LAFAN1 is a motion capture dataset.

  • Scale & Characteristics: The authors selected a diverse subset of 21 sequences from LAFAN1. These motions vary significantly in length (from 5 seconds to 2 minutes) and difficulty, including:

    • Simple locomotion: walking, turning.
    • Dynamic and complex motions: martial arts, kicks, dancing, running, hopping, jumping.
  • Domain: Human motions.

  • Format: BVH format.

  • Exclusions: Motions with complex interaction with the environment (e.g., crawling, getting up from the floor) are generally excluded. An exception is made for a cartwheel motion, as the robot maintains contact with either feet or hands throughout.

  • Target Robot: All motions are retargeted to a Unitree G1 robot. This specific humanoid robot serves as the target embodiment for all experiments.

    An example of data samples would be the diverse motions themselves, such as a walk, run, dance, jump, or kick sequence, as shown qualitatively in the user study interface (Fig. 1 of the original paper). Each sequence comprises a series of poses over time for the human skeleton.

These datasets were chosen because LAFAN1 provides a rich variety of human movements, allowing for a comprehensive test of retargeting capabilities across different motion complexities. The Unitree G1 is a commonly used humanoid robot platform in research, making the results relevant to the robotics community.

5.2. Evaluation Metrics

The evaluation focuses on both the policy's ability to maintain balance and its tracking performance.

  • Success Rate:

    1. Conceptual Definition: This metric quantifies the percentage of rollouts (simulation runs) where the policy successfully completes the reference motion from start to end without the robot falling. A rollout is considered a failure if the robot's anchor body height or orientation deviates from the reference by more than a predefined threshold, leading to episode termination.
    2. Mathematical Formula: Not explicitly provided as a formula in the paper, but conceptually it is: $ \text{Success Rate} = \frac{\text{Number of Successful Rollouts}}{\text{Total Number of Rollouts}} \times 100% $
    3. Symbol Explanation:
      • Number of Successful Rollouts: The count of simulation runs where the robot completed the motion without falling.
      • Total Number of Rollouts: The total number of simulation runs performed for a given policy and evaluation setting.
  • Average position error of body parts in global coordinates (EgmpbpeE_{\mathrm{g-mpbpe}}):

    1. Conceptual Definition: This metric measures the average Euclidean distance between the global 3D positions of corresponding body parts (joints or links) of the robot and the retargeted reference motion across all time steps where the policy is active. It indicates how accurately the robot's overall spatial configuration matches the reference.
    2. Mathematical Formula: Assuming body parts refers to a set of NbN_b key body parts and averaged over TT active frames. $ E_{\mathrm{g-mpbpe}} = \frac{1}{T \cdot N_b} \sum_{t=1}^{T} \sum_{k=1}^{N_b} | \mathbf{p}{k,t}^{\text{robot}} - \mathbf{p}{k,t}^{\text{reference}} |_2 $
    3. Symbol Explanation:
      • EgmpbpeE_{\mathrm{g-mpbpe}}: Average position error of body parts in global coordinates, reported in millimeters (mm).
      • TT: Total number of frames (time steps) during which the policy is actively tracking the motion.
      • NbN_b: Number of key body parts considered for error calculation.
      • pk,trobot\mathbf{p}_{k,t}^{\text{robot}}: Global 3D position vector of robot body part kk at time step tt.
      • pk,treference\mathbf{p}_{k,t}^{\text{reference}}: Global 3D position vector of the corresponding reference body part kk at time step tt.
      • 2\| \cdot \|_2: The Euclidean (L2) norm, representing the 3D distance.
  • Average position error of body parts relative to the root position (EmpbpeE_{\mathrm{mpbpe}}):

    1. Conceptual Definition: This metric measures the average Euclidean distance between the relative 3D positions of corresponding body parts (relative to the robot's root) of the robot and the retargeted reference motion. It assesses how well the robot's internal pose or body shape matches the reference, irrespective of its global position.
    2. Mathematical Formula: $ E_{\mathrm{mpbpe}} = \frac{1}{T \cdot N_b} \sum_{t=1}^{T} \sum_{k=1}^{N_b} | (\mathbf{p}{k,t}^{\text{robot}} - \mathbf{p}{\text{root},t}^{\text{robot}}) - (\mathbf{p}{k,t}^{\text{reference}} - \mathbf{p}{\text{root},t}^{\text{reference}}) |_2 $
    3. Symbol Explanation:
      • EmpbpeE_{\mathrm{mpbpe}}: Average position error of body parts relative to the root position, reported in millimeters (mm).
      • TT: Total number of frames where the policy is active.
      • NbN_b: Number of key body parts considered.
      • pk,trobot\mathbf{p}_{k,t}^{\text{robot}}: Global 3D position vector of robot body part kk at time step tt.
      • proot,trobot\mathbf{p}_{\text{root},t}^{\text{robot}}: Global 3D position vector of the robot's root at time step tt.
      • pk,treference\mathbf{p}_{k,t}^{\text{reference}}: Global 3D position vector of reference body part kk at time step tt.
      • proot,treference\mathbf{p}_{\text{root},t}^{\text{reference}}: Global 3D position vector of the reference's root at time step tt.
      • 2\| \cdot \|_2: The Euclidean (L2) norm.
  • Average angular error of joint rotations (EmpjpeE_{\mathrm{mpjpe}}):

    1. Conceptual Definition: This metric quantifies the average angular difference between the joint rotations of the robot and the retargeted reference motion across all active frames. It directly measures the accuracy of the robot's joint configurations relative to the target.
    2. Mathematical Formula: Assuming NjN_j joints and angular difference is measured using geodesic distance between orientations (e.g., quaternions or rotation matrices). If using quaternion q1,q2q_1, q_2: 2acos(q1q2)2 \cdot \mathrm{acos}(|q_1 \cdot q_2|). $ E_{\mathrm{mpjpe}} = \frac{1}{T \cdot N_j} \sum_{t=1}^{T} \sum_{k=1}^{N_j} \text{angular_distance}(\mathbf{R}{k,t}^{\text{robot}}, \mathbf{R}{k,t}^{\text{reference}}) $
    3. Symbol Explanation:
      • EmpjpeE_{\mathrm{mpjpe}}: Average angular error of joint rotations, reported in 103 10^{-3} radians (rad).
      • TT: Total number of frames where the policy is active.
      • NjN_j: Number of joints considered for error calculation.
      • Rk,trobot\mathbf{R}_{k,t}^{\text{robot}}: Rotation (e.g., quaternion or rotation matrix) of robot joint kk at time step tt.
      • Rk,treference\mathbf{R}_{k,t}^{\text{reference}}: Rotation of the corresponding reference joint kk at time step tt.
      • angular_distance(,)\text{angular\_distance}(\cdot, \cdot): A function that calculates the shortest angular distance between two 3D rotations, typically using the geodesic distance on SO(3)SO(3) or quaternion space.
  • User Study - Perceptual Faithfulness: This is a subjective metric evaluated through a user study. Participants judge which retargeted motion looks more similar or faithful to the original human reference motion. This assesses the aesthetic quality and naturalness of the retargeted motion from a human perception standpoint.

5.3. Baselines

The proposed GMR method is evaluated against three other retargeting methods to provide a comprehensive comparison:

  • PHC [38]: This is a widely used open-source retargeting method that relies on SMPL for scaling and gradient descent for Inverse Kinematics (IK). It serves as a representative of current SMPL-based retargeting approaches in robotics research.

  • ProtoMotions [40]: Another open-source method that utilizes global axis-aligned scaling and a differential IK solver (Mink) for position and orientation matching. It represents a different approach to IK-based retargeting.

  • Unitree (Closed-Source): This is a high-quality retargeted motion dataset provided by Unitree, generated using a proprietary, closed-source method. It acts as a strong upper-bound baseline for performance, representing what can be achieved with potentially more sophisticated or hand-tuned industrial solutions.

    These baselines are chosen because they represent different paradigms in motion retargeting (SMPL-based, differential IK, and industrial best practice) and are either widely adopted in the community (PHC, ProtoMotions) or provide a benchmark for high-quality results (Unitree).

6. Results & Analysis

The experimental results systematically evaluate the impact of different retargeting methods on RL policy performance and perceptual faithfulness. The analysis focuses on success rates, tracking errors, and user study feedback.

6.1. Core Results Analysis

6.1.1. Policy Success Rates

The following are the results from [Table I] of the original paper:

Motion Length (s) sim sim-dr sim2sim
PHC GMR PM U PHC GMR PM U PHC GMR PM U
Walk 1 33 100 100 100 100 100 100 100 100 100 100 100 100
Walk 2 5.5 23 100 100 100 53.54 100 99.98 100 100* 100* 100* 100*
Turn 1 12.3 93 100 100 100 87.18 99.98 99.95 100 100* 100* 99* 100*
Turn 2 12.3 100 100 100 100 99.95 99.98 100 99.98 99 100 100 99
Walk (old) 33 100 100 100 100 100 100 100 100 100 100 100 100
Walk (army) 13 100 100 100 100 99.85 98.63 99.95 99.95 100 100 99 100
Hop 13 95 100 100 100 92.97 100 100 100 100 100 100 100
Walk (knees) 19.58 100 100 100 100 99.98 100 100 100 100 100 100 100
Dance 1 118 0 100 100 99 0 99.46 99.24 99.95 0 100 100 100
Dance 2 130.5 0 100 100 100 0.02 99.9 99.88 99.98 0 100 100 100
Dance 3 120 100 100 100 100 100 100 99.95 100 99 100 100 100
Dance 4 20 100 100 100 100 100 100 100 100 99 100 100 100
Dance 5 68.4 100 96 100 100 100 92.75 99.98 100 100 51 100 100
Run (slow) 50 100 100 100 100 99.19 99.88 99.95 99.98 100 100 100 100
Run 11 100 100 100 100 99.98 100 99.95 100 100 100 100 100
Run (stop & go) 37 17 98 20 100 20.46 91.24 40.26 99.83 74 100 26 100
Hop around 18 100 100 100 100 100 100 100 100 100 100 100 100
Hopscotch 10 100 100 100 100 100 100 100 99.98 100 100 100 100
Jump and rotate 21 100 100 100 100 99.98 100 99.9 100 99 100 100 99
Kung fu 8.6 100 100 100 100 100 99.95 100 100 100 100 100 100
Various sports 42.58 100 100 100 100 99.98 99.98 99.95 100 100 100 100 99

Analysis:

  • Overall Performance: Out of 21 motions, 11 motions achieved success rates above 98% across all retargeting methods, and 3 achieved perfect 100% success (Walk 1, Walk (old), Hop around). This indicates that for many motions, all methods can generate traackable references.

  • Impact of Retargeting Quality: For the remaining 7 motions, there's a significant variation in performance.

    • Unitree: Consistently achieves near-perfect performance across all motions and evaluation settings (sim, sim-dr, sim2sim), validating its quality as a baseline.
    • GMR: Shows strong performance, closely following Unitree. It achieves 100% success for Dance 1 and Dance 2 where PHC completely fails. Its performance dips slightly for Dance 5 (51% in sim2sim) and Run (stop & go) (91.24% in sim-dr, 98% in sim).
    • ProtoMotions (PM): Generally performs well, comparable to GMR in many cases, but shows a significant drop for Run (stop & go) (20% in sim, 40.26% in sim-dr, 26% in sim2sim).
    • PHC: Exhibits the lowest performance, with catastrophic failures (0% success) for Dance 1 and Dance 2 across all settings. It also struggles with Walk 2 (23% in sim), Turn 1 (93% in sim), and Run (stop & go) (17% in sim).
  • Long-Horizon Motions: The paper notes that long motions aren't inherently challenging, as PHC still achieves 100% success for Dance 3 (two minutes long). Failures are attributed to retargeting artifacts.

  • Robustness Settings: The performance typically degrades from sim to sim-dr to sim2sim, as more randomization and real-world factors (like ROS timing, state estimation noise) are introduced. Methods producing higher quality references (Unitree, GMR) maintain high success rates even in the more challenging sim-dr and sim2sim environments.

    This confirms the answer to Q1: the choice of retargeting method critically impacts policy performance, especially for dynamic or challenging sequences, and without extensive reward engineering.

6.1.2. Tracking Error Statistics

The following are the results from [Table II] of the original paper:

Eg-mpbpe, mm Emppe, mm Empjpe, 10−3 rad
Statistics PHC GMR PM U PHC GMR PM U PHC GMR PM U
Min 71.8 59.9 66.0 51.1 20.9 18.1 24.1 18.2 569.5 362.0 499.0 355.5
Median 111.9 91.2 101.9 73.4 29.9 27.6 30.4 23.1 739.8 546.0 599.7 467.2
Mean 247.8 104.1 139.7 77.2 40.2 28.1 33.2 23.2 778.5 561.7 641.8 483.0
Max 1062.3 200.0 915.6 131.4 134.4 48.0 107.9 28.9 1336.1 1044.8 1397.9 678.5

Analysis:

  • Lower Tracking Errors for GMR and Unitree: GMR and Unitree consistently demonstrate significantly lower mean and median tracking errors across all three metrics (EgmpbpeE_{\mathrm{g-mpbpe}}, EmpbpeE_{\mathrm{mpbpe}}, EmpjpeE_{\mathrm{mpjpe}}).

    • For global position error (EgmpbpeE_{\mathrm{g-mpbpe}}), GMR's mean error is 104.1mm, much lower than PHC (247.8mm) and ProtoMotions (139.7mm), and close to Unitree (77.2mm).
    • Similar trends are observed for relative position error (EmpbpeE_{\mathrm{mpbpe}}) and angular joint error (EmpjpeE_{\mathrm{mpjpe}}).
  • Consistency: The lower maximum errors for GMR (e.g., 200.0mm for EgmpbpeE_{\mathrm{g-mpbpe}} compared to PHC's 1062.3mm and ProtoMotions' 915.6mm) indicate that GMR produces more consistently high-quality retargeted motions, avoiding extreme deviations that plague PHC and ProtoMotions.

  • Correlation with Success Rate: High tracking errors for PHC and ProtoMotions correlate with their lower success rates for certain motions. This suggests that policies trained on these less accurate references struggle to maintain stability, even if they sometimes complete the task.

    The tracking error statistics provide a more detailed picture, showing that while policies might sometimes succeed with lower quality retargets, they do so with considerable tracking errors, indicating difficulty in accurately imitating the reference.

6.1.3. Impact of Retargeting Artifacts

To answer Q2, the paper directly links low sim2sim success rates to specific artifacts in the retargeted motions:

  • Ground Penetration (PHC): The PHC retarget of Dance 1 and Dance 2 motions exhibited severe ground penetration (up to 60 cm). This physically impossible state makes it impossible for an RL policy to track correctly, resulting in 0% success rates. As illustrated in Fig. 3(a) of the original paper:

    该图像是插图,展示了带有Unitree标志的机器人模型处于跪地姿势,可能用于说明人形机器人动作重定向中的姿态状态或特定动作示例。 该图像是插图,展示了带有Unitree标志的机器人模型处于跪地姿势,可能用于说明人形机器人动作重定向中的姿态状态或特定动作示例。

    The above figure shows an example of ground penetration in PHC retargeted motion, making physical simulation impossible.

  • Self-Intersection (ProtoMotions): The ProtoMotions retarget of the Run (stop & go) motion showed robot legs intersecting with each other. This self-intersection is a physically infeasible state, leading to instability and low success rates (20-40% for ProtoMotions). As illustrated in Fig. 3(b) of the original paper:

    该图像是一个机器人动作示意图,展示了Humanoid机器人在动态动作中的运动姿态,可能用于说明论文中关于动作重定向与人形机器人运动跟踪的研究内容。 该图像是一个机器人动作示意图,展示了Humanoid机器人在动态动作中的运动姿态,可能用于说明论文中关于动作重定向与人形机器人运动跟踪的研究内容。

    The above figure shows an example of self-intersection in ProtoMotions retargeted motion, where robot limbs pass through each other.

  • Sudden Jumps in Joint Values (GMR): The GMR retarget of the Dance 5 motion displayed many sudden jumps in the waist roll value. Although GMR generally performs very well, this rare optimization artifact can still occur and drastically reduce policy robustness (GMR's sim2sim success for Dance 5 was 51%). These abrupt velocity spikes make the motion difficult to execute smoothly. As illustrated in Fig. 3(c) of the original paper:

    Fig. 3: Example artifacts found in the retargeted references with low success rates. 该图像是一个折线图,展示了腰部滚转角和俯仰角随帧数变化的关节角度曲线,以及关节角度限制(红色虚线)。图中反映了动作轨迹中关节角度接近或超出限制的情况。

    The above figure shows sudden jumps in waist roll and pitch values in a GMR retargeted motion, indicating abrupt changes in joint angles.

These observations highlight that physically inconsistent height, self-intersections, and sudden jumps in joint values are critical artifacts that must be avoided in retargeting to ensure RL policy success and robustness.

6.1.4. User Study Results

To answer Q3, a user study (N=20 participants) was conducted to assess the perceptual faithfulness of retargeted motions to the source human motion. The following are the results from [Fig. 4] of the original paper:

Fig. 4: User study \(( N = 2 0\) results for comparing GMR to other retargets in terms of faithfulness to the source motion. The bars represent the percentage of responses. 该图像是图表,展示了用户研究中20位参与者对GMR与其他三种重定向方法(Unitree、PHC、ProtoMotions)在动作还原真实性上的偏好百分比。图中蓝色代表偏好GMR,绿色为无偏好,橙色代表偏好其他方法。

The above figure displays the user study results comparing GMR to other retargeting methods (Unitree, PHC, ProtoMotions) in terms of faithfulness to the source motion. Bars represent the percentage of responses.

Analysis:

  • GMR vs. PHC/ProtoMotions: Users consistently found GMR to be more faithful to the reference motion compared to PHC and ProtoMotions. This suggests that GMR successfully preserves the look and naturalness of the human motion better than other open-source methods.

  • GMR vs. Unitree: The comparison between GMR and Unitree is closer. While Unitree is still perceived as more faithful by users, the difference is less pronounced, and a substantial percentage of users reported no difference. This indicates that GMR achieves a perceptual fidelity very close to a high-quality, closed-source baseline.

    This user study confirms that GMR produces retargeted motions that are not only more physically feasible but also perceptually superior, addressing concerns about whether policies are learning the intended motion.

6.1.5. Impact of the First Reference Frame

The paper reiterates a finding from prior work [7] that the starting frame of the reference motion can significantly impact policy performance. The following are the results from [Table III] of the original paper:

Motion Start frame PHC GMR PM U
Walk 2 0 100 64 100 100
7 100 100 100 100
Turn 1 0 14 100 86 47
49 100 100 99 100

Analysis:

  • For Walk 2, GMR has a lower success rate (64%) when starting from frame 0 compared to 100% when starting from frame 7.

  • For Turn 1, PHC and Unitree show drastically reduced success rates (14% and 47% respectively) when starting from frame 0, but achieve 100% when starting from frame 49. Even ProtoMotions sees an improvement from 86% to 99%.

    This emphasizes the practical importance of carefully selecting an initial pose that the robot can safely and robustly reach before policy inference begins, and similarly, an end pose for safe deactivation.

6.2. Ablation Studies / Parameter Analysis

The paper does not present explicit ablation studies in the traditional sense (e.g., removing components of GMR to measure their individual contribution). However, it implicitly touches on the parameter analysis and tuning aspect of GMR. The authors note that the sudden jumps in waist roll for GMR's Dance 5 motion are a rare occurrence introduced during the optimization phase. They suggest that "some motions might require further weight tuning to achieve optimal results" for the optimization weights used in Steps 4 and 5 of GMR. This implies that while the default parameters work well broadly, specific challenging motions might benefit from customized tuning, similar to reward engineering in RL, though applied to the retargeting process itself.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper rigorously demonstrates the critical impact of motion retargeting quality on the performance and robustness of humanoid motion tracking policies. By suppressing reward tuning and domain randomization in the RL training process (BeyondMimic), the authors successfully isolated the effects of retargeting artifacts. The proposed General Motion Retargeting (GMR) method, with its non-uniform local scaling and two-stage optimization, effectively addresses common pitfalls of existing open-source retargeters (PHC, ProtoMotions). GMR consistently yields higher policy success rates and lower tracking errors, approaching the performance of a high-quality closed-source baseline (Unitree). Crucially, the research identifies specific artifacts (ground penetrations, self-intersections, sudden joint value jumps) as detrimental to policy learning and robustness. A user study further validates GMR's perceptual faithfulness to source motions, ensuring that RL policies are trained on visually accurate references.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose directions for future research:

  • Data Source Diversity: The study exclusively used the LAFAN1 dataset. Future work should explore more diverse sources like the AMASS dataset or human motion reconstructed from monocular video to validate GMR's generalizability.
  • Robot Embodiment Diversity: The experiments were confined to the Unitree G1 robot. Although BeyondMimic and the retargeting algorithms are general, extending the analysis to other humanoid robots (e.g., Unitree H1) would be valuable.
  • Interactive Motions: The current work primarily focused on non-interactive motions. Future research should investigate the impact of retargeting on motion sequences involving interactions with the environment, objects, or other robots, which introduces additional complexities like contact forces and object manipulation.
  • Optimization Tuning: While GMR's default parameters work well, some complex motions might still benefit from further weight tuning in its optimization stages to eliminate occasional artifacts, suggesting an area for potential automation or more adaptive parameter selection.

7.3. Personal Insights & Critique

This paper provides a crucial perspective by emphasizing the upstream data quality in RL-based robotics. Too often, RL is expected to be a panacea, capable of learning from noisy or imperfect data by brute force of reward engineering and domain randomization. This work clearly illustrates that garbage in, garbage out still holds true; a high-quality reference motion significantly eases the RL learning burden and leads to more robust and successful policies.

The identification of specific artifacts like ground penetration and self-intersection as critical failure modes is a valuable takeaway. These are not merely cosmetic issues but fundamentally break the physical realism of the reference, making it impossible for a physics-based RL agent to imitate. GMR's approach to non-uniform local scaling is particularly insightful for addressing the embodiment gap, as it acknowledges that human and robot proportions don't scale uniformly across all body parts.

A potential area for improvement or future exploration could be to integrate contact-aware optimization directly into GMR's pipeline. While GMR implicitly addresses foot contact through uniform root scaling and post-processing, explicit contact constraints (e.g., ensuring feet remain on the ground and hands are placed correctly for complex motions) could further enhance realism and reduce artifacts, especially for interactive tasks. Additionally, exploring learning-based approaches to automatically determine optimal scaling factors and optimization weights for GMR could reduce the need for manual tuning and increase its applicability across an even wider range of robots and motions. The user study is a strong component, grounding technical metrics in human perception, which is often the ultimate judge of naturalness in humanoid motion.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.