Paper status: completed

PHUMA: Physically-Grounded Humanoid Locomotion Dataset

Published:10/30/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
8 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

PHUMA introduces a physics-constrained retargeting dataset from large-scale videos, eliminating artifacts like foot skating, enabling stable and diverse humanoid locomotion imitation, outperforming prior datasets in unseen motion and path-following tasks.

Abstract

Motion imitation is a promising approach for humanoid locomotion, enabling agents to acquire humanlike behaviors. Existing methods typically rely on high-quality motion capture datasets such as AMASS, but these are scarce and expensive, limiting scalability and diversity. Recent studies attempt to scale data collection by converting large-scale internet videos, exemplified by Humanoid-X. However, they often introduce physical artifacts such as floating, penetration, and foot skating, which hinder stable imitation. In response, we introduce PHUMA, a Physically-grounded HUMAnoid locomotion dataset that leverages human video at scale, while addressing physical artifacts through careful data curation and physics-constrained retargeting. PHUMA enforces joint limits, ensures ground contact, and eliminates foot skating, producing motions that are both large-scale and physically reliable. We evaluated PHUMA in two sets of conditions: (i) imitation of unseen motion from self-recorded test videos and (ii) path following with pelvis-only guidance. In both cases, PHUMA-trained policies outperform Humanoid-X and AMASS, achieving significant gains in imitating diverse motions. The code is available at https://davian-robotics.github.io/PHUMA.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

PHUMA: Physically-Grounded Humanoid Locomotion Dataset

1.2. Authors

Kyungmin Lee, Sibeen Kim, Minho Park, Hyunseung Kim, Dongyoon Hwang, Hojoon Lee, Jaegul Choo. All authors are affiliated with KAIST.

1.3. Journal/Conference

This paper is published as a preprint on arXiv, with an announced publication date of 2025-10-30T08:13:12.000Z. arXiv is a well-regarded open-access repository for preprints of scientific papers, particularly in fields like physics, mathematics, computer science, and quantitative biology. While preprints are not peer-reviewed, they allow for rapid dissemination of research and feedback from the scientific community. The intended publication venue or a related workshop (ICRA 2025 Workshop: Human-Centered Robot Learning in the Era of Big Data and Large Models) is mentioned in a citation (Mao et al., 2025), indicating its relevance to major robotics and AI conferences.

1.4. Publication Year

2025

1.5. Abstract

This paper introduces PHUMA, a Physically-grounded HUMAnoid locomotion dataset designed to address the limitations of existing motion imitation approaches for humanoid robots. Current methods often rely on expensive, scarce, and limited motion capture (MoCap) datasets like AMASS. While recent efforts, such as Humanoid-X, attempt to scale data collection using large-scale internet videos, they frequently suffer from physical artifacts like floating, penetration, and foot skating, which hinder stable imitation. PHUMA tackles these issues by leveraging human video at scale, combined with careful data curation and a novel physics-constrained retargeting method called PhySINK. PhySINK enforces joint limits, ensures ground contact, and eliminates foot skating, yielding motions that are both extensive and physically reliable. The dataset's effectiveness was evaluated in two scenarios: (i) imitating unseen motions from self-recorded test videos and (ii) path following with only pelvis guidance. In both cases, policies trained with PHUMA significantly outperformed those trained with Humanoid-X and AMASS, demonstrating substantial gains in imitating diverse motions. The code and dataset are made publicly available.

Official Source Link: https://arxiv.org/abs/2510.26236 PDF Link: https://arxiv.org/pdf/2510.26236v1.pdf

2. Executive Summary

2.1. Background & Motivation

The deployment of humanoid robots in real-world scenarios necessitates locomotion that is both stable and humanlike. While traditional reinforcement learning (RL) has advanced quadrupedal locomotion, directly applying these techniques to humanoids often results in effective but non-humanlike gaits. Motion imitation has emerged as a promising paradigm, where robot policies are trained to replicate human movements. This process typically involves three stages: collecting human motion data, retargeting it to the robot's morphology, and using RL to track these retargeted trajectories.

The core problem the paper addresses is the fundamental constraint on motion imitation due to the scale, diversity, and physical feasibility of available human motion data. High-quality motion capture (MoCap) datasets, such as AMASS (Archive of Motion Capture as Surface Shapes), provide relatively physically feasible motions but are limited in size and diversity, often dominated by simple actions like walking or reaching. This scarcity makes it challenging to train robust policies for a wide range of complex human behaviors.

To overcome this limitation, recent research has explored scaling data collection by converting large volumes of internet videos into motion data, exemplified by Humanoid-X. However, this approach introduces significant challenges:

  1. Physical Artifacts: Video-to-motion models often misestimate global pelvis translation, leading to floating (robot hovering above the ground) or ground penetration (robot's limbs sinking into the ground).

  2. Retargeting Issues: The subsequent retargeting stage, which adapts human motion to robot morphology, often prioritizes joint alignment over physical plausibility. This can result in joint violations (joint angles exceeding mechanical limits) and foot skating (feet sliding on the ground during contact), as visually depicted in Figure 1. These artifacts make the motions unstable and unreliable for training physical robots, hindering stable imitation.

    The problem is important because without large-scale, diverse, and physically reliable motion data, humanoid robots cannot acquire the full spectrum of humanlike behaviors required for general-purpose embodied AI. Existing solutions either lack scale/diversity or introduce critical physical inconsistencies.

The paper's entry point is to bridge this gap by leveraging the scalability of internet video data while rigorously enforcing physical plausibility.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  1. PHUMA Dataset: Introduction of PHUMA (Physically-grounded HUMAnoid locomotion dataset), a large-scale dataset (73.0 hours, 76.0K clips) that leverages human video at scale while addressing physical artifacts through careful data curation and physics-constrained retargeting. PHUMA enforces joint limits, ensures ground contact, and eliminates foot skating, yielding motions that are both extensive and physically reliable. It aggregates motions from various sources, including motion capture and human videos, after undergoing a physics-aware curation pipeline.

  2. PhySINK Retargeting Method: Development of PhySINK (Physically-grounded Shape-adaptive Inverse Kinematics), a novel retargeting method. PhySINK extends existing Shape-Adaptive Inverse Kinematics (SINK) approaches by incorporating explicit joint feasibility, grounding, and anti-skating loss terms into its optimization objective. This ensures that the retargeted motions maintain fidelity to the source human movement while remaining physically plausible for a robot.

  3. Enhanced Motion Quality: PHUMA provides substantially more physically plausible motions compared to existing datasets, achieving significantly higher percentages of joint feasibility, non-floating, non-penetration, and non-skating frames than AMASS and Humanoid-X (as shown in Figure 2b and Table 2).

  4. Superior Performance in Motion Imitation and Path Following: Policies trained with PHUMA consistently outperform those trained with Humanoid-X and AMASS in two key evaluation settings:

    • Motion Imitation: Achieved 1.2×1.2\times and 2.1×2.1\times higher success rates than AMASS and Humanoid-X, respectively, when imitating unseen motions from self-recorded test videos (Figure 2c, Table 4).
    • Path Following: Improved overall success rate by 1.4×1.4\times over AMASS, with even greater gains in vertical (1.6×1.6\times) and horizontal (2.1×2.1\times) motion categories when using pelvis-only guidance (Figure 2d, Table 5).
  5. Public Release: The authors commit to releasing PHUMA as a public resource, along with code and trained models, to foster future research in humanoid locomotion.

    The key findings are that motion data for humanoid locomotion requires not only scale and diversity but critically, also physical reliability. PHUMA addresses the specific problems of physical artifacts introduced by video-to-motion conversion and unconstrained retargeting, leading to more robust and capable humanoid control policies.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Humanoid Locomotion: This refers to the ability of humanoid robots (robots designed to resemble the human body, typically with two arms, two legs, and a torso) to move around in an environment. The goal is often to achieve both stable (maintaining balance and not falling) and humanlike (natural and efficient, resembling human movement patterns) gaits. This is crucial for robots intended to operate in human-centric environments.

  • Reinforcement Learning (RL): A subfield of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal. The agent learns through trial and error, receiving feedback (rewards or penalties) for its actions. In the context of humanoid locomotion, an RL agent (the robot's controller) learns to generate motor commands to move the robot, aiming to achieve a task (e.g., walking, imitating a motion) while being stable. Proximal Policy Optimization (PPO) (Schulman et al., 2017) is a popular RL algorithm used for training policies.

  • Motion Imitation: A specific paradigm within RL or control where the goal is to train a robot to replicate or track a given reference motion (e.g., a human walking, jumping, or dancing). Instead of defining a task with hand-crafted rewards, the robot is rewarded for how closely its movements match the target motion. This approach allows robots to acquire complex, humanlike behaviors without explicit programming for each movement.

  • Motion Capture (MoCap): A technology used to record the movement of people or objects. Typically, markers are placed on a human subject's body, and multiple cameras or sensors track the 3D positions of these markers over time. This generates highly accurate and precise 3D kinematic data (joint positions, orientations, velocities) that captures human motion. MoCap datasets like AMASS are considered high-quality but are expensive and time-consuming to produce, leading to limited scale and diversity.

  • SMPL/SMPL-X Model: SMPL (Skinned Multi-Person Linear Model) and its extension SMPL-X are statistical 3D human body models. These models can represent a wide range of human body shapes and poses using a small set of parameters. Given shape parameters (e.g., height, weight) and pose parameters (joint angles), SMPL/SMPL-X can generate a realistic 3D mesh of a human body. These models are widely used as a standard representation for human motion data, allowing for easy manipulation and retargeting.

  • Inverse Kinematics (IK): In robotics and computer graphics, Inverse Kinematics refers to the process of determining the joint angles of a kinematic chain (like a robot arm or a human skeleton) that would result in a desired position and orientation of a specific end-effector (e.g., a hand or foot). Given a target spatial position for an end-effector, IK algorithms calculate the required joint configurations to reach that target. Traditional IK often struggles with preserving the natural motion style or accounting for morphological differences between humans and robots.

  • Shape-Adaptive Inverse Kinematics (SINK): An advanced form of IK that specifically addresses the motion mismatch problem in retargeting. Unlike traditional IK which might apply rigid scaling, SINK methods first adapt the source human model (e.g., SMPL) to match the body shape and limb proportions of the target robot. Only then is the motion aligned by matching global or local joint positions/orientations. While better at preserving motion style, SINK approaches are often physically under-constrained, meaning they do not explicitly account for physical limitations like joint limits or ground interactions, leading to artifacts.

  • Physical Artifacts: These are undesirable and unrealistic behaviors that arise in simulated or retargeted motions, making them unphysical for a real robot or simulation. The paper specifically mentions:

    • Floating: The robot's feet or body hovering above the ground, losing contact when it should be touching.
    • Penetration: The robot's limbs or body sinking into or passing through the ground or other objects.
    • Foot Skating: The robot's feet sliding across the ground surface during phases when they should be firmly planted.
    • Joint Violations: Joint angles exceeding the physical or mechanical limits of the robot's actuators or structure, leading to unnatural or impossible poses.

3.2. Previous Works

The paper builds upon and differentiates itself from several key prior studies:

  • Motion Capture Datasets (e.g., LaFAN1, AMASS):

    • LaFAN1 (Harvey et al., 2020): A relatively small-scale motion capture dataset (a few hours). While providing high-quality data, its limited size restricts the diversity of motions that can be learned.
    • AMASS (Mahmood et al., 2019): The most extensive and widely-used MoCap dataset. It aggregates various MoCap libraries into a common SMPL format. Despite its size, AMASS is still limited in scale and diversity, often dominated by walking motions and conducted in indoor lab settings. The paper notes that AMASS is moderate-quality and, even when retargeted with SINK, still exhibits physical reliability issues (Figure 2b).
    • Limitation: Both LaFAN1 and AMASS suffer from scalability and diversity issues inherent to MoCap data collection. They provide a high proportion of physically feasible motions, but their content is mostly simple movements.
  • Video-to-Motion Datasets (e.g., Humanoid-X):

    • Humanoid-X (Mao et al., 2025): This dataset represents a trend of scaling data collection by leveraging vast internet videos. It converts videos to SMPL representations using video-to-motion models (like VIBE by Kocabas et0al., 2020) and then retargets them to humanoid embodiments. Humanoid-X is large-scale but lower-quality in terms of physical plausibility.
    • Limitation: The pipeline used by Humanoid-X (video-to-motion followed by retargeting) introduces significant physical artifacts. The video-to-motion models often misestimate global pelvis translation, leading to floating or ground penetration. The retargeting stage prioritizes joint alignment over physical plausibility, causing joint violations and foot skating (as highlighted in Figure 1). These issues make the data less stable for training robust robot policies.
  • MaskedMimic (Tessler et al., 2024): This is the reinforcement learning framework used by the authors for policy training. MaskedMimic provides a unified approach for motion tracking and supports both full-body state and partial-body state (e.g., pelvis-only) imitation. The paper uses this framework to train its policies, demonstrating that the improvements observed are due to the PHUMA dataset and PhySINK method, rather than a different RL training methodology.

    • Note: While not a dataset, MaskedMimic is a crucial methodological prerequisite as it's the learning algorithm for which PHUMA provides the training data.
  • SINK (Shape-Adaptive Inverse Kinematics) Methods (He et al., 2024b;a; 2025a;b): These methods are a direct predecessor to PhySINK. SINK aims to preserve motion style by adapting the human model to the robot's proportions before aligning motions.

    • Limitation: As discussed, SINK approaches are physically under-constrained. They effectively match kinematic poses but introduce physical artifacts like joint limit violations and implausible ground interactions (floating, penetration, skating) because they don't explicitly optimize for these physical constraints.

3.3. Technological Evolution

The field of humanoid locomotion has evolved significantly:

  1. Early RL for Locomotion: Initially, reinforcement learning focused on task-oriented rewards to achieve locomotion (e.g., Hwangbo et al., 2019; Lee et al., 2020). While effective, these often produced non-humanlike gaits.
  2. Emergence of Motion Imitation: To achieve humanlike behaviors, the focus shifted to motion imitation, using human motion data as a reference. This involved MoCap data and retargeting.
  3. MoCap Data Limitations: MoCap datasets (like CMU, LaFAN1, AMASS) provided high-quality, physically sound motions but were inherently limited in scale and diversity due to the expensive and labor-intensive collection process. They often lacked complex or diverse movements.
  4. Scaling with Video Data: The advent of robust video-to-motion models (e.g., VIBE by Kocabas et al., 2020; TRAM by Wang et al., 2024) enabled researchers to leverage large-scale internet videos to generate vast amounts of human motion data (e.g., Humanoid-X). This addressed the scale and diversity problem.
  5. New Challenges with Video Data: However, video-derived motions and unconstrained retargeting introduced new problems: physical artifacts (floating, penetration, skating, joint violations) that undermined the stability and reliability of the data for robot training.
  6. PHUMA's Position: This paper's work, PHUMA, represents the next evolutionary step. It aims to combine the scale and diversity benefits of video-derived data with the physical reliability of MoCap data. It does this through a physics-aware curation pipeline and a novel physics-constrained retargeting method (PhySINK), seeking to deliver both quantity and quality.

3.4. Differentiation Analysis

Compared to the main methods in related work, PHUMA's core differences and innovations are:

  • Addressing the Scale vs. Quality Trade-off:

    • Vs. MoCap Datasets (e.g., AMASS, LaFAN1): While AMASS and LaFAN1 offer relatively high-quality, physically feasible motions, they are limited in scale and diversity. PHUMA vastly expands the scale and diversity of motions available for training (73 hours vs. 20.9 hours for AMASS) by incorporating video-derived data, while simultaneously maintaining high physical reliability.
    • Vs. Large-Scale Video Datasets (e.g., Humanoid-X): Humanoid-X achieves large scale by converting internet videos but suffers from significant physical artifacts (floating, penetration, skating, joint violations). PHUMA directly addresses these quality issues through its physics-aware motion curation and PhySINK retargeting, making the large-scale data physically reliable. Figure 1 visually highlights this differentiation in physical reliability.
  • Novel Retargeting Method (PhySINK):

    • Vs. Traditional IK: Traditional IK methods often fail to preserve motion style and produce unnatural gaits due to morphological differences between humans and robots. PhySINK, like SINK, adapts to body shape, but goes further.
    • Vs. SINK Methods: SINK methods improve pose matching and style preservation but are physically under-constrained, leading to artifacts. PhySINK innovates by augmenting SINK with explicit physical constraint losses for joint feasibility, grounding, and anti-skating. This is a critical addition that ensures the retargeted motions are not only stylistically faithful but also physically plausible for simulation and real-world robot execution (Table 2 demonstrates this significant improvement).
  • Comprehensive Data Curation: PHUMA includes a meticulous physics-aware motion curation pipeline (Section 3.1) that filters out problematic motions (e.g., high jitter, unstable CoM, incorrect foot contact) from raw video-derived data. This proactive cleaning step is crucial for ensuring the quality of the large-scale dataset, a level of rigor not explicitly highlighted in similar video-based datasets.

    In essence, PHUMA differentiates itself by providing a solution that marries the scale and diversity of video-derived data with the physical reliability typically associated with expensive MoCap data, thus overcoming the primary limitations of previous approaches.

4. Methodology

The core idea behind PHUMA is to create a large-scale, physically reliable dataset for humanoid locomotion by building upon existing large motion datasets (like Humanoid-X) and meticulously filtering and retargeting them with physical constraints. The theoretical basis is that for humanoid robots to effectively learn humanlike behaviors through motion imitation, the reference motions must not only be diverse and abundant but also strictly adhere to physical laws and robot mechanical limitations. Unphysical motions lead to unstable learning and poor real-world performance.

The PHUMA pipeline consists of four main stages, as illustrated in Figure 3. The first two stages are central to constructing the PHUMA dataset itself: (1) Motion Curation to filter out problematic motions, and (2) Motion Retargeting using PhySINK to adapt filtered motions to the humanoid while enforcing physical plausibility. The subsequent stages (3) Policy Learning and (4) Inference use this curated data for training and application.

4.1. Core Methodology In-depth (Layer by Layer)

4.1.1. Overall PHUMA Pipeline (Figure 3)

The PHUMA pipeline integrates several steps to achieve physically-grounded humanoid locomotion:

  1. Motion Curation: This initial stage focuses on processing diverse human motion data, which can come from motion capture systems or be reconstructed from videos. The key task here is to identify and filter out infeasible motions that contain physical artifacts like root jitter, floating, penetration, joint violations, or actions requiring external objects (e.g., sitting on a chair that isn't modeled). This step aims to maximize the physical plausibility of the raw motion data.

  2. Motion Retargeting (PhySINK): The cleaned motion data is then retargeted to the specific humanoid robot morphology. This crucial step employs PhySINK, the paper's novel physics-constrained retargeting method. PhySINK enforces joint limits, ground contact, and anti-skating constraints during the retargeting process, ensuring that the resulting robot motions are both kinematically similar to the human source and physically executable by the robot.

  3. Policy Learning: With a robust and physically plausible motion dataset, reinforcement learning (RL) is used to train a policy (a neural network controller) that can imitate these retargeted motions. The policy learns to generate motor commands for the humanoid robot to track the desired motion trajectories. The MaskedMimic framework is used for this purpose.

  4. Application (Inference): The trained policy is then used to control the humanoid robot. This allows the robot to imitate motions from unseen videos (which are first converted to motion data via a video-to-motion model and then retargeted) or to perform tasks like path following.

    The following sections detail the first two stages, which constitute the core of PHUMA's dataset construction methodology.

4.1.2. Physics-Aware Motion Curation (Section 3.1)

The goal of this stage is to refine raw motion data, which often contains artifacts, making it physically implausible for a humanoid robot. The process targets severe jitter, instabilities from interactions with unmodeled objects, and incorrect foot-ground contact.

  • Low-Pass Noise Filtering for Motion Data: To mitigate high-frequency jitter often present in raw motion data (especially from video-to-motion models), a low-pass Butterworth filter is applied to all motion channels. This filter smooths out rapid, noisy fluctuations while preserving the essential movement patterns.

    • The filter used is a zero-phase, 4-th-order Butterworth low-pass filter.
    • The sampling frequency (fsf_s) is 30Hz30 \:\mathrm{Hz}.
    • For root translation (the overall movement of the robot's base), the cutoff frequency is set to 3Hz3 \:\mathrm{Hz}. This means slower movements are preserved, while faster, jerky translations are attenuated.
    • For global orientation (the overall rotation of the robot in space) and body pose (the configuration of its joints), the cutoff frequency is 6Hz6 \:\mathrm{Hz}. This allows for more dynamic rotational and pose changes compared to root translation, while still smoothing noise.
  • Extracting Ground Contact Information: Establishing a consistent ground plane in the world frame is crucial for correcting foot-ground contact issues like floating and penetration. Raw motions, especially video-derived ones, often lack a true ground reference as they are defined in a camera's coordinate frame.

    1. Foot Vertex Selection (Figure 6 & Table 6): The method identifies a specific subset of SMPL-X foot vertices that are most indicative of ground interaction. These are the 22 vertically lowest vertices from each of the four key foot regions: Left Heel (LH), Left Toe (LT), Right Heel (RH), and Right Toe (RT). This totals 88 vertices. These vertices are illustrated in Figure 6. The following are the results from Table 6 of the original paper:

      RegionVertex indices
      Left heel8888, 8889, 8891, 8909, 8910, 8911, 8913, 8914, 8915, 8916, 8917, 8918, 8919, 8920, 8921, 8922, 8923, 8924, 8925, 8929, 8930, 8934
      Left toe5773, 5781, 5782, 5791, 5793, 5805, 5808, 5816, 5817, 5830, 5831, 5859, 5860, 5906, 5907, 5908, 5909, 5912, 5914, 5915, 5916, 5917
      Right heel8676, 8677, 8679, 8697, 8698, 8699, 8701, 8702, 8703, 8704, 8705, 8706, 8707, 8708, 8709, 8710, 8711, 8712, 8713, 8714, 8715, 8716
      Right toe8467, 8475, 8476, 8485, 8487, 8499, 8502, 8510, 8511, 8524, 8525, 8553, 8554, 8600, 8601, 8602, 8603, 8606, 8608, 8609, 8610, 8611

      As shown in Figure 6, these selected vertices accurately represent the ground-contact areas.

      Figure 6: SMPL-X Foot Vertices for Ground-Contact Detection. This figure illustrates the selected foot vertices on the SMPL-X model used to detect ground contact. Green and orange points denote the l… Figure 6: SMPL-X Foot Vertices for Ground-Contact Detection. This figure illustrates the selected foot vertices on the SMPL-X model used to detect ground contact. Green and orange points denote the left heel and left toe, while blue and pink represent the right heel and right toe, respectively. The remaining foot vertices are shown in light-gray. The clusters of colored points correspond to the specific parts of the foot that are used to check for contact with the ground, making the process more accurate and robust than using a single point.

    2. Majority-Voting Scheme for Ground Height: To establish a single, consistent ground plane without introducing jitter, a majority-voting scheme is used:

      • For each frame tt, the minimum vertical position (zz-coordinate) among the 88 selected foot vertices is recorded as a candidate coordinate for the ground plane, gtg_t.
      • Each candidate gtg_t is then evaluated by counting the total number of foot vertices (across all frames in the sequence) that fall within a narrow tolerance band of δ=2.5cm\delta = 2.5 \mathrm{cm} around that candidate height.
      • The candidate height gg^\star that garners the most votes (i.e., has the most foot vertices consistently near it) is selected as the optimal ground plane.
      • Finally, the entire motion sequence is translated vertically so that gg^\star is at a height of zero, establishing a consistent ground reference.
  • Filtering Motion Data by Physical Information: After establishing a reliable ground plane, the motion sequences are segmented into 4-second clips. Each clip is then evaluated against specific physical criteria, and those failing to meet the thresholds are discarded. This "chunk-and-filter" process maximizes the retention of viable segments from longer, potentially flawed sequences. The following are the results from Table 7 of the original paper:

    MetricThreshold
    Root jerk< 50 m/s3
    Foot contact score> 0.6
    Minimum pelvis height> 0.6 m
    Maximum pelvis height< 1.5 m
    Pelvis distance to base of support< 6 cm
    Spine1 distance to base of support< 11 cm

    The metrics and their thresholds are:

    1. Root jerk: This measures the rate of change of root acceleration. High root jerk indicates abrupt or unnatural movements. Clips with root jerk exceeding 50m/s350 \:\mathrm{m/s^3} are discarded to ensure smooth and physically plausible trajectories.
    2. Foot contact score: This metric quantifies the consistency and sufficiency of foot-ground interactions based on graded ground-contact signals. It is calculated as the average maximum contact score across the four foot regions over all frames in a clip. The formula for Foot contact score is: $ \mathrm { F o o t ~ c o n t a c t ~ s c o r e } = \frac { 1 } { T } \sum _ { t = 1 } ^ { T } \operatorname* { m a x } \left( c _ { t } ^ { l h } , c _ { t } ^ { l t } , c _ { t } ^ { r h } , c _ { t } ^ { r t } \right) $ where:
      • TT: The total number of frames in the sub-clip.
      • ctlh,ctlt,ctrh,ctrtc_t^{lh}, c_t^{lt}, c_t^{rh}, c_t^{rt}: Contact scores at frame tt for the Left Heel, Left Toe, Right Heel, and Right Toe regions, respectively. These scores indicate how strongly each region is in contact with the ground (e.g., based on vertex proximity to the ground).
      • max\operatorname{max}: The maximum contact score among the four foot regions at each frame tt. This ensures that if any part of the foot is firmly in contact, that frame contributes positively to the score. A score below 0.6 indicates significant penetration or floating and results in the clip being discarded. The authors note that motions involving airborne phases (like jumps) can still satisfy this criterion as long as contact before and after the airborne phase is consistent.
    3. Minimum pelvis height and Maximum pelvis height: These criteria filter out segments where the humanoid is positioned unnaturally low (e.g., excessively crouched, lying down, or ground penetration) or unnaturally high (e.g., floating or an exaggerated jump that exceeds reasonable bounds). The minimum pelvis height must be greater than 0.6m0.6 \:\mathrm{m}, and the maximum pelvis height must be less than 1.5m1.5 \:\mathrm{m}.
    4. Pelvis distance to base of support and Spine1 distance to base of support: These metrics assess the robot's balance and stability. The base of support is defined as the convex hull formed by the horizontal-plane projections of the left foot, right foot, left ankle, and right ankle joints. The SMPL-X model's center of mass (CoM) typically lies between the pelvis and spine1 joints. Large deviations of these joints' horizontal projections from the base of support indicate imbalance or instability that would be infeasible for humanoids.
      • Pelvis distance to base of support must be less than 6cm6 \:\mathrm{cm}.
      • Spine1 distance to base of support must be less than 11cm11 \:\mathrm{cm}.
  • Data Augmentation: Finally, these curated motions are augmented with data from other high-quality sources such as LaFAN1, LocoMuJoCo, and the authors' own video captures, contributing to the diversity and scale of the final PHUMA dataset.

4.1.3. Physics-Constrained Motion Retargeting (PhySINK) (Section 3.2)

This section details the PhySINK method, which extends Shape-Adaptive Inverse Kinematics (SINK) with explicit physical constraints to produce both stylistically faithful and physically plausible motions. PhySINK optimizes the humanoid joint positions (qtq_t) and root translation (γt\gamma_t) over time tt.

The following are the results from Figure 4 of the original paper:

Figure 4: Common physical artifacts in motion retargeting. From left to right: Motion Mismatch, Joint Violation, Floating, Penetration, and Skating. Figure 4: Common physical artifacts in motion retargeting. From left to right: Motion Mismatch, Joint Violation, Floating, Penetration, and Skating.

  • Motion Fidelity Loss (LFidelity\mathcal{L}_{\mathrm{Fidelity}}): This component ensures that the retargeted motion closely matches the source human motion in terms of both global joint positions and local limb orientations. It also includes a smoothness term. The LFidelity\mathcal{L}_{\mathrm{Fidelity}} is defined as: $ \mathcal { L } _ { \mathrm { g l o b a l - m a t c h } } = \sum _ { t } \sum _ { i } \left| p _ { i } ^ { \mathrm { S M P L - X } } ( t ) - p _ { i } ^ { \mathrm { H u m a n o i d } } ( t ) \right| _ { 1 } $ $ \begin{array} { r l r } { \mathcal { L } _ { \mathrm { l o c a l - m a t c h } } = \sum _ { t } \sum _ { i \ne j } m _ { i j } \underbrace { | \Delta p _ { i j } ^ { \mathrm { SMPL-X } } ( t ) - \Delta p _ { i j } ^ { \mathrm { Humanoid } } ( t ) | _ { 2 } ^ { 2 } } _ { \mathrm { p o s i t i o n } } } \ { } & { + \sum _ { t } \sum _ { i \ne j } m _ { i j } \underbrace { ( 1 - \frac { \Delta p _ { i j } ^ { \mathrm { SMPL-X } } ( t ) \cdot \Delta p _ { i j } ^ { \mathrm { Humanoid } } ( t ) } { | \Delta p _ { i j } ^ { \mathrm { SMPL-X } } ( t ) | | \Delta p _ { i j } ^ { \mathrm { Humanoid } } ( t ) | } ) } _ { \mathrm { orientation } } } \end{array} $ $ \begin{array} { r } { \displaystyle \mathcal { L } _ { \mathrm { s m o o t h } } = \sum _ { t } \left| \dot { q } _ { t } - 2 \dot { q } _ { t + 1 } + \dot { q } _ { t + 2 } \right| _ { 1 } + \sum _ { t } \left| \dot { \gamma } _ { t } - 2 \dot { \gamma } _ { t + 1 } + \dot { \gamma } _ { t + 2 } \right| _ { 1 } } \ { \displaystyle \mathcal { L } _ { \mathrm { Fidelity } } = w _ { \mathrm { global-match } } \mathcal { L } _ { \mathrm { global-match } } + w _ { \mathrm { local-match } } \mathcal { L } _ { \mathrm { local-match } } + w _ { \mathrm { s m o o t h } } \mathcal { L } _ { \mathrm { s m o o t h } } } \end{array} $ where:

    • tt: Time index (frame number).
    • ii: Joint index.
    • piSMPLX(t)p_i^{\mathrm{SMPL-X}}(t): Global 3D position of joint ii at time tt for the source SMPL-X human model.
    • piHumanoid(t)p_i^{\mathrm{Humanoid}}(t): Global 3D position of joint ii at time tt for the target humanoid robot.
    • 1\| \cdot \|_1: L1-norm (Manhattan distance), used for global-match loss.
    • Δpij\Delta p_{ij}: Position difference (vector) between joint ii and joint jj.
    • mijm_{ij}: A binary mask that equals 1 if joints ii and jj are immediate neighbors in the humanoid kinematic tree (i.e., connected by a bone), and 0 otherwise. This ensures that local-match loss focuses on maintaining the relative configuration of connected limbs.
    • 22\| \cdot \|_2^2: Squared L2-norm (Euclidean distance), used for the position component of local-match loss.
    • The orientation term in Llocalmatch\mathcal{L}_{\mathrm{local-match}} is a cosine similarity term, (1cosθ)(1 - \cos \theta), where cosθ=ΔpijSMPLX(t)ΔpijHumanoid(t)ΔpijSMPLX(t)ΔpijHumanoid(t)\cos \theta = \frac { \Delta p _ { i j } ^ { \mathrm { SMPL-X } } ( t ) \cdot \Delta p _ { i j } ^ { \mathrm { Humanoid } } ( t ) } { \| \Delta p _ { i j } ^ { \mathrm { SMPL-X } } ( t ) \| \| \Delta p _ { i j } ^ { \mathrm { Humanoid } } ( t ) \| }. It penalizes differences in the orientation of the vectors connecting neighboring joints in the human and humanoid models. This helps preserve the local limb orientations.
    • q˙t\dot{q}_t: Velocity of the joint angles at time tt.
    • γ˙t\dot{\gamma}_t: Velocity of the root translation at time tt.
    • q˙t2q˙t+1+q˙t+21\left\| \dot { q } _ { t } - 2 \dot { q } _ { t + 1 } + \dot { q } _ { t + 2 } \right\| _ { 1 } and γ˙t2γ˙t+1+γ˙t+21\left\| \dot { \gamma } _ { t } - 2 \dot { \gamma } _ { t + 1 } + \dot { \gamma } _ { t + 2 } \right\| _ { 1 }: These are second-order finite difference approximations of acceleration (specifically, the change in velocity over two time steps). Minimizing these terms promotes smoothness in joint angle velocities and root translation velocities, preventing jerky motions.
    • wglobalmatch,wlocalmatch,wsmoothw_{\mathrm{global-match}}, w_{\mathrm{local-match}}, w_{\mathrm{smooth}}: Weighting coefficients for each loss term, determining their relative importance in the overall fidelity loss.
    • Motion Fidelity (%): This is defined as the average percentage of frames where the mean per-joint position error is below 10cm10 \:\mathrm{cm} and the mean per-link orientation error is below 10 degrees.
  • Joint Feasibility Loss (LFeasibility\mathcal{L}_{\mathrm{Feasibility}}): This loss penalizes joint angles (qtq_t) and joint velocities (q˙t\dot{q}_t) that approach or exceed the predefined operational limits of the humanoid robot. This prevents joint violations. $ \begin{array} { r l } & { \mathcal { L } _ { \mathrm { p o s i t i o n - v i o l a t i o n } } = \displaystyle \sum _ { t } \left[ \operatorname* { m a x } ( 0 , q _ { t } - 0 . 9 8 q _ { \operatorname* { m a x } } ) + \operatorname* { m a x } ( 0 , 0 . 9 8 q _ { \operatorname* { m i n } } - q _ { t } ) \right] } \ & { \mathcal { L } _ { \mathrm { v e l o c i t y - v i o l a t i o n } } = \displaystyle \sum _ { t } \left[ \operatorname* { m a x } ( 0 , \dot { q } _ { t } - 0 . 9 8 \dot { q } _ { \operatorname* { m a x } } ) + \operatorname* { m a x } ( 0 , 0 . 9 8 \dot { q } _ { \operatorname* { m i n } } - \dot { q } _ { t } ) \right] } \ & { \qquad \mathcal { L } _ { \mathrm { Feasibility } } = \mathcal { L } _ { \mathrm { p o s i t i o n - v i o l a t i o n } } + \mathcal { L } _ { \mathrm { v e l o c i t y - v i o l a t i o n } } . } \end{array} $ where:

    • qtq_t: Current joint angle at time tt.
    • qmaxq_{\max}: Upper mechanical limit for the joint angle.
    • qminq_{\min}: Lower mechanical limit for the joint angle.
    • max(0,x)\operatorname{max}(0, x): A rectified linear unit (ReLU)-like function that only penalizes positive violations.
    • The terms 0.98qmax0.98 q_{\max} and 0.98qmin0.98 q_{\min} introduce a soft limit, penalizing angles that are within 2%2\% of the actual physical limit, encouraging the optimization to stay comfortably within bounds rather than directly hitting them.
    • q˙t\dot{q}_t: Current joint angular velocity at time tt.
    • q˙max\dot{q}_{\max}: Upper mechanical limit for joint angular velocity.
    • q˙min\dot{q}_{\min}: Lower mechanical limit for joint angular velocity.
    • Joint Feasibility (%): Defined as the percentage of frames where all joint positions and velocities remain within 98%98\% of their predefined mechanical limits.
  • Grounding Loss (LGround\mathcal{L}_{\mathrm{Ground}}): This loss term addresses floating and penetration artifacts by enforcing that the humanoid's foot regions remain on the ground plane during frames where contact is detected. $ { \mathcal { L } } _ { \mathrm { Ground } } = \sum _ { i \in { \mathrm { LH } , \mathrm { LT } , \mathrm { RH } , \mathrm { RT } } } \sum _ { t } c _ { t } ^ { i } \left. p _ { t } ^ { i } ( z ) \right. _ { 2 } ^ { 2 } $ where:

    • i{LH,LT,RH,RT}i \in \{\mathrm{LH}, \mathrm{LT}, \mathrm{RH}, \mathrm{RT}\}: Index for the four specified foot regions (Left Heel, Left Toe, Right Heel, Right Toe).
    • tt: Time index.
    • ctic_t^i: A contact score for foot region ii at frame tt. This score is typically 1 if the foot region is in contact with the ground and 0 otherwise (or a continuous value indicating degree of contact).
    • pti(z)p_t^i(z): The vertical position (zz-coordinate) of foot region ii at time tt. The loss penalizes this vertical position when contact is detected.
    • 22\| \cdot \|_2^2: Squared L2-norm.
    • Non-Floating (%): Percentage of contact frames where the foot is within 1cm1 \:\mathrm{cm} above the ground.
    • Non-Penetration (%): Percentage of contact frames where the foot is within 1cm1 \:\mathrm{cm} below the ground.
  • Skating Loss (LSkate\mathcal{L}_{\mathrm{Skate}}): This loss prevents foot sliding by penalizing the horizontal velocity of any foot region that is detected to be in contact with the ground. $ { \mathcal { L } } _ { \mathrm { Skate } } = \sum _ { i \in { \mathrm { LH } , \mathrm { LT } , \mathrm { RH } , \mathrm { RT } } } \sum _ { t } c _ { t } ^ { i } \left. { \dot { p } } _ { t } ^ { i } ( x , y ) \right. _ { 2 } $ where:

    • i{LH,LT,RH,RT}i \in \{\mathrm{LH}, \mathrm{LT}, \mathrm{RH}, \mathrm{RT}\}: Index for the four foot regions.
    • tt: Time index.
    • ctic_t^i: The contact score for foot region ii at frame tt. The loss is only active when contact is detected.
    • p˙ti(x,y)\dot{p}_t^i(x,y): The horizontal velocity (velocity in the xx-yy plane) of foot region ii at time tt.
    • 2\| \cdot \|_2: L2-norm.
    • Non-Skating (%): Percentage of contact frames where the foot's horizontal velocity is below 10cm/s10 \:\mathrm{cm/s}.
  • PhySINK Objective (LPhySINK\mathcal{L}_{\mathrm{PhySINK}}): The complete PhySINK objective is a weighted sum of the motion fidelity loss and all the physical constraint terms. By optimizing this augmented objective, PhySINK generates motions that maintain kinematic similarity to the source while being physically plausible. $ { \mathcal { L } } _ { \mathrm { PhySINK } } = { \mathcal { L } } _ { \mathrm { Fidelity } } + w _ { \mathrm { Feasibility } } { \mathcal { L } } _ { \mathrm { Feasibility } } + w _ { \mathrm { Ground } } { \mathcal { L } } _ { \mathrm { Ground } } + w _ { \mathrm { Skate } } { \mathcal { L } } _ { \mathrm { Skate } } $ where:

    • wFeasibility,wGround,wSkatew_{\mathrm{Feasibility}}, w_{\mathrm{Ground}}, w_{\mathrm{Skate}}: Weighting coefficients for the Joint Feasibility, Grounding, and Skating losses, respectively. These weights determine the importance of each physical constraint during the optimization process. The baseline SINK method would only use LFidelity\mathcal{L}_{\mathrm{Fidelity}}.

      This detailed integration of physical constraints directly into the retargeting optimization is what distinguishes PhySINK and enables the creation of the physically-grounded PHUMA dataset. The authors validate PhySINK by retargeting PHUMA to two Unitree robots (G1 and H1-2) and comparing it against a standard IK solver and the SINK framework, showing progressive performance enhancement with the addition of each physical constraint loss (Table 2 in results).

5. Experimental Setup

The experiments are designed to evaluate the effectiveness of PhySINK and the PHUMA dataset across three main research questions: (RQ1) PhySINK vs. established retargeting methods; (RQ2) PHUMA vs. prior datasets for motion imitation; and (RQ3) PHUMA's impact on path-following with simplified control.

5.1. Datasets

The following datasets were used in the experiments:

  • PHUMA Dataset:

    • Source: The paper's novel dataset. It aggregates motions from LocoMuJoCo, GRAB, EgoBody, LAFAN1, AMASS (all motion capture), HAA500, Motion-X Video, HuMMan, AIST, IDEA400 (all human video), and PHUMA Video (authors' own video captures).

    • Scale: Large-scale, containing 73.0 hours of physically-grounded motion across 76.0K clips.

    • Characteristics: Designed to be both large-scale and high-quality (physically reliable), correcting for physical artifacts through curation and PhySINK.

    • Why chosen: It is the primary subject of the paper, demonstrating the efficacy of the proposed methodology. The following are the results from Table 1 of the original paper:

      Dataset# Clip# FrameDurationSource
      LocoMuJoCo (Al-Hafez et al., 2023)0.78K0.93M0.86hMotion Capture
      GRAB (Taheri et al., 2020)1.73K0.20M1.88hMotion Capture
      EgoBody (Zhang et al., 2022)2.12K0.24M2.19hMotion Capture
      LAFAN1 (Harvey et al., 2020)2.18K0.26M2.40hMotion Capture
      AMASS (Mahmood et al., 2019)21.73K2.25M20.86hMotion Capture
      HAA500 (Chung et al., 2021)1.76K0.11M1.01hHuman Video
      Motion-X Video (Lin et al., 2023)33.04K3.45M31.98hHuman n Video
      HuMMan (Cai et al., 2022)0.50K0.05M0.47hHuman Video
      AIST (Tsuchida et al., 2019)1.75K0.18M1.66hHuman Video
      IDEA400 (Lin et al., 2023)9.94K0.98M9.10hHuman n Video
      PHUMA Video0.50K0.06M0.56hHuman Video
      PHUMA76.01K7.88M72.96h
  • LaFAN1 (Harvey et al., 2020):

    • Source: Motion capture.
    • Scale: Small-scale (2.4 hours).
    • Characteristics: High-quality, but limited in diversity.
    • Why chosen: Represents a baseline for high-quality but small-scale motion data.
  • AMASS (Mahmood et al., 2019):

    • Source: Motion capture.
    • Scale: Medium-scale (20.9 hours).
    • Characteristics: Widely-used, moderate-quality, but limited in diversity, primarily walking motions.
    • Why chosen: A common benchmark, representing the state-of-the-art in widely available MoCap data. Used with the SINK retargeting method for fair comparison, as it provides human motion source data.
  • Humanoid-X (Mao et al., 2025):

    • Source: Large-scale internet videos converted to motion.
    • Scale: Large-scale (231.4 hours).
    • Characteristics: High diversity, but lower quality due to physical artifacts (floating, penetration, skating, joint violations) inherent in video-to-motion conversion and unconstrained retargeting.
    • Why chosen: Represents the trend of scaling data with video, highlighting the problems PHUMA aims to solve. Used directly as a pre-existing humanoid dataset.
  • Self-Collected Video Dataset (for evaluation):

    • Source: Custom video recordings by the authors.

    • Scale: 504 sequences across 11 motion types.

    • Characteristics: Designed for unbiased evaluation on completely unseen motions not present in any training data. It has a uniform distribution across 11 motion types.

    • Why chosen: To ensure fair evaluation of generalization capabilities beyond the training data. As can be seen from the following Figure 9, this dataset is processed through a pipeline:

      Figure 9: Overview of the Self-collected Data Pipeline. This figure illustrates the three main steps of our data collection pipeline: (left) a self-recorded video of a human motion, (center) the moti… Figure 9: Overview of the Self-collected Data Pipeline. This figure illustrates the three main steps of our data collection pipeline: (left) a self-recorded video of a human motion, (center) the motion extracted using a video-to-motion model, and (right) the final motion retargeted to a humanoid robot.

    Example of data sample: A self-recorded video of a human performing a specific motion (e.g., a jump), which is then converted to SMPL parameters using TRAM (Wang et al., 2024), and finally retargeted to the humanoid robot using PhySINK.

5.2. Evaluation Metrics

For every evaluation metric, the paper implicitly uses specific definitions and thresholds.

  • Success Rate (for Motion Tracking):

    1. Conceptual Definition: This metric quantifies the effectiveness of a policy in imitating a target motion. It measures the proportion of motion sequences where the robot successfully tracks the reference motion within a specified deviation threshold over its entire duration. A higher success rate indicates better imitation performance.
    2. Mathematical Formula: While not explicitly provided as a formula in the paper, it's defined conceptually as: $ \text{Success Rate} = \frac{\text{Number of successfully imitated motions}}{\text{Total number of evaluated motions}} \times 100% $ A motion is considered "successfully imitated" if the mean per-joint position error (or pelvis tracking error for path following) remains below a certain threshold for the entire duration of the motion.
    3. Symbol Explanation:
      • Number of successfully imitated motions: The count of individual motion sequences where the policy achieves the tracking goal within the defined error tolerance.
      • Total number of evaluated motions: The total count of motion sequences used for testing.
    • Thresholds:
      • Standard Threshold: 0.5m0.5 \:\mathrm{m}. This is a conventional threshold used in prior motion imitation studies. However, the authors argue it incorrectly classifies scenarios as successful when humanoids remain stationary during jumps or stay upright during squatting motions, failing to capture true imitation quality.
      • Stricter Threshold (Proposed): 0.15m0.15 \:\mathrm{m}. The authors adopt this stricter threshold for a more meaningful measure of imitation quality and to reveal clearer performance differences. This threshold is applied to the mean per-joint position error for full-state tracking and pelvis tracking accuracy for path following.
  • Pelvis Path Following Success Rate:

    1. Conceptual Definition: This metric is a specialized success rate for the path following task. It measures how effectively the policy can track the pelvis trajectory of a reference motion, even with pelvis-only guidance. A motion is successful if the robot's pelvis stays within the deviation threshold of the target pelvis trajectory throughout the sequence.
    2. Mathematical Formula: Conceptually similar to the general success rate, but specifically focused on pelvis error: $ \text{Pelvis Path Following Success Rate} = \frac{\text{Number of motions with successful pelvis tracking}}{\text{Total number of evaluated motions}} \times 100% $
    3. Symbol Explanation:
      • Number of motions with successful pelvis tracking: Count of motions where the robot's pelvis position stays within the 0.15m0.15 \:\mathrm{m} threshold of the target pelvis trajectory.
      • Total number of evaluated motions: Total motions tested.
  • Motion Categories (for evaluation breakdown): To assess how well policies generalize across different types of human movement, evaluations are organized into four categories:

    • Stationary: Motions involving minimal displacement, such as stand and reach.
    • Angular: Motions with significant rotational components or changes in body orientation, like bend, twist, turn, and kick.
    • Vertical: Motions involving significant vertical displacement of the body, such as squat, lunge, and jump.
    • Horizontal: Motions primarily involving horizontal translation, like walk and run.

5.3. Baselines

The paper compares its methodology against several representative baselines:

  • Retargeting Methods (for RQ1):

    • IK (Inverse Kinematics): A standard, general-purpose IK solver (specifically, mink by Zakka, 2025). This represents a basic approach to mapping human end-effector positions to robot joint angles, often without explicit consideration of body proportions or physical constraints.
    • SINK (Shape-Adaptive Inverse Kinematics): This represents a learning-based IK method that adapts the source human model to match the body shape and limb proportions of the target robot. It aims to preserve motion style better than traditional IK but is physically under-constrained.
  • Prior Datasets (for RQ2 and RQ3):

    • LaFAN1 (Harvey et al., 2020): A small-scale, high-quality motion capture dataset. It serves as a baseline for systems trained on limited but clean data.
    • AMASS (Mahmood et al., 2019): A medium-scale, moderate-quality motion capture dataset. It's a widely-used benchmark. For comparison, AMASS data is retargeted using SINK (the best method prior to PhySINK for preserving style).
    • Humanoid-X (Mao et al., 2025): A large-scale, lower-quality dataset derived from internet videos. It represents the trade-off of scale over data fidelity, serving as a direct contrast to PHUMA's physically-grounded approach.

5.4. Training Framework and Robots

  • Training Framework: All policies are trained using the MaskedMimic framework (Tessler et al., 2024).

    • Full-State Motion Tracking (RQ1, RQ2): Policies receive current proprioceptive state (stps_t^p) (joint positions, orientations, velocities) and full goal states (stgs_t^g) (target motion trajectories). They output joint angle commands (ata_t) executed via PD controllers. The reward function measures how well the humanoid matches the target motion.
    • Partial-State Protocol / Pelvis-Only Guidance (RQ3): This involves a teacher-student learning approach:
      1. A full-state teacher policy is first trained on full-body reference motion data.
      2. A student policy is then trained using knowledge distillation to mimic the teacher's actions, but it receives only pelvis position and rotation as input. This enables pelvis path-following control while maintaining humanlike movement.
    • Reinforcement Learning Algorithm: PPO (Proximal Policy Optimization) (Schulman et al., 2017) is used for training.
  • Simulation Environment: All experiments are conducted in the IsaacGym simulator.

  • Humanoid Robots:

    • Unitree G1: A humanoid robot with 29 Degrees of Freedom (DoF).
    • Unitree H1-2: A humanoid robot with 21 DoF (excluding wrist joints). These robots represent common humanoid platforms for locomotion research.

5.5. Observation Space Compositions

The observation space for the RL agent comprises two main components: proprioceptive states (what the robot senses about itself) and goal states (information about the target motion).

The following are the results from Table 8 of the original paper:

StateDimension
G1H1-2
(a) Proprioceptive State
Root height11
Body position32 × 324 × 3
Body rotation33 × 625 × 6
Body velocity33 × 325 × 3
Body angular velocity33 × 325 × 3
(b) Goal State
Relative body position33 × 15 × 325 × 15 × 3
Absolute body position33 × 15 × 325 × 15 × 3
Relative body rotation33 × 15 × 625 × 15 × 6
Absolute body rotation33 × 15 × 625 × 15 × 6
Time33 × 15 × 125 × 15 × 1
Total dim98989898
  • Proprioceptive States: These are measurements from the robot's own sensors about its current state.

    • Root height: The vertical position of the robot's root (usually the pelvis). Dimension: 1.
    • Body position: The 3D positions of various bodies (links or segments) in the robot's kinematic chain. The root body is excluded. G1 has 32 bodies (32×3=9632 \times 3 = 96 dimensions), H1-2 has 24 bodies (24×3=7224 \times 3 = 72 dimensions).
    • Body rotation: The orientation of all bodies (including the root). Represented using 6D rotation representation (e.g., two columns of a rotation matrix). G1 has 33 bodies (33×6=19833 \times 6 = 198 dimensions), H1-2 has 25 bodies (25×6=15025 \times 6 = 150 dimensions).
    • Body velocity: The linear velocity of all bodies (including the root). G1 has 33 bodies (33×3=9933 \times 3 = 99 dimensions), H1-2 has 25 bodies (25×3=7525 \times 3 = 75 dimensions).
    • Body angular velocity: The angular velocity of all bodies (including the root). G1 has 33 bodies (33×3=9933 \times 3 = 99 dimensions), H1-2 has 25 bodies (25×3=7525 \times 3 = 75 dimensions).
  • Goal States: Information about the target motion trajectory that the robot needs to imitate. This information is typically provided for a future horizon (e.g., 15 timesteps ahead).

    • Relative body position: The difference between the future 15 timesteps of reference motion states and the robot's current proprioceptive state. This provides information about where the target body parts should be relative to the robot's current pose.
      • Dimension: 33×15×333 \times 15 \times 3 for G1, 25×15×325 \times 15 \times 3 for H1-2.
    • Absolute body position: The 3D positions of the target body parts relative to the reference motion's root position. This provides a root-relative coordinate frame for the target motion, making the goal robust to global shifts.
      • Dimension: 33×15×333 \times 15 \times 3 for G1, 25×15×325 \times 15 \times 3 for H1-2.
    • Relative body rotation: The difference in orientation between the future 15 timesteps of reference motion states and the robot's current proprioceptive state (using 6D rotation representation).
      • Dimension: 33×15×633 \times 15 \times 6 for G1, 25×15×625 \times 15 \times 6 for H1-2.
    • Absolute body rotation: The orientations of the target body parts relative to the reference motion's root orientation (using 6D rotation representation).
      • Dimension: 33×15×633 \times 15 \times 6 for G1, 25×15×625 \times 15 \times 6 for H1-2.
    • Time: Time-related information, possibly indicating the progress within the motion sequence or phase.
      • Dimension: 33×15×133 \times 15 \times 1 for G1, 25×15×125 \times 15 \times 1 for H1-2.

        The Total dimension of the observation space is 9898 for both G1 and H1-2, indicating a very high-dimensional state representation.

5.6. Reward Function

The reward function guides the reinforcement learning agent during training, shaping its behavior to effectively track the target motion. It comprises motion tracking task rewards and regularization rewards.

The following are the results from Table 9 of the original paper:

TermExpressionWeight
(a) Task
Global body positionexp(−100 · kpt − ptk2)0.5
Root heightexp(−100 · (hroot − hroot)2)0.2
Global body rotationexp(−10 · k|θt θtk2)0.3
Global body velocityexp(−0.5 ·kvt − vtk2)0.1
Global body angular velocityexp(−0.1 · ωt − ωt∥2)0.1
(b) Regularization
Power consumptionF qk1-1e-05
Action ratekat − at−1k2-0.2
  • Motion Tracking Task Rewards: These terms encourage the policy to precisely match the reference motion. They are typically exponentially decaying functions of the error, meaning smaller errors yield much higher rewards.

    • Global body position: Rewards policies for aligning the positions of all robot bodies with the target body positions.
      • Expression: exp(100ptpt2)\exp(−100 \cdot \|p_t - p_t^*\|^2)
      • Weight: 0.5
      • ptp_t: Current global body position.
      • ptp_t^*: Target global body position.
      • 2\|\cdot\|^2: Squared L2-norm of the position error.
      • The large coefficient (-100) ensures that even small position errors are heavily penalized, encouraging precise matching.
    • Root height: Rewards policies for maintaining the correct vertical position of the robot's root (pelvis).
      • Expression: exp(100(hroothroot)2)\exp(−100 \cdot (h_{\mathrm{root}} - h_{\mathrm{root}}^*)^2)
      • Weight: 0.2
      • hrooth_{\mathrm{root}}: Current root height.
      • hrooth_{\mathrm{root}}^*: Target root height.
      • Similar to body position, a high penalty coefficient for height error.
    • Global body rotation: Rewards policies for aligning the orientations of all robot bodies with the target body orientations.
      • Expression: exp(10θtθt2)\exp(−10 \cdot \|\theta_t - \theta_t^*\|^2)
      • Weight: 0.3
      • θt\theta_t: Current global body rotation.
      • θt\theta_t^*: Target global body rotation.
      • 2\|\cdot\|^2: Squared L2-norm of the rotation error (e.g., quaternion or 6D rotation error). The coefficient (-10) is smaller than for position, indicating a slightly lower sensitivity to rotational errors.
    • Global body velocity: Rewards policies for matching the linear velocities of all robot bodies with the target velocities.
      • Expression: exp(0.5vtvt2)\exp(−0.5 \cdot \|v_t - v_t^*\|^2)
      • Weight: 0.1
      • vtv_t: Current global body linear velocity.
      • vtv_t^*: Target global body linear velocity.
    • Global body angular velocity: Rewards policies for matching the angular velocities of all robot bodies with the target angular velocities.
      • Expression: exp(0.1ωtωt2)\exp(−0.1 \cdot \|\omega_t - \omega_t^*\|^2)
      • Weight: 0.1
      • ωt\omega_t: Current global body angular velocity.
      • ωt\omega_t^*: Target global body angular velocity.
      • The lower coefficients for velocity terms (0.5 and 0.1) suggest that achieving exact velocity matching is less critical than achieving correct positions and orientations.
  • Regularization Rewards: These terms penalize undesirable behaviors to promote smooth and stable motion execution.

    • Power consumption: Penalizes the power exerted by the robot's motors. This encourages energy-efficient movements.
      • Expression: Fqk\|F \cdot q_k\| (or some similar measure of torque ×\times velocity)
      • Weight: 1e05-1 \mathrm{e}{-05}
      • FF: Force/Torque applied by actuators.
      • qkq_k: Joint velocity.
      • This term is a small negative weight, meaning the robot is gently encouraged to minimize energy use, but not at the expense of task performance.
    • Action rate: Penalizes large changes between consecutive actions (joint commands). This helps to ensure smooth joint movements and prevent abrupt motion transitions, which can lead to instability or hardware wear.
      • Expression: atat12\|a_t - a_{t-1}\|^2

      • Weight: -0.2

      • ata_t: Current action (joint command).

      • at1a_{t-1}: Previous action.

      • 2\|\cdot\|^2: Squared L2-norm of the difference between consecutive actions.

        The overall reward is the sum of all these terms, weighted by their respective coefficients.

5.7. PPO Hyperparameter

The Proximal Policy Optimization (PPO) algorithm is used for training the policies. The following are the detailed hyperparameter settings:

The following are the results from Table 10 of the original paper:

HyperparameterValue
OptimizerAdam
Num envs8192
Mini Batches32
Learning epochs1
Entropy coefficient0.0
Value loss coefficient0.5
Clip param0.2
Max grad norm50.0
Init noise std-2.9
Actor learning rate2e-5
Critic learning rate1e-4
GAE decay factor(λ)0.95
GAE discount factor(γ)0.99
Actor Transformer dimension512
Actor layers4
Actor heads4
Critic MLP size[1024, 1024, 1024, 1024]
ActivationReLU
  • Optimizer: Adam. A popular optimization algorithm in deep learning, known for its efficiency and good performance.
  • Num envs: 8192. This refers to the number of parallel simulation environments run simultaneously during training. A large number of environments helps with data collection efficiency and exploration in RL.
  • Mini Batches: 32. The number of samples (trajectories or state-action pairs) processed in each update step of the neural network.
  • Learning epochs: 1. The number of times the collected data is iterated over for policy updates per PPO iteration.
  • Entropy coefficient: 0.0. This term encourages exploration during training by adding a bonus for higher entropy (more randomness) in the policy's actions. A value of 0.0 means this regularization is not used in this specific setup.
  • Value loss coefficient: 0.5. This weights the value function loss component in the overall PPO objective. The value function estimates the expected future reward from a given state.
  • Clip param: 0.2. This is the ϵ\epsilon parameter in PPO's clipped surrogate objective, which limits the policy update size to prevent large, destabilizing changes.
  • Max grad norm: 50.0. Gradient clipping is used to prevent exploding gradients by scaling down gradients if their L2 norm exceeds this threshold.
  • Init noise std: -2.9. This likely refers to the standard deviation of initial noise added to the policy's actions to encourage exploration at the beginning of training. A negative value might imply a specific initialization scheme or scaling.
  • Actor learning rate: 2e52 \mathrm{e}{-5}. The learning rate for the actor network, which is the policy network that outputs actions.
  • Critic learning rate: 1e41 \mathrm{e}{-4}. The learning rate for the critic network, which is the value network that estimates state values.
  • GAEdecayfactor(λ)GAE decay factor (λ): 0.95. This is the lambda parameter for Generalized Advantage Estimation (GAE), which balances bias and variance in advantage estimates.
  • GAEdiscountfactor(γ)GAE discount factor (γ): 0.99. The discount factor for future rewards, determining how much future rewards are valued compared to immediate rewards. A value close to 1 indicates a long-term planning horizon.
  • Actor Transformer dimension: 512. The dimensionality of the internal representation used by the Transformer network within the actor policy. This indicates the actor uses a Transformer architecture.
  • Actor layers: 4. The number of layers in the actor Transformer network.
  • Actor heads: 4. The number of attention heads in the actor Transformer network.
  • Critic MLP size: [1024, 1024, 1024, 1024]. The architecture of the critic network, which is a Multi-Layer Perceptron (MLP) with four hidden layers, each having 1024 units.
  • Activation: ReLU (Rectified Linear Unit). The activation function used in the neural networks.

6. Results & Analysis

The experiments evaluate PhySINK and PHUMA across three research questions using the Unitree G1 and H1-2 humanoids and two test sets: PHUMA Test (10% held-out from PHUMA) and Unseen Video (504 self-collected sequences). The primary metric is success rate with a strict 0.15m0.15 \:\mathrm{m} deviation threshold.

6.1. Core Results Analysis

6.1.1. PhySINK Retargeting Method Effectiveness (RQ1)

This section evaluates how PhySINK compares to established retargeting approaches (IK and SINK) in terms of motion imitation performance. The policies are trained on AMASS data, retargeted using each of the three methods.

The following are the results from Table 3 of the original paper:

PHUMA TestUnseen Video
RetargetTotalStationaryAngularVerticalHorizontalTotalStationaryAngularVerticalHorizontal
(a) G1
IK52.875.343.924.344.254.080.354.632.743.3
SINK76.288.572.156.866.870.290.775.062.744.1
PhySINK79.589.976.161.169.572.893.378.265.547.3
(b) H1-2
IK45.370.935.715.235.054.278.060.730.128.6
SINK54.474.945.917.249.664.387.359.746.063.9
PhySINK64.383.657.027.7555.972.499.266.357.463.1

Analysis of Table 3:

  • PhySINK consistently outperforms others: PhySINK achieves the highest success rates across all motion categories (Stationary, Angular, Vertical, Horizontal) and on both humanoid robots (G1 and H1-2) for both PHUMA Test and Unseen Video sets. This demonstrates that physically-grounded retargeting directly translates to better imitation performance.
  • SINK's improvement over IK: SINK significantly improves performance over IK for both robots and test sets. This highlights the importance of shape-adaptive retargeting in preserving motion style and making it more imitable. For example, on G1 PHUMA Test, SINK (76.2%) is much better than IK (52.8%).
  • PhySINK's further gains: PhySINK further improves upon SINK, particularly noticeable in dynamic motions (Vertical and Horizontal categories) where physical constraints are most critical. For G1 on PHUMA Test, PhySINK (61.1%) is better than SINK (56.8%) in Vertical motions. For H1-2 on PHUMA Test, PhySINK (27.7%) is significantly better than SINK (17.2%) in Vertical motions. This validates the importance of explicitly addressing joint limits, ground contact, and skating during retargeting.
  • Performance on Unseen Video: The trends hold for the Unseen Video test set, indicating that the benefits of PhySINK generalize well to novel motions.

6.1.2. PHUMA Dataset Effectiveness (RQ2)

This section compares the motion tracking performance of policies trained on LaFAN1, AMASS, Humanoid-X, and PHUMA.

The following are the results from Table 4 of the original paper:

PHUMA TestUnseen Video
DatasetHoursTotalStationaryAngularVerticalHorizontalTotalStationaryAngularVerticalHorizontal
(a) G1
LaFAN12.446.166.136.224.042.528.446.928.419.610.5
AMASS20.976.288.572.156.866.870.290.775.062.744.1
Humanoid-X231.450.678.443.026.031.839.178.039.623.06.5
PHUMA73.092.795.691.7886.0085.682.996.788.07.867.1
(b) H1-2
LaFAN12.462.079.354.726.670.892.466.756.468.2
AMASS20.954.474.945.917.258.9 49.664.387.359.746.063.9
Humanoid-X231.449.774.640.417.037.360.588.360.048.739.7
PHUMA73.082.791.579.568.168.478.697.576.874.563.8

Analysis of Table 4:

  • PHUMA's Dominance: Policies trained on PHUMA achieve the highest success rates across all motion categories, on both G1 and H1-2 humanoids, and for both test sets. For G1 on PHUMA Test, PHUMA scores 92.7% total success, significantly outperforming AMASS (76.2%), Humanoid-X (50.6%), and LaFAN1 (46.1%). This is a strong validation of PHUMA's combined benefits of large scale and high physical quality.

  • Scale Alone is Insufficient: Humanoid-X, despite being the largest dataset (231.4 hours), significantly underperforms compared to AMASS (20.9 hours) and even LaFAN1 in some cases (e.g., G1 Total on PHUMA Test). This clearly demonstrates that scale alone is insufficient if data quality (physical reliability) is compromised. The physical artifacts in Humanoid-X hinder effective policy learning.

  • Quality Alone is Insufficient: LaFAN1 (2.4 hours) offers high-quality data but its limited scale and diversity lead to poorer generalization, especially on the Unseen Video test set (G1 total 28.4%). Even AMASS (20.9 hours), while better than LaFAN1, shows limitations in covering diverse motion types.

  • Importance of Diversity (Figure 8): The paper notes that LaFAN1 and AMASS have uneven motion distributions, lacking certain categories (e.g., reach, bend, squat) or being dominated by simple ones. PHUMA provides a more balanced and comprehensive coverage (Figure 8), leading to better generalization across diverse behaviors.

    Figure 8: Motion Type Distribution per Dataset. This radar chart compares the total duration of each motion type across PHUMA, AMASS, and LaFAN1 datasets. Figure 8: Motion Type Distribution per Dataset. This radar chart compares the total duration of each motion type across PHUMA, AMASS, and LaFAN1 datasets.

    Figure 8 visually confirms PHUMA's superior motion type distribution. It shows that PHUMA has more balanced coverage and substantially more examples per motion type compared to AMASS and LaFAN1, which have prominent peaks for specific motions (e.g., walk in AMASS) and gaps in others.

  • Performance on Unseen Motions: PHUMA maintains its strong performance on the Unseen Video test set (G1 total 82.9%), demonstrating excellent generalization capabilities for motions not encountered during training. This is crucial for real-world application.

6.1.3. Pelvis-Only Path Following Control Performance (RQ3)

This section evaluates PHUMA's effectiveness for pelvis path-following control, a simplified control mode that uses only pelvis position and rotation as input. Policies are trained using a teacher-student distillation approach, with teachers trained on AMASS and PHUMA.

The following are the results from Table 5 of the original paper:

PHUMA TestUnseen Video
DatasetTotalStationaryAngularVerticalHorizontalTotalStationaryAngularVerticalHorizontal
(a) G1
AMASS60.585.660.151.466.554.883.666.533.027.5
PHUMA84.594.686.183.790.274.698.383.354.357.1
(a) H1-2
AMASS60.484.062.843.678.772.396.677.352.172.5
PHUMA73.991.276.566.984.878.196.677.860.678.0

Analysis of Table 5:

  • PHUMA's Superiority in Path Following: Policies trained on PHUMA consistently outperform those trained on AMASS across all motion categories and humanoids for both test sets. For G1 on PHUMA Test, PHUMA achieves 84.5% total success rate compared to AMASS's 60.5%.

  • Pronounced Gains in Dynamic Motions: The improvement is particularly significant for Vertical and Horizontal motions. For G1 on PHUMA Test, PHUMA (83.7% vertical, 90.2% horizontal) shows substantial gains over AMASS (51.4% vertical, 66.5% horizontal). This is attributed to AMASS's limitation in diverse dynamic movements.

  • Running Motion Example (Figure 5): The paper provides a qualitative example of running motion, where AMASS-trained policies frequently fail, while PHUMA-trained policies maintain robust performance. This visually confirms PHUMA's ability to enable more diverse and dynamic humanoid control.

    Figure 5: Path following on running motion. We visualize the robot's trajectory in a running motion. The target pelvis path is visualized with a green line. Top row presents results from a policy tra… Figure 5: Path following on running motion. We visualize the robot's trajectory in a running motion. The target pelvis path is visualized with a green line. Top row presents results from a policy trained on AMASS, while bottom row presents results from a policy trained on PHUMA.

    Figure 5 clearly illustrates the difference: the AMASS-trained robot (top row) deviates significantly from the green target path and often falls, while the PHUMA-trained robot (bottom row) stays close to the path, demonstrating successful and stable running.

  • Practical Value: These results validate the practical value of PHUMA for complex control, especially in scenarios where only partial state information (like pelvis-only) is available, making it more applicable to real-world control interfaces.

6.2. Ablation Studies / Parameter Analysis

The ablation study in Table 2 directly verifies the effectiveness of PhySINK's components.

The following are the results from Table 2 of the original paper:

Motion Fidelity (%) Joint Feasibility (%) Non-Floating (%) Non-Penetration (%) Non-Skating (%)
(a) G1
IK27.691.755.647.859.7
SINK94.895.996.414.955.4
+ Joint Feasibility Loss94.9100.096.414.855.6
+ Grounding Loss94.9100.099.997.253.6
+ Skating Loss = PhySINK94.8100.099.996.889.7
() H1-2
IK36.380.957.745.256.1
SINK93.915.342.281.447.9
+ Joint Feasibility Loss94.099.944.479.950.7
+ Grounding Loss93.999.999.898.149.3
+ Skating Loss = PhySINK93.999.999.797.787.7

Analysis of Table 2:

  • IK vs. SINK: IK struggles with Motion Fidelity (27.6% for G1, 36.3% for H1-2), indicating it fails to accurately reproduce human motion style. SINK dramatically improves Motion Fidelity (94.8% for G1, 93.9% for H1-2), but at the cost of physical plausibility. For example, for G1, SINK has poor Non-Penetration (14.9%) and Non-Skating (55.4%). For H1-2, SINK has very low Joint Feasibility (15.3%) and poor Non-Floating (42.2%). This confirms SINK's physically under-constrained nature.

  • Impact of + Joint Feasibility Loss: Adding this loss term significantly boosts Joint Feasibility to nearly 100% for both robots (from 95.9% to 100% for G1, and dramatically from 15.3% to 99.9% for H1-2). This shows that the loss term effectively enforces joint limits without negatively impacting Motion Fidelity.

  • Impact of + Grounding Loss: This term drastically improves ground contact metrics. For G1, Non-Floating increases to 99.9% and Non-Penetration to 97.2% (from 96.4% and 14.8% respectively). Similar significant gains are observed for H1-2. This confirms its effectiveness in preventing feet from floating or penetrating the ground.

  • Impact of +SkatingLoss(=PhySINK)+ Skating Loss (= PhySINK): The final addition of Skating Loss brings Non-Skating performance to nearly 90% (89.7% for G1, 87.7% for H1-2), a substantial improvement over the previous stages (e.g., 53.6% for G1 before this loss). Crucially, this is achieved while maintaining high Motion Fidelity and other physical metrics.

  • Overall PhySINK Performance: The full PhySINK method (with all losses) preserves high Motion Fidelity while achieving strong results across all physical metrics (near 100% Joint Feasibility, near 100% Non-Floating and Non-Penetration, and nearly 90% Non-Skating). This thoroughly validates the design of PhySINK and the necessity of each added physical constraint loss.

6.3. Success Rate Threshold Analysis

The paper performs an additional analysis to justify its choice of a stricter success rate threshold (0.15m0.15 \:\mathrm{m} instead of the conventional 0.5m0.5 \:\mathrm{m}).

The following are the results from Table 11 of the original paper:

Success Threshold=0.15mSuccess Threshold=0.5m
DatasetHoursTotalStationaryAngularVerticalHorizontalTotalStationaryAngularVerticalHorizontal
(a) G1
LaFAN12.446.166.136.224.042.574.887.869.247.172.6
AMASS20.976.288.572.156.866.890.295.087.981.183.7
Humanoid-X231.450.678.443.026.031.878.491.372.959.565.9
PHUMA73.092.795.691.786.085.697.198.796.5594.492.5
(b) H1-2
LaFAN12.462.079.354.726.658.970.892.466.756.468.2
AMASS20.954.474.945.917.249.670.486.362.641.465.9
Humanoid-X231.449.774.640.417.037.354.878.545.222.143.2
PHUMA73.082.791.579.568.168.492.09.689.785.679.4

The following are the results from Table 12 of the original paper:

Success Threshold=0.15mSuccess Threshold=0.5m
DatasetHoursTotalStationaryAngularVerticalHorizontalTotalStationaryAngularVerticalHorizontal
(a) G1
LaFAN12.428.446.928.419.610.578.285.570.876.380.8
AMASS20.970.290.775.062.744.192.399.292.182.188.0
Humanoid-X231.439.178.039.623.06.584.198.379.976.076.2
PHUMA73.082.996.788.071.867.193.7100.096.885.984.7
(b) H1-2
LaFAN12.470.892.466.756.468.285.597.579.077.590.0
AMASS20.964.387.359.746.063.980.493.369.972.889.0
Humanoid-X231.460.588.360.048.739.768.793.365.160.250.5
PHUMA73.078.697.576874.563.8889.999.289.484.683.9

Analysis of Table 11 and 12:

  • Masking True Differences: Under the loose0.5 :\mathrm{m}threshold, performance differences between datasets appear relatively modest. For example, on G1 (PHUMA Test), AMASS achieves 90.2% and PHUMA achieves 97.1%. Humanoid-X (78.4%) also seems competitive. This can mask scenarios where robots remain upright but perform motions inaccurately (e.g., staying still instead of jumping).
  • Revealing True Quality with Stricter Threshold: When the stricter0.15 :\mathrm{m}threshold is applied, the performance differences become substantially more pronounced and accurately reflect the quality. For G1 (PHUMA Test), PHUMA's lead becomes much clearer (92.7% vs. AMASS 76.2%, Humanoid-X 50.6%). This validates the authors' claim that the stricter threshold is a more meaningful measure of imitation quality, showing that PHUMA-trained policies achieve more precise motion tracking.
  • Consistency Across Test Sets: The pattern holds for the Unseen Video test set as well. PHUMA consistently delivers superior results, especially under the stricter threshold, confirming its ability to produce highly accurate and generalizable policies.

6.4. Data Presentation (Tables)

All tables from the original paper have been fully transcribed as requested.

6.5. Overall Performance Visualization (Figure 2)

The following are the results from Figure 2 of the original paper:

Figure 2: Overview of datasets and performance. PHUMA is both large-scale and physically reliable, which translates into higher success rates in motion imitation and pelvis path following. (a) Feasib… Figure 2: Overview of datasets and performance. PHUMA is both large-scale and physically reliable, which translates into higher success rates in motion imitation and pelvis path following. (a) Feasible and infeasible human motion sources in each dataset. (b) Physical reliability, with AMASS retargeted using a standard learning-based inverse kinematics method. (c) Success rate on unseen motions. (d) Success rate in path-following. Results are reported on the Unitree G1 humanoid.

Analysis of Figure 2:

  • (a) Feasible and Infeasible Motion Sources: Shows that Humanoid-X starts with a huge amount of data, but a significant portion is deemed infeasible. PHUMA effectively curates this to a smaller, but higher-quality set. AMASS has less infeasible data but is much smaller overall.

  • (b) Physical Reliability: This radar chart visually represents the quantitative improvements from PhySINK and PHUMA. PHUMA shows near-perfect scores for Joint Feasibility, Non-Floating, and Non-Penetration, and very high Non-Skating, significantly outperforming AMASS (even with SINK retargeting) across all physical metrics. Humanoid-X is not explicitly shown here but is known to have lower reliability.

  • (c) Success Rate on Unseen Motions: PHUMA (dark blue bar) has the highest success rate, followed by AMASS (light blue) and then Humanoid-X (orange). This summarizes the findings from Table 4 for unseen motions, clearly showing PHUMA's superior imitation performance.

  • (d) Success Rate in Path-Following: Again, PHUMA (dark blue) leads AMASS (light blue) across all motion categories, with particularly strong gains in Vertical and Horizontal motions, reinforcing the results from Table 5.

    Figure 2 serves as a concise visual summary, strongly supporting the paper's central claims about PHUMA's scale, physical reliability, and resulting performance gains.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces PHUMA, a Physically-grounded HUMAnoid locomotion dataset, addressing a critical bottleneck in humanoid motion imitation: the lack of large-scale, diverse, and physically reliable motion data. The authors effectively demonstrate that while large-scale video-derived datasets (like Humanoid-X) offer diversity, they are plagued by physical artifacts (floating, penetration, joint violations, skating). Conversely, high-quality motion capture datasets (AMASS, LaFAN1) are limited in scale and diversity.

PHUMA overcomes this dilemma through a two-pronged approach:

  1. Physics-aware Motion Curation: A rigorous filtering process that refines raw motion data, removing problematic segments based on metrics like root jerk, foot contact score, pelvis height, and center of mass stability.

  2. PhySINK Retargeting: A novel physics-constrained retargeting method that extends Shape-Adaptive Inverse Kinematics (SINK) by incorporating explicit joint feasibility, grounding, and anti-skating loss terms. This ensures that retargeted motions are not only kinematically faithful but also physically plausible for humanoid robots.

    Experimental results consistently show that policies trained on PHUMA significantly outperform those trained on AMASS and Humanoid-X in both motion imitation of unseen videos and pelvis-guided path following tasks, especially for dynamic vertical and horizontal motions. This validates the central thesis that achieving progress in humanoid locomotion requires both scale and physical reliability in motion data.

7.2. Limitations & Future Work

The authors acknowledge two key areas for future work:

  1. Sim-to-Real Transfer: The current work focuses on training policies in simulation. A crucial next step is to enable these PHUMA-trained policies to produce physically reliable motions on real humanoid robots. This involves addressing the inevitable reality gap between simulation and the physical world.

  2. Vision-Based Control: The current policies rely on privileged state inputs (perfect knowledge of joint angles, velocities, etc.). Future work aims to transition to vision-based control, where video observations replace these state inputs. This would better align with real-world perception challenges and allow robots to interpret human motion directly from visual cues.

    While not explicitly stated as limitations by the authors, one could infer some current boundaries:

  • Computational Cost: The physics-aware curation and PhySINK retargeting are computationally intensive processes. Scaling this pipeline to even larger, more diverse raw video datasets might introduce new challenges.
  • Dependence on Video-to-Motion Models: The quality of PHUMA still fundamentally relies on the initial video-to-motion models (e.g., TRAM). While curation helps, inherent inaccuracies or biases in these models could propagate.

7.3. Personal Insights & Critique

This paper offers a highly valuable contribution to the field of humanoid robotics and motion imitation. My key insights are:

  • The "Quality over Quantity" (or "Quality with Quantity") Principle: The paper powerfully illustrates that for robotics, simply having a large amount of data is not enough if that data is unphysical or unreliable. The significant underperformance of Humanoid-X despite its massive scale is a stark reminder. PHUMA's success lies in its disciplined approach to ensuring both scale and physical plausibility, a lesson that extends beyond motion imitation to other data-driven robotics tasks.
  • Generalizability of Physics-Constrained Optimization: The PhySINK method is a robust and generalizable approach. Its loss terms (joint feasibility, grounding, skating) are fundamental physical constraints that apply to virtually any physically-embodied robot. This framework could potentially be adapted to other robot morphologies (e.g., quadrupedal robots with different foot contact dynamics) or even to tasks beyond locomotion where physical interaction is paramount.
  • Importance of Data Curation: The detailed physics-aware motion curation pipeline (Section 3.1) is often overlooked in discussions of large datasets but is critical for high-quality outcomes. The strategic "chunk-and-filter" approach is smart, maximizing usable segments from noisy raw data.
  • Impact on Low-Level Control: The improved performance in pelvis-only path following is particularly inspiring. This demonstrates that higher-quality motion data enables more robust and expressive control even with simplified, abstract guidance. This is crucial for real-world applications where full-body state information might not always be available or easy to specify.
  • Potential for Real-World Deployment: The focus on reducing artifacts like floating, penetration, and skating makes the generated motions inherently safer and more stable for real robot deployment, even before sim-to-real transfer is fully achieved. Joint limit enforcement is especially critical for protecting robot hardware.

Potential Issues or Areas for Improvement:

  • Human Subjectivity in Curation: While the curation process uses quantitative metrics (Table 7), defining thresholds (e.g., 50m/s350 m/s^3 for root jerk, 0.6 for foot contact score) might involve some expert judgment or tuning. The robustness of these thresholds across extremely diverse raw video sources could be further explored.

  • Computational Cost of PhySINK: The PhySINK optimization involves summing multiple loss terms over all frames and joints. While effective, the computational cost might become a limiting factor for extremely long motion sequences or real-time retargeting in some contexts. An analysis of the runtime performance of PhySINK could be beneficial.

  • Complexity of Reward Function: The MaskedMimic reward function (Table 9) is quite complex, with many terms and weights. While effective, tuning these weights can be challenging. Future work could investigate how sensitive the training is to these weights and if simpler reward structures could achieve comparable results with PHUMA's high-quality data.

  • Long-Term Physical Interactions: The current constraints primarily focus on immediate physical plausibility (joint limits, ground contact, no skating). For complex manipulation or interaction tasks, more advanced physical modeling (e.g., stability over longer horizons, contact forces, object interaction) might be needed.

    Overall, PHUMA represents a significant step forward, solidifying the foundation for more capable and humanlike humanoid robots by proving that data quality, achieved through meticulous processing, is as important as data quantity.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.