PHUMA: Physically-Grounded Humanoid Locomotion Dataset
TL;DR Summary
PHUMA introduces a physics-constrained retargeting dataset from large-scale videos, eliminating artifacts like foot skating, enabling stable and diverse humanoid locomotion imitation, outperforming prior datasets in unseen motion and path-following tasks.
Abstract
Motion imitation is a promising approach for humanoid locomotion, enabling agents to acquire humanlike behaviors. Existing methods typically rely on high-quality motion capture datasets such as AMASS, but these are scarce and expensive, limiting scalability and diversity. Recent studies attempt to scale data collection by converting large-scale internet videos, exemplified by Humanoid-X. However, they often introduce physical artifacts such as floating, penetration, and foot skating, which hinder stable imitation. In response, we introduce PHUMA, a Physically-grounded HUMAnoid locomotion dataset that leverages human video at scale, while addressing physical artifacts through careful data curation and physics-constrained retargeting. PHUMA enforces joint limits, ensures ground contact, and eliminates foot skating, producing motions that are both large-scale and physically reliable. We evaluated PHUMA in two sets of conditions: (i) imitation of unseen motion from self-recorded test videos and (ii) path following with pelvis-only guidance. In both cases, PHUMA-trained policies outperform Humanoid-X and AMASS, achieving significant gains in imitating diverse motions. The code is available at https://davian-robotics.github.io/PHUMA.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
PHUMA: Physically-Grounded Humanoid Locomotion Dataset
1.2. Authors
Kyungmin Lee, Sibeen Kim, Minho Park, Hyunseung Kim, Dongyoon Hwang, Hojoon Lee, Jaegul Choo. All authors are affiliated with KAIST.
1.3. Journal/Conference
This paper is published as a preprint on arXiv, with an announced publication date of 2025-10-30T08:13:12.000Z. arXiv is a well-regarded open-access repository for preprints of scientific papers, particularly in fields like physics, mathematics, computer science, and quantitative biology. While preprints are not peer-reviewed, they allow for rapid dissemination of research and feedback from the scientific community. The intended publication venue or a related workshop (ICRA 2025 Workshop: Human-Centered Robot Learning in the Era of Big Data and Large Models) is mentioned in a citation (Mao et al., 2025), indicating its relevance to major robotics and AI conferences.
1.4. Publication Year
2025
1.5. Abstract
This paper introduces PHUMA, a Physically-grounded HUMAnoid locomotion dataset designed to address the limitations of existing motion imitation approaches for humanoid robots. Current methods often rely on expensive, scarce, and limited motion capture (MoCap) datasets like AMASS. While recent efforts, such as Humanoid-X, attempt to scale data collection using large-scale internet videos, they frequently suffer from physical artifacts like floating, penetration, and foot skating, which hinder stable imitation. PHUMA tackles these issues by leveraging human video at scale, combined with careful data curation and a novel physics-constrained retargeting method called PhySINK. PhySINK enforces joint limits, ensures ground contact, and eliminates foot skating, yielding motions that are both extensive and physically reliable. The dataset's effectiveness was evaluated in two scenarios: (i) imitating unseen motions from self-recorded test videos and (ii) path following with only pelvis guidance. In both cases, policies trained with PHUMA significantly outperformed those trained with Humanoid-X and AMASS, demonstrating substantial gains in imitating diverse motions. The code and dataset are made publicly available.
1.6. Original Source Link
Official Source Link: https://arxiv.org/abs/2510.26236 PDF Link: https://arxiv.org/pdf/2510.26236v1.pdf
2. Executive Summary
2.1. Background & Motivation
The deployment of humanoid robots in real-world scenarios necessitates locomotion that is both stable and humanlike. While traditional reinforcement learning (RL) has advanced quadrupedal locomotion, directly applying these techniques to humanoids often results in effective but non-humanlike gaits. Motion imitation has emerged as a promising paradigm, where robot policies are trained to replicate human movements. This process typically involves three stages: collecting human motion data, retargeting it to the robot's morphology, and using RL to track these retargeted trajectories.
The core problem the paper addresses is the fundamental constraint on motion imitation due to the scale, diversity, and physical feasibility of available human motion data. High-quality motion capture (MoCap) datasets, such as AMASS (Archive of Motion Capture as Surface Shapes), provide relatively physically feasible motions but are limited in size and diversity, often dominated by simple actions like walking or reaching. This scarcity makes it challenging to train robust policies for a wide range of complex human behaviors.
To overcome this limitation, recent research has explored scaling data collection by converting large volumes of internet videos into motion data, exemplified by Humanoid-X. However, this approach introduces significant challenges:
-
Physical Artifacts: Video-to-motion models often misestimate global pelvis translation, leading to
floating(robot hovering above the ground) orground penetration(robot's limbs sinking into the ground). -
Retargeting Issues: The subsequent
retargetingstage, which adapts human motion to robot morphology, often prioritizesjoint alignmentover physical plausibility. This can result injoint violations(joint angles exceeding mechanical limits) andfoot skating(feet sliding on the ground during contact), as visually depicted in Figure 1. These artifacts make the motions unstable and unreliable for training physical robots, hindering stable imitation.The problem is important because without large-scale, diverse, and physically reliable motion data, humanoid robots cannot acquire the full spectrum of humanlike behaviors required for general-purpose embodied AI. Existing solutions either lack scale/diversity or introduce critical physical inconsistencies.
The paper's entry point is to bridge this gap by leveraging the scalability of internet video data while rigorously enforcing physical plausibility.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
PHUMA Dataset: Introduction of
PHUMA(Physically-grounded HUMAnoid locomotion dataset), a large-scale dataset (73.0 hours, 76.0K clips) that leverages human video at scale while addressing physical artifacts through careful data curation and physics-constrained retargeting. PHUMA enforces joint limits, ensures ground contact, and eliminates foot skating, yielding motions that are both extensive and physically reliable. It aggregates motions from various sources, including motion capture and human videos, after undergoing a physics-aware curation pipeline. -
PhySINK Retargeting Method: Development of
PhySINK(Physically-grounded Shape-adaptive Inverse Kinematics), a novel retargeting method. PhySINK extends existingShape-Adaptive Inverse Kinematics (SINK)approaches by incorporating explicitjoint feasibility,grounding, andanti-skating lossterms into its optimization objective. This ensures that the retargeted motions maintain fidelity to the source human movement while remaining physically plausible for a robot. -
Enhanced Motion Quality: PHUMA provides substantially more physically plausible motions compared to existing datasets, achieving significantly higher percentages of joint feasibility, non-floating, non-penetration, and non-skating frames than
AMASSandHumanoid-X(as shown in Figure 2b and Table 2). -
Superior Performance in Motion Imitation and Path Following: Policies trained with PHUMA consistently outperform those trained with
Humanoid-XandAMASSin two key evaluation settings:- Motion Imitation: Achieved and higher success rates than
AMASSandHumanoid-X, respectively, when imitating unseen motions from self-recorded test videos (Figure 2c, Table 4). - Path Following: Improved overall success rate by over
AMASS, with even greater gains in vertical () and horizontal () motion categories when using pelvis-only guidance (Figure 2d, Table 5).
- Motion Imitation: Achieved and higher success rates than
-
Public Release: The authors commit to releasing PHUMA as a public resource, along with code and trained models, to foster future research in humanoid locomotion.
The key findings are that motion data for humanoid locomotion requires not only
scaleanddiversitybut critically, alsophysical reliability. PHUMA addresses the specific problems of physical artifacts introduced by video-to-motion conversion and unconstrained retargeting, leading to more robust and capable humanoid control policies.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Humanoid Locomotion: This refers to the ability of
humanoid robots(robots designed to resemble the human body, typically with two arms, two legs, and a torso) to move around in an environment. The goal is often to achieve bothstable(maintaining balance and not falling) andhumanlike(natural and efficient, resembling human movement patterns) gaits. This is crucial for robots intended to operate in human-centric environments. -
Reinforcement Learning (RL): A subfield of machine learning where an
agentlearns to make decisions by performing actions in anenvironmentto maximize a cumulativereward signal. The agent learns through trial and error, receiving feedback (rewards or penalties) for its actions. In the context of humanoid locomotion, an RL agent (the robot's controller) learns to generate motor commands to move the robot, aiming to achieve a task (e.g., walking, imitating a motion) while being stable.Proximal Policy Optimization (PPO)(Schulman et al., 2017) is a popular RL algorithm used for training policies. -
Motion Imitation: A specific paradigm within RL or control where the goal is to train a robot to
replicateortracka given reference motion (e.g., a human walking, jumping, or dancing). Instead of defining a task with hand-crafted rewards, the robot is rewarded for how closely its movements match the target motion. This approach allows robots to acquire complex, humanlike behaviors without explicit programming for each movement. -
Motion Capture (MoCap): A technology used to record the movement of people or objects. Typically,
markersare placed on a human subject's body, and multiple cameras or sensors track the 3D positions of these markers over time. This generates highly accurate and precise 3D kinematic data (joint positions, orientations, velocities) that captures human motion.MoCapdatasets likeAMASSare considered high-quality but are expensive and time-consuming to produce, leading to limited scale and diversity. -
SMPL/SMPL-X Model:
SMPL(Skinned Multi-Person Linear Model) and its extensionSMPL-Xarestatistical 3D human body models. These models can represent a wide range of human body shapes and poses using a small set of parameters. Given shape parameters (e.g., height, weight) and pose parameters (joint angles),SMPL/SMPL-Xcan generate a realistic 3D mesh of a human body. These models are widely used as a standard representation for human motion data, allowing for easy manipulation andretargeting. -
Inverse Kinematics (IK): In robotics and computer graphics,
Inverse Kinematicsrefers to the process of determining the joint angles of a kinematic chain (like a robot arm or a human skeleton) that would result in a desired position and orientation of a specificend-effector(e.g., a hand or foot). Given a target spatial position for an end-effector, IK algorithms calculate the required joint configurations to reach that target. Traditional IK often struggles with preserving the naturalmotion styleor accounting formorphological differencesbetween humans and robots. -
Shape-Adaptive Inverse Kinematics (SINK): An advanced form of IK that specifically addresses the
motion mismatchproblem inretargeting. Unlike traditional IK which might apply rigid scaling,SINKmethods first adapt the source human model (e.g.,SMPL) to match thebody shapeandlimb proportionsof the target robot. Only then is the motion aligned by matching global or local joint positions/orientations. While better at preservingmotion style,SINKapproaches are oftenphysically under-constrained, meaning they do not explicitly account for physical limitations like joint limits or ground interactions, leading to artifacts. -
Physical Artifacts: These are undesirable and unrealistic behaviors that arise in simulated or retargeted motions, making them unphysical for a real robot or simulation. The paper specifically mentions:
- Floating: The robot's feet or body hovering above the ground, losing contact when it should be touching.
- Penetration: The robot's limbs or body sinking into or passing through the ground or other objects.
- Foot Skating: The robot's feet sliding across the ground surface during phases when they should be firmly planted.
- Joint Violations: Joint angles exceeding the physical or mechanical limits of the robot's actuators or structure, leading to unnatural or impossible poses.
3.2. Previous Works
The paper builds upon and differentiates itself from several key prior studies:
-
Motion Capture Datasets (e.g., LaFAN1, AMASS):
- LaFAN1 (Harvey et al., 2020): A relatively small-scale motion capture dataset (a few hours). While providing high-quality data, its limited size restricts the diversity of motions that can be learned.
- AMASS (Mahmood et al., 2019): The most extensive and widely-used
MoCapdataset. It aggregates variousMoCaplibraries into a commonSMPLformat. Despite its size,AMASSis still limited in scale and diversity, often dominated bywalking motionsand conducted inindoor lab settings. The paper notes thatAMASSis moderate-quality and, even when retargeted withSINK, still exhibits physical reliability issues (Figure 2b). - Limitation: Both
LaFAN1andAMASSsuffer fromscalabilityanddiversityissues inherent toMoCapdata collection. They provide a high proportion of physically feasible motions, but their content is mostly simple movements.
-
Video-to-Motion Datasets (e.g., Humanoid-X):
- Humanoid-X (Mao et al., 2025): This dataset represents a trend of
scaling data collectionby leveragingvast internet videos. It converts videos toSMPLrepresentations usingvideo-to-motion models(likeVIBEby Kocabas et0al., 2020) and thenretargetsthem to humanoid embodiments.Humanoid-Xislarge-scalebutlower-qualityin terms of physical plausibility. - Limitation: The pipeline used by
Humanoid-X(video-to-motion followed by retargeting) introduces significantphysical artifacts. The video-to-motion models oftenmisestimate global pelvis translation, leading tofloatingorground penetration. Theretargeting stageprioritizesjoint alignmentover physical plausibility, causingjoint violationsandfoot skating(as highlighted in Figure 1). These issues make the data less stable for training robust robot policies.
- Humanoid-X (Mao et al., 2025): This dataset represents a trend of
-
MaskedMimic (Tessler et al., 2024): This is the
reinforcement learning frameworkused by the authors for policy training.MaskedMimicprovides a unified approach formotion trackingand supports bothfull-body stateandpartial-body state(e.g., pelvis-only) imitation. The paper uses this framework to train its policies, demonstrating that the improvements observed are due to thePHUMA datasetandPhySINKmethod, rather than a different RL training methodology.- Note: While not a dataset,
MaskedMimicis a crucialmethodological prerequisiteas it's the learning algorithm for whichPHUMAprovides the training data.
- Note: While not a dataset,
-
SINK (Shape-Adaptive Inverse Kinematics) Methods (He et al., 2024b;a; 2025a;b): These methods are a direct predecessor to
PhySINK.SINKaims to preservemotion styleby adapting the human model to the robot's proportions before aligning motions.- Limitation: As discussed,
SINKapproaches arephysically under-constrained. They effectively matchkinematic posesbut introduce physical artifacts likejoint limit violationsandimplausible ground interactions(floating, penetration, skating) because they don't explicitly optimize for these physical constraints.
- Limitation: As discussed,
3.3. Technological Evolution
The field of humanoid locomotion has evolved significantly:
- Early RL for Locomotion: Initially,
reinforcement learningfocused ontask-oriented rewardsto achieve locomotion (e.g., Hwangbo et al., 2019; Lee et al., 2020). While effective, these often producednon-humanlike gaits. - Emergence of Motion Imitation: To achieve
humanlike behaviors, the focus shifted tomotion imitation, using human motion data as a reference. This involvedMoCap dataandretargeting. - MoCap Data Limitations:
MoCapdatasets (likeCMU,LaFAN1,AMASS) provided high-quality, physically sound motions but were inherently limited inscaleanddiversitydue to the expensive and labor-intensive collection process. They often lacked complex or diverse movements. - Scaling with Video Data: The advent of robust
video-to-motion models(e.g.,VIBEby Kocabas et al., 2020;TRAMby Wang et al., 2024) enabled researchers to leveragelarge-scale internet videosto generate vast amounts of human motion data (e.g.,Humanoid-X). This addressed thescaleanddiversityproblem. - New Challenges with Video Data: However,
video-derived motionsand unconstrainedretargetingintroduced new problems:physical artifacts(floating, penetration, skating, joint violations) that undermined the stability and reliability of the data for robot training. - PHUMA's Position: This paper's work,
PHUMA, represents the next evolutionary step. It aims to combine thescaleanddiversitybenefits ofvideo-derived datawith thephysical reliabilityofMoCap data. It does this through aphysics-aware curation pipelineand a novelphysics-constrained retargeting method (PhySINK), seeking to deliver both quantity and quality.
3.4. Differentiation Analysis
Compared to the main methods in related work, PHUMA's core differences and innovations are:
-
Addressing the Scale vs. Quality Trade-off:
- Vs. MoCap Datasets (e.g., AMASS, LaFAN1): While
AMASSandLaFAN1offer relatively high-quality, physically feasible motions, they are limited inscaleanddiversity.PHUMAvastly expands the scale and diversity of motions available for training (73 hours vs. 20.9 hours forAMASS) by incorporatingvideo-derived data, while simultaneously maintaining high physical reliability. - Vs. Large-Scale Video Datasets (e.g., Humanoid-X):
Humanoid-Xachieves large scale by converting internet videos but suffers from significantphysical artifacts(floating, penetration, skating, joint violations).PHUMAdirectly addresses these quality issues through itsphysics-aware motion curationandPhySINKretargeting, making the large-scale data physically reliable. Figure 1 visually highlights this differentiation in physical reliability.
- Vs. MoCap Datasets (e.g., AMASS, LaFAN1): While
-
Novel Retargeting Method (PhySINK):
- Vs. Traditional IK: Traditional
IKmethods often fail to preservemotion styleand produce unnatural gaits due to morphological differences between humans and robots.PhySINK, likeSINK, adapts to body shape, but goes further. - Vs. SINK Methods:
SINKmethods improvepose matchingandstyle preservationbut arephysically under-constrained, leading to artifacts.PhySINKinnovates by augmentingSINKwith explicitphysical constraint lossesforjoint feasibility,grounding, andanti-skating. This is a critical addition that ensures the retargeted motions are not only stylistically faithful but alsophysically plausiblefor simulation and real-world robot execution (Table 2 demonstrates this significant improvement).
- Vs. Traditional IK: Traditional
-
Comprehensive Data Curation:
PHUMAincludes a meticulousphysics-aware motion curation pipeline(Section 3.1) that filters out problematic motions (e.g., high jitter, unstable CoM, incorrect foot contact) from raw video-derived data. This proactive cleaning step is crucial for ensuring the quality of the large-scale dataset, a level of rigor not explicitly highlighted in similar video-based datasets.In essence,
PHUMAdifferentiates itself by providing a solution that marries thescaleanddiversityof video-derived data with thephysical reliabilitytypically associated with expensiveMoCapdata, thus overcoming the primary limitations of previous approaches.
4. Methodology
The core idea behind PHUMA is to create a large-scale, physically reliable dataset for humanoid locomotion by building upon existing large motion datasets (like Humanoid-X) and meticulously filtering and retargeting them with physical constraints. The theoretical basis is that for humanoid robots to effectively learn humanlike behaviors through motion imitation, the reference motions must not only be diverse and abundant but also strictly adhere to physical laws and robot mechanical limitations. Unphysical motions lead to unstable learning and poor real-world performance.
The PHUMA pipeline consists of four main stages, as illustrated in Figure 3. The first two stages are central to constructing the PHUMA dataset itself: (1) Motion Curation to filter out problematic motions, and (2) Motion Retargeting using PhySINK to adapt filtered motions to the humanoid while enforcing physical plausibility. The subsequent stages (3) Policy Learning and (4) Inference use this curated data for training and application.
4.1. Core Methodology In-depth (Layer by Layer)
4.1.1. Overall PHUMA Pipeline (Figure 3)
The PHUMA pipeline integrates several steps to achieve physically-grounded humanoid locomotion:
-
Motion Curation: This initial stage focuses on processing diverse human motion data, which can come from motion capture systems or be reconstructed from videos. The key task here is to identify and filter out
infeasible motionsthat contain physical artifacts likeroot jitter,floating,penetration,joint violations, oractions requiring external objects(e.g., sitting on a chair that isn't modeled). This step aims to maximize the physical plausibility of the raw motion data. -
Motion Retargeting (PhySINK): The cleaned motion data is then
retargetedto the specifichumanoid robot morphology. This crucial step employsPhySINK, the paper's novelphysics-constrained retargeting method.PhySINKenforcesjoint limits,ground contact, andanti-skating constraintsduring the retargeting process, ensuring that the resulting robot motions are both kinematically similar to the human source and physically executable by the robot. -
Policy Learning: With a robust and physically plausible motion dataset,
reinforcement learning (RL)is used to train apolicy(a neural network controller) that canimitatethese retargeted motions. The policy learns to generate motor commands for the humanoid robot to track the desired motion trajectories. TheMaskedMimic frameworkis used for this purpose. -
Application (Inference): The
trained policyis then used to control the humanoid robot. This allows the robot to imitate motions fromunseen videos(which are first converted to motion data via avideo-to-motion modeland then retargeted) or to perform tasks likepath following.The following sections detail the first two stages, which constitute the core of PHUMA's dataset construction methodology.
4.1.2. Physics-Aware Motion Curation (Section 3.1)
The goal of this stage is to refine raw motion data, which often contains artifacts, making it physically implausible for a humanoid robot. The process targets severe jitter, instabilities from interactions with unmodeled objects, and incorrect foot-ground contact.
-
Low-Pass Noise Filtering for Motion Data: To mitigate
high-frequency jitteroften present in raw motion data (especially from video-to-motion models), alow-pass Butterworth filteris applied to all motion channels. This filter smooths out rapid, noisy fluctuations while preserving the essential movement patterns.- The filter used is a
zero-phase, 4-th-order Butterworth low-pass filter. - The
sampling frequency() is . - For
root translation(the overall movement of the robot's base), thecutoff frequencyis set to . This means slower movements are preserved, while faster, jerky translations are attenuated. - For
global orientation(the overall rotation of the robot in space) andbody pose(the configuration of its joints), thecutoff frequencyis . This allows for more dynamic rotational and pose changes compared to root translation, while still smoothing noise.
- The filter used is a
-
Extracting Ground Contact Information: Establishing a
consistent ground planein the world frame is crucial for correctingfoot-ground contactissues like floating and penetration. Raw motions, especially video-derived ones, often lack a true ground reference as they are defined in a camera's coordinate frame.-
Foot Vertex Selection (Figure 6 & Table 6): The method identifies a specific subset of
SMPL-X foot verticesthat are most indicative of ground interaction. These are the22 vertically lowest verticesfrom each of the four key foot regions:Left Heel (LH),Left Toe (LT),Right Heel (RH), andRight Toe (RT). This totals88 vertices. These vertices are illustrated in Figure 6. The following are the results from Table 6 of the original paper:Region Vertex indices Left heel 8888, 8889, 8891, 8909, 8910, 8911, 8913, 8914, 8915, 8916, 8917, 8918, 8919, 8920, 8921, 8922, 8923, 8924, 8925, 8929, 8930, 8934 Left toe 5773, 5781, 5782, 5791, 5793, 5805, 5808, 5816, 5817, 5830, 5831, 5859, 5860, 5906, 5907, 5908, 5909, 5912, 5914, 5915, 5916, 5917 Right heel 8676, 8677, 8679, 8697, 8698, 8699, 8701, 8702, 8703, 8704, 8705, 8706, 8707, 8708, 8709, 8710, 8711, 8712, 8713, 8714, 8715, 8716 Right toe 8467, 8475, 8476, 8485, 8487, 8499, 8502, 8510, 8511, 8524, 8525, 8553, 8554, 8600, 8601, 8602, 8603, 8606, 8608, 8609, 8610, 8611 As shown in Figure 6, these selected vertices accurately represent the ground-contact areas.
Figure 6: SMPL-X Foot Vertices for Ground-Contact Detection. This figure illustrates the selected foot vertices on the SMPL-X model used to detect ground contact. Green and orange points denote the left heel and left toe, while blue and pink represent the right heel and right toe, respectively. The remaining foot vertices are shown in light-gray. The clusters of colored points correspond to the specific parts of the foot that are used to check for contact with the ground, making the process more accurate and robust than using a single point. -
Majority-Voting Scheme for Ground Height: To establish a single, consistent ground plane without introducing jitter, a
majority-voting schemeis used:- For each frame , the minimum vertical position (-coordinate) among the 88 selected foot vertices is recorded as a
candidate coordinatefor the ground plane, . - Each candidate is then evaluated by counting the total number of foot vertices (across all frames in the sequence) that fall within a narrow
tolerance bandof around that candidate height. - The candidate height that garners the most votes (i.e., has the most foot vertices consistently near it) is selected as the optimal ground plane.
- Finally, the entire motion sequence is
translated verticallyso that is at a height of zero, establishing a consistent ground reference.
- For each frame , the minimum vertical position (-coordinate) among the 88 selected foot vertices is recorded as a
-
-
Filtering Motion Data by Physical Information: After establishing a reliable ground plane, the motion sequences are
segmented into 4-second clips. Each clip is then evaluated against specific physical criteria, and those failing to meet the thresholds are discarded. This "chunk-and-filter" process maximizes the retention of viable segments from longer, potentially flawed sequences. The following are the results from Table 7 of the original paper:Metric Threshold Root jerk < 50 m/s3 Foot contact score > 0.6 Minimum pelvis height > 0.6 m Maximum pelvis height < 1.5 m Pelvis distance to base of support < 6 cm Spine1 distance to base of support < 11 cm The metrics and their thresholds are:
Root jerk: This measures the rate of change ofroot acceleration. Highroot jerkindicates abrupt or unnatural movements. Clips withroot jerkexceeding are discarded to ensure smooth and physically plausible trajectories.Foot contact score: This metric quantifies the consistency and sufficiency offoot-ground interactionsbased on gradedground-contact signals. It is calculated as the average maximum contact score across the four foot regions over all frames in a clip. The formula forFoot contact scoreis: $ \mathrm { F o o t ~ c o n t a c t ~ s c o r e } = \frac { 1 } { T } \sum _ { t = 1 } ^ { T } \operatorname* { m a x } \left( c _ { t } ^ { l h } , c _ { t } ^ { l t } , c _ { t } ^ { r h } , c _ { t } ^ { r t } \right) $ where:- : The total number of frames in the sub-clip.
- : Contact scores at frame for the Left Heel, Left Toe, Right Heel, and Right Toe regions, respectively. These scores indicate how strongly each region is in contact with the ground (e.g., based on vertex proximity to the ground).
- : The maximum contact score among the four foot regions at each frame . This ensures that if any part of the foot is firmly in contact, that frame contributes positively to the score.
A score below
0.6indicates significant penetration or floating and results in the clip being discarded. The authors note that motions involving airborne phases (like jumps) can still satisfy this criterion as long as contact before and after the airborne phase is consistent.
Minimum pelvis heightandMaximum pelvis height: These criteria filter out segments where the humanoid is positioned unnaturally low (e.g., excessively crouched, lying down, or ground penetration) or unnaturally high (e.g., floating or an exaggerated jump that exceeds reasonable bounds). Theminimum pelvis heightmust be greater than , and themaximum pelvis heightmust be less than .Pelvis distance to base of supportandSpine1 distance to base of support: These metrics assess the robot'sbalanceandstability. Thebase of supportis defined as theconvex hullformed by thehorizontal-plane projectionsof the left foot, right foot, left ankle, and right ankle joints. TheSMPL-X model's center of mass (CoM)typically lies between thepelvisandspine1joints. Large deviations of these joints' horizontal projections from thebase of supportindicateimbalanceorinstabilitythat would be infeasible for humanoids.Pelvis distance to base of supportmust be less than .Spine1 distance to base of supportmust be less than .
-
Data Augmentation: Finally, these
curated motionsare augmented with data from other high-quality sources such asLaFAN1,LocoMuJoCo, and the authors' own video captures, contributing to the diversity and scale of the final PHUMA dataset.
4.1.3. Physics-Constrained Motion Retargeting (PhySINK) (Section 3.2)
This section details the PhySINK method, which extends Shape-Adaptive Inverse Kinematics (SINK) with explicit physical constraints to produce both stylistically faithful and physically plausible motions. PhySINK optimizes the humanoid joint positions () and root translation () over time .
The following are the results from Figure 4 of the original paper:
Figure 4: Common physical artifacts in motion retargeting. From left to right: Motion Mismatch, Joint Violation, Floating, Penetration, and Skating.
-
Motion Fidelity Loss (): This component ensures that the
retargeted motionclosely matches the sourcehuman motionin terms of bothglobal joint positionsandlocal limb orientations. It also includes asmoothness term. The is defined as: $ \mathcal { L } _ { \mathrm { g l o b a l - m a t c h } } = \sum _ { t } \sum _ { i } \left| p _ { i } ^ { \mathrm { S M P L - X } } ( t ) - p _ { i } ^ { \mathrm { H u m a n o i d } } ( t ) \right| _ { 1 } $ $ \begin{array} { r l r } { \mathcal { L } _ { \mathrm { l o c a l - m a t c h } } = \sum _ { t } \sum _ { i \ne j } m _ { i j } \underbrace { | \Delta p _ { i j } ^ { \mathrm { SMPL-X } } ( t ) - \Delta p _ { i j } ^ { \mathrm { Humanoid } } ( t ) | _ { 2 } ^ { 2 } } _ { \mathrm { p o s i t i o n } } } \ { } & { + \sum _ { t } \sum _ { i \ne j } m _ { i j } \underbrace { ( 1 - \frac { \Delta p _ { i j } ^ { \mathrm { SMPL-X } } ( t ) \cdot \Delta p _ { i j } ^ { \mathrm { Humanoid } } ( t ) } { | \Delta p _ { i j } ^ { \mathrm { SMPL-X } } ( t ) | | \Delta p _ { i j } ^ { \mathrm { Humanoid } } ( t ) | } ) } _ { \mathrm { orientation } } } \end{array} $ $ \begin{array} { r } { \displaystyle \mathcal { L } _ { \mathrm { s m o o t h } } = \sum _ { t } \left| \dot { q } _ { t } - 2 \dot { q } _ { t + 1 } + \dot { q } _ { t + 2 } \right| _ { 1 } + \sum _ { t } \left| \dot { \gamma } _ { t } - 2 \dot { \gamma } _ { t + 1 } + \dot { \gamma } _ { t + 2 } \right| _ { 1 } } \ { \displaystyle \mathcal { L } _ { \mathrm { Fidelity } } = w _ { \mathrm { global-match } } \mathcal { L } _ { \mathrm { global-match } } + w _ { \mathrm { local-match } } \mathcal { L } _ { \mathrm { local-match } } + w _ { \mathrm { s m o o t h } } \mathcal { L } _ { \mathrm { s m o o t h } } } \end{array} $ where:- : Time index (frame number).
- : Joint index.
- : Global 3D position of joint at time for the source
SMPL-Xhuman model. - : Global 3D position of joint at time for the target
humanoid robot. - : L1-norm (Manhattan distance), used for
global-match loss. - : Position difference (vector) between joint and joint .
- : A
binary maskthat equals 1 if joints and are immediate neighbors in the humanoid kinematic tree (i.e., connected by a bone), and 0 otherwise. This ensures thatlocal-match lossfocuses on maintaining the relative configuration of connected limbs. - : Squared L2-norm (Euclidean distance), used for the position component of
local-match loss. - The
orientationterm in is a cosine similarity term, , where . It penalizes differences in theorientationof the vectors connecting neighboring joints in the human and humanoid models. This helps preserve thelocal limb orientations. - : Velocity of the joint angles at time .
- : Velocity of the root translation at time .
- and : These are
second-order finite differenceapproximations of acceleration (specifically, the change in velocity over two time steps). Minimizing these terms promotessmoothnessin joint angle velocities and root translation velocities, preventing jerky motions. - : Weighting coefficients for each loss term, determining their relative importance in the overall
fidelity loss. - Motion Fidelity (%): This is defined as the average percentage of frames where the
mean per-joint position erroris below and themean per-link orientation erroris below10degrees.
-
Joint Feasibility Loss (): This loss penalizes
joint angles() andjoint velocities() that approach or exceed the predefinedoperational limitsof the humanoid robot. This preventsjoint violations. $ \begin{array} { r l } & { \mathcal { L } _ { \mathrm { p o s i t i o n - v i o l a t i o n } } = \displaystyle \sum _ { t } \left[ \operatorname* { m a x } ( 0 , q _ { t } - 0 . 9 8 q _ { \operatorname* { m a x } } ) + \operatorname* { m a x } ( 0 , 0 . 9 8 q _ { \operatorname* { m i n } } - q _ { t } ) \right] } \ & { \mathcal { L } _ { \mathrm { v e l o c i t y - v i o l a t i o n } } = \displaystyle \sum _ { t } \left[ \operatorname* { m a x } ( 0 , \dot { q } _ { t } - 0 . 9 8 \dot { q } _ { \operatorname* { m a x } } ) + \operatorname* { m a x } ( 0 , 0 . 9 8 \dot { q } _ { \operatorname* { m i n } } - \dot { q } _ { t } ) \right] } \ & { \qquad \mathcal { L } _ { \mathrm { Feasibility } } = \mathcal { L } _ { \mathrm { p o s i t i o n - v i o l a t i o n } } + \mathcal { L } _ { \mathrm { v e l o c i t y - v i o l a t i o n } } . } \end{array} $ where:- : Current joint angle at time .
- : Upper mechanical limit for the joint angle.
- : Lower mechanical limit for the joint angle.
- : A
rectified linear unit (ReLU)-like function that only penalizes positive violations. - The terms and introduce a
soft limit, penalizing angles that are within of the actual physical limit, encouraging the optimization to stay comfortably within bounds rather than directly hitting them. - : Current joint angular velocity at time .
- : Upper mechanical limit for joint angular velocity.
- : Lower mechanical limit for joint angular velocity.
- Joint Feasibility (%): Defined as the percentage of frames where all joint positions and velocities remain within of their predefined mechanical limits.
-
Grounding Loss (): This loss term addresses
floatingandpenetrationartifacts by enforcing that the humanoid's foot regions remain on theground planeduring frames wherecontactis detected. $ { \mathcal { L } } _ { \mathrm { Ground } } = \sum _ { i \in { \mathrm { LH } , \mathrm { LT } , \mathrm { RH } , \mathrm { RT } } } \sum _ { t } c _ { t } ^ { i } \left. p _ { t } ^ { i } ( z ) \right. _ { 2 } ^ { 2 } $ where:- : Index for the four specified foot regions (Left Heel, Left Toe, Right Heel, Right Toe).
- : Time index.
- : A
contact scorefor foot region at frame . This score is typically 1 if the foot region is in contact with the ground and 0 otherwise (or a continuous value indicating degree of contact). - : The
vertical position(-coordinate) of foot region at time . The loss penalizes this vertical position when contact is detected. - : Squared L2-norm.
- Non-Floating (%): Percentage of
contact frameswhere the foot is within above the ground. - Non-Penetration (%): Percentage of
contact frameswhere the foot is within below the ground.
-
Skating Loss (): This loss prevents
foot slidingby penalizing thehorizontal velocityof any foot region that is detected to be in contact with the ground. $ { \mathcal { L } } _ { \mathrm { Skate } } = \sum _ { i \in { \mathrm { LH } , \mathrm { LT } , \mathrm { RH } , \mathrm { RT } } } \sum _ { t } c _ { t } ^ { i } \left. { \dot { p } } _ { t } ^ { i } ( x , y ) \right. _ { 2 } $ where:- : Index for the four foot regions.
- : Time index.
- : The
contact scorefor foot region at frame . The loss is only active when contact is detected. - : The
horizontal velocity(velocity in the - plane) of foot region at time . - : L2-norm.
- Non-Skating (%): Percentage of
contact frameswhere the foot's horizontal velocity is below .
-
PhySINK Objective (): The complete
PhySINKobjective is aweighted sumof themotion fidelity lossand all thephysical constraint terms. By optimizing this augmented objective,PhySINKgenerates motions that maintain kinematic similarity to the source while being physically plausible. $ { \mathcal { L } } _ { \mathrm { PhySINK } } = { \mathcal { L } } _ { \mathrm { Fidelity } } + w _ { \mathrm { Feasibility } } { \mathcal { L } } _ { \mathrm { Feasibility } } + w _ { \mathrm { Ground } } { \mathcal { L } } _ { \mathrm { Ground } } + w _ { \mathrm { Skate } } { \mathcal { L } } _ { \mathrm { Skate } } $ where:-
: Weighting coefficients for the
Joint Feasibility,Grounding, andSkatinglosses, respectively. These weights determine the importance of each physical constraint during the optimization process. Thebaseline SINKmethod would only use .This detailed integration of physical constraints directly into the retargeting optimization is what distinguishes
PhySINKand enables the creation of thephysically-groundedPHUMA dataset. The authors validatePhySINKby retargeting PHUMA to two Unitree robots (G1 and H1-2) and comparing it against a standard IK solver and the SINK framework, showing progressive performance enhancement with the addition of each physical constraint loss (Table 2 in results).
-
5. Experimental Setup
The experiments are designed to evaluate the effectiveness of PhySINK and the PHUMA dataset across three main research questions: (RQ1) PhySINK vs. established retargeting methods; (RQ2) PHUMA vs. prior datasets for motion imitation; and (RQ3) PHUMA's impact on path-following with simplified control.
5.1. Datasets
The following datasets were used in the experiments:
-
PHUMA Dataset:
-
Source: The paper's novel dataset. It aggregates motions from
LocoMuJoCo,GRAB,EgoBody,LAFAN1,AMASS(all motion capture),HAA500,Motion-X Video,HuMMan,AIST,IDEA400(all human video), andPHUMA Video(authors' own video captures). -
Scale: Large-scale, containing 73.0 hours of physically-grounded motion across 76.0K clips.
-
Characteristics: Designed to be both
large-scaleandhigh-quality(physically reliable), correcting for physical artifacts through curation andPhySINK. -
Why chosen: It is the primary subject of the paper, demonstrating the efficacy of the proposed methodology. The following are the results from Table 1 of the original paper:
Dataset # Clip # Frame Duration Source LocoMuJoCo (Al-Hafez et al., 2023) 0.78K 0.93M 0.86h Motion Capture GRAB (Taheri et al., 2020) 1.73K 0.20M 1.88h Motion Capture EgoBody (Zhang et al., 2022) 2.12K 0.24M 2.19h Motion Capture LAFAN1 (Harvey et al., 2020) 2.18K 0.26M 2.40h Motion Capture AMASS (Mahmood et al., 2019) 21.73K 2.25M 20.86h Motion Capture HAA500 (Chung et al., 2021) 1.76K 0.11M 1.01h Human Video Motion-X Video (Lin et al., 2023) 33.04K 3.45M 31.98h Human n Video HuMMan (Cai et al., 2022) 0.50K 0.05M 0.47h Human Video AIST (Tsuchida et al., 2019) 1.75K 0.18M 1.66h Human Video IDEA400 (Lin et al., 2023) 9.94K 0.98M 9.10h Human n Video PHUMA Video 0.50K 0.06M 0.56h Human Video PHUMA 76.01K 7.88M 72.96h
-
-
LaFAN1 (Harvey et al., 2020):
- Source: Motion capture.
- Scale: Small-scale (2.4 hours).
- Characteristics: High-quality, but limited in diversity.
- Why chosen: Represents a baseline for high-quality but small-scale motion data.
-
AMASS (Mahmood et al., 2019):
- Source: Motion capture.
- Scale: Medium-scale (20.9 hours).
- Characteristics: Widely-used, moderate-quality, but limited in diversity, primarily walking motions.
- Why chosen: A common benchmark, representing the state-of-the-art in widely available MoCap data. Used with the
SINKretargeting method for fair comparison, as it provides human motion source data.
-
Humanoid-X (Mao et al., 2025):
- Source: Large-scale internet videos converted to motion.
- Scale: Large-scale (231.4 hours).
- Characteristics: High diversity, but lower quality due to physical artifacts (floating, penetration, skating, joint violations) inherent in video-to-motion conversion and unconstrained retargeting.
- Why chosen: Represents the trend of scaling data with video, highlighting the problems
PHUMAaims to solve. Used directly as a pre-existing humanoid dataset.
-
Self-Collected Video Dataset (for evaluation):
-
Source: Custom video recordings by the authors.
-
Scale: 504 sequences across 11 motion types.
-
Characteristics: Designed for unbiased evaluation on completely
unseen motionsnot present in any training data. It has a uniform distribution across 11 motion types. -
Why chosen: To ensure fair evaluation of
generalization capabilitiesbeyond the training data. As can be seen from the following Figure 9, this dataset is processed through a pipeline:
Figure 9: Overview of the Self-collected Data Pipeline. This figure illustrates the three main steps of our data collection pipeline: (left) a self-recorded video of a human motion, (center) the motion extracted using a video-to-motion model, and (right) the final motion retargeted to a humanoid robot.
Example of data sample: A self-recorded video of a human performing a specific motion (e.g., a jump), which is then converted to
SMPLparameters usingTRAM(Wang et al., 2024), and finallyretargetedto the humanoid robot usingPhySINK. -
5.2. Evaluation Metrics
For every evaluation metric, the paper implicitly uses specific definitions and thresholds.
-
Success Rate (for Motion Tracking):
- Conceptual Definition: This metric quantifies the effectiveness of a policy in
imitatinga target motion. It measures the proportion of motion sequences where the robot successfully tracks the reference motion within a specified deviation threshold over its entire duration. A higher success rate indicates better imitation performance. - Mathematical Formula: While not explicitly provided as a formula in the paper, it's defined conceptually as:
$
\text{Success Rate} = \frac{\text{Number of successfully imitated motions}}{\text{Total number of evaluated motions}} \times 100%
$
A motion is considered "successfully imitated" if the
mean per-joint position error(orpelvis tracking errorfor path following) remains below a certain threshold for the entire duration of the motion. - Symbol Explanation:
Number of successfully imitated motions: The count of individual motion sequences where the policy achieves the tracking goal within the defined error tolerance.Total number of evaluated motions: The total count of motion sequences used for testing.
- Thresholds:
- Standard Threshold: . This is a conventional threshold used in prior motion imitation studies. However, the authors argue it incorrectly classifies scenarios as successful when humanoids remain stationary during jumps or stay upright during squatting motions, failing to capture true imitation quality.
- Stricter Threshold (Proposed): . The authors adopt this stricter threshold for a more meaningful measure of
imitation qualityand to reveal clearer performance differences. This threshold is applied to themean per-joint position errorfor full-state tracking andpelvis tracking accuracyfor path following.
- Conceptual Definition: This metric quantifies the effectiveness of a policy in
-
Pelvis Path Following Success Rate:
- Conceptual Definition: This metric is a specialized
success ratefor thepath following task. It measures how effectively the policy can track thepelvis trajectoryof a reference motion, even withpelvis-only guidance. A motion is successful if the robot's pelvis stays within the deviation threshold of the target pelvis trajectory throughout the sequence. - Mathematical Formula: Conceptually similar to the general success rate, but specifically focused on pelvis error: $ \text{Pelvis Path Following Success Rate} = \frac{\text{Number of motions with successful pelvis tracking}}{\text{Total number of evaluated motions}} \times 100% $
- Symbol Explanation:
Number of motions with successful pelvis tracking: Count of motions where the robot's pelvis position stays within the threshold of the target pelvis trajectory.Total number of evaluated motions: Total motions tested.
- Conceptual Definition: This metric is a specialized
-
Motion Categories (for evaluation breakdown): To assess how well policies
generalizeacross different types of human movement, evaluations are organized into four categories:- Stationary: Motions involving minimal displacement, such as
standandreach. - Angular: Motions with significant rotational components or changes in body orientation, like
bend,twist,turn, andkick. - Vertical: Motions involving significant vertical displacement of the body, such as
squat,lunge, andjump. - Horizontal: Motions primarily involving horizontal translation, like
walkandrun.
- Stationary: Motions involving minimal displacement, such as
5.3. Baselines
The paper compares its methodology against several representative baselines:
-
Retargeting Methods (for RQ1):
- IK (Inverse Kinematics): A standard, general-purpose IK solver (specifically,
minkby Zakka, 2025). This represents a basic approach to mapping human end-effector positions to robot joint angles, often without explicit consideration of body proportions or physical constraints. - SINK (Shape-Adaptive Inverse Kinematics): This represents a learning-based IK method that adapts the source human model to match the body shape and limb proportions of the target robot. It aims to preserve
motion stylebetter than traditional IK but isphysically under-constrained.
- IK (Inverse Kinematics): A standard, general-purpose IK solver (specifically,
-
Prior Datasets (for RQ2 and RQ3):
- LaFAN1 (Harvey et al., 2020): A
small-scale,high-qualitymotion capture dataset. It serves as a baseline for systems trained on limited but clean data. - AMASS (Mahmood et al., 2019): A
medium-scale,moderate-qualitymotion capture dataset. It's a widely-used benchmark. For comparison,AMASSdata isretargeted using SINK(the best method prior toPhySINKfor preserving style). - Humanoid-X (Mao et al., 2025): A
large-scale,lower-qualitydataset derived from internet videos. It represents the trade-off of scale over data fidelity, serving as a direct contrast toPHUMA's physically-grounded approach.
- LaFAN1 (Harvey et al., 2020): A
5.4. Training Framework and Robots
-
Training Framework: All policies are trained using the
MaskedMimic framework(Tessler et al., 2024).- Full-State Motion Tracking (RQ1, RQ2): Policies receive
current proprioceptive state() (joint positions, orientations, velocities) andfull goal states() (target motion trajectories). They outputjoint angle commands() executed viaPD controllers. The reward function measures how well the humanoid matches the target motion. - Partial-State Protocol / Pelvis-Only Guidance (RQ3): This involves a
teacher-student learningapproach:- A
full-state teacher policyis first trained on full-body reference motion data. - A
student policyis then trained usingknowledge distillationto mimic the teacher's actions, but it receives onlypelvis position and rotationas input. This enablespelvis path-following controlwhile maintaining humanlike movement.
- A
- Reinforcement Learning Algorithm:
PPO (Proximal Policy Optimization)(Schulman et al., 2017) is used for training.
- Full-State Motion Tracking (RQ1, RQ2): Policies receive
-
Simulation Environment: All experiments are conducted in the
IsaacGym simulator. -
Humanoid Robots:
- Unitree G1: A humanoid robot with 29 Degrees of Freedom (DoF).
- Unitree H1-2: A humanoid robot with 21 DoF (excluding wrist joints). These robots represent common humanoid platforms for locomotion research.
5.5. Observation Space Compositions
The observation space for the RL agent comprises two main components: proprioceptive states (what the robot senses about itself) and goal states (information about the target motion).
The following are the results from Table 8 of the original paper:
| State | Dimension | |
| G1 | H1-2 | |
| (a) Proprioceptive State | ||
| Root height | 1 | 1 |
| Body position | 32 × 3 | 24 × 3 |
| Body rotation | 33 × 6 | 25 × 6 |
| Body velocity | 33 × 3 | 25 × 3 |
| Body angular velocity | 33 × 3 | 25 × 3 |
| (b) Goal State | ||
| Relative body position | 33 × 15 × 3 | 25 × 15 × 3 |
| Absolute body position | 33 × 15 × 3 | 25 × 15 × 3 |
| Relative body rotation | 33 × 15 × 6 | 25 × 15 × 6 |
| Absolute body rotation | 33 × 15 × 6 | 25 × 15 × 6 |
| Time | 33 × 15 × 1 | 25 × 15 × 1 |
| Total dim | 9898 | 9898 |
-
Proprioceptive States: These are measurements from the robot's own sensors about its current state.
Root height: The vertical position of the robot'sroot(usually the pelvis). Dimension: 1.Body position: The 3D positions of variousbodies(links or segments) in the robot's kinematic chain. Theroot bodyis excluded. G1 has 32 bodies ( dimensions), H1-2 has 24 bodies ( dimensions).Body rotation: The orientation of allbodies(including the root). Represented using6D rotation representation(e.g., two columns of a rotation matrix). G1 has 33 bodies ( dimensions), H1-2 has 25 bodies ( dimensions).Body velocity: The linear velocity of allbodies(including the root). G1 has 33 bodies ( dimensions), H1-2 has 25 bodies ( dimensions).Body angular velocity: The angular velocity of allbodies(including the root). G1 has 33 bodies ( dimensions), H1-2 has 25 bodies ( dimensions).
-
Goal States: Information about the target motion trajectory that the robot needs to imitate. This information is typically provided for a future horizon (e.g., 15 timesteps ahead).
Relative body position: The difference between the future 15 timesteps of reference motion states and the robot'scurrent proprioceptive state. This provides information about where the target body parts should be relative to the robot's current pose.- Dimension: for G1, for H1-2.
Absolute body position: The 3D positions of the target body partsrelative to the reference motion's root position. This provides aroot-relative coordinate framefor the target motion, making the goal robust to global shifts.- Dimension: for G1, for H1-2.
Relative body rotation: The difference in orientation between the future 15 timesteps of reference motion states and the robot'scurrent proprioceptive state(using 6D rotation representation).- Dimension: for G1, for H1-2.
Absolute body rotation: The orientations of the target body partsrelative to the reference motion's root orientation(using 6D rotation representation).- Dimension: for G1, for H1-2.
Time: Time-related information, possibly indicating the progress within the motion sequence or phase.-
Dimension: for G1, for H1-2.
The
Total dimensionof the observation space is 9898 for both G1 and H1-2, indicating a very high-dimensional state representation.
-
5.6. Reward Function
The reward function guides the reinforcement learning agent during training, shaping its behavior to effectively track the target motion. It comprises motion tracking task rewards and regularization rewards.
The following are the results from Table 9 of the original paper:
| Term | Expression | Weight |
| (a) Task | ||
| Global body position | exp(−100 · kpt − ptk2) | 0.5 |
| Root height | exp(−100 · (hroot − hroot)2) | 0.2 |
| Global body rotation | exp(−10 · k|θt θtk2) | 0.3 |
| Global body velocity | exp(−0.5 ·kvt − vtk2) | 0.1 |
| Global body angular velocity | exp(−0.1 · ωt − ωt∥2) | 0.1 |
| (b) Regularization | ||
| Power consumption | F qk1 | -1e-05 |
| Action rate | kat − at−1k2 | -0.2 |
-
Motion Tracking Task Rewards: These terms encourage the policy to precisely match the reference motion. They are typically
exponentially decaying functionsof the error, meaning smaller errors yield much higher rewards.Global body position: Rewards policies for aligning the positions of all robot bodies with the target body positions.- Expression:
- Weight:
0.5 - : Current global body position.
- : Target global body position.
- : Squared L2-norm of the position error.
- The large coefficient (-100) ensures that even small position errors are heavily penalized, encouraging precise matching.
Root height: Rewards policies for maintaining the correct vertical position of the robot's root (pelvis).- Expression:
- Weight:
0.2 - : Current root height.
- : Target root height.
- Similar to body position, a high penalty coefficient for height error.
Global body rotation: Rewards policies for aligning the orientations of all robot bodies with the target body orientations.- Expression:
- Weight:
0.3 - : Current global body rotation.
- : Target global body rotation.
- : Squared L2-norm of the rotation error (e.g., quaternion or 6D rotation error). The coefficient (-10) is smaller than for position, indicating a slightly lower sensitivity to rotational errors.
Global body velocity: Rewards policies for matching the linear velocities of all robot bodies with the target velocities.- Expression:
- Weight:
0.1 - : Current global body linear velocity.
- : Target global body linear velocity.
Global body angular velocity: Rewards policies for matching the angular velocities of all robot bodies with the target angular velocities.- Expression:
- Weight:
0.1 - : Current global body angular velocity.
- : Target global body angular velocity.
- The lower coefficients for velocity terms (0.5 and 0.1) suggest that achieving exact velocity matching is less critical than achieving correct positions and orientations.
-
Regularization Rewards: These terms penalize undesirable behaviors to promote
smoothandstable motion execution.Power consumption: Penalizes the power exerted by the robot's motors. This encourages energy-efficient movements.- Expression: (or some similar measure of torque velocity)
- Weight:
- : Force/Torque applied by actuators.
- : Joint velocity.
- This term is a small negative weight, meaning the robot is gently encouraged to minimize energy use, but not at the expense of task performance.
Action rate: Penalizes large changes between consecutiveactions(joint commands). This helps to ensuresmooth joint movementsand preventabrupt motion transitions, which can lead to instability or hardware wear.-
Expression:
-
Weight:
-0.2 -
: Current action (joint command).
-
: Previous action.
-
: Squared L2-norm of the difference between consecutive actions.
The overall reward is the sum of all these terms, weighted by their respective coefficients.
-
5.7. PPO Hyperparameter
The Proximal Policy Optimization (PPO) algorithm is used for training the policies. The following are the detailed hyperparameter settings:
The following are the results from Table 10 of the original paper:
| Hyperparameter | Value |
| Optimizer | Adam |
| Num envs | 8192 |
| Mini Batches | 32 |
| Learning epochs | 1 |
| Entropy coefficient | 0.0 |
| Value loss coefficient | 0.5 |
| Clip param | 0.2 |
| Max grad norm | 50.0 |
| Init noise std | -2.9 |
| Actor learning rate | 2e-5 |
| Critic learning rate | 1e-4 |
| GAE decay factor(λ) | 0.95 |
| GAE discount factor(γ) | 0.99 |
| Actor Transformer dimension | 512 |
| Actor layers | 4 |
| Actor heads | 4 |
| Critic MLP size | [1024, 1024, 1024, 1024] |
| Activation | ReLU |
Optimizer:Adam. A popular optimization algorithm in deep learning, known for its efficiency and good performance.Num envs: 8192. This refers to the number of parallel simulation environments run simultaneously during training. A large number of environments helps withdata collection efficiencyandexplorationin RL.Mini Batches: 32. The number of samples (trajectories or state-action pairs) processed in each update step of the neural network.Learning epochs: 1. The number of times the collected data is iterated over for policy updates per PPO iteration.Entropy coefficient: 0.0. This term encouragesexplorationduring training by adding a bonus for higher entropy (more randomness) in the policy's actions. A value of 0.0 means this regularization is not used in this specific setup.Value loss coefficient: 0.5. This weights thevalue function losscomponent in the overall PPO objective. Thevalue functionestimates the expected future reward from a given state.Clip param: 0.2. This is the parameter in PPO's clipped surrogate objective, which limits the policy update size to prevent large, destabilizing changes.Max grad norm: 50.0.Gradient clippingis used to preventexploding gradientsby scaling down gradients if their L2 norm exceeds this threshold.Init noise std: -2.9. This likely refers to the standard deviation of initial noise added to the policy's actions to encourage exploration at the beginning of training. A negative value might imply a specific initialization scheme or scaling.Actor learning rate: . The learning rate for theactor network, which is the policy network that outputs actions.Critic learning rate: . The learning rate for thecritic network, which is the value network that estimates state values.- : 0.95. This is the
lambda parameterforGeneralized Advantage Estimation (GAE), which balances bias and variance in advantage estimates. - : 0.99. The
discount factorfor future rewards, determining how much future rewards are valued compared to immediate rewards. A value close to 1 indicates a long-term planning horizon. Actor Transformer dimension: 512. The dimensionality of the internal representation used by theTransformer networkwithin theactor policy. This indicates the actor uses a Transformer architecture.Actor layers: 4. The number of layers in theactor Transformer network.Actor heads: 4. The number ofattention headsin theactor Transformer network.Critic MLP size: [1024, 1024, 1024, 1024]. The architecture of thecritic network, which is aMulti-Layer Perceptron (MLP)with four hidden layers, each having 1024 units.Activation:ReLU(Rectified Linear Unit). The activation function used in the neural networks.
6. Results & Analysis
The experiments evaluate PhySINK and PHUMA across three research questions using the Unitree G1 and H1-2 humanoids and two test sets: PHUMA Test (10% held-out from PHUMA) and Unseen Video (504 self-collected sequences). The primary metric is success rate with a strict deviation threshold.
6.1. Core Results Analysis
6.1.1. PhySINK Retargeting Method Effectiveness (RQ1)
This section evaluates how PhySINK compares to established retargeting approaches (IK and SINK) in terms of motion imitation performance. The policies are trained on AMASS data, retargeted using each of the three methods.
The following are the results from Table 3 of the original paper:
| PHUMA Test | Unseen Video | |||||||||
| Retarget | Total | Stationary | Angular | Vertical | Horizontal | Total | Stationary | Angular | Vertical | Horizontal |
| (a) G1 | ||||||||||
| IK | 52.8 | 75.3 | 43.9 | 24.3 | 44.2 | 54.0 | 80.3 | 54.6 | 32.7 | 43.3 |
| SINK | 76.2 | 88.5 | 72.1 | 56.8 | 66.8 | 70.2 | 90.7 | 75.0 | 62.7 | 44.1 |
| PhySINK | 79.5 | 89.9 | 76.1 | 61.1 | 69.5 | 72.8 | 93.3 | 78.2 | 65.5 | 47.3 |
| (b) H1-2 | ||||||||||
| IK | 45.3 | 70.9 | 35.7 | 15.2 | 35.0 | 54.2 | 78.0 | 60.7 | 30.1 | 28.6 |
| SINK | 54.4 | 74.9 | 45.9 | 17.2 | 49.6 | 64.3 | 87.3 | 59.7 | 46.0 | 63.9 |
| PhySINK | 64.3 | 83.6 | 57.0 | 27.7 | 555.9 | 72.4 | 99.2 | 66.3 | 57.4 | 63.1 |
Analysis of Table 3:
- PhySINK consistently outperforms others:
PhySINKachieves the highest success rates across all motion categories (Stationary,Angular,Vertical,Horizontal) and on both humanoid robots (G1 and H1-2) for bothPHUMA TestandUnseen Videosets. This demonstrates thatphysically-grounded retargetingdirectly translates to betterimitation performance. - SINK's improvement over IK:
SINKsignificantly improves performance overIKfor both robots and test sets. This highlights the importance ofshape-adaptiveretargeting in preserving motion style and making it more imitable. For example, on G1 PHUMA Test,SINK(76.2%) is much better thanIK(52.8%). - PhySINK's further gains:
PhySINKfurther improves uponSINK, particularly noticeable indynamic motions(VerticalandHorizontalcategories) where physical constraints are most critical. For G1 on PHUMA Test,PhySINK(61.1%) is better thanSINK(56.8%) inVerticalmotions. For H1-2 on PHUMA Test,PhySINK(27.7%) is significantly better thanSINK(17.2%) inVerticalmotions. This validates the importance of explicitly addressingjoint limits,ground contact, andskatingduring retargeting. - Performance on Unseen Video: The trends hold for the
Unseen Videotest set, indicating that the benefits ofPhySINKgeneralize well to novel motions.
6.1.2. PHUMA Dataset Effectiveness (RQ2)
This section compares the motion tracking performance of policies trained on LaFAN1, AMASS, Humanoid-X, and PHUMA.
The following are the results from Table 4 of the original paper:
| PHUMA Test | Unseen Video | ||||||||||
| Dataset | Hours | Total | Stationary | Angular | Vertical | Horizontal | Total | Stationary | Angular | Vertical | Horizontal |
| (a) G1 | |||||||||||
| LaFAN1 | 2.4 | 46.1 | 66.1 | 36.2 | 24.0 | 42.5 | 28.4 | 46.9 | 28.4 | 19.6 | 10.5 |
| AMASS | 20.9 | 76.2 | 88.5 | 72.1 | 56.8 | 66.8 | 70.2 | 90.7 | 75.0 | 62.7 | 44.1 |
| Humanoid-X | 231.4 | 50.6 | 78.4 | 43.0 | 26.0 | 31.8 | 39.1 | 78.0 | 39.6 | 23.0 | 6.5 |
| PHUMA | 73.0 | 92.7 | 95.6 | 91.7 | 886.00 | 85.6 | 82.9 | 96.7 | 88.0 | 7.8 | 67.1 |
| (b) H1-2 | |||||||||||
| LaFAN1 | 2.4 | 62.0 | 79.3 | 54.7 | 26.6 | 70.8 | 92.4 | 66.7 | 56.4 | 68.2 | |
| AMASS | 20.9 | 54.4 | 74.9 | 45.9 | 17.2 | 58.9 49.6 | 64.3 | 87.3 | 59.7 | 46.0 | 63.9 |
| Humanoid-X | 231.4 | 49.7 | 74.6 | 40.4 | 17.0 | 37.3 | 60.5 | 88.3 | 60.0 | 48.7 | 39.7 |
| PHUMA | 73.0 | 82.7 | 91.5 | 79.5 | 68.1 | 68.4 | 78.6 | 97.5 | 76.8 | 74.5 | 63.8 |
Analysis of Table 4:
-
PHUMA's Dominance: Policies trained on
PHUMAachieve the highest success rates across all motion categories, on both G1 and H1-2 humanoids, and for both test sets. For G1 on PHUMA Test,PHUMAscores 92.7% total success, significantly outperformingAMASS(76.2%),Humanoid-X(50.6%), andLaFAN1(46.1%). This is a strong validation ofPHUMA's combined benefits oflarge scaleandhigh physical quality. -
Scale Alone is Insufficient:
Humanoid-X, despite being the largest dataset (231.4 hours), significantly underperforms compared toAMASS(20.9 hours) and evenLaFAN1in some cases (e.g., G1 Total on PHUMA Test). This clearly demonstrates thatscale alone is insufficientif data quality (physical reliability) is compromised. Thephysical artifactsinHumanoid-Xhinder effective policy learning. -
Quality Alone is Insufficient:
LaFAN1(2.4 hours) offers high-quality data but its limited scale and diversity lead to poorer generalization, especially on theUnseen Videotest set (G1 total 28.4%). EvenAMASS(20.9 hours), while better thanLaFAN1, shows limitations in covering diverse motion types. -
Importance of Diversity (Figure 8): The paper notes that
LaFAN1andAMASShave uneven motion distributions, lacking certain categories (e.g., reach, bend, squat) or being dominated by simple ones.PHUMAprovides a more balanced and comprehensive coverage (Figure 8), leading to better generalization across diverse behaviors.
Figure 8: Motion Type Distribution per Dataset. This radar chart compares the total duration of each motion type across PHUMA, AMASS, and LaFAN1 datasets.Figure 8 visually confirms
PHUMA's superior motion type distribution. It shows thatPHUMAhas more balanced coverage and substantially more examples per motion type compared toAMASSandLaFAN1, which have prominent peaks for specific motions (e.g.,walkinAMASS) and gaps in others. -
Performance on Unseen Motions:
PHUMAmaintains its strong performance on theUnseen Videotest set (G1 total 82.9%), demonstrating excellentgeneralization capabilitiesfor motions not encountered during training. This is crucial for real-world application.
6.1.3. Pelvis-Only Path Following Control Performance (RQ3)
This section evaluates PHUMA's effectiveness for pelvis path-following control, a simplified control mode that uses only pelvis position and rotation as input. Policies are trained using a teacher-student distillation approach, with teachers trained on AMASS and PHUMA.
The following are the results from Table 5 of the original paper:
| PHUMA Test | Unseen Video | |||||||||
| Dataset | Total | Stationary | Angular | Vertical | Horizontal | Total | Stationary | Angular | Vertical | Horizontal |
| (a) G1 | ||||||||||
| AMASS | 60.5 | 85.6 | 60.1 | 51.4 | 66.5 | 54.8 | 83.6 | 66.5 | 33.0 | 27.5 |
| PHUMA | 84.5 | 94.6 | 86.1 | 83.7 | 90.2 | 74.6 | 98.3 | 83.3 | 54.3 | 57.1 |
| (a) H1-2 | ||||||||||
| AMASS | 60.4 | 84.0 | 62.8 | 43.6 | 78.7 | 72.3 | 96.6 | 77.3 | 52.1 | 72.5 |
| PHUMA | 73.9 | 91.2 | 76.5 | 66.9 | 84.8 | 78.1 | 96.6 | 77.8 | 60.6 | 78.0 |
Analysis of Table 5:
-
PHUMA's Superiority in Path Following: Policies trained on
PHUMAconsistently outperform those trained onAMASSacross all motion categories and humanoids for both test sets. For G1 on PHUMA Test,PHUMAachieves 84.5% total success rate compared toAMASS's 60.5%. -
Pronounced Gains in Dynamic Motions: The improvement is particularly significant for
VerticalandHorizontalmotions. For G1 on PHUMA Test,PHUMA(83.7% vertical, 90.2% horizontal) shows substantial gains overAMASS(51.4% vertical, 66.5% horizontal). This is attributed toAMASS's limitation in diverse dynamic movements. -
Running Motion Example (Figure 5): The paper provides a qualitative example of
running motion, whereAMASS-trained policies frequently fail, whilePHUMA-trained policies maintain robust performance. This visually confirmsPHUMA's ability to enable morediverse and dynamic humanoid control.
Figure 5: Path following on running motion. We visualize the robot's trajectory in a running motion. The target pelvis path is visualized with a green line. Top row presents results from a policy trained on AMASS, while bottom row presents results from a policy trained on PHUMA.Figure 5 clearly illustrates the difference: the
AMASS-trained robot (top row) deviates significantly from the green target path and often falls, while thePHUMA-trained robot (bottom row) stays close to the path, demonstrating successful and stable running. -
Practical Value: These results validate the practical value of
PHUMAforcomplex control, especially in scenarios where only partial state information (like pelvis-only) is available, making it more applicable to real-world control interfaces.
6.2. Ablation Studies / Parameter Analysis
The ablation study in Table 2 directly verifies the effectiveness of PhySINK's components.
The following are the results from Table 2 of the original paper:
| Motion Fidelity (%) Joint Feasibility (%) Non-Floating (%) Non-Penetration (%) Non-Skating (%) | |||||
| (a) G1 | |||||
| IK | 27.6 | 91.7 | 55.6 | 47.8 | 59.7 |
| SINK | 94.8 | 95.9 | 96.4 | 14.9 | 55.4 |
| + Joint Feasibility Loss | 94.9 | 100.0 | 96.4 | 14.8 | 55.6 |
| + Grounding Loss | 94.9 | 100.0 | 99.9 | 97.2 | 53.6 |
| + Skating Loss = PhySINK | 94.8 | 100.0 | 99.9 | 96.8 | 89.7 |
| () H1-2 | |||||
| IK | 36.3 | 80.9 | 57.7 | 45.2 | 56.1 |
| SINK | 93.9 | 15.3 | 42.2 | 81.4 | 47.9 |
| + Joint Feasibility Loss | 94.0 | 99.9 | 44.4 | 79.9 | 50.7 |
| + Grounding Loss | 93.9 | 99.9 | 99.8 | 98.1 | 49.3 |
| + Skating Loss = PhySINK | 93.9 | 99.9 | 99.7 | 97.7 | 87.7 |
Analysis of Table 2:
-
IK vs. SINK:
IKstruggles withMotion Fidelity(27.6% for G1, 36.3% for H1-2), indicating it fails to accurately reproduce human motion style.SINKdramatically improvesMotion Fidelity(94.8% for G1, 93.9% for H1-2), but at the cost ofphysical plausibility. For example, for G1,SINKhas poorNon-Penetration(14.9%) andNon-Skating(55.4%). For H1-2,SINKhas very lowJoint Feasibility(15.3%) and poorNon-Floating(42.2%). This confirmsSINK'sphysically under-constrainednature. -
Impact of
+ Joint Feasibility Loss: Adding this loss term significantly boostsJoint Feasibilityto nearly 100% for both robots (from 95.9% to 100% for G1, and dramatically from 15.3% to 99.9% for H1-2). This shows that the loss term effectively enforces joint limits without negatively impactingMotion Fidelity. -
Impact of
+ Grounding Loss: This term drastically improvesground contactmetrics. For G1,Non-Floatingincreases to 99.9% andNon-Penetrationto 97.2% (from 96.4% and 14.8% respectively). Similar significant gains are observed for H1-2. This confirms its effectiveness in preventing feet from floating or penetrating the ground. -
Impact of : The final addition of
Skating LossbringsNon-Skatingperformance to nearly 90% (89.7% for G1, 87.7% for H1-2), a substantial improvement over the previous stages (e.g., 53.6% for G1 before this loss). Crucially, this is achieved while maintaining highMotion Fidelityand other physical metrics. -
Overall
PhySINKPerformance: The fullPhySINKmethod (with all losses) preserves highMotion Fidelitywhile achieving strong results across all physical metrics (near 100%Joint Feasibility, near 100%Non-FloatingandNon-Penetration, and nearly 90%Non-Skating). This thoroughly validates the design ofPhySINKand the necessity of each added physical constraint loss.
6.3. Success Rate Threshold Analysis
The paper performs an additional analysis to justify its choice of a stricter success rate threshold ( instead of the conventional ).
The following are the results from Table 11 of the original paper:
| Success Threshold=0.15m | Success Threshold=0.5m | ||||||||||
| Dataset | Hours | Total | Stationary | Angular | Vertical | Horizontal | Total | Stationary | Angular | Vertical | Horizontal |
| (a) G1 | |||||||||||
| LaFAN1 | 2.4 | 46.1 | 66.1 | 36.2 | 24.0 | 42.5 | 74.8 | 87.8 | 69.2 | 47.1 | 72.6 |
| AMASS | 20.9 | 76.2 | 88.5 | 72.1 | 56.8 | 66.8 | 90.2 | 95.0 | 87.9 | 81.1 | 83.7 |
| Humanoid-X | 231.4 | 50.6 | 78.4 | 43.0 | 26.0 | 31.8 | 78.4 | 91.3 | 72.9 | 59.5 | 65.9 |
| PHUMA | 73.0 | 92.7 | 95.6 | 91.7 | 86.0 | 85.6 | 97.1 | 98.7 | 96.55 | 94.4 | 92.5 |
| (b) H1-2 | |||||||||||
| LaFAN1 | 2.4 | 62.0 | 79.3 | 54.7 | 26.6 | 58.9 | 70.8 | 92.4 | 66.7 | 56.4 | 68.2 |
| AMASS | 20.9 | 54.4 | 74.9 | 45.9 | 17.2 | 49.6 | 70.4 | 86.3 | 62.6 | 41.4 | 65.9 |
| Humanoid-X | 231.4 | 49.7 | 74.6 | 40.4 | 17.0 | 37.3 | 54.8 | 78.5 | 45.2 | 22.1 | 43.2 |
| PHUMA | 73.0 | 82.7 | 91.5 | 79.5 | 68.1 | 68.4 | 92.0 | 9.6 | 89.7 | 85.6 | 79.4 |
The following are the results from Table 12 of the original paper:
| Success Threshold=0.15m | Success Threshold=0.5m | ||||||||||
| Dataset | Hours | Total | Stationary | Angular | Vertical | Horizontal | Total | Stationary | Angular | Vertical | Horizontal |
| (a) G1 | |||||||||||
| LaFAN1 | 2.4 | 28.4 | 46.9 | 28.4 | 19.6 | 10.5 | 78.2 | 85.5 | 70.8 | 76.3 | 80.8 |
| AMASS | 20.9 | 70.2 | 90.7 | 75.0 | 62.7 | 44.1 | 92.3 | 99.2 | 92.1 | 82.1 | 88.0 |
| Humanoid-X | 231.4 | 39.1 | 78.0 | 39.6 | 23.0 | 6.5 | 84.1 | 98.3 | 79.9 | 76.0 | 76.2 |
| PHUMA | 73.0 | 82.9 | 96.7 | 88.0 | 71.8 | 67.1 | 93.7 | 100.0 | 96.8 | 85.9 | 84.7 |
| (b) H1-2 | |||||||||||
| LaFAN1 | 2.4 | 70.8 | 92.4 | 66.7 | 56.4 | 68.2 | 85.5 | 97.5 | 79.0 | 77.5 | 90.0 |
| AMASS | 20.9 | 64.3 | 87.3 | 59.7 | 46.0 | 63.9 | 80.4 | 93.3 | 69.9 | 72.8 | 89.0 |
| Humanoid-X | 231.4 | 60.5 | 88.3 | 60.0 | 48.7 | 39.7 | 68.7 | 93.3 | 65.1 | 60.2 | 50.5 |
| PHUMA | 73.0 | 78.6 | 97.5 | 768 | 74.5 | 63.8 | 889.9 | 99.2 | 89.4 | 84.6 | 83.9 |
Analysis of Table 11 and 12:
- Masking True Differences: Under the
loose0.5 :\mathrm{m}threshold, performance differences between datasets appear relatively modest. For example, on G1 (PHUMA Test),AMASSachieves 90.2% andPHUMAachieves 97.1%.Humanoid-X(78.4%) also seems competitive. This can mask scenarios where robots remain upright but perform motions inaccurately (e.g., staying still instead of jumping). - Revealing True Quality with Stricter Threshold: When the
stricter0.15 :\mathrm{m}thresholdis applied, the performance differences become substantially more pronounced and accurately reflect the quality. For G1 (PHUMA Test),PHUMA's lead becomes much clearer (92.7% vs.AMASS76.2%,Humanoid-X50.6%). This validates the authors' claim that the stricter threshold is amore meaningful measure of imitation quality, showing thatPHUMA-trained policies achievemore precise motion tracking. - Consistency Across Test Sets: The pattern holds for the
Unseen Videotest set as well.PHUMAconsistently delivers superior results, especially under the stricter threshold, confirming its ability to produce highly accurate and generalizable policies.
6.4. Data Presentation (Tables)
All tables from the original paper have been fully transcribed as requested.
6.5. Overall Performance Visualization (Figure 2)
The following are the results from Figure 2 of the original paper:
Figure 2: Overview of datasets and performance. PHUMA is both large-scale and physically reliable, which translates into higher success rates in motion imitation and pelvis path following. (a) Feasible and infeasible human motion sources in each dataset. (b) Physical reliability, with AMASS retargeted using a standard learning-based inverse kinematics method. (c) Success rate on unseen motions. (d) Success rate in path-following. Results are reported on the Unitree G1 humanoid.
Analysis of Figure 2:
-
(a) Feasible and Infeasible Motion Sources: Shows that
Humanoid-Xstarts with a huge amount of data, but a significant portion is deemedinfeasible.PHUMAeffectively curates this to a smaller, but higher-quality set.AMASShas less infeasible data but is much smaller overall. -
(b) Physical Reliability: This radar chart visually represents the quantitative improvements from
PhySINKandPHUMA.PHUMAshows near-perfect scores forJoint Feasibility,Non-Floating, andNon-Penetration, and very highNon-Skating, significantly outperformingAMASS(even with SINK retargeting) across all physical metrics.Humanoid-Xis not explicitly shown here but is known to have lower reliability. -
(c) Success Rate on Unseen Motions:
PHUMA(dark blue bar) has the highest success rate, followed byAMASS(light blue) and thenHumanoid-X(orange). This summarizes the findings from Table 4 for unseen motions, clearly showingPHUMA's superior imitation performance. -
(d) Success Rate in Path-Following: Again,
PHUMA(dark blue) leadsAMASS(light blue) across all motion categories, with particularly strong gains inVerticalandHorizontalmotions, reinforcing the results from Table 5.Figure 2 serves as a concise visual summary, strongly supporting the paper's central claims about
PHUMA's scale, physical reliability, and resulting performance gains.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces PHUMA, a Physically-grounded HUMAnoid locomotion dataset, addressing a critical bottleneck in humanoid motion imitation: the lack of large-scale, diverse, and physically reliable motion data. The authors effectively demonstrate that while large-scale video-derived datasets (like Humanoid-X) offer diversity, they are plagued by physical artifacts (floating, penetration, joint violations, skating). Conversely, high-quality motion capture datasets (AMASS, LaFAN1) are limited in scale and diversity.
PHUMA overcomes this dilemma through a two-pronged approach:
-
Physics-aware Motion Curation: A rigorous filtering process that refines raw motion data, removing problematic segments based on metrics like
root jerk,foot contact score,pelvis height, andcenter of mass stability. -
PhySINK Retargeting: A novel
physics-constrained retargeting methodthat extendsShape-Adaptive Inverse Kinematics (SINK)by incorporating explicitjoint feasibility,grounding, andanti-skating lossterms. This ensures that retargeted motions are not only kinematically faithful but also physically plausible for humanoid robots.Experimental results consistently show that policies trained on PHUMA significantly outperform those trained on
AMASSandHumanoid-Xin bothmotion imitationof unseen videos andpelvis-guided path followingtasks, especially for dynamicverticalandhorizontalmotions. This validates the central thesis that achieving progress in humanoid locomotion requires both scale and physical reliability in motion data.
7.2. Limitations & Future Work
The authors acknowledge two key areas for future work:
-
Sim-to-Real Transfer: The current work focuses on training policies in simulation. A crucial next step is to enable these
PHUMA-trained policies toproduce physically reliable motions on real humanoid robots. This involves addressing the inevitablereality gapbetween simulation and the physical world. -
Vision-Based Control: The current policies rely on
privileged state inputs(perfect knowledge of joint angles, velocities, etc.). Future work aims to transition tovision-based control, wherevideo observationsreplace these state inputs. This would better align withreal-world perceptionchallenges and allow robots to interpret human motion directly from visual cues.While not explicitly stated as limitations by the authors, one could infer some current boundaries:
- Computational Cost: The
physics-aware curationandPhySINKretargeting are computationally intensive processes. Scaling this pipeline to even larger, more diverse raw video datasets might introduce new challenges. - Dependence on Video-to-Motion Models: The quality of PHUMA still fundamentally relies on the initial
video-to-motion models(e.g.,TRAM). While curation helps, inherent inaccuracies or biases in these models could propagate.
7.3. Personal Insights & Critique
This paper offers a highly valuable contribution to the field of humanoid robotics and motion imitation. My key insights are:
- The "Quality over Quantity" (or "Quality with Quantity") Principle: The paper powerfully illustrates that for robotics, simply having a large amount of data is not enough if that data is
unphysicalorunreliable. The significant underperformance ofHumanoid-Xdespite its massive scale is a stark reminder. PHUMA's success lies in its disciplined approach to ensuring both scale and physical plausibility, a lesson that extends beyond motion imitation to other data-driven robotics tasks. - Generalizability of Physics-Constrained Optimization: The
PhySINKmethod is a robust and generalizable approach. Its loss terms (joint feasibility,grounding,skating) are fundamental physical constraints that apply to virtually any physically-embodied robot. This framework could potentially be adapted to other robot morphologies (e.g., quadrupedal robots with different foot contact dynamics) or even to tasks beyond locomotion where physical interaction is paramount. - Importance of Data Curation: The detailed
physics-aware motion curation pipeline(Section 3.1) is often overlooked in discussions of large datasets but is critical for high-quality outcomes. The strategic "chunk-and-filter" approach is smart, maximizing usable segments from noisy raw data. - Impact on Low-Level Control: The improved performance in
pelvis-only path followingis particularly inspiring. This demonstrates that higher-quality motion data enables more robust and expressive control even with simplified, abstract guidance. This is crucial for real-world applications where full-body state information might not always be available or easy to specify. - Potential for Real-World Deployment: The focus on reducing artifacts like floating, penetration, and skating makes the generated motions inherently safer and more stable for real robot deployment, even before
sim-to-realtransfer is fully achieved.Joint limit enforcementis especially critical for protecting robot hardware.
Potential Issues or Areas for Improvement:
-
Human Subjectivity in Curation: While the curation process uses quantitative metrics (Table 7), defining thresholds (e.g., for root jerk,
0.6for foot contact score) might involve some expert judgment or tuning. The robustness of these thresholds across extremely diverse raw video sources could be further explored. -
Computational Cost of PhySINK: The
PhySINKoptimization involves summing multiple loss terms over all frames and joints. While effective, the computational cost might become a limiting factor for extremely long motion sequences or real-time retargeting in some contexts. An analysis of the runtime performance ofPhySINKcould be beneficial. -
Complexity of Reward Function: The
MaskedMimicreward function (Table 9) is quite complex, with many terms and weights. While effective, tuning these weights can be challenging. Future work could investigate how sensitive the training is to these weights and if simpler reward structures could achieve comparable results withPHUMA's high-quality data. -
Long-Term Physical Interactions: The current constraints primarily focus on immediate physical plausibility (joint limits, ground contact, no skating). For complex manipulation or interaction tasks, more advanced physical modeling (e.g., stability over longer horizons, contact forces, object interaction) might be needed.
Overall, PHUMA represents a significant step forward, solidifying the foundation for more capable and humanlike humanoid robots by proving that data quality, achieved through meticulous processing, is as important as data quantity.
Similar papers
Recommended via semantic vector search.