ExBody2: Advanced Expressive Humanoid Whole-Body Control
TL;DR Summary
The paper presents ExBody2, an advanced control method enabling humanoid robots to perform expressive whole-body movements while maintaining stability. It employs a training approach based on human motion capture and simulations, addressing trade-offs between versatility and spec
Abstract
This paper tackles the challenge of enabling real-world humanoid robots to perform expressive and dynamic whole-body motions while maintaining overall stability and robustness. We propose Advanced Expressive Whole-Body Control (Exbody2), a method for producing whole-body tracking controllers that are trained on both human motion capture and simulated data and then transferred to the real world. We introduce a technique for decoupling the velocity tracking of the entire body from tracking body landmarks. We use a teacher policy to produce intermediate data that better conforms to the robot's kinematics and to automatically filter away infeasible whole-body motions. This two-step approach enabled us to produce a student policy that can be deployed on the robot that can walk, crouch, and dance. We also provide insight into the trade-off between versatility and the tracking performance on specific motions. We observed significant improvement of tracking performance after fine-tuning on a small amount of data, at the expense of the others.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
ExBody2: Advanced Expressive Humanoid Whole-Body Control
1.2. Authors
The paper is authored by Mazeyu Ji*, Xuanbin Peng*, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng†, and Xiaolong Wang†. Their affiliations are:
-
UC San Diego (Mazeyu Ji, Xuanbin Peng, Jialong Li, Xuxin Cheng, Xiaolong Wang)
-
UC Berkeley (Fangchen Liu)
-
MIT (Ge Yang)
The asterisks (*) denote equal contribution, and the daggers (†) denote equal advising.
1.3. Journal/Conference
The paper is published as a preprint on arXiv. While arXiv is not a peer-reviewed journal or conference, it is a highly influential platform for disseminating cutting-edge research quickly in fields like artificial intelligence, robotics, and physics. Papers published here are often submitted to top-tier conferences (e.g., NeurIPS, ICML, ICLR, RSS, ICRA) or journals later. The presence of authors from prestigious institutions like UC San Diego, UC Berkeley, and MIT suggests high-quality research, often aimed at prominent publication venues.
1.4. Publication Year
Published at (UTC): 2024-12-17T18:59:51.000Z. The publication year is 2024.
1.5. Abstract
This paper introduces Advanced Expressive Whole-Body Control (Exbody2), a novel method designed to enable real-world humanoid robots to execute expressive and dynamic full-body movements while ensuring stability and robustness. ExBody2 employs a two-step approach: first, it trains whole-body tracking controllers using both human motion capture data and simulated data. A key innovation is the decoupling of velocity tracking for the entire robot body from the tracking of specific body landmarks. The method utilizes a teacher policy to generate intermediate data that is kinematically feasible for the robot and to automatically filter out unachievable whole-body motions. This refined data then trains a student policy which can be deployed on physical robots, enabling them to perform actions such as walking, crouching, and dancing. The paper also explores the trade-off between the policy's versatility across various motions and its tracking performance on specific tasks, noting that fine-tuning on a small amount of targeted data significantly improves performance for those specific tasks, albeit at the expense of others.
1.6. Original Source Link
Official source and PDF link:
- Original Source Link: https://arxiv.org/abs/2412.13196
- PDF Link: http://arxiv.org/pdf/2412.13196v2 The publication status is a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is the challenge of enabling humanoid robots to perform expressive, dynamic, and human-like whole-body motions while simultaneously maintaining stability and robustness in real-world environments. This problem is crucial because humanoid robots are envisioned to operate in human living spaces, requiring them to interact with their environment in a natural and versatile manner.
Existing challenges and gaps in prior research include:
-
Dynamic and Kinematic Gap: A fundamental mismatch exists between biological human bodies and mechanical robots. Robots have different degrees of freedom (DoF), joint limits, and dynamic capabilities, making direct imitation of human motion capture data difficult.
-
Trade-off between Expressiveness and Stability: Current control methods often struggle to achieve both high expressiveness (e.g., fluid dancing, complex gestures) and robust stability (e.g., maintaining balance, handling perturbations) simultaneously.
-
Infeasible Motion Data: Human motion datasets often contain movements that are physically impossible or highly challenging for robots, leading to poor training and performance if used directly. Manual filtering is labor-intensive and prone to error, potentially reducing data diversity.
-
Tracking Failures: Many previous
whole-body trackingapproaches, especially those relying onglobal keypoint tracking, suffer from cumulative errors and tracking failures when robots cannot perfectly align with desired global positions, limiting their application to highly stationary scenarios.The paper's entry point and innovative idea revolve around addressing these gaps by proposing a novel framework,
Exbody2, which combines automated data curation, a generalist-specialist policy training pipeline, and a decoupled motion-velocity control strategy to bridge thesim-to-realgap and achieve both expressiveness and robustness.
2.2. Main Contributions / Findings
The primary contributions of Exbody2 are threefold:
-
Generalist Policy with Automated Data Curation:
- Contribution: Exbody2 introduces an automated method for curating human motion datasets. It uses a
teacher policyto evaluate thefeasibilityof motions for the robot, particularly focusing onlower-body stabilitywhile preservingupper-body diversity. This results in aFeasibility-Diversity Principlefor dataset construction. - Finding: This automated curation process generates a
robust generalist policythat significantly outperforms previous methods across diverse motions, both in simulation and real-world deployment, by balancing dataset feasibility and diversity. It learns broad, expressive behaviors without being hindered by impractical movements.
- Contribution: Exbody2 introduces an automated method for curating human motion datasets. It uses a
-
Specialist Policy with Finetuning for Targeted Motions:
- Contribution: Building upon the
generalist policy, Exbody2 proposes fine-tuning it for specificmotion groupsortasks(e.g., dancing, specific locomotion patterns). This approach leverages the generalist's learned priors for efficient adaptation. - Finding: Fine-tuned
specialist policiesachieve even higher precision and fidelity for targeted behaviors compared to the generalist policy or policies trained from scratch. This demonstrates the effectiveness of apretrain-finetune paradigmfor specialized tasks, improving robustness to disturbances and enhancing real-world generalization.
- Contribution: Building upon the
-
Decoupled Motion-Velocity Control Strategy:
- Contribution: Exbody2 introduces a novel control strategy that
decouples keypoint tracking from velocity control. It converts global keypoints into the robot's local frame and primarily usesvelocity-based global trackingto guide movement, whilekey body trackingfocuses on motion imitation. - Finding: This decoupled strategy, combined with a
teacher-student framework(where theteacher policyusesprivileged informationin simulation and thestudent policyis distilled for real-world deployment), improves tracking robustness and stability. It prevents cumulative errors often seen in global keypoint tracking, allowing for expressive motion reproduction even with slight positional deviations.
- Contribution: Exbody2 introduces a novel control strategy that
Key Conclusions: The findings collectively demonstrate ExBody2's potential to bridge the gap between human-level expressiveness and reliable whole-body control in humanoid robots. The method achieves superior tracking accuracy, stability, and adaptability compared to state-of-the-art baselines.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the methodology and contributions of this paper, a beginner should understand the following foundational concepts:
- Humanoid Robots: Robots designed to mimic the human body's shape and movement capabilities, typically having a torso, head, two arms, and two legs. They operate with many
degrees of freedom (DoF), which are the independent parameters that define the configuration of a mechanical system. For example, each joint (like a knee or elbow) has one or more DoF. - Whole-Body Control (WBC): A control strategy that coordinates all the joints and limbs of a robot simultaneously to achieve a desired task while respecting physical constraints (e.g., balance, joint limits, contact forces). It's a complex problem due to the high DoF and non-linear dynamics.
- Motion Capture (MoCap) Data: Digital recordings of human body movements. Sensors are placed on a human actor, and their positions and orientations are tracked in 3D space. This data provides a rich source of
human-like motionfor robots to imitate. TheCMU MoCap dataset[1] is a common benchmark. - Reinforcement Learning (RL): A machine learning paradigm where an
agentlearns to make decisions by interacting with anenvironment. The agent receivesrewardsfor desired actions andpenaltiesfor undesired ones, aiming to maximize its cumulative reward over time.- Agent: The entity that makes decisions (e.g., the robot's controller).
- Environment: The system the agent interacts with (e.g., the simulated or real world).
- State: A complete description of the environment at a given time.
- Action: A decision made by the agent that changes the environment's state.
- Reward: A scalar feedback signal indicating the desirability of an action.
- Policy (): A function that maps states to actions, defining the agent's behavior.
- Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is defined by states, actions, transition probabilities between states, and rewards. This is the underlying mathematical model for many RL problems.
- Proximal Policy Optimization (PPO): A popular
on-policyRL algorithm that is sample-efficient and robust. It's anactor-criticmethod, meaning it learns both apolicy(actor) and avalue function(critic) which estimates the expected future reward. PPO aims to take the largest possible step towards a new policy without collapsing performance, often by clipping the policy ratio. - Sim-to-Real Transfer: The process of training an RL policy in a simulated environment and then deploying it on a physical robot. This is desirable because training in simulation is safer, faster, and cheaper. However,
sim-to-real gaprefers to the discrepancies between simulation and reality (e.g., imperfect physics models, sensor noise, latency) that can cause a policy trained in sim to perform poorly in the real world. Techniques likedomain randomization(varying simulation parameters) andprivileged informationare used to mitigate this gap. - Teacher-Student Framework (Knowledge Distillation): A learning paradigm where a complex, high-performing model (the
teacher) is used to train a simpler, more efficient model (thestudent). In robotics RL, a teacher policy often has access toprivileged information(ground truth data not available in the real world) in simulation, making it easier to train. The student policy then learns to imitate the teacher's actions using onlyreal-world observable information. - DAgger (Dataset Aggregation): An
imitation learningalgorithm used in the teacher-student framework. It iteratively collects data by rolling out thestudent policyin the environment, then asks theteacher policy(oracle) to label the correct actions for those collected states, and finally retrains the student on this aggregated dataset. This helps the student learn to correct its own mistakes and perform well in states it might encounter during its own execution. - Proportional-Derivative (PD) Controllers: A type of feedback control loop commonly used in robotics. A PD controller calculates an output (e.g., motor torque) based on the current
error(difference between desired and actual state) and therate of change of the error. It aims to move the system towards the desired state and dampen oscillations. In this paper, theaction spaceis the target joint positions for PD controllers. - Keypoints: Specific points on the body (e.g., hands, feet, head, hips) used for tracking motion.
Global keypointsrefer to their positions in a fixed world coordinate system, whilelocal keypointsare relative to the robot's own coordinate frame. - Morphology: The shape and structure of a robot or organism.
Retargetinghuman motion data to a robot's morphology involves adapting the human poses to fit the robot's specific joint structure and dimensions.
3.2. Previous Works
The paper contextualizes its contributions by referencing several prior works in humanoid whole-body control and motion imitation:
-
Traditional Dynamics Modeling and Control [40, 62, 22, 41, 8, 25, 61, 27, 20, 6, 7, 9, 42, 49]: These methods rely on precise mathematical models of the robot's physics and dynamics. They require accurate system identification and intensive online computation to handle perturbations and generate stable locomotion. While effective for well-defined tasks, they can be rigid and lack the adaptability for diverse, expressive motions.
-
RL-based Whole-Body Control [32, 33, 53, 10, 34, 35, 48, 23, 24, 47, 55, 52]: More recent approaches leverage RL to learn complex skills in simulation, which are then transferred to real robots. These methods often use task-specific rewards and environment randomization.
-
Motion Imitation for Expressive Control (e.g., ExBody [3, 4], H2O [18], OmniH2O [17], Humanplus [13]):
- ExBody [3, 4]: This paper's predecessor. It uses a one-stage RL training pipeline and primarily tracks upper-body movements, focusing on partial body tracking. It does not explicitly follow lower-body step patterns.
- H2O [18] and OmniH2O [17]: These methods focus on
human-to-humanoid teleoperationand learning. They rely onglobal tracking of keypoint positions. A key limitation, as highlighted by ExBody2, is that thisglobal tracking strategyoften leads totracking failuresdue tocumulative errorswhen the robot struggles to align with current global keypoints, limiting their real-world application to highly stationary scenarios. OmniH2O also uses a similar observation space. - Limitations of existing motion imitation: These methods often rely on manually filtering feasible motion data, which is labor-intensive and may still contain infeasible motions or lack diversity, limiting the robot's full hardware potential.
-
Physics-based Character Motion Imitation in Simulation [44, 45, 56, 16, 37, 36, 63, 60, 57]: These works focus on generating realistic character animations in simulation, which is a related but distinct problem from real-world robot control, as they don't face the same hardware constraints or sim-to-real challenges.
3.3. Technological Evolution
The field of humanoid robot control has evolved from:
-
Early Model-Based Control (1970s-1990s): Focused on inverse kinematics (IK) and inverse dynamics (ID) to calculate joint trajectories and torques based on desired end-effector positions or whole-body postures. Examples include early Honda humanoid robots [20] and approaches like the
3D Linear Inverted Pendulum Mode (LIPM)[25] for stable bipedal walking. These methods excel at stability and precision for pre-programmed tasks but struggle with adaptability and expressiveness for diverse human-like motions. -
Optimization-Based Whole-Body Control (2000s-2010s): Advanced WBC frameworks that optimize for multiple objectives (e.g., task achievement, balance, joint limits) simultaneously, often formulated as quadratic programs. This allowed for more complex tasks and better handling of constraints but still relied heavily on accurate models and online computation.
-
Reinforcement Learning for Locomotion (2010s-Present): The rise of deep RL, especially
PPO, coupled with powerful physics simulators (IsaacGym[39]), enabled training policies for complex, robust locomotion (e.g., quadrupedal robots [29, 28], bipedal robots [32, 33]). This allowed learning behaviors that were difficult to program manually, including dynamic motions and handling perturbations. -
RL for Expressive Motion Imitation (Recent Years): Integrating human
motion capture datainto RL training. This shifted the focus from just stable locomotion toexpressive whole-body movements, such as dancing or gesturing. Early works likeExBody,H2O, andOmniH2Odemonstrated this potential but faced challenges with data feasibility, tracking robustness, and sim-to-real transfer.ExBody2 fits into the cutting edge of this evolution, building on RL-based motion imitation but significantly improving its robustness, expressiveness, and real-world applicability by addressing key limitations of prior methods.
3.4. Differentiation Analysis
ExBody2 differentiates itself from previous methods, particularly ExBody, H2O, and OmniH2O, through three key innovations:
-
Automated Data Curation (vs. Manual Filtering/Unfiltered Data):
- Previous:
ExBodyuses language labels for filtering (which can be ambiguous), while others [18, 17] useSMPL avatarsto simulate motions, but these can still exceed real robot capabilities. Manual filtering is prone to human error and can reduce diversity. - ExBody2 Innovation: Introduces an
automated data curationmethod based on aFeasibility-Diversity Principle. It trains an initial policy, evaluates itslower-body tracking errorfor each motion, and then filters the dataset based on this error to remove infeasible motions. This automatically balancesdiversity(especially in upper-body movements) withfeasibility(in the lower body), leading to a more effective training dataset and a more robustgeneralist policy.
- Previous:
-
Generalist-Specialist Training Pipeline (vs. Single Policy):
- Previous: Prior methods typically train a single policy, which might struggle to achieve both broad adaptability and high precision for specific, complex tasks.
- ExBody2 Innovation: Employs a
two-step approach. First, ageneralist policyis trained on the curated diverse dataset. Second, thisgeneralist policyisfine-tunedto createspecialist policiesfor targeted motion groups (e.g., dancing, kung fu). - Differentiation: This allows ExBody2 to achieve high adaptability across a wide range of motions with the generalist, and then even higher fidelity and precision for specific tasks with the specialists, leveraging the generalist's learned priors efficiently.
-
Decoupled Motion-Velocity Control Strategy (vs. Global Keypoint Tracking):
- Previous: Approaches like
H2O[18] andOmniH2O[17] predominantly rely onglobal tracking of keypoint positions. This often leads tocumulative errorsandtracking failuresbecause robots struggle to perfectly align with current global keypoints, especially in dynamic scenarios. - ExBody2 Innovation: Converts
global keypointsto the robot'slocal frameanddecouples keypoint tracking from velocity control. It usesvelocity-based global trackingto guide overall movement andlocal key body trackingfor expressive motion imitation. It also incorporates ateacher-student frameworkwhere the teacher usesprivileged information(like ground-truth root velocity) to achieve high performance in simulation, and the student learns to infer this information from historical observations for real-world deployment. - Differentiation: This strategy enhances tracking robustness by preventing cumulative errors, allowing the robot to maintain stable movement while still achieving expressive whole-body imitation, even with slight positional deviations.
- Previous: Approaches like
4. Methodology
4.1. Principles
The core idea behind ExBody2's methodology is to enable expressive and robust whole-body control for humanoid robots by meticulously preparing the training data and structuring the learning process. The key principles are:
- Feasibility-Diversity Principle: This principle guides the creation of an optimal training dataset. It states that the dataset must be
diverse enough(especially in upper-body movements) to allow the robot to learn a broad range of expressive motions, butfeasible enoughin itslower-body motionsto avoid movements that exceed the robot's physical limits or stability envelopes. This ensures that the policy learns from achievable actions, preventing training noise and instability. - Generalist-Specialist Learning: The approach first learns a
generalist policycapable of broad motion coverage from a curated dataset. Then, this generalist policy serves as a foundation forfine-tuning specialist policiesfor specific, high-precision tasks. This leverages transfer learning, providing a "warm start" and enhanced robustness for specialized behaviors. - Teacher-Student Distillation for Sim-to-Real Transfer: A
teacher policyis trained in a high-fidelity simulation with access toprivileged information(ground truth data not available in the real world) to achieve optimal performance. This knowledge is thendistilledinto astudent policythat learns to perform the same task using onlyreal-world observable information(e.g., historical observations), making it deployable on the physical robot. - Decoupled Motion-Velocity Control: Instead of relying solely on tracking global keypoint positions (which can lead to cumulative errors), ExBody2 separates the concerns of
overall robot velocity controlfromprecise local keypoint tracking. This allows for stable global movement while still achieving expressive local pose imitation.
4.2. Core Methodology In-depth (Layer by Layer)
ExBody2 adopts a sim-to-real framework structured around a Generalist-Specialist training pipeline and a teacher-student distillation process, as illustrated in Figure 2 and Figure 3.
As shown in Figure 2, the overall workflow begins with human motion data, which is first retargeted to the robot's specific morphology. This data then feeds into the Data-driven Generalist-specialist Training Pipeline. Within this pipeline, an automated data curation strategy is applied to train a generalist policy. Subsequently, this generalist policy can be fine-tuned to produce specialist policies. Finally, these policies are deployed onto real humanoid robots.
The following figure (Figure 2 from the original paper) shows the overall framework of ExBody2:

该图像是示意图,展示了先进的全身控制策略(ExBody2)的训练和应用过程,包括运动重定向、数据集过滤、特定策略微调及实际应用。通过自动筛选,生成最佳通用策略,然后微调专业策略以实现行走、舞蹈等动作。
humanoid robot, demonstrating expressive, dynamic, and stable whole-body motions in real-world environments.
4.2.1. Data-driven Generalist-specialist Training Pipeline
This pipeline is designed to balance adaptability and precision in whole-body motion tracking. It is guided by the Feasibility-Diversity Principle.
4.2.1.1. Feasibility-Diversity Principle
This principle dictates the design of the training dataset. It requires:
- Diversity: Sufficient motion diversity, particularly in the
upper body, to cover a broad distribution of tasks and expressive movements. - Feasibility: Maintaining feasibility in the
lower bodyto avoid unachievable or overly dynamic motions that could destabilize training. This primarily involves filtering outextreme lower-body sampleswhile retaining a wide range ofupper-body actions.
4.2.1.2. Generalist Policy with Automated Data Curation
The goal is to train a generalist policy that performs well across diverse motion inputs. This is achieved through an automated process:
- Initial Policy Training: An initial policy, denoted as , is first trained on a comprehensive, unfiltered motion dataset . This dataset is typically very diverse but may contain many motions that are infeasible for the robot.
- Tracking Error Evaluation: After training , its tracking accuracy is evaluated for each motion sequence . A
tracking error metrice(s)is computed, specifically focusing on thelower body. This focus is critical because the lower body is central to dynamic feasibility and balance. The error metrice(s)is defined as: $ e(s) = \alpha E_{\mathrm{key}}(s) + \beta E_{\mathrm{dof}}(s) $ Where:- : Represents the mean
keybody position errorfor the lower body. This term helps prevent extreme deviations such as flipping or rolling, which are indicative of severe instability. - : Measures the mean
joint-angle tracking errorfor the lower body. This term ensures precise joint-level imitation. - : Are coefficients that weight the relative importance of keybody position error and joint-angle error for lower-body stability and precision. In the ablation studies, and are used, indicating a heavier weight on joint-angle tracking.
- : Represents the mean
- Motion Ranking and Distribution: Once
e(s)is computed for all sequences, the motions are ranked by their tracking errors, and theempirical distributionP(e)is derived. - Optimal Threshold Determination: The objective is to find an
error thresholdsuch that the subset of motion sequences with error less than or equal to , denoted as , enables the training of a new policy that maximizes performance across the full original dataset . Formally, this search is for: $ \tau^* = \arg\operatorname*{max}{\tau} \mathbb{E}{s \in \mathcal{D}} [ \mathrm{Performance}(\pi_{\tau}, s) ] $ Where:- : The optimal error threshold.
- : The expected performance of policy (trained on ) when evaluated on the full dataset .
- : The policy trained on the filtered dataset .
In practice,
P(e)is divided into evenly spaced error intervals, and agreedy searchis used to identify . The paper notes that optimal performance is consistently achieved at amoderate, balancing diversity and feasibility.
The following figure (Figure 8 from the original paper) illustrates the empirical cumulative distribution function (CDF) for error metric e(s), which guides threshold selection:

该图像是一个展示在数据集 ext{D}_{ ext{CMU}} 中基本策略误差指标 e(s) 的经验累积分布函数(CDF)的图表。横轴表示运动序列的百分比,从 (最低误差)到 (最高误差),纵轴显示误差指标 e(s)。图中用虚线标记了关键阈值 ,用于系统地确定可行与不可行的动作。
Fig. 8: Empirical CDF of the base policy's error metric e ( s ) on the entire dataset. The horizontal axis indicates the percentile of motion sequences from (lowest error) to (highest error), while the vertical axis shows e ( s ) . We overlay dashed horizontal lines at key thresholds to illustrate how we systematically determine feasible versus unfeasible motions based on the empirical distribution.
4.2.1.3. Specialist Policies with Finetuning
After obtaining the generalist policy , which balances diversity and feasibility, it is further refined into specialist policies for specific, high-precision tasks. This fine-tuning process offers several advantages over training from scratch:
- Efficiency: It's more efficient as specialist policies track a smaller set of motions. The generalist policy provides a
warm start, leveraging learned priors. - Robustness: The specialist policy inherits adaptability and robustness from the generalist's exposure to a wider range of motion sequences, improving real-world generalization.
- Reduced Training Time: Fine-tuning an already well-trained model is computationally less demanding.
4.2.2. Policy Objective and Architecture
ExBody2 uses a two-stage teacher-student training procedure for sim-to-real transfer, similar to [29, 28]. All policies are trained using IsaacGym [39] for efficient parallel simulation.
The following figure (Figure 3 from the original paper) depicts the teacher-student framework:

该图像是示意图,展示了 humanoid 运动学习中的教师-学生框架。左侧部分描述了教师策略如何利用特权信息和前馈输入生成控制动作,而学生策略则通过模仿学习,从过去的观察中学习并生成控制动作。右侧展示了在模拟环境中运行的机器人(Unitree G1)进行的各种动作。该框架旨在提高机器人运动的表达能力和稳定性。
Fig. 3: Teacher-student framework for humanoid motion learning, where the teacher uses privileged information, and the student learns from past observations to generate control actions.
4.2.2.1. Teacher Policy Training
The humanoid motion control problem is formulated as a Markov Decision Process (MDP).
-
State Space: The state space for the teacher policy consists of three components:
Privileged information( or ): Ground-truth states and environmental properties only observable in simulators.Proprioceptive states( or ): Robot's internal sensor readings (e.g., joint positions, velocities).Motion tracking target( or ): The desired motion for the robot to imitate. The teacher policy takes as input.
-
Action Space: The teacher policy outputs an action (for the Unitree G1 robot), which represents the
target joint positionsforProportional-Derivative (PD) controllers. These PD controllers then calculate the torques to apply to the robot's motors. -
RL Algorithm: The
Proximal Policy Optimization (PPO)algorithm [51] is used to train the teacher policy. PPO aims to maximize the expected accumulated future rewards, encouraging robust behavior and accurate tracking of demonstrations: $ \mathbb{E}{\hat{\pi}} \left[ \sum{t=0}^T \gamma^t \mathcal{R}(s_t, \hat{a}_t) \right] $ Where:- : Expected value under the teacher policy .
- : Sum of discounted future rewards.
- : Discount factor, weighting immediate rewards more heavily than future ones.
- : Reward function, providing feedback for state and action .
-
Privileged Information (): This information is crucial for efficient teacher policy training in simulation and includes:
- Ground-truth
root velocity(linear and angular). - Real
body links' positions(accurate positions of all robot parts). Physical propertiesof the environment and robot (e.g., friction coefficients, motor strength, mass properties). This information helps the teacher policy learn quickly and effectively by reducing the observation noise and providing direct access to critical dynamics.
- Ground-truth
-
Motion Tracking Target (): This specifies the desired human motion to be imitated. It comprises two main parts:
- Desired
jointsand3D keypointsfor both the upper and lower body. - Target
root velocityandroot pose(position and orientation). This allows the policy to learn to accurately track whole-body motions while also being controllable by external commands (e.g., joystick commands for linear velocity and body pose).
- Desired
-
Reward Design: The reward function is carefully designed to promote both
tracking accuracyandrobot stability. It primarily consists of:-
Tracking Rewards: Encouraging accurate tracking of velocity, direction, orientation of the root, and precise tracking of keypoints and joint positions.
-
Regularization Terms: Penalizing undesirable behaviors to boost stability and improve
sim-to-real transfer(e.g., joint limits violations, high accelerations, foot slippage).The main elements of the tracking reward are detailed in Table I from the paper:
Term Expression Weight Expression Goal Ge DoF Position exp(−0.7|qref − q|) 3.0 Keypoint Position exp(−|pref − p|) Root Movement Goal Gm 2.0 Linear Velocity exp(−4.0|vref − v|) Velocity Direction 6.0 exp(−4.0 cos(vref, v)) 6.0 Roll & Pitch − θ| exp(−| θ 1.0 1.0 Yaw exp(−|∆y|)
Where:
DoF Position:- : Reference joint position from the motion target.
- : Actual joint position of the robot.
- : Absolute difference.
- Weight: 3.0
- Purpose: Rewards the robot for keeping its joint positions close to the reference motion.
Keypoint Position:- : Reference keypoint position from the motion target.
- : Actual keypoint position of the robot.
- Weight: 2.0
- Purpose: Rewards the robot for keeping its keypoint positions close to the reference motion.
Linear Velocity:- : Reference linear velocity of the robot's root (base).
- : Actual linear velocity of the robot's root.
- Weight: 6.0
- Purpose: Rewards accurate tracking of the reference linear velocity.
Velocity Direction:- : Cosine similarity between reference and actual root velocity vectors, indicating alignment of direction.
- Weight: 6.0
- Purpose: Rewards the robot for moving in the desired direction.
Roll & Pitch: (implied for both roll and pitch)- : Roll or pitch angle of the robot's root.
- Weight: 1.0 (for each, likely implying a negative reward for deviation from upright)
- Purpose: Penalizes deviations from a desired (e.g., upright) roll and pitch orientation, contributing to stability.
Yaw:- : Difference in yaw (heading) angle from the reference.
- Weight: 1.0
- Purpose: Rewards the robot for maintaining the desired yaw orientation.
-
4.2.2.2. Student Policy Training
In this stage, the student policy is trained to be deployable in the real world.
- Removal of Privileged Information: The
privileged information(available to the teacher) is removed from the student's observations. - Historical Observations: The student policy uses a
longer history of observations() to infer the necessary information that was privileged for the teacher. This includes proprioceptive states () and the motion tracking goal (). The student policy learns to predict action . - DAgger-style Distillation: The student policy is supervised using the
teacher's oracle actionswith aMean Squared Error (MSE) loss. The loss function is defined as: $ l = \lVert a_t - \hat{a}_t \rVert^2 $ Where:- : Action predicted by the student policy.
- : Oracle action produced by the teacher policy.
- : Squared Euclidean norm, representing the squared difference between the student's and teacher's actions.
The training process uses a
DAgger[50]-style strategy: the student policy is rolled out in the simulation, and for each visited state, the teacher policy computes the oracle action as a supervision signal. The student policy is then refined by iteratively minimizing the loss on this accumulated data. Training continues until the loss converges.
4.2.2.3. Motion-velocity Decoupled Control Strategy
This strategy is crucial for robust tracking.
- Limitations of Global Keypoint Tracking: Previous methods like
H2OandOmniH2Olearn to follow the trajectory ofglobal keypoints. This can lead tosuboptimal or failed trackingbecause global keypoints maydrift over time, causingcumulative errorsand hindering learning. - ExBody2's Approach:
- Local Coordinate Frame: ExBody2 converts
global keypointsinto the robot'scurrent coordinate frame. This means keypoint tracking is relative to the robot itself, reducing issues from global drift. - Decoupling:
Keypoint tracking(for expressive pose imitation) isdecoupledfromvelocity control(for guiding overall movement).Velocity-based global trackingis used to guide the robot's global movement, whilekey body trackingfocuses on precise motion imitation. - Robustness Enhancement: During the
training stage, a smallglobal drift of keypointsis allowed, and they areperiodically correctedto the robot's current coordinate frame. This helps the robot learn to follow challenging keypoint motions without being overly constrained by perfect alignment at every instant. - Deployment Strategy: During
real-world deployment, ExBody2 strictly employslocal keypoint trackingwith thismotion-velocity decoupled control. This coordination allows for completion of tracking with maximal expressiveness, even if slight positional deviations arise, by prioritizing overall velocity and local pose.
- Local Coordinate Frame: ExBody2 converts
5. Experimental Setup
5.1. Datasets
The experiments primarily utilize variations and subsets of a classic motion capture dataset, along with manually curated ones for specific evaluations.
-
CMU Dataset [1]: The primary dataset used, known as , is the full Carnegie Mellon University motion capture repository, comprising 1,919 sequences.
- Characteristics: It is highly diverse, including a wide variety of action types, but also contains
extreme movements(e.g., push-ups, rolling on the ground, somersaults) that are beyond a robot's physical capabilities. - Purpose: Serves as the base for automated data curation and evaluation of generalist policies.
- Characteristics: It is highly diverse, including a wide variety of action types, but also contains
-
Curated Subsets for Ablation Studies: To investigate the
Feasibility-Diversity Principle, the following subsets of the CMU dataset were manually designed: *
\mathcal{D}_{50}`(50-action dataset)`: A minimal set containing only `fundamental` and mostly `static actions` (e.g., standing, simple walking).
* **Characteristics:** Highly feasible but lacks diversity in both upper and lower limb motions.
* **Purpose:** To evaluate the effect of overly simple datasets on generalization.
*
\mathcal{D}_{250}(250-action dataset): A moderate-sized set extending with additional upper-limb variations (e.g., arm gestures, some dance moves) and moderately dynamic lower-body actions (e.g., running, mild jumps).
* Characteristics: Crucially, it avoids highly extreme motions difficult for the robot.
* Purpose: To find the optimal balance between feasibility and diversity.
-
ACCAD Dataset ():
- Characteristics: An
out-of-distribution (OOD)dataset, meaning it contains motion patterns not present in the training data from CMU. - Purpose: To evaluate the
generalization capabilityof learned policies to previously unseen motion patterns.
- Characteristics: An
-
Task-Specific Datasets for Finetuning: For evaluating
specialist policies, the following manually curated datasets were used:- , , : A series of datasets with
increasing difficulty levels, categorized bymotion dynamics.- Characteristics: contains static or low-movement motions. includes more dynamic and high-momentum movements.
- Purpose: To assess how well policies generalize to increasingly complex motions and to demonstrate the advantage of fine-tuning.
- : A specific subset of motions for fine-tuning a specialist policy for dance movements, exemplified by the Cha-Cha dance.
-
Characteristics: Involves dynamic lower-body movements combined with expressive upper-body gestures.
-
Purpose: To showcase the high precision of specialist policies for specific expressive tasks.
Data Sample: The paper does not provide concrete examples of individual data samples (e.g., a specific keypoint trajectory or a frame from a motion clip). However, the description of the datasets (e.g., "standing, simple walking" for , "push-ups, rolling on the ground" for ) offers a conceptual understanding of the data's form: sequences of human poses (joint angles, keypoint positions) over time. These human poses are
retargetedto theUnitree G1robot's morphology before training.
-
- , , : A series of datasets with
5.2. Evaluation Metrics
The policy's performance is evaluated using several quantitative metrics, calculated across all motion sequences in an evaluation dataset.
-
Mean Linear Velocity Error ()
- Conceptual Definition: Quantifies the average absolute difference between the robot's root linear velocity and the reference linear velocity from the demonstration. It indicates how well the robot tracks the speed and direction of global movement.
- Mathematical Formula: $ E_{\mathrm{vel}} = \frac{1}{N \cdot T} \sum_{i=1}^N \sum_{t=1}^T \lVert v_{i,t}^{\mathrm{robot}} - v_{i,t}^{\mathrm{ref}} \rVert $
- Symbol Explanation:
- : Total number of motion sequences.
- : Number of time steps (frames) in each sequence.
- : The linear velocity vector of the robot's root at time in sequence .
- : The reference linear velocity vector of the root from the demonstration at time in sequence .
- : Euclidean norm (magnitude of the vector).
-
Mean Per Keybody Position Error (MPKPE - )
- Conceptual Definition: Measures the average positional error of specific key body points (landmarks) on the robot relative to the reference motion. It reflects the overall accuracy of tracking key body parts.
- Mathematical Formula: $ E_{\mathrm{mpkpe}} = \frac{1}{N \cdot T \cdot K} \sum_{i=1}^N \sum_{t=1}^T \sum_{k=1}^K \lVert p_{i,t,k}^{\mathrm{robot}} - p_{i,t,k}^{\mathrm{ref}} \rVert $
- Symbol Explanation:
- : Total number of motion sequences.
- : Number of time steps (frames) in each sequence.
- : Total number of key body points tracked.
- : The 3D position vector of robot keypoint at time in sequence .
- : The 3D position vector of reference keypoint at time in sequence .
- : Euclidean norm.
- Variants: The paper also reports specific MPKPE for upper body () and lower body () to provide a more granular analysis of tracking performance in different body regions.
-
Mean Per Joint Position Error (MPJPE - )
- Conceptual Definition: Quantifies the average angular error (in radians) between the robot's joint angles and the reference joint angles from the demonstration. It indicates the precision of joint-level pose tracking.
- Mathematical Formula: $ E_{\mathrm{mpjpe}} = \frac{1}{N \cdot T \cdot J} \sum_{i=1}^N \sum_{t=1}^T \sum_{j=1}^J |q_{i,t,j}^{\mathrm{robot}} - q_{i,t,j}^{\mathrm{ref}}| $
- Symbol Explanation:
- : Total number of motion sequences.
- : Number of time steps (frames) in each sequence.
- : Total number of joints.
- : The angular position (in radians) of robot joint at time in sequence .
- : The reference angular position (in radians) of joint at time in sequence .
- : Absolute difference.
- Variants: Similar to MPKPE, the paper reports MPJPE for upper body () and lower body () for detailed analysis. The table uses
Eloefor andElowerfor . TheEmpipeseems to be the overall MPJPE, andEniecould be a typo or an abbreviation for upper-body error. GivenEloeandElowerare for upper and lower DoF error,Empipeis likely the totalMPJPE.
5.3. Baselines
ExBody2 is evaluated against four state-of-the-art baselines using the Unitree G1 robot platform:
-
Exbody [4]:
- Description: This is the predecessor to ExBody2. It uses a
one-stage RL training pipeline. Its core design involves trackingonly the upper body movementsfrom human data, while tracking theroot motionof the lower body without explicitly following step patterns. - Distinguishing Feature: Focuses on
partial body tracking(upper body) and does not utilize ateacher-student structure. It uses a shorter history length (5) and relies entirely onlocal keypointsfor tracking.
- Description: This is the predecessor to ExBody2. It uses a
-
Exbody†:
- Description: This is a
whole-body controlversion of the original Exbody. It extends Exbody to trackfull-body movementsbased on human data. - Distinguishing Feature: Attempts
comprehensive human motion imitationacross the entire body posture while maintaining most other aspects of the original Exbody's design (e.g., one-stage RL, no teacher-student).
- Description: This is a
-
OmniH2O [17]:*
- Description: This is the authors' reproduction of the OmniH2O method. It specifically uses
global keypoints trackingand has an observation space consistent with the original paper. - Distinguishing Feature: The primary difference from ExBody2 is its reliance on
global keypoint trackingandnot using the robot's velocity as privileged informationduring training. For fair comparison in evaluation, OmniH2O* is adapted to uselocal keypointsduring testing, but its training method remains true to the original.
- Description: This is the authors' reproduction of the OmniH2O method. It specifically uses
-
Exbody2-w/o-Filter (ExBody2 without automated data curation): This is essentially ExBody2 without its core innovation of automated data curation. It serves as an ablation baseline to demonstrate the impact of the filtering strategy.
The choice of these baselines is representative because they encompass different tracking methods (partial vs. whole-body, global vs. local keypoints), training strategies (one-stage RL vs. teacher-student), and data handling (unfiltered vs. partially filtered).
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate the effectiveness of ExBody2's innovations across various evaluation scenarios, both in simulation and real-world deployment.
6.1.1. Generalist Policy Performance
The performance of the ExBody2 generalist policy is initially compared against state-of-the-art baselines in simulation (Table II) and then validated in real-world settings (Table III).
The following are the results from Table II of the original paper:
| Method | Evel ↓ | Empkpe | Emp | Eloe | Empipe | Elpe Y | Elower |
| Exbody | 0.4700 | 0.1339 | 0.1249 | 0.1428 | 0.2020 | 0.1343 | 0.2952 |
| Exbody† | 0.4195 | 0.1150 | 0.1106 | 0.1198 | 0.1496 | 0.1416 | 0.1607 |
| OmniH20* | 0.3725 | 0.1253 | 0.1266 | 0.1240 | 0.1681 | 0.1564 | 0.1843 |
| Exbody2-w/o-Filter | 0.2787 | 0.1133 | 0.1087 | 0.1182 | 0.1355 | 0.1192 | 0.1579 |
| Exbody2(Ours) | 0.2930 | 0.1000 | 0.0960 | 0.1040 | 0.1079 | 0.0953 | 0.1253 |
Analysis of Simulation Results (Table II):
-
Superiority of ExBody2:
ExBody2(Ours)achieves the lowest errors across almost all metrics:Empkpe(0.1000),Emp(likely , 0.0960),Eloe(likely , 0.1040),Empipe(0.1079),Elpe Y(likely , 0.0953), andElower(likely , 0.1253). This indicates significantly better whole-body (both upper and lower) and joint-level tracking compared to all baselines. -
Impact of Filtering: Comparing
ExBody2-w/o-FilterwithExBody2(Ours)reveals the substantial benefit of the automated data curation. Filtering improvesEmpkpe(from 0.1133 to 0.1000),Empipe(from 0.1355 to 0.1079), and especiallyElower(from 0.1579 to 0.1253). This highlights that removing infeasible lower-body motions leads to greater global stability and more precise upper-body control. -
Velocity Trade-off:
ExBody2(Ours)shows a slight increase inEvel(0.2930) compared toExBody2-w/o-Filter(0.2787). The paper attributes this to the full dataset containing broader velocity patterns, which, while enabling diverse dynamics, also introduce noise. This suggests a minor trade-off where improved stability and precision (across keypoints and joints) are prioritized over a marginal increase in velocity tracking error. -
Baseline Performance:
Exbody(partial body tracking) performs worst, as expected.Exbody†(full-body Exbody) improves overExbodybut is still significantly worse than ExBody2. , despite using global keypoint tracking, also falls short of ExBody2, underscoring the benefits of ExBody2's decoupled control and data curation.The following are the results from Table III of the original paper:
Method Empipe ↓ Elow Exbody 0.2178 0.1223 0.3239 Exbody† 0.1465 0.1314 0.1672 OmniH20* 0.1396 0.1273 0.1533 Exbody2-w/o-Filter 0.1361 0.1254 0.1481 Exbody2(Ours) 0.1074 0.1092 0.1054
Analysis of Real-World Results (Table III):
- Consistency with Simulation: The real-world experiments, conducted on the
Unitree G1robot using a representative subset of the CMU dataset, closely mirror the simulation results.ExBody2(Ours)consistently achieves the lowest errors forEmpipe(0.1074) andElow(0.1054), which likely represent overall and lower-body joint position errors, respectively. The unlabeled column (0.1092) is likely an upper-body error (e.g., ). - Confirmation of Data Curation: The substantial enhancement in performance for
ExBody2(Ours)compared toExBody2-w/o-Filter(e.g.,Empipefrom 0.1361 to 0.1074,Elowfrom 0.1481 to 0.1054) is critical, particularly in real-world environments with unpredictable disturbances. This validates that automated data curation significantly contributes to robust and consistent behavior, enabling high-precision tracking. - Overall Generalist Efficacy: The generalist policy of ExBody2 demonstrates stable and effective tracking performance in dynamic real-world environments, showcasing significant improvements in full-body and velocity tracking accuracy.
6.1.2. Impact of Automatic Data Curation
This section evaluates how the selection criteria for dataset construction affect the learning of a generalist policy, directly validating the Feasibility-Diversity Principle.
The following figure (Figure 4 from the original paper) shows the impact of dataset filtering thresholds on policy tracking errors:

该图像是图表,展示了不同数据集过滤阈值对策略跟踪误差的影响。随着过滤阈值的变化,跟踪误差趋势显示在图中,其中最佳结果出现在平衡多样性和稳定性的数据集过滤阈值下。例如,策略 实现了最低的跟踪误差,而过于严格或宽松的阈值导致效果下降。
Fig. 4: Impact of dataset filtering thresholds on policy tracking errors. The figure shows the tracking error trends across different dataset filtering thresholds. Policies trained on datasets with filtering thresholds that balance diversity and stability (e.g., ) achieve the lowest tracking errors. The base policy exhibits suboptimal performance due to unfiltered data, while overly restrictive thresholds (e.g., and overly lenient thresholds (e.g., show reduced effectiveness. We compute the error metric with , assigning heavier weight to the joint-angle term.
Analysis of Filtering Thresholds (Figure 4):
-
The figure plots tracking error trends against different dataset filtering thresholds ().
-
Moderate Thresholds () Optimal: Policies trained on datasets with a
moderate threshold(e.g., ) achieve the lowest tracking errors. This confirms that a balance between diversity and stability is crucial. This dataset retains sufficient variability for generalization while excluding excessively difficult or unstable motions. -
Low Thresholds () Poor Generalization: Overly restrictive thresholds (e.g., ) result in policies trained on overly simple motions. While stable, these policies exhibit
poor generalizationto the full dataset due to a lack of diversity. -
High Thresholds (,
Base Policy) Inconsistent Behavior: Overly lenient thresholds (e.g., ) or no filtering (Base Policywith unfiltered data) lead to the inclusion of highly dynamic and unstable motions. Thisnoise degrades training effectiveness, resulting in inconsistent policy behavior and reduced tracking accuracy.This visual analysis strongly validates the importance of carefully balancing feasibility and diversity in dataset selection.
The following are the results from Table X of the original paper:
| Metrics | ||||||||
| Training Dataset | In dist. | Evel ↓ | Empkpe | Elpe | Elope | Empipe | Eni e | Elower |
| (a) Eval. on D50 | ||||||||
| D50 | ✓ | 0.1375 | 0.0627 | 0.0571 | 0.0682 | 0.0753 | 0.0626 | 0.0928 |
| D250 | ✓ | 0.1454 | 0.0669 | 0.0600 | 0.0738 | 0.0870 | 0.0689 | 0.1119 |
| DcmU | ✓ | 0.1543 | 0.0767 | 0.0649 | 0.0885 | 0.1099 | 0.0854 | 0.1437 |
| (b) Eval. on DcMu | ||||||||
| D50 | X | 0.3509 | 0.1076 | 0.1074 | 0.1076 | 0.1338 | 0.1285 | 0.1410 |
| D250 | X | 0.2834 | 0.1048 | 0.1021 | 0.1073 | 0.1148 | 0.1012 | 0.1335 |
| DcmU | ✓ | 0.2622 | 0.1071 | 0.1036 | 0.1110 | 0.1291 | 0.1129 | 0.1512 |
| (c) Eval. on DAcCAD | ||||||||
| D50 | × | 0.4226 | 0.1277 | 0.1210 | 0.1330 | 0.1720 | 0.1618 | 0.1861 |
| D250 | X | 0.3533 | 0.1234 | 0.1141 | 0.1315 | 0.1421 | 0.1223 | 0.1692 |
| DcMU | X | 0.3452 | 0.1267 | 0.1146 | 0.1381 | 0.1780 | 0.1635 | 0.1979 |
Analysis of Dataset Ablation Study (Table X):
- Evaluation on (Part a):
- The policy trained on achieves the best performance (lowest errors) when evaluated on itself (e.g.,
Empkpe0.0627,Empipe0.0753). This is expected as it's an in-distribution evaluation on simple motions. - Policies trained on larger, more diverse datasets (, ) show slightly higher errors on , suggesting that additional complexity in training data doesn't necessarily improve performance on very simple tasks. The -trained policy performs notably worse, indicating that
noisefrom infeasible motions can degrade performance even on simple tasks if the policy is not specifically filtered for them.
- The policy trained on achieves the best performance (lowest errors) when evaluated on itself (e.g.,
- Evaluation on (Part b):
- The policy trained on (a subset of CMU with moderate diversity and feasibility) achieves the best performance on the full dataset across several key metrics (e.g.,
Evel0.2834,Empkpe0.1048,Empipe0.1148). - The policy trained on the full, unfiltered performs worse than the -trained policy, even though it's evaluating on its own training distribution. This crucial finding reinforces that
noisy, infeasible motions in a dataset degrade policy performance, as the agent wastes effort on unachievable goals instead of focusing on learnable actions. - The -trained policy performs poorly on the broader dataset, demonstrating its
limited generalizationdue to insufficient diversity.
- The policy trained on (a subset of CMU with moderate diversity and feasibility) achieves the best performance on the full dataset across several key metrics (e.g.,
- Evaluation on (Part c):
-
The policy trained on again outperforms the others on the
out-of-distribution (OOD)dataset (e.g.,Empkpe0.1234,Empipe0.1421). This highlights that aclean and balanced dataset(like ) leads to better generalization to unseen motion patterns. -
The -trained policy performs slightly better in
Evel(0.3452) than (0.3533) but worse in other metrics, suggesting that high velocity diversity might be captured but at the cost of pose accuracy. -
The -trained policy shows substantial tracking errors, confirming its inability to handle diverse or novel data.
Conclusion for Automatic Data Curation: These results conclusively validate the
Feasibility-Diversity Principle. A small dataset lacks generalization, while a large, unfiltered dataset introduces detrimental noise. The optimally curated dataset ( in this study, which corresponds to the moderate threshold in Figure 4) provides the best balance, leading to robust and expressive whole-body control.
-
6.1.3. Specialist Policy Finetuning
This section evaluates the effectiveness of the pretrain-finetune paradigm using generalist and specialist policies.
The following are the results from Table IV of the original paper:
| Method | Evel ↓ | Empkpe | Enie | Eloee | Empipe | Eipe | Elowe |
| (b) Deasy | |||||||
| Specialist | 0.0828 | 0.0561 | 0.0564 | 0.0558 | 0.0772 | 0.0647 | 0.0944 |
| Scratch | 0.0853 | 0.0608 | 0.0623 | 0.0592 | 0.0843 | 0.0711 | 0.1024 |
| Generalist | 0.0986 | 0.0699 | 0.0708 | 0.0690 | 0.1041 | 0.0882 | 0.1259 |
| (a) DModerate | |||||||
| Specialist | 0.0991 | 0.0571 | 0.0582 | 0.0559 | 0.0760 | 0.0636 | 0.0930 |
| Scratch | 0.1188 | 0.0676 | 0.0688 | 0.0663 | 0.0924 | 0.0794 | 0.1103 |
| Generalist | 0.1217 | 0.0741 | 0.0727 | 0.0755 | 0.1092 | 0.0914 | 0.1337 |
| (c) DHard | |||||||
| Specialist | 0.1712 | 0.0827 | 0.0829 | 0.0826 | 0.1047 | 0.0911 | 0.1234 |
| Scratch | 0.1631 | 0.0886 | 0.0898 | 0.0873 | 0.1188 | 0.1067 | 0.1354 |
| Generalist | 0.1452 | 0.0890 | 0.0867 | 0.0912 | 0.1181 | 0.1011 | 0.1414 |
| ) DaCCad | |||||||
| Specialist | 0.4021 | 0.1149 | 0.1079 | 0.1215 | 0.1402 | 0.1290 | 0.1557 |
| Scratch | 0.4153 | 0.1246 | 0.1154 | 0.1332 | 0.1609 | 0.1490 | 0.1771 |
| Generalist | 0.3361 | 0.1268 | 0.1156 | 0.1391 | 0.1716 | 0.1532 | 0.1967 |
Analysis of Finetuning (Table IV):
-
Performance on , , :
- The
Specialistpolicy (fine-tuned from the generalist) consistently achieves the best performance (lowest errors) across all difficulty levels forEmpkpe,Enie,Eloee,Empipe,Eipe, andElowe. This highlights the precision gain from fine-tuning. - The advantage of
SpecialistoverScratch(trained from scratch with matched total iterations) becomes more pronounced with increasing difficulty. For example, on , Specialist has significantly lowerEmpkpe(0.0827 vs. 0.0886) andEmpipe(0.1047 vs. 0.1188). This shows the importance of using a pretrained policy as a foundation for complex tasks. - The
Generalistpolicy, while providing broad coverage, shows higher errors than bothSpecialistandScratchon these specific datasets, confirming that it's designed for versatility, not task-specific precision. - An interesting observation: For and , the
Generalistpolicy exhibits slightly bettervelocity tracking (Evel)than theSpecialistpolicy. This is likely because the generalist is exposed to a broader range of dynamic movements, giving it a slight edge inglobal velocity trackingfor challenging scenarios, even if its pose tracking is less precise.
- The
-
Performance on (Out-of-Distribution):
-
The
Specialistpolicy significantlyoutperforms both Generalist and Scratchon this OOD dataset (e.g.,Empkpe0.1149 vs. 0.1268 (Generalist) vs. 0.1246 (Scratch)). This is a strong indicator of thesuperior generalizability and adaptabilityachieved by fine-tuning from a robust generalist, rather than training from scratch or relying solely on a broad generalist. It demonstrates that the specialist retains the adaptability from the generalist while gaining task-specific robustness.The following figure (Figure 5 from the original paper) visually illustrates the performance of different policies for the Cha-Cha dance:
该图像是图表,展示了一系列机器人执行Cha-Cha舞蹈的过程。图中从上到下依次为:SMPL模型的参考动作、算法在仿真中的表现以及真实机器人上的表现。此外,底部三行展示了每帧的误差,包括整个身体关节的自由度误差、上半身关节的自由度误差和下半身关节的自由度误差,蓝色曲线表示针对微调的Exbody2-Specialist策略,橙色表示从头开始训练的Exbody2-Scratch策略,绿色为基于过滤后的训练的Exbody2-Generalist策略。
-
Fig. 5: A sequence of a robot performing the Cha-Cha dance. From top to bottom: the reference motion represented by an avatar, our algorithm's performance in the simulation, and its performance on a real robot. The bottom three rows show the per-frame errors: wholebody joint DoF error, upper-body joint DoF error, and lower-body DoF error, with the blue curve representing Exbody2-Specialist policy finetuned on , orange for Exbody2-Scratch policy training from scratch on , green for our Exbody2-Generalist policy trained on filtered .
Visual Analysis of Cha-Cha Dance (Figure 5):
-
The visual comparison of the robot performing the Cha-Cha dance in simulation and real-world against the reference motion (avatar) qualitatively shows high fidelity.
-
The
per-frame error plotsat the bottom (whole-body, upper-body, and lower-body joint DoF error) quantitatively confirm the superior performance of theExbody2-Specialistpolicy (blue curve). It consistently maintainssignificantly lower tracking errorscompared toExbody2-Scratch(orange) andExbody2-Generalist(green). This is particularly evident for the dynamic movements of the Cha-Cha, which involve complex coordination of both upper and lower body.Conclusion for Specialist Finetuning: The
pretrain-finetune paradigmis highly effective. The pretrainedgeneralist policyprovides a strong foundation and robustness, while subsequentfine-tuningallows fortask-specific specializationand significantlyhigher precision, especially for challenging and OOD scenarios.
6.2. Ablation Studies / Parameter Analysis
The paper conducts ablation studies to verify the effectiveness of key components in ExBody2: history length for the student policy and teacher-student (DAgger) distillation.
The following are the results from Table XI of the original paper:
| Method | Evel ↓ | Empkpe | En e | Eloe | Empipe | Enip e | Elow |
| (a) History Length Ablation | |||||||
| Exbody2-History10 (Ours) | 0.2930 | 0.1000 | 0.0960 | 0.1040 | 0.1079 | 0.0953 | 0.1253 |
| Exbody2-History0 | 0.4151 | 0.1047 | 0.1010 | 0.1081 | 0.119 | 0.0986 | 0.1303 |
| Exbody2-History25 | 0.2950 | 0.1032 | 0.0984 | 0.1078 | 0.1128 | 0.0965 | 0.1351 |
| Exbody2-History50 | 0.2648 | 0.1004 | 0.0956 | 0.1051 | 0.1114 | 0.0967 | 0.1317 |
| Exbody2-History100 | 0.3242 | 0.1063 | 0.1001 | 0.1122 | 0.1225 | 0.1050 | 0.1466 |
| (b) DAgger Ablation | |||||||
| Exbody2(Ours) | 0.2930 | 0.1000 | 0.0960 | 0.1040 | 0.1079 | 0.0953 | 0.1253 |
| Exbody2-w/o-DAgger | 0.4195 | 0.1150 | 0.1106 | 0.1198 | 0.1496 | 0.1416 | 0.1607 |
Analysis of History Length Ablation (Table XI (a)):
- The student policy uses
historical observationsto compensate for the lack ofprivileged information. Exbody2-History0(no extra history) performs significantly worse, especially inEvel(0.4151) andEmpipe(0.119), highlighting that historical context is crucial for the student to infer necessary state information.- Among the variants with history,
Exbody2-History10 (Ours)yields the best overall performance, with lowestEmpkpe(0.1000),Empipe(0.1079), andElow(0.1253). - Increasing history length beyond 10 (e.g., 25, 50, 100) generally does not improve performance and can sometimes degrade it (e.g.,
History100has higher errors in most metrics thanHistory10). The paper suggests thatlonger history lengths increase the difficulty of fitting the privileged information, leading to reduced tracking performance. An optimal balance of historical context is required.
Analysis of DAgger Ablation (Table XI (b)):
- This compares
Exbody2(Ours)(with DAgger distillation) againstExbody2-w/o-DAgger. - Removing
DAgger-style distillation(Exbody2-w/o-DAgger) severelydegrades performanceacross all metrics. For instance,Evelincreases from 0.2930 to 0.4195, andEmpipejumps from 0.1079 to 0.1496. - This strong negative impact confirms that
DAggeris critical. Without its iterative data collection and teacher supervision, the student policy struggles to learn robustvelocity trackingdirectly from raw observations, making it difficult to follow fast or dynamic motions accurately. The teacher's privileged velocity guidance is effectively transferred through DAgger.
6.3. Data Presentation (Tables)
The tables presented in the paper have been transcribed and integrated into the analysis above. Specifically:
- Table II: Comparisons with baselines on dataset for Unitree G1 (simulation).
- Table III: Comparisons with baselines on selected motions for Unitree G1 in real world.
- Table IV: Evaluation of finetuned policies on , , , and datasets.
- Table X: Dataset Ablation Study, evaluating policies trained on different datasets (, , ) against various evaluation sets.
- Table XI: Self Ablation Study on
History LengthandDAggerdistillation.
6.4. Visual Demonstrations
The paper includes several visual demonstrations:
-
Figure 1 and 6 showcase the expressive and dynamic whole-body motions performed by the robot (Unitree G1), including walking, crouching, dancing, and interactions with objects.
-
Figure 5 provides a detailed frame-by-frame comparison of the Cha-Cha dance, illustrating the reference motion, simulation, and real robot performance, along with per-frame error plots for different policies.
-
Figure 7 illustrates how ExBody2 successfully replicates various motions (clapping fists, greeting, punching, crouching, defensive pose) in both simulation and real-world settings, emphasizing the retention of high fidelity to the target motion, including lower-body poses critical for balance.
These visual results qualitatively support the quantitative findings, demonstrating that ExBody2 enables realistic and robust expressive humanoid control.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces ExBody2, an Advanced Expressive Whole-Body Control framework for humanoid robots, which significantly advances the field of human-like motion imitation. The core contributions include:
-
Automated Data Curation: A novel method based on a
Feasibility-Diversity Principlethat intelligently filters human motion datasets. This ensures that the robot learns from kinematically feasible motions (especially for lower-body stability) while retaining broad expressiveness (particularly in upper-body movements), leading to arobust generalist policy. -
Generalist-Specialist Training Pipeline: A two-stage learning approach where a generalist policy, trained on the curated diverse data, serves as a foundation for
fine-tuning specialist policies. These specialists achieve higher precision for targeted, complex tasks, leveraging the generalist's learned priors. -
Decoupled Motion-Velocity Control Strategy: A robust control scheme that separates global
velocity trackingfrom localkeypoint tracking, converting global references to the robot's local frame. This, combined with ateacher-student framework(using privileged information in simulation and DAgger distillation for real-world deployment), enhances tracking robustness and stability against cumulative errors.Experimental results in both simulation and real-world deployment on the
Unitree G1robot demonstrate that ExBody2 consistentlyoutperforms prior methodsacross various metrics. The automated data curation is shown to be critical for robust performance, and the pretrain-finetune paradigm significantly boosts precision for specialized motions andgeneralization to out-of-distribution tasks.
7.2. Limitations & Future Work
The authors acknowledge a key limitation of the current ExBody2 framework:
-
Inability to Seamlessly Recombine Specialist Policies: While
specialist policiesachieve high accuracy for their targeted motion groups, the current approach lacks a mechanism todynamically blend or switchbetween these different policies within a single tracking session. This restricts the flexibility of handling complex, multi-modal motion sequences that might require transitions between different motion types.Future work directions suggested by this limitation include:
-
Dynamic Policy-Integration Mechanism: Developing a system that can adaptively
blend or switchbetween specialized policies in real-time. This would allow for a more seamless and efficient execution of complex motion sequences that involve transitions between different types of movements, unifying the broad coverage of the generalist with the high accuracy of individual specialists. Such an advancement would further improve overall tracking precision, adaptability, and robustness.Additionally, while not explicitly stated as limitations but areas for future exploration hinted at by the appendix:
-
Integration with Real-time Inputs: The paper shows preliminary work on integrating
RGB-based real-time mimicry(using HybrIK) andmotion synthesis(using CVAE). Further development in these areas could enable more interactive and long-horizon tasks, moving beyond offline motion capture data. -
Robustness to Diverse External Perturbations: While the framework improves robustness, further work could explore how to maintain expressiveness and stability under even more extreme external perturbations.
7.3. Personal Insights & Critique
This paper presents a highly significant advancement in humanoid whole-body control. Its strengths lie in its systematic approach to address known limitations in the field.
Strengths:
- Pragmatic Data Curation: The
automated data curationmethod based on theFeasibility-Diversity Principleis a brilliant solution to a long-standing problem in motion imitation. It moves beyond manual, subjective filtering and intelligently optimizes the training data for robot capabilities, which is crucial forsim-to-real transfer. - Effective Generalist-Specialist Paradigm: The
pretrain-finetuneapproach is well-justified and empirically validated. It efficiently balancesbroad applicabilitywithtask-specific precision, a design choice often seen in successful large-scale machine learning models and now effectively applied to robotics. - Robust Control Strategy: The
decoupling of velocity and keypoint tracking, combined withlocal frame keypoints, directly tackles thecumulative errorissue of global keypoint tracking, leading to more stable and robust real-world performance. Theteacher-studentframework is also a well-established and effective strategy for bridging thesim-to-real gap. - Thorough Evaluation: The comprehensive experimental evaluation, including comparisons with multiple baselines, ablation studies, and both simulation and real-world tests across various motion difficulties and OOD datasets, provides strong evidence for the method's effectiveness.
Potential Issues/Areas for Improvement:
- Computational Cost of Optimal Threshold Search: While the automated data curation is powerful, the method describes a greedy search for by training multiple policies on different subsets. This could be computationally intensive if the initial dataset is very large and many values need to be explored. Future work could investigate more efficient ways to estimate , perhaps using transfer learning or meta-learning on the error distribution itself.
- Generalizability of : The paper states that the
optimal threshold\tau^*exhibits generalizability and can be effectively applied to other motion datasets. While promising, this claim could benefit from more extensive cross-dataset validation to confirm its robustness across different robot morphologies or motion domains. - Lack of Dynamic Policy Integration: The acknowledged limitation of not being able to dynamically blend specialist policies is significant. For real-world humanoid robots, fluent transitions between diverse, fine-tuned behaviors are essential for truly human-like versatility. This is a critical next step.
- Specifics of Reward Weights: The reward weights (e.g., in Table I and Table IX) are given as fixed values. While common in RL, understanding how sensitive the policy's performance is to these specific weights and whether they require extensive tuning for new robots or tasks would be valuable.
Transferability and Applications: The methods and conclusions of ExBody2 are highly transferable.
-
Other Humanoid Platforms: The framework's modularity (data curation, teacher-student, decoupled control) makes it adaptable to other humanoid robots with different kinematics and dynamics, requiring only appropriate retargeting and potentially re-tuning of parameters.
-
Expressive Tasks Beyond Imitation: The principles of balancing feasibility and diversity and using generalist-specialist policies could be applied to learning other expressive tasks, such as human-robot interaction where natural gestures are critical, or even creative tasks like robot choreography.
-
Human-Robot Collaboration: A robot capable of expressive and stable whole-body movements can perform more intuitive and safe collaboration with humans.
-
Teleoperation and Remote Presence: The
RGB-based real-time mimicry(as shown in the appendix) has direct applications in enhancedteleoperationandremote presencesystems, where users can intuitively control robots with their own body movements.Overall, ExBody2 represents a robust and well-thought-out solution to a complex problem, setting a new benchmark for expressive whole-body control in humanoid robotics. Its contributions pave the way for more sophisticated and versatile human-robot interactions.
Similar papers
Recommended via semantic vector search.