Paper status: completed

ExBody2: Advanced Expressive Humanoid Whole-Body Control

Published:12/18/2024

Humanoid Whole-Body Control (5)Expressive Dynamic Motion Generation (1)Motion Capture-Based Control Strategy (1)Kinematic Adaptation Optimization for Robots (1)Whole-Body Motion Tracking Algorithm (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents ExBody2, an advanced control method enabling humanoid robots to perform expressive whole-body movements while maintaining stability. It employs a training approach based on human motion capture and simulations, addressing trade-offs between versatility and spec

Abstract

This paper tackles the challenge of enabling real-world humanoid robots to perform expressive and dynamic whole-body motions while maintaining overall stability and robustness. We propose Advanced Expressive Whole-Body Control (Exbody2), a method for producing whole-body tracking controllers that are trained on both human motion capture and simulated data and then transferred to the real world. We introduce a technique for decoupling the velocity tracking of the entire body from tracking body landmarks. We use a teacher policy to produce intermediate data that better conforms to the robot's kinematics and to automatically filter away infeasible whole-body motions. This two-step approach enabled us to produce a student policy that can be deployed on the robot that can walk, crouch, and dance. We also provide insight into the trade-off between versatility and the tracking performance on specific motions. We observed significant improvement of tracking performance after fine-tuning on a small amount of data, at the expense of the others.

Mind Map

In-depth Reading

English Analysis~39 min read · 54,755 chars

1. Bibliographic Information

1.1. Title

ExBody2: Advanced Expressive Humanoid Whole-Body Control

1.2. Authors

The paper is authored by Mazeyu Ji*, Xuanbin Peng*, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng†, and Xiaolong Wang†. Their affiliations are:

UC San Diego (Mazeyu Ji, Xuanbin Peng, Jialong Li, Xuxin Cheng, Xiaolong Wang)
UC Berkeley (Fangchen Liu)
MIT (Ge Yang)

The asterisks (*) denote equal contribution, and the daggers (†) denote equal advising.

1.3. Journal/Conference

The paper is published as a preprint on arXiv. While arXiv is not a peer-reviewed journal or conference, it is a highly influential platform for disseminating cutting-edge research quickly in fields like artificial intelligence, robotics, and physics. Papers published here are often submitted to top-tier conferences (e.g., NeurIPS, ICML, ICLR, RSS, ICRA) or journals later. The presence of authors from prestigious institutions like UC San Diego, UC Berkeley, and MIT suggests high-quality research, often aimed at prominent publication venues.

1.4. Publication Year

Published at (UTC): 2024-12-17T18:59:51.000Z. The publication year is 2024.

1.5. Abstract

This paper introduces Advanced Expressive Whole-Body Control (Exbody2), a novel method designed to enable real-world humanoid robots to execute expressive and dynamic full-body movements while ensuring stability and robustness. ExBody2 employs a two-step approach: first, it trains whole-body tracking controllers using both human motion capture data and simulated data. A key innovation is the decoupling of velocity tracking for the entire robot body from the tracking of specific body landmarks. The method utilizes a teacher policy to generate intermediate data that is kinematically feasible for the robot and to automatically filter out unachievable whole-body motions. This refined data then trains a student policy which can be deployed on physical robots, enabling them to perform actions such as walking, crouching, and dancing. The paper also explores the trade-off between the policy's versatility across various motions and its tracking performance on specific tasks, noting that fine-tuning on a small amount of targeted data significantly improves performance for those specific tasks, albeit at the expense of others.

1.6. Original Source Link

Official source and PDF link:

Original Source Link: https://arxiv.org/abs/2412.13196
PDF Link: http://arxiv.org/pdf/2412.13196v2 The publication status is a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the challenge of enabling humanoid robots to perform expressive, dynamic, and human-like whole-body motions while simultaneously maintaining stability and robustness in real-world environments. This problem is crucial because humanoid robots are envisioned to operate in human living spaces, requiring them to interact with their environment in a natural and versatile manner.

Existing challenges and gaps in prior research include:

Dynamic and Kinematic Gap: A fundamental mismatch exists between biological human bodies and mechanical robots. Robots have different degrees of freedom (DoF), joint limits, and dynamic capabilities, making direct imitation of human motion capture data difficult.
Trade-off between Expressiveness and Stability: Current control methods often struggle to achieve both high expressiveness (e.g., fluid dancing, complex gestures) and robust stability (e.g., maintaining balance, handling perturbations) simultaneously.
Infeasible Motion Data: Human motion datasets often contain movements that are physically impossible or highly challenging for robots, leading to poor training and performance if used directly. Manual filtering is labor-intensive and prone to error, potentially reducing data diversity.
Tracking Failures: Many previous whole-body tracking approaches, especially those relying on global keypoint tracking, suffer from cumulative errors and tracking failures when robots cannot perfectly align with desired global positions, limiting their application to highly stationary scenarios.

The paper's entry point and innovative idea revolve around addressing these gaps by proposing a novel framework, Exbody2, which combines automated data curation, a generalist-specialist policy training pipeline, and a decoupled motion-velocity control strategy to bridge the sim-to-real gap and achieve both expressiveness and robustness.

2.2. Main Contributions / Findings

The primary contributions of Exbody2 are threefold:

Generalist Policy with Automated Data Curation:
- Contribution: Exbody2 introduces an automated method for curating human motion datasets. It uses a teacher policy to evaluate the feasibility of motions for the robot, particularly focusing on lower-body stability while preserving upper-body diversity. This results in a Feasibility-Diversity Principle for dataset construction.
- Finding: This automated curation process generates a robust generalist policy that significantly outperforms previous methods across diverse motions, both in simulation and real-world deployment, by balancing dataset feasibility and diversity. It learns broad, expressive behaviors without being hindered by impractical movements.
Specialist Policy with Finetuning for Targeted Motions:
- Contribution: Building upon the generalist policy, Exbody2 proposes fine-tuning it for specific motion groups or tasks (e.g., dancing, specific locomotion patterns). This approach leverages the generalist's learned priors for efficient adaptation.
- Finding: Fine-tuned specialist policies achieve even higher precision and fidelity for targeted behaviors compared to the generalist policy or policies trained from scratch. This demonstrates the effectiveness of a pretrain-finetune paradigm for specialized tasks, improving robustness to disturbances and enhancing real-world generalization.
Decoupled Motion-Velocity Control Strategy:
- Contribution: Exbody2 introduces a novel control strategy that decouples keypoint tracking from velocity control. It converts global keypoints into the robot's local frame and primarily uses velocity-based global tracking to guide movement, while key body tracking focuses on motion imitation.
- Finding: This decoupled strategy, combined with a teacher-student framework (where the teacher policy uses privileged information in simulation and the student policy is distilled for real-world deployment), improves tracking robustness and stability. It prevents cumulative errors often seen in global keypoint tracking, allowing for expressive motion reproduction even with slight positional deviations.

Key Conclusions: The findings collectively demonstrate ExBody2's potential to bridge the gap between human-level expressiveness and reliable whole-body control in humanoid robots. The method achieves superior tracking accuracy, stability, and adaptability compared to state-of-the-art baselines.

3.1. Foundational Concepts

To fully grasp the methodology and contributions of this paper, a beginner should understand the following foundational concepts:

Humanoid Robots: Robots designed to mimic the human body's shape and movement capabilities, typically having a torso, head, two arms, and two legs. They operate with many degrees of freedom (DoF), which are the independent parameters that define the configuration of a mechanical system. For example, each joint (like a knee or elbow) has one or more DoF.
Whole-Body Control (WBC): A control strategy that coordinates all the joints and limbs of a robot simultaneously to achieve a desired task while respecting physical constraints (e.g., balance, joint limits, contact forces). It's a complex problem due to the high DoF and non-linear dynamics.
Motion Capture (MoCap) Data: Digital recordings of human body movements. Sensors are placed on a human actor, and their positions and orientations are tracked in 3D space. This data provides a rich source of human-like motion for robots to imitate. The CMU MoCap dataset [1] is a common benchmark.
Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by interacting with an environment. The agent receives rewards for desired actions and penalties for undesired ones, aiming to maximize its cumulative reward over time.
- Agent: The entity that makes decisions (e.g., the robot's controller).
- Environment: The system the agent interacts with (e.g., the simulated or real world).
- State: A complete description of the environment at a given time.
- Action: A decision made by the agent that changes the environment's state.
- Reward: A scalar feedback signal indicating the desirability of an action.
- Policy ( $\pi$ ): A function that maps states to actions, defining the agent's behavior.
Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is defined by states, actions, transition probabilities between states, and rewards. This is the underlying mathematical model for many RL problems.
Proximal Policy Optimization (PPO): A popular on-policy RL algorithm that is sample-efficient and robust. It's an actor-critic method, meaning it learns both a policy (actor) and a value function (critic) which estimates the expected future reward. PPO aims to take the largest possible step towards a new policy without collapsing performance, often by clipping the policy ratio.
Sim-to-Real Transfer: The process of training an RL policy in a simulated environment and then deploying it on a physical robot. This is desirable because training in simulation is safer, faster, and cheaper. However, sim-to-real gap refers to the discrepancies between simulation and reality (e.g., imperfect physics models, sensor noise, latency) that can cause a policy trained in sim to perform poorly in the real world. Techniques like domain randomization (varying simulation parameters) and privileged information are used to mitigate this gap.
Teacher-Student Framework (Knowledge Distillation): A learning paradigm where a complex, high-performing model (the teacher) is used to train a simpler, more efficient model (the student). In robotics RL, a teacher policy often has access to privileged information (ground truth data not available in the real world) in simulation, making it easier to train. The student policy then learns to imitate the teacher's actions using only real-world observable information.
DAgger (Dataset Aggregation): An imitation learning algorithm used in the teacher-student framework. It iteratively collects data by rolling out the student policy in the environment, then asks the teacher policy (oracle) to label the correct actions for those collected states, and finally retrains the student on this aggregated dataset. This helps the student learn to correct its own mistakes and perform well in states it might encounter during its own execution.
Proportional-Derivative (PD) Controllers: A type of feedback control loop commonly used in robotics. A PD controller calculates an output (e.g., motor torque) based on the current error (difference between desired and actual state) and the rate of change of the error. It aims to move the system towards the desired state and dampen oscillations. In this paper, the action space is the target joint positions for PD controllers.
Keypoints: Specific points on the body (e.g., hands, feet, head, hips) used for tracking motion. Global keypoints refer to their positions in a fixed world coordinate system, while local keypoints are relative to the robot's own coordinate frame.
Morphology: The shape and structure of a robot or organism. Retargeting human motion data to a robot's morphology involves adapting the human poses to fit the robot's specific joint structure and dimensions.

3.2. Previous Works

The paper contextualizes its contributions by referencing several prior works in humanoid whole-body control and motion imitation:

Traditional Dynamics Modeling and Control [40, 62, 22, 41, 8, 25, 61, 27, 20, 6, 7, 9, 42, 49]: These methods rely on precise mathematical models of the robot's physics and dynamics. They require accurate system identification and intensive online computation to handle perturbations and generate stable locomotion. While effective for well-defined tasks, they can be rigid and lack the adaptability for diverse, expressive motions.
RL-based Whole-Body Control [32, 33, 53, 10, 34, 35, 48, 23, 24, 47, 55, 52]: More recent approaches leverage RL to learn complex skills in simulation, which are then transferred to real robots. These methods often use task-specific rewards and environment randomization.
Motion Imitation for Expressive Control (e.g., ExBody [3, 4], H2O [18], OmniH2O [17], Humanplus [13]):
- ExBody [3, 4]: This paper's predecessor. It uses a one-stage RL training pipeline and primarily tracks upper-body movements, focusing on partial body tracking. It does not explicitly follow lower-body step patterns.
- H2O [18] and OmniH2O [17]: These methods focus on human-to-humanoid teleoperation and learning. They rely on global tracking of keypoint positions. A key limitation, as highlighted by ExBody2, is that this global tracking strategy often leads to tracking failures due to cumulative errors when the robot struggles to align with current global keypoints, limiting their real-world application to highly stationary scenarios. OmniH2O also uses a similar observation space.
- Limitations of existing motion imitation: These methods often rely on manually filtering feasible motion data, which is labor-intensive and may still contain infeasible motions or lack diversity, limiting the robot's full hardware potential.
Physics-based Character Motion Imitation in Simulation [44, 45, 56, 16, 37, 36, 63, 60, 57]: These works focus on generating realistic character animations in simulation, which is a related but distinct problem from real-world robot control, as they don't face the same hardware constraints or sim-to-real challenges.

3.3. Technological Evolution

The field of humanoid robot control has evolved from:

Early Model-Based Control (1970s-1990s): Focused on inverse kinematics (IK) and inverse dynamics (ID) to calculate joint trajectories and torques based on desired end-effector positions or whole-body postures. Examples include early Honda humanoid robots [20] and approaches like the 3D Linear Inverted Pendulum Mode (LIPM) [25] for stable bipedal walking. These methods excel at stability and precision for pre-programmed tasks but struggle with adaptability and expressiveness for diverse human-like motions.
Optimization-Based Whole-Body Control (2000s-2010s): Advanced WBC frameworks that optimize for multiple objectives (e.g., task achievement, balance, joint limits) simultaneously, often formulated as quadratic programs. This allowed for more complex tasks and better handling of constraints but still relied heavily on accurate models and online computation.
Reinforcement Learning for Locomotion (2010s-Present): The rise of deep RL, especially PPO, coupled with powerful physics simulators (IsaacGym [39]), enabled training policies for complex, robust locomotion (e.g., quadrupedal robots [29, 28], bipedal robots [32, 33]). This allowed learning behaviors that were difficult to program manually, including dynamic motions and handling perturbations.
RL for Expressive Motion Imitation (Recent Years): Integrating human motion capture data into RL training. This shifted the focus from just stable locomotion to expressive whole-body movements, such as dancing or gesturing. Early works like ExBody, H2O, and OmniH2O demonstrated this potential but faced challenges with data feasibility, tracking robustness, and sim-to-real transfer.

ExBody2 fits into the cutting edge of this evolution, building on RL-based motion imitation but significantly improving its robustness, expressiveness, and real-world applicability by addressing key limitations of prior methods.

3.4. Differentiation Analysis

ExBody2 differentiates itself from previous methods, particularly ExBody, H2O, and OmniH2O, through three key innovations:

Automated Data Curation (vs. Manual Filtering/Unfiltered Data):
- Previous: ExBody uses language labels for filtering (which can be ambiguous), while others [18, 17] use SMPL avatars to simulate motions, but these can still exceed real robot capabilities. Manual filtering is prone to human error and can reduce diversity.
- ExBody2 Innovation: Introduces an automated data curation method based on a Feasibility-Diversity Principle. It trains an initial policy, evaluates its lower-body tracking error for each motion, and then filters the dataset based on this error to remove infeasible motions. This automatically balances diversity (especially in upper-body movements) with feasibility (in the lower body), leading to a more effective training dataset and a more robust generalist policy.
Generalist-Specialist Training Pipeline (vs. Single Policy):
- Previous: Prior methods typically train a single policy, which might struggle to achieve both broad adaptability and high precision for specific, complex tasks.
- ExBody2 Innovation: Employs a two-step approach. First, a generalist policy is trained on the curated diverse dataset. Second, this generalist policy is fine-tuned to create specialist policies for targeted motion groups (e.g., dancing, kung fu).
- Differentiation: This allows ExBody2 to achieve high adaptability across a wide range of motions with the generalist, and then even higher fidelity and precision for specific tasks with the specialists, leveraging the generalist's learned priors efficiently.
Decoupled Motion-Velocity Control Strategy (vs. Global Keypoint Tracking):
- Previous: Approaches like H2O [18] and OmniH2O [17] predominantly rely on global tracking of keypoint positions. This often leads to cumulative errors and tracking failures because robots struggle to perfectly align with current global keypoints, especially in dynamic scenarios.
- ExBody2 Innovation: Converts global keypoints to the robot's local frame and decouples keypoint tracking from velocity control. It uses velocity-based global tracking to guide overall movement and local key body tracking for expressive motion imitation. It also incorporates a teacher-student framework where the teacher uses privileged information (like ground-truth root velocity) to achieve high performance in simulation, and the student learns to infer this information from historical observations for real-world deployment.
- Differentiation: This strategy enhances tracking robustness by preventing cumulative errors, allowing the robot to maintain stable movement while still achieving expressive whole-body imitation, even with slight positional deviations.

4. Methodology

4.1. Principles

The core idea behind ExBody2's methodology is to enable expressive and robust whole-body control for humanoid robots by meticulously preparing the training data and structuring the learning process. The key principles are:

Feasibility-Diversity Principle: This principle guides the creation of an optimal training dataset. It states that the dataset must be diverse enough (especially in upper-body movements) to allow the robot to learn a broad range of expressive motions, but feasible enough in its lower-body motions to avoid movements that exceed the robot's physical limits or stability envelopes. This ensures that the policy learns from achievable actions, preventing training noise and instability.
Generalist-Specialist Learning: The approach first learns a generalist policy capable of broad motion coverage from a curated dataset. Then, this generalist policy serves as a foundation for fine-tuning specialist policies for specific, high-precision tasks. This leverages transfer learning, providing a "warm start" and enhanced robustness for specialized behaviors.
Teacher-Student Distillation for Sim-to-Real Transfer: A teacher policy is trained in a high-fidelity simulation with access to privileged information (ground truth data not available in the real world) to achieve optimal performance. This knowledge is then distilled into a student policy that learns to perform the same task using only real-world observable information (e.g., historical observations), making it deployable on the physical robot.
Decoupled Motion-Velocity Control: Instead of relying solely on tracking global keypoint positions (which can lead to cumulative errors), ExBody2 separates the concerns of overall robot velocity control from precise local keypoint tracking. This allows for stable global movement while still achieving expressive local pose imitation.

4.2. Core Methodology In-depth (Layer by Layer)

ExBody2 adopts a sim-to-real framework structured around a Generalist-Specialist training pipeline and a teacher-student distillation process, as illustrated in Figure 2 and Figure 3.

As shown in Figure 2, the overall workflow begins with human motion data, which is first retargeted to the robot's specific morphology. This data then feeds into the Data-driven Generalist-specialist Training Pipeline. Within this pipeline, an automated data curation strategy is applied to train a generalist policy. Subsequently, this generalist policy can be fine-tuned to produce specialist policies. Finally, these policies are deployed onto real humanoid robots.

The following figure (Figure 2 from the original paper) shows the overall framework of ExBody2:

该图像是示意图，展示了先进的全身控制策略（ExBody2）的训练和应用过程，包括运动重定向、数据集过滤、特定策略微调及实际应用。通过自动筛选，生成最佳通用策略，然后微调专业策略以实现行走、舞蹈等动作。

humanoid robot, demonstrating expressive, dynamic, and stable whole-body motions in real-world environments.

4.2.1. Data-driven Generalist-specialist Training Pipeline

This pipeline is designed to balance adaptability and precision in whole-body motion tracking. It is guided by the Feasibility-Diversity Principle.

4.2.1.1. Feasibility-Diversity Principle

This principle dictates the design of the training dataset. It requires:

Diversity: Sufficient motion diversity, particularly in the upper body, to cover a broad distribution of tasks and expressive movements.
Feasibility: Maintaining feasibility in the lower body to avoid unachievable or overly dynamic motions that could destabilize training. This primarily involves filtering out extreme lower-body samples while retaining a wide range of upper-body actions.

4.2.1.2. Generalist Policy with Automated Data Curation

The goal is to train a generalist policy that performs well across diverse motion inputs. This is achieved through an automated process:

Initial Policy Training: An initial policy, denoted as $\pi_0$ , is first trained on a comprehensive, unfiltered motion dataset $\mathcal{D}$ . This dataset $\mathcal{D}$ is typically very diverse but may contain many motions that are infeasible for the robot.
Tracking Error Evaluation: After training $\pi_0$ $π_{0}$ , its tracking accuracy is evaluated for each motion sequence $s \in \mathcal{D}$ $s \in D$ . A tracking error metric e(s) is computed, specifically focusing on the lower body. This focus is critical because the lower body is central to dynamic feasibility and balance. The error metric e(s) is defined as: $ e(s) = \alpha E_{\mathrm{key}}(s) + \beta E_{\mathrm{dof}}(s) $ Where:
- $E_{\mathrm{key}}(s)$ : Represents the mean keybody position error for the lower body. This term helps prevent extreme deviations such as flipping or rolling, which are indicative of severe instability.
- $E_{\mathrm{dof}}(s)$ : Measures the mean joint-angle tracking error for the lower body. This term ensures precise joint-level imitation.
- $\alpha, \beta$ : Are coefficients that weight the relative importance of keybody position error and joint-angle error for lower-body stability and precision. In the ablation studies, $\alpha = 0.1$ and $\beta = 0.9$ are used, indicating a heavier weight on joint-angle tracking.
Motion Ranking and Distribution: Once e(s) is computed for all sequences, the motions are ranked by their tracking errors, and the empirical distribution P(e) is derived.
Optimal Threshold Determination: The objective is to find an error threshold $\tau$ $τ$ such that the subset of motion sequences with error less than or equal to $\tau$ $τ$ , denoted as $\mathcal{D}_{\tau} = \{s \in \mathcal{D} \mid e(s) \leq \tau\}$ $D_{τ} = {s \in D ∣ e (s) \leq τ}$ , enables the training of a new policy $\pi_{\tau}$ $π_{τ}$ that maximizes performance across the full original dataset $\mathcal{D}$ $D$ . Formally, this search is for: $ \tau^* = \arg\operatorname*{max}{\tau} \mathbb{E}{s \in \mathcal{D}} [ \mathrm{Performance}(\pi_{\tau}, s) ] $ Where:
- $\tau^*$ : The optimal error threshold.
- $\mathbb{E}_{s \in \mathcal{D}} [ \mathrm{Performance}(\pi_{\tau}, s) ]$ : The expected performance of policy $\pi_{\tau}$ (trained on $\mathcal{D}_{\tau}$ ) when evaluated on the full dataset $\mathcal{D}$ .
- $\pi_{\tau}$ : The policy trained on the filtered dataset $\mathcal{D}_{\tau}$ . In practice, P(e) is divided into evenly spaced error intervals, and a greedy search is used to identify $\tau^*$ . The paper notes that optimal performance is consistently achieved at a moderate $\tau$ , balancing diversity and feasibility.

The following figure (Figure 8 from the original paper) illustrates the empirical cumulative distribution function (CDF) for error metric e(s), which guides threshold selection:

$Fig. 8: Empirical CDF of the base policy's error metric `e ( s )` on the entire $\\mathcal { D } _ { \\mathrm { C M U } }$ dataset. The horizontal axis indicates the percentile of motion sequences from…$
该图像是一个展示在数据集 ext{D}_{ ext{CMU}} 中基本策略误差指标 e(s) 的经验累积分布函数（CDF）的图表。横轴表示运动序列的百分比，从 $0\\%$ （最低误差）到 $100\\%$ （最高误差），纵轴显示误差指标 e(s)。图中用虚线标记了关键阈值 $(\tau = 0.075, 0.10, 0.125, 0.15, 0.175)$ ，用于系统地确定可行与不可行的动作。

Fig. 8: Empirical CDF of the base policy's error metric e ( s ) on the entire $\\mathcal { D } _ { \\mathrm { C M U } }$ dataset. The horizontal axis indicates the percentile of motion sequences from $0 \\%$ (lowest error) to $1 0 0 \\%$ (highest error), while the vertical axis shows e ( s ) . We overlay dashed horizontal lines at key thresholds $( \\tau = 0 . 0 7 5 , 0 . 1 0 , 0 . 1 2 5 , 0 . 1 5 , 0 . 1 7 5 )$ to illustrate how we systematically determine feasible versus unfeasible motions based on the empirical distribution.

4.2.1.3. Specialist Policies with Finetuning

After obtaining the generalist policy $\pi_{\tau^*}$ , which balances diversity and feasibility, it is further refined into specialist policies for specific, high-precision tasks. This fine-tuning process offers several advantages over training from scratch:

Efficiency: It's more efficient as specialist policies track a smaller set of motions. The generalist policy provides a warm start, leveraging learned priors.
Robustness: The specialist policy inherits adaptability and robustness from the generalist's exposure to a wider range of motion sequences, improving real-world generalization.
Reduced Training Time: Fine-tuning an already well-trained model is computationally less demanding.

4.2.2. Policy Objective and Architecture

ExBody2 uses a two-stage teacher-student training procedure for sim-to-real transfer, similar to [29, 28]. All policies are trained using IsaacGym [39] for efficient parallel simulation.

The following figure (Figure 3 from the original paper) depicts the teacher-student framework:

Fig. 3: Teacher-student framework for humanoid motion learning, where the teacher uses privileged information, and the student learns from past observations to generate control actions.
该图像是示意图，展示了 humanoid 运动学习中的教师-学生框架。左侧部分描述了教师策略如何利用特权信息和前馈输入生成控制动作，而学生策略则通过模仿学习，从过去的观察中学习并生成控制动作。右侧展示了在模拟环境中运行的机器人（Unitree G1）进行的各种动作。该框架旨在提高机器人运动的表达能力和稳定性。

Fig. 3: Teacher-student framework for humanoid motion learning, where the teacher uses privileged information, and the student learns from past observations to generate control actions.

4.2.2.1. Teacher Policy Training

The humanoid motion control problem is formulated as a Markov Decision Process (MDP).

State Space: The state space $s$ for the teacher policy consists of three components:
- Privileged information ( $\mathcal{X}$ or $p_t$ ): Ground-truth states and environmental properties only observable in simulators.
- Proprioceptive states ( $\mathcal{O}$ or $o_t$ ): Robot's internal sensor readings (e.g., joint positions, velocities).
- Motion tracking target ( $\mathcal{G}$ or $g_t$ ): The desired motion for the robot to imitate. The teacher policy $\hat{\pi}$ takes $\{p_t, o_t, g_t\}$ as input.
Action Space: The teacher policy outputs an action $\hat{a}_t \in R^{23}$ (for the Unitree G1 robot), which represents the target joint positions for Proportional-Derivative (PD) controllers. These PD controllers then calculate the torques to apply to the robot's motors.
RL Algorithm: The Proximal Policy Optimization (PPO) algorithm [51] is used to train the teacher policy. PPO aims to maximize the expected accumulated future rewards, encouraging robust behavior and accurate tracking of demonstrations: $ \mathbb{E}{\hat{\pi}} \left[ \sum{t=0}^T \gamma^t \mathcal{R}(s_t, \hat{a}_t) \right] $ Where:
- $\mathbb{E}_{\hat{\pi}}[\cdot]$ : Expected value under the teacher policy $\hat{\pi}$ .
- $\sum_{t=0}^T \gamma^t \mathcal{R}(s_t, \hat{a}_t)$ : Sum of discounted future rewards.
- $\gamma$ : Discount factor, weighting immediate rewards more heavily than future ones.
- $\mathcal{R}(s_t, \hat{a}_t)$ : Reward function, providing feedback for state $s_t$ and action $\hat{a}_t$ .
Privileged Information ( $p_t$ ): This information is crucial for efficient teacher policy training in simulation and includes:
- Ground-truth root velocity (linear and angular).
- Real body links' positions (accurate positions of all robot parts).
- Physical properties of the environment and robot (e.g., friction coefficients, motor strength, mass properties). This information helps the teacher policy learn quickly and effectively by reducing the observation noise and providing direct access to critical dynamics.
Motion Tracking Target ( $g_t$ ): This specifies the desired human motion to be imitated. It comprises two main parts:
- Desired joints and 3D keypoints for both the upper and lower body.
- Target root velocity and root pose (position and orientation). This allows the policy to learn to accurately track whole-body motions while also being controllable by external commands (e.g., joystick commands for linear velocity and body pose).

Reward Design: The reward function is carefully designed to promote both tracking accuracy and robot stability. It primarily consists of:

Tracking Rewards: Encouraging accurate tracking of velocity, direction, orientation of the root, and precise tracking of keypoints and joint positions.

Regularization Terms: Penalizing undesirable behaviors to boost stability and improve sim-to-real transfer (e.g., joint limits violations, high accelerations, foot slippage).

The main elements of the tracking reward are detailed in Table I from the paper:

Term	Expression Weight
Expression Goal Ge
DoF Position	exp(−0.7\|qref − q\|) 3.0
Keypoint Position exp(−\|pref − p\|) Root Movement Goal Gm	2.0
Linear Velocity exp(−4.0\|vref − v\|)
Velocity Direction	6.0 exp(−4.0 cos(vref, v)) 6.0
Roll & Pitch	− θ\| exp(−\| θ 1.0
	1.0
Yaw	exp(−\|∆y\|)

Where:

DoF Position: $\mathrm{exp}(-0.7|q_{\mathrm{ref}} - q|)$ $exp (- 0.7∣ q_{ref} - q ∣)$
- $q_{\mathrm{ref}}$ : Reference joint position from the motion target.
- $q$ : Actual joint position of the robot.
- $|\cdot|$ : Absolute difference.
- Weight: 3.0
- Purpose: Rewards the robot for keeping its joint positions close to the reference motion.
Keypoint Position: $\mathrm{exp}(-|p_{\mathrm{ref}} - p|)$ $exp (- ∣ p_{ref} - p ∣)$
- $p_{\mathrm{ref}}$ : Reference keypoint position from the motion target.
- $p$ : Actual keypoint position of the robot.
- Weight: 2.0
- Purpose: Rewards the robot for keeping its keypoint positions close to the reference motion.
Linear Velocity: $\mathrm{exp}(-4.0|v_{\mathrm{ref}} - v|)$ $exp (- 4.0∣ v_{ref} - v ∣)$
- $v_{\mathrm{ref}}$ : Reference linear velocity of the robot's root (base).
- $v$ : Actual linear velocity of the robot's root.
- Weight: 6.0
- Purpose: Rewards accurate tracking of the reference linear velocity.
Velocity Direction: $\mathrm{exp}(-4.0 \cos(v_{\mathrm{ref}}, v))$ $exp (- 4.0 cos (v_{ref}, v))$
- $\cos(v_{\mathrm{ref}}, v)$ : Cosine similarity between reference and actual root velocity vectors, indicating alignment of direction.
- Weight: 6.0
- Purpose: Rewards the robot for moving in the desired direction.
Roll & Pitch: $- |\theta|$ $- ∣ θ ∣$ (implied for both roll and pitch)
- $\theta$ : Roll or pitch angle of the robot's root.
- Weight: 1.0 (for each, likely implying a negative reward for deviation from upright)
- Purpose: Penalizes deviations from a desired (e.g., upright) roll and pitch orientation, contributing to stability.
Yaw: $\mathrm{exp}(-|\Delta y|)$ $exp (- ∣Δ y ∣)$
- $\Delta y$ : Difference in yaw (heading) angle from the reference.
- Weight: 1.0
- Purpose: Rewards the robot for maintaining the desired yaw orientation.

4.2.2.2. Student Policy Training

In this stage, the student policy is trained to be deployable in the real world.

Removal of Privileged Information: The privileged information $p_t$ (available to the teacher) is removed from the student's observations.
Historical Observations: The student policy uses a longer history of observations ( $O_{t-H:t}$ ) to infer the necessary information that was privileged for the teacher. This includes proprioceptive states ( $o_t$ ) and the motion tracking goal ( $g_t$ ). The student policy $\pi$ learns to predict action $a_t \sim \pi(\cdot | o_{t-H:t}, g_t)$ .
DAgger-style Distillation: The student policy is supervised using the teacher's oracle actions $\hat{a}_t \sim \hat{\pi}(\cdot | o_t, g_t)$ $\overset{a}{^}_{t} \sim \overset{π}{^} (\cdot ∣ o_{t}, g_{t})$ with a Mean Squared Error (MSE) loss. The loss function $l$ $l$ is defined as: $ l = \lVert a_t - \hat{a}_t \rVert^2 $ Where:
- $a_t$ : Action predicted by the student policy.
- $\hat{a}_t$ : Oracle action produced by the teacher policy.
- $\lVert \cdot \rVert^2$ : Squared Euclidean norm, representing the squared difference between the student's and teacher's actions. The training process uses a DAgger [50]-style strategy: the student policy is rolled out in the simulation, and for each visited state, the teacher policy computes the oracle action as a supervision signal. The student policy is then refined by iteratively minimizing the loss $l$ on this accumulated data. Training continues until the loss converges.

4.2.2.3. Motion-velocity Decoupled Control Strategy

This strategy is crucial for robust tracking.

Limitations of Global Keypoint Tracking: Previous methods like H2O and OmniH2O learn to follow the trajectory of global keypoints. This can lead to suboptimal or failed tracking because global keypoints may drift over time, causing cumulative errors and hindering learning.
ExBody2's Approach:
- Local Coordinate Frame: ExBody2 converts global keypoints into the robot's current coordinate frame. This means keypoint tracking is relative to the robot itself, reducing issues from global drift.
- Decoupling: Keypoint tracking (for expressive pose imitation) is decoupled from velocity control (for guiding overall movement). Velocity-based global tracking is used to guide the robot's global movement, while key body tracking focuses on precise motion imitation.
- Robustness Enhancement: During the training stage, a small global drift of keypoints is allowed, and they are periodically corrected to the robot's current coordinate frame. This helps the robot learn to follow challenging keypoint motions without being overly constrained by perfect alignment at every instant.
- Deployment Strategy: During real-world deployment, ExBody2 strictly employs local keypoint tracking with this motion-velocity decoupled control. This coordination allows for completion of tracking with maximal expressiveness, even if slight positional deviations arise, by prioritizing overall velocity and local pose.

5. Experimental Setup

5.1. Datasets

The experiments primarily utilize variations and subsets of a classic motion capture dataset, along with manually curated ones for specific evaluations.

CMU Dataset [1]: The primary dataset used, known as $\mathcal{D}_{CMU}$ , is the full Carnegie Mellon University motion capture repository, comprising 1,919 sequences.
- Characteristics: It is highly diverse, including a wide variety of action types, but also contains extreme movements (e.g., push-ups, rolling on the ground, somersaults) that are beyond a robot's physical capabilities.
- Purpose: Serves as the base for automated data curation and evaluation of generalist policies.
Curated Subsets for Ablation Studies: To investigate the Feasibility-Diversity Principle, the following subsets of the CMU dataset were manually designed: *

\mathcal{D}_{50}`(50-action dataset)`: A minimal set containing only `fundamental` and mostly `static actions` (e.g., standing, simple walking).
        *   **Characteristics:** Highly feasible but lacks diversity in both upper and lower limb motions.
        *   **Purpose:** To evaluate the effect of overly simple datasets on generalization.
    *

\mathcal{D}_{250}(250-action dataset): A moderate-sized set extending $\mathcal{D}_{50}$ with additional upper-limb variations (e.g., arm gestures, some dance moves) and moderately dynamic lower-body actions (e.g., running, mild jumps). * Characteristics: Crucially, it avoids highly extreme motions difficult for the robot. * Purpose: To find the optimal balance between feasibility and diversity.

ACCAD Dataset ( $\mathcal{D}_{ACCAD}$ ):
- Characteristics: An out-of-distribution (OOD) dataset, meaning it contains motion patterns not present in the training data from CMU.
- Purpose: To evaluate the generalization capability of learned policies to previously unseen motion patterns.
Task-Specific Datasets for Finetuning: For evaluating specialist policies, the following manually curated datasets were used:
- $\mathcal{D}_{\mathrm{easy}}$ $D_{easy}$ , $\mathcal{D}_{\mathrm{moderate}}$ , $\mathcal{D}_{\mathrm{hard}}$ $D_{hard}$ : A series of datasets with increasing difficulty levels, categorized by motion dynamics.
  - Characteristics: $\mathcal{D}_{\mathrm{easy}}$ contains static or low-movement motions. $\mathcal{D}_{\mathrm{hard}}$ includes more dynamic and high-momentum movements.
  - Purpose: To assess how well policies generalize to increasingly complex motions and to demonstrate the advantage of fine-tuning.
- $\mathcal{D}_{\mathrm{dancing}}$ $D_{dancing}$ : A specific subset of motions for fine-tuning a specialist policy for dance movements, exemplified by the Cha-Cha dance.
  - Characteristics: Involves dynamic lower-body movements combined with expressive upper-body gestures.
  - Purpose: To showcase the high precision of specialist policies for specific expressive tasks.
    
    Data Sample: The paper does not provide concrete examples of individual data samples (e.g., a specific keypoint trajectory or a frame from a motion clip). However, the description of the datasets (e.g., "standing, simple walking" for $\mathcal{D}_{50}$ , "push-ups, rolling on the ground" for $\mathcal{D}_{CMU}$ ) offers a conceptual understanding of the data's form: sequences of human poses (joint angles, keypoint positions) over time. These human poses are retargeted to the Unitree G1 robot's morphology before training.

5.2. Evaluation Metrics

The policy's performance is evaluated using several quantitative metrics, calculated across all motion sequences in an evaluation dataset.

Mean Linear Velocity Error ( $E_{\mathrm{vel}}$ )
- Conceptual Definition: Quantifies the average absolute difference between the robot's root linear velocity and the reference linear velocity from the demonstration. It indicates how well the robot tracks the speed and direction of global movement.
- Mathematical Formula: $ E_{\mathrm{vel}} = \frac{1}{N \cdot T} \sum_{i=1}^N \sum_{t=1}^T \lVert v_{i,t}^{\mathrm{robot}} - v_{i,t}^{\mathrm{ref}} \rVert $
- Symbol Explanation:
  - $N$ : Total number of motion sequences.
  - $T$ : Number of time steps (frames) in each sequence.
  - $v_{i,t}^{\mathrm{robot}}$ : The linear velocity vector of the robot's root at time $t$ in sequence $i$ .
  - $v_{i,t}^{\mathrm{ref}}$ : The reference linear velocity vector of the root from the demonstration at time $t$ in sequence $i$ .
  - $\lVert \cdot \rVert$ : Euclidean norm (magnitude of the vector).
Mean Per Keybody Position Error (MPKPE - $E_{\mathrm{mpkpe}}$ )
- Conceptual Definition: Measures the average positional error of specific key body points (landmarks) on the robot relative to the reference motion. It reflects the overall accuracy of tracking key body parts.
- Mathematical Formula: $ E_{\mathrm{mpkpe}} = \frac{1}{N \cdot T \cdot K} \sum_{i=1}^N \sum_{t=1}^T \sum_{k=1}^K \lVert p_{i,t,k}^{\mathrm{robot}} - p_{i,t,k}^{\mathrm{ref}} \rVert $
- Symbol Explanation:
  - $N$ : Total number of motion sequences.
  - $T$ : Number of time steps (frames) in each sequence.
  - $K$ : Total number of key body points tracked.
  - $p_{i,t,k}^{\mathrm{robot}}$ : The 3D position vector of robot keypoint $k$ at time $t$ in sequence $i$ .
  - $p_{i,t,k}^{\mathrm{ref}}$ : The 3D position vector of reference keypoint $k$ at time $t$ in sequence $i$ .
  - $\lVert \cdot \rVert$ : Euclidean norm.
- Variants: The paper also reports specific MPKPE for upper body ( $E_{\mathrm{mpkpe}}^{\mathrm{upper}}$ ) and lower body ( $E_{\mathrm{mpkpe}}^{\mathrm{lower}}$ ) to provide a more granular analysis of tracking performance in different body regions.
Mean Per Joint Position Error (MPJPE - $E_{\mathrm{mpjpe}}$ )
- Conceptual Definition: Quantifies the average angular error (in radians) between the robot's joint angles and the reference joint angles from the demonstration. It indicates the precision of joint-level pose tracking.
- Mathematical Formula: $ E_{\mathrm{mpjpe}} = \frac{1}{N \cdot T \cdot J} \sum_{i=1}^N \sum_{t=1}^T \sum_{j=1}^J |q_{i,t,j}^{\mathrm{robot}} - q_{i,t,j}^{\mathrm{ref}}| $
- Symbol Explanation:
  - $N$ : Total number of motion sequences.
  - $T$ : Number of time steps (frames) in each sequence.
  - $J$ : Total number of joints.
  - $q_{i,t,j}^{\mathrm{robot}}$ : The angular position (in radians) of robot joint $j$ at time $t$ in sequence $i$ .
  - $q_{i,t,j}^{\mathrm{ref}}$ : The reference angular position (in radians) of joint $j$ at time $t$ in sequence $i$ .
  - $|\cdot|$ : Absolute difference.
- Variants: Similar to MPKPE, the paper reports MPJPE for upper body ( $E_{\mathrm{mpjpe}}^{\mathrm{upper}}$ ) and lower body ( $E_{\mathrm{mpjpe}}^{\mathrm{lower}}$ ) for detailed analysis. The table uses Eloe for $E_{\mathrm{mpjpe}}^{\mathrm{upper}}$ and Elower for $E_{\mathrm{mpjpe}}^{\mathrm{lower}}$ . The Empipe seems to be the overall MPJPE, and Enie could be a typo or an abbreviation for upper-body error. Given Eloe and Elower are for upper and lower DoF error, Empipe is likely the total MPJPE.

5.3. Baselines

ExBody2 is evaluated against four state-of-the-art baselines using the Unitree G1 robot platform:

Exbody [4]:
- Description: This is the predecessor to ExBody2. It uses a one-stage RL training pipeline. Its core design involves tracking only the upper body movements from human data, while tracking the root motion of the lower body without explicitly following step patterns.
- Distinguishing Feature: Focuses on partial body tracking (upper body) and does not utilize a teacher-student structure. It uses a shorter history length (5) and relies entirely on local keypoints for tracking.
Exbody†:
- Description: This is a whole-body control version of the original Exbody. It extends Exbody to track full-body movements based on human data.
- Distinguishing Feature: Attempts comprehensive human motion imitation across the entire body posture while maintaining most other aspects of the original Exbody's design (e.g., one-stage RL, no teacher-student).
OmniH2O [17]:*
- Description: This is the authors' reproduction of the OmniH2O method. It specifically uses global keypoints tracking and has an observation space consistent with the original paper.
- Distinguishing Feature: The primary difference from ExBody2 is its reliance on global keypoint tracking and not using the robot's velocity as privileged information during training. For fair comparison in evaluation, OmniH2O* is adapted to use local keypoints during testing, but its training method remains true to the original.
Exbody2-w/o-Filter (ExBody2 without automated data curation): This is essentially ExBody2 without its core innovation of automated data curation. It serves as an ablation baseline to demonstrate the impact of the filtering strategy.

The choice of these baselines is representative because they encompass different tracking methods (partial vs. whole-body, global vs. local keypoints), training strategies (one-stage RL vs. teacher-student), and data handling (unfiltered vs. partially filtered).

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate the effectiveness of ExBody2's innovations across various evaluation scenarios, both in simulation and real-world deployment.

6.1.1. Generalist Policy Performance

The performance of the ExBody2 generalist policy is initially compared against state-of-the-art baselines in simulation (Table II) and then validated in real-world settings (Table III).

The following are the results from Table II of the original paper:

Method	Evel ↓	Empkpe	Emp	Eloe	Empipe	Elpe Y	Elower
Exbody	0.4700	0.1339	0.1249	0.1428	0.2020	0.1343	0.2952
Exbody†	0.4195	0.1150	0.1106	0.1198	0.1496	0.1416	0.1607
OmniH20*	0.3725	0.1253	0.1266	0.1240	0.1681	0.1564	0.1843
Exbody2-w/o-Filter	0.2787	0.1133	0.1087	0.1182	0.1355	0.1192	0.1579
Exbody2(Ours)	0.2930	0.1000	0.0960	0.1040	0.1079	0.0953	0.1253

Analysis of Simulation Results (Table II):

Superiority of ExBody2: ExBody2(Ours) achieves the lowest errors across almost all metrics: Empkpe (0.1000), Emp (likely $E_{\mathrm{mpkpe}}^{\mathrm{upper}}$ , 0.0960), Eloe (likely $E_{\mathrm{mpkpe}}^{\mathrm{lower}}$ , 0.1040), Empipe (0.1079), Elpe Y (likely $E_{\mathrm{mpjpe}}^{\mathrm{upper}}$ , 0.0953), and Elower (likely $E_{\mathrm{mpjpe}}^{\mathrm{lower}}$ , 0.1253). This indicates significantly better whole-body (both upper and lower) and joint-level tracking compared to all baselines.
Impact of Filtering: Comparing ExBody2-w/o-Filter with ExBody2(Ours) reveals the substantial benefit of the automated data curation. Filtering improves Empkpe (from 0.1133 to 0.1000), Empipe (from 0.1355 to 0.1079), and especially Elower (from 0.1579 to 0.1253). This highlights that removing infeasible lower-body motions leads to greater global stability and more precise upper-body control.
Velocity Trade-off: ExBody2(Ours) shows a slight increase in Evel (0.2930) compared to ExBody2-w/o-Filter (0.2787). The paper attributes this to the full dataset containing broader velocity patterns, which, while enabling diverse dynamics, also introduce noise. This suggests a minor trade-off where improved stability and precision (across keypoints and joints) are prioritized over a marginal increase in velocity tracking error.
Baseline Performance: Exbody (partial body tracking) performs worst, as expected. Exbody† (full-body Exbody) improves over Exbody but is still significantly worse than ExBody2. $OmniH2O*$ , despite using global keypoint tracking, also falls short of ExBody2, underscoring the benefits of ExBody2's decoupled control and data curation.

The following are the results from Table III of the original paper:

Method Empipe ↓ Elow
Exbody 0.2178 0.1223 0.3239
Exbody† 0.1465 0.1314 0.1672
OmniH20* 0.1396 0.1273 0.1533
Exbody2-w/o-Filter 0.1361 0.1254 0.1481
Exbody2(Ours) 0.1074 0.1092 0.1054

Analysis of Real-World Results (Table III):

Consistency with Simulation: The real-world experiments, conducted on the Unitree G1 robot using a representative subset of the CMU dataset, closely mirror the simulation results. ExBody2(Ours) consistently achieves the lowest errors for Empipe (0.1074) and Elow (0.1054), which likely represent overall and lower-body joint position errors, respectively. The unlabeled column (0.1092) is likely an upper-body error (e.g., $E_{\mathrm{mpjpe}}^{\mathrm{upper}}$ ).
Confirmation of Data Curation: The substantial enhancement in performance for ExBody2(Ours) compared to ExBody2-w/o-Filter (e.g., Empipe from 0.1361 to 0.1074, Elow from 0.1481 to 0.1054) is critical, particularly in real-world environments with unpredictable disturbances. This validates that automated data curation significantly contributes to robust and consistent behavior, enabling high-precision tracking.
Overall Generalist Efficacy: The generalist policy of ExBody2 demonstrates stable and effective tracking performance in dynamic real-world environments, showcasing significant improvements in full-body and velocity tracking accuracy.

6.1.2. Impact of Automatic Data Curation

This section evaluates how the selection criteria for dataset construction affect the learning of a generalist policy, directly validating the Feasibility-Diversity Principle.

The following figure (Figure 4 from the original paper) shows the impact of dataset filtering thresholds on policy tracking errors:

Fig. 4: Impact of dataset filtering thresholds on policy tracking errors. The figure shows the tracking error trends across different dataset filtering thresholds. Policies trained on datasets with f…
该图像是图表，展示了不同数据集过滤阈值对策略跟踪误差的影响。随着过滤阈值的变化，跟踪误差趋势显示在图中，其中最佳结果出现在平衡多样性和稳定性的数据集过滤阈值下。例如，策略 $au = 0.150$ 实现了最低的跟踪误差，而过于严格或宽松的阈值导致效果下降。

Fig. 4: Impact of dataset filtering thresholds on policy tracking errors. The figure shows the tracking error trends across different dataset filtering thresholds. Policies trained on datasets with filtering thresholds that balance diversity and stability (e.g., $\\pi _ { \\tau = 0 . 1 5 0 }$ ) achieve the lowest tracking errors. The base policy exhibits suboptimal performance due to unfiltered data, while overly restrictive thresholds (e.g., $\\pi _ { \\tau = 0 . 0 7 5 }$ and overly lenient thresholds (e.g., $\\pi _ { \\tau = 0 . 1 7 5 }$ show reduced effectiveness. We compute the error metric $e ( s ) = \\alpha E _ { \\mathrm { k e y } } ( s ) + \\beta E _ { \\mathrm { d o f } } ( s )$ with $\\alpha = 0 . 1 , \\beta = 0 . 9$ , assigning heavier weight to the joint-angle term.

Analysis of Filtering Thresholds (Figure 4):

The figure plots tracking error trends against different dataset filtering thresholds ( $\tau$ ).
Moderate Thresholds ( $\pi_{\tau=0.150}$ ) Optimal: Policies trained on datasets with a moderate threshold (e.g., $\pi_{\tau=0.150}$ ) achieve the lowest tracking errors. This confirms that a balance between diversity and stability is crucial. This dataset retains sufficient variability for generalization while excluding excessively difficult or unstable motions.
Low Thresholds ( $\pi_{\tau=0.075}$ ) Poor Generalization: Overly restrictive thresholds (e.g., $\pi_{\tau=0.075}$ ) result in policies trained on overly simple motions. While stable, these policies exhibit poor generalization to the full dataset due to a lack of diversity.
High Thresholds ( $\pi_{\tau=0.175}$ , Base Policy) Inconsistent Behavior: Overly lenient thresholds (e.g., $\pi_{\tau=0.175}$ ) or no filtering (Base Policy with unfiltered data) lead to the inclusion of highly dynamic and unstable motions. This noise degrades training effectiveness, resulting in inconsistent policy behavior and reduced tracking accuracy.

This visual analysis strongly validates the importance of carefully balancing feasibility and diversity in dataset selection.

The following are the results from Table X of the original paper:

		Metrics
Training Dataset	In dist.	Evel ↓	Empkpe	Elpe	Elope	Empipe	Eni e	Elower
(a) Eval. on D50
D50	✓	0.1375	0.0627	0.0571	0.0682	0.0753	0.0626	0.0928
D250	✓	0.1454	0.0669	0.0600	0.0738	0.0870	0.0689	0.1119
DcmU	✓	0.1543	0.0767	0.0649	0.0885	0.1099	0.0854	0.1437
(b) Eval. on DcMu
D50	X	0.3509	0.1076	0.1074	0.1076	0.1338	0.1285	0.1410
D250	X	0.2834	0.1048	0.1021	0.1073	0.1148	0.1012	0.1335
DcmU	✓	0.2622	0.1071	0.1036	0.1110	0.1291	0.1129	0.1512
(c) Eval. on DAcCAD
D50	×	0.4226	0.1277	0.1210	0.1330	0.1720	0.1618	0.1861
D250	X	0.3533	0.1234	0.1141	0.1315	0.1421	0.1223	0.1692
DcMU	X	0.3452	0.1267	0.1146	0.1381	0.1780	0.1635	0.1979

Analysis of Dataset Ablation Study (Table X):

Evaluation on $\mathcal{D}_{50}$ (Part a):
- The policy trained on $\mathcal{D}_{50}$ achieves the best performance (lowest errors) when evaluated on $\mathcal{D}_{50}$ itself (e.g., Empkpe 0.0627, Empipe 0.0753). This is expected as it's an in-distribution evaluation on simple motions.
- Policies trained on larger, more diverse datasets ( $\mathcal{D}_{250}$ , $\mathcal{D}_{CMU}$ ) show slightly higher errors on $\mathcal{D}_{50}$ , suggesting that additional complexity in training data doesn't necessarily improve performance on very simple tasks. The $\mathcal{D}_{CMU}$ -trained policy performs notably worse, indicating that noise from infeasible motions can degrade performance even on simple tasks if the policy is not specifically filtered for them.
Evaluation on $\mathcal{D}_{CMU}$ (Part b):
- The policy trained on $\mathcal{D}_{250}$ (a subset of CMU with moderate diversity and feasibility) achieves the best performance on the full $\mathcal{D}_{CMU}$ dataset across several key metrics (e.g., Evel 0.2834, Empkpe 0.1048, Empipe 0.1148).
- The policy trained on the full, unfiltered $\mathcal{D}_{CMU}$ performs worse than the $\mathcal{D}_{250}$ -trained policy, even though it's evaluating on its own training distribution. This crucial finding reinforces that noisy, infeasible motions in a dataset degrade policy performance, as the agent wastes effort on unachievable goals instead of focusing on learnable actions.
- The $\mathcal{D}_{50}$ -trained policy performs poorly on the broader $\mathcal{D}_{CMU}$ dataset, demonstrating its limited generalization due to insufficient diversity.
Evaluation on $\mathcal{D}_{ACCAD}$ (Part c):
- The policy trained on $\mathcal{D}_{250}$ again outperforms the others on the out-of-distribution (OOD) $\mathcal{D}_{ACCAD}$ dataset (e.g., Empkpe 0.1234, Empipe 0.1421). This highlights that a clean and balanced dataset (like $\mathcal{D}_{250}$ ) leads to better generalization to unseen motion patterns.
- The $\mathcal{D}_{CMU}$ -trained policy performs slightly better in Evel (0.3452) than $\mathcal{D}_{250}$ (0.3533) but worse in other metrics, suggesting that high velocity diversity might be captured but at the cost of pose accuracy.
- The $\mathcal{D}_{50}$ -trained policy shows substantial tracking errors, confirming its inability to handle diverse or novel data.
  
  Conclusion for Automatic Data Curation: These results conclusively validate the Feasibility-Diversity Principle. A small dataset lacks generalization, while a large, unfiltered dataset introduces detrimental noise. The optimally curated dataset ( $\mathcal{D}_{250}$ in this study, which corresponds to the moderate threshold in Figure 4) provides the best balance, leading to robust and expressive whole-body control.

6.1.3. Specialist Policy Finetuning

This section evaluates the effectiveness of the pretrain-finetune paradigm using generalist and specialist policies.

The following are the results from Table IV of the original paper:

Method	Evel ↓	Empkpe	Enie	Eloee	Empipe	Eipe	Elowe
(b) Deasy
Specialist	0.0828	0.0561	0.0564	0.0558	0.0772	0.0647	0.0944
Scratch	0.0853	0.0608	0.0623	0.0592	0.0843	0.0711	0.1024
Generalist	0.0986	0.0699	0.0708	0.0690	0.1041	0.0882	0.1259
(a) DModerate
Specialist	0.0991	0.0571	0.0582	0.0559	0.0760	0.0636	0.0930
Scratch	0.1188	0.0676	0.0688	0.0663	0.0924	0.0794	0.1103
Generalist	0.1217	0.0741	0.0727	0.0755	0.1092	0.0914	0.1337
(c) DHard
Specialist	0.1712	0.0827	0.0829	0.0826	0.1047	0.0911	0.1234
Scratch	0.1631	0.0886	0.0898	0.0873	0.1188	0.1067	0.1354
Generalist	0.1452	0.0890	0.0867	0.0912	0.1181	0.1011	0.1414
) DaCCad
Specialist	0.4021	0.1149	0.1079	0.1215	0.1402	0.1290	0.1557
Scratch	0.4153	0.1246	0.1154	0.1332	0.1609	0.1490	0.1771
Generalist	0.3361	0.1268	0.1156	0.1391	0.1716	0.1532	0.1967

Analysis of Finetuning (Table IV):

Performance on $\mathcal{D}_{\mathrm{easy}}$ , $\mathcal{D}_{\mathrm{moderate}}$ , $\mathcal{D}_{\mathrm{hard}}$ :
- The Specialist policy (fine-tuned from the generalist) consistently achieves the best performance (lowest errors) across all difficulty levels for Empkpe, Enie, Eloee, Empipe, Eipe, and Elowe. This highlights the precision gain from fine-tuning.
- The advantage of Specialist over Scratch (trained from scratch with matched total iterations) becomes more pronounced with increasing difficulty. For example, on $\mathcal{D}_{\mathrm{hard}}$ , Specialist has significantly lower Empkpe (0.0827 vs. 0.0886) and Empipe (0.1047 vs. 0.1188). This shows the importance of using a pretrained policy as a foundation for complex tasks.
- The Generalist policy, while providing broad coverage, shows higher errors than both Specialist and Scratch on these specific datasets, confirming that it's designed for versatility, not task-specific precision.
- An interesting observation: For $\mathcal{D}_{\mathrm{hard}}$ and $\mathcal{D}_{\mathrm{ACCAD}}$ , the Generalist policy exhibits slightly better velocity tracking (Evel) than the Specialist policy. This is likely because the generalist is exposed to a broader range of dynamic movements, giving it a slight edge in global velocity tracking for challenging scenarios, even if its pose tracking is less precise.
Performance on $\mathcal{D}_{\mathrm{ACCAD}}$ (Out-of-Distribution):
- The Specialist policy significantly outperforms both Generalist and Scratch on this OOD dataset (e.g., Empkpe 0.1149 vs. 0.1268 (Generalist) vs. 0.1246 (Scratch)). This is a strong indicator of the superior generalizability and adaptability achieved by fine-tuning from a robust generalist, rather than training from scratch or relying solely on a broad generalist. It demonstrates that the specialist retains the adaptability from the generalist while gaining task-specific robustness.
  
  The following figure (Figure 5 from the original paper) visually illustrates the performance of different policies for the Cha-Cha dance:
  
  该图像是图表，展示了一系列机器人执行Cha-Cha舞蹈的过程。图中从上到下依次为：SMPL模型的参考动作、算法在仿真中的表现以及真实机器人上的表现。此外，底部三行展示了每帧的误差，包括整个身体关节的自由度误差、上半身关节的自由度误差和下半身关节的自由度误差，蓝色曲线表示针对 $\mathcal{D}_{dancing}$ 微调的Exbody2-Specialist策略，橙色表示从头开始训练的Exbody2-Scratch策略，绿色为基于过滤后的 $\mathcal{D}_{CMU}$ 训练的Exbody2-Generalist策略。

Fig. 5: A sequence of a robot performing the Cha-Cha dance. From top to bottom: the reference motion represented by an avatar, our algorithm's performance in the simulation, and its performance on a real robot. The bottom three rows show the per-frame errors: wholebody joint DoF error, upper-body joint DoF error, and lower-body DoF error, with the blue curve representing Exbody2-Specialist policy finetuned on $\\mathcal { D } _ { d a n c i n g }$ , orange for Exbody2-Scratch policy training from scratch on $\\mathcal { D } _ { d a n c i n g }$ , green for our Exbody2-Generalist policy trained on filtered $\\mathcal { D } _ { C M U }$ .

Visual Analysis of Cha-Cha Dance (Figure 5):

The visual comparison of the robot performing the Cha-Cha dance in simulation and real-world against the reference motion (avatar) qualitatively shows high fidelity.
The per-frame error plots at the bottom (whole-body, upper-body, and lower-body joint DoF error) quantitatively confirm the superior performance of the Exbody2-Specialist policy (blue curve). It consistently maintains significantly lower tracking errors compared to Exbody2-Scratch (orange) and Exbody2-Generalist (green). This is particularly evident for the dynamic movements of the Cha-Cha, which involve complex coordination of both upper and lower body.

Conclusion for Specialist Finetuning: The pretrain-finetune paradigm is highly effective. The pretrained generalist policy provides a strong foundation and robustness, while subsequent fine-tuning allows for task-specific specialization and significantly higher precision, especially for challenging and OOD scenarios.

6.2. Ablation Studies / Parameter Analysis

The paper conducts ablation studies to verify the effectiveness of key components in ExBody2: history length for the student policy and teacher-student (DAgger) distillation.

The following are the results from Table XI of the original paper:

Method	Evel ↓	Empkpe	En e	Eloe	Empipe	Enip e	Elow
(a) History Length Ablation
Exbody2-History10 (Ours)	0.2930	0.1000	0.0960	0.1040	0.1079	0.0953	0.1253
Exbody2-History0	0.4151	0.1047	0.1010	0.1081	0.119	0.0986	0.1303
Exbody2-History25	0.2950	0.1032	0.0984	0.1078	0.1128	0.0965	0.1351
Exbody2-History50	0.2648	0.1004	0.0956	0.1051	0.1114	0.0967	0.1317
Exbody2-History100	0.3242	0.1063	0.1001	0.1122	0.1225	0.1050	0.1466
(b) DAgger Ablation
Exbody2(Ours)	0.2930	0.1000	0.0960	0.1040	0.1079	0.0953	0.1253
Exbody2-w/o-DAgger	0.4195	0.1150	0.1106	0.1198	0.1496	0.1416	0.1607

Analysis of History Length Ablation (Table XI (a)):

The student policy uses historical observations to compensate for the lack of privileged information.
Exbody2-History0 (no extra history) performs significantly worse, especially in Evel (0.4151) and Empipe (0.119), highlighting that historical context is crucial for the student to infer necessary state information.
Among the variants with history, Exbody2-History10 (Ours) yields the best overall performance, with lowest Empkpe (0.1000), Empipe (0.1079), and Elow (0.1253).
Increasing history length beyond 10 (e.g., 25, 50, 100) generally does not improve performance and can sometimes degrade it (e.g., History100 has higher errors in most metrics than History10). The paper suggests that longer history lengths increase the difficulty of fitting the privileged information, leading to reduced tracking performance. An optimal balance of historical context is required.

Analysis of DAgger Ablation (Table XI (b)):

This compares Exbody2(Ours) (with DAgger distillation) against Exbody2-w/o-DAgger.
Removing DAgger-style distillation (Exbody2-w/o-DAgger) severely degrades performance across all metrics. For instance, Evel increases from 0.2930 to 0.4195, and Empipe jumps from 0.1079 to 0.1496.
This strong negative impact confirms that DAgger is critical. Without its iterative data collection and teacher supervision, the student policy struggles to learn robust velocity tracking directly from raw observations, making it difficult to follow fast or dynamic motions accurately. The teacher's privileged velocity guidance is effectively transferred through DAgger.

6.3. Data Presentation (Tables)

The tables presented in the paper have been transcribed and integrated into the analysis above. Specifically:

Table II: Comparisons with baselines on dataset $\mathcal{D}_{CMU}$ for Unitree G1 (simulation).
Table III: Comparisons with baselines on selected motions for Unitree G1 in real world.
Table IV: Evaluation of finetuned policies on $\mathcal{D}_{\mathrm{easy}}$ , $\mathcal{D}_{\mathrm{moderate}}$ , $\mathcal{D}_{\mathrm{hard}}$ , and $\mathcal{D}_{\mathrm{ACCAD}}$ datasets.
Table X: Dataset Ablation Study, evaluating policies trained on different datasets ( $\mathcal{D}_{50}$ , $\mathcal{D}_{250}$ , $\mathcal{D}_{CMU}$ ) against various evaluation sets.
Table XI: Self Ablation Study on History Length and DAgger distillation.

6.4. Visual Demonstrations

The paper includes several visual demonstrations:

Figure 1 and 6 showcase the expressive and dynamic whole-body motions performed by the robot (Unitree G1), including walking, crouching, dancing, and interactions with objects.
Figure 5 provides a detailed frame-by-frame comparison of the Cha-Cha dance, illustrating the reference motion, simulation, and real robot performance, along with per-frame error plots for different policies.
Figure 7 illustrates how ExBody2 successfully replicates various motions (clapping fists, greeting, punching, crouching, defensive pose) in both simulation and real-world settings, emphasizing the retention of high fidelity to the target motion, including lower-body poses critical for balance.

These visual results qualitatively support the quantitative findings, demonstrating that ExBody2 enables realistic and robust expressive humanoid control.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces ExBody2, an Advanced Expressive Whole-Body Control framework for humanoid robots, which significantly advances the field of human-like motion imitation. The core contributions include:

Automated Data Curation: A novel method based on a Feasibility-Diversity Principle that intelligently filters human motion datasets. This ensures that the robot learns from kinematically feasible motions (especially for lower-body stability) while retaining broad expressiveness (particularly in upper-body movements), leading to a robust generalist policy.
Generalist-Specialist Training Pipeline: A two-stage learning approach where a generalist policy, trained on the curated diverse data, serves as a foundation for fine-tuning specialist policies. These specialists achieve higher precision for targeted, complex tasks, leveraging the generalist's learned priors.
Decoupled Motion-Velocity Control Strategy: A robust control scheme that separates global velocity tracking from local keypoint tracking, converting global references to the robot's local frame. This, combined with a teacher-student framework (using privileged information in simulation and DAgger distillation for real-world deployment), enhances tracking robustness and stability against cumulative errors.

Experimental results in both simulation and real-world deployment on the Unitree G1 robot demonstrate that ExBody2 consistently outperforms prior methods across various metrics. The automated data curation is shown to be critical for robust performance, and the pretrain-finetune paradigm significantly boosts precision for specialized motions and generalization to out-of-distribution tasks.

7.2. Limitations & Future Work

The authors acknowledge a key limitation of the current ExBody2 framework:

Inability to Seamlessly Recombine Specialist Policies: While specialist policies achieve high accuracy for their targeted motion groups, the current approach lacks a mechanism to dynamically blend or switch between these different policies within a single tracking session. This restricts the flexibility of handling complex, multi-modal motion sequences that might require transitions between different motion types.

Future work directions suggested by this limitation include:
Dynamic Policy-Integration Mechanism: Developing a system that can adaptively blend or switch between specialized policies in real-time. This would allow for a more seamless and efficient execution of complex motion sequences that involve transitions between different types of movements, unifying the broad coverage of the generalist with the high accuracy of individual specialists. Such an advancement would further improve overall tracking precision, adaptability, and robustness.

Additionally, while not explicitly stated as limitations but areas for future exploration hinted at by the appendix:
Integration with Real-time Inputs: The paper shows preliminary work on integrating RGB-based real-time mimicry (using HybrIK) and motion synthesis (using CVAE). Further development in these areas could enable more interactive and long-horizon tasks, moving beyond offline motion capture data.
Robustness to Diverse External Perturbations: While the framework improves robustness, further work could explore how to maintain expressiveness and stability under even more extreme external perturbations.

7.3. Personal Insights & Critique

This paper presents a highly significant advancement in humanoid whole-body control. Its strengths lie in its systematic approach to address known limitations in the field.

Strengths:

Pragmatic Data Curation: The automated data curation method based on the Feasibility-Diversity Principle is a brilliant solution to a long-standing problem in motion imitation. It moves beyond manual, subjective filtering and intelligently optimizes the training data for robot capabilities, which is crucial for sim-to-real transfer.
Effective Generalist-Specialist Paradigm: The pretrain-finetune approach is well-justified and empirically validated. It efficiently balances broad applicability with task-specific precision, a design choice often seen in successful large-scale machine learning models and now effectively applied to robotics.
Robust Control Strategy: The decoupling of velocity and keypoint tracking, combined with local frame keypoints, directly tackles the cumulative error issue of global keypoint tracking, leading to more stable and robust real-world performance. The teacher-student framework is also a well-established and effective strategy for bridging the sim-to-real gap.
Thorough Evaluation: The comprehensive experimental evaluation, including comparisons with multiple baselines, ablation studies, and both simulation and real-world tests across various motion difficulties and OOD datasets, provides strong evidence for the method's effectiveness.

Potential Issues/Areas for Improvement:

Computational Cost of Optimal Threshold Search: While the automated data curation is powerful, the method describes a greedy search for $\tau^*$ by training multiple policies on different subsets. This could be computationally intensive if the initial dataset is very large and many $\tau$ values need to be explored. Future work could investigate more efficient ways to estimate $\tau^*$ , perhaps using transfer learning or meta-learning on the error distribution itself.
Generalizability of $\tau^*$ : The paper states that the optimal threshold\tau^*exhibits generalizability and can be effectively applied to other motion datasets. While promising, this claim could benefit from more extensive cross-dataset validation to confirm its robustness across different robot morphologies or motion domains.
Lack of Dynamic Policy Integration: The acknowledged limitation of not being able to dynamically blend specialist policies is significant. For real-world humanoid robots, fluent transitions between diverse, fine-tuned behaviors are essential for truly human-like versatility. This is a critical next step.
Specifics of Reward Weights: The reward weights (e.g., in Table I and Table IX) are given as fixed values. While common in RL, understanding how sensitive the policy's performance is to these specific weights and whether they require extensive tuning for new robots or tasks would be valuable.

Transferability and Applications: The methods and conclusions of ExBody2 are highly transferable.

Other Humanoid Platforms: The framework's modularity (data curation, teacher-student, decoupled control) makes it adaptable to other humanoid robots with different kinematics and dynamics, requiring only appropriate retargeting and potentially re-tuning of parameters.
Expressive Tasks Beyond Imitation: The principles of balancing feasibility and diversity and using generalist-specialist policies could be applied to learning other expressive tasks, such as human-robot interaction where natural gestures are critical, or even creative tasks like robot choreography.
Human-Robot Collaboration: A robot capable of expressive and stable whole-body movements can perform more intuitive and safe collaboration with humans.
Teleoperation and Remote Presence: The RGB-based real-time mimicry (as shown in the appendix) has direct applications in enhanced teleoperation and remote presence systems, where users can intuitively control robots with their own body movements.

Overall, ExBody2 represents a robust and well-thought-out solution to a complex problem, setting a new benchmark for expressive whole-body control in humanoid robotics. Its contributions pave the way for more sophisticated and versatile human-robot interactions.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Method	Empipe ↓		Elow
Exbody	0.2178	0.1223	0.3239
Exbody†	0.1465	0.1314	0.1672
OmniH20*	0.1396	0.1273	0.1533
Exbody2-w/o-Filter	0.1361	0.1254	0.1481
Exbody2(Ours)	0.1074	0.1092	0.1054

ExBody2: Advanced Expressive Humanoid Whole-Body Control

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~39 min read · 54,755 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Data-driven Generalist-specialist Training Pipeline

4.2.1.1. Feasibility-Diversity Principle

4.2.1.2. Generalist Policy with Automated Data Curation

4.2.1.3. Specialist Policies with Finetuning

4.2.2. Policy Objective and Architecture

4.2.2.1. Teacher Policy Training

4.2.2.2. Student Policy Training

4.2.2.3. Motion-velocity Decoupled Control Strategy

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Generalist Policy Performance

6.1.2. Impact of Automatic Data Curation

6.1.3. Specialist Policy Finetuning

6.2. Ablation Studies / Parameter Analysis

6.3. Data Presentation (Tables)

6.4. Visual Demonstrations

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers