Paper status: completed

GMT: General Motion Tracking for Humanoid Whole-Body Control

Published:06/18/2025

Humanoid Whole-Body Control (3)General Motion Tracking Framework (1)Adaptive Sampling Strategy (1)Motion Mixture-of-Experts Architecture (1)Diverse Motion Tracking (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents GMT, a motion tracking framework enabling humanoid robots to track diverse full-body motions in real-world settings. It features an Adaptive Sampling strategy and a Motion Mixture-of-Experts architecture, demonstrating state-of-the-art performance through exten

Abstract

The ability to track general whole-body motions in the real world is a useful way to build general-purpose humanoid robots. However, achieving this can be challenging due to the temporal and kinematic diversity of the motions, the policy's capability, and the difficulty of coordination of the upper and lower bodies. To address these issues, we propose GMT, a general and scalable motion-tracking framework that trains a single unified policy to enable humanoid robots to track diverse motions in the real world. GMT is built upon two core components: an Adaptive Sampling strategy and a Motion Mixture-of-Experts (MoE) architecture. The Adaptive Sampling automatically balances easy and difficult motions during training. The MoE ensures better specialization of different regions of the motion manifold. We show through extensive experiments in both simulation and the real world the effectiveness of GMT, achieving state-of-the-art performance across a broad spectrum of motions using a unified general policy. Videos and additional information can be found at https://gmt-humanoid.github.io.

Mind Map

In-depth Reading

English Analysis~36 min read · 47,583 chars

1. Bibliographic Information

1.1. Title

The title of the paper is "GMT: General Motion Tracking for Humanoid Whole-Body Control". The central topic is the development of a unified and scalable framework for humanoid robots to track diverse whole-body motions in the real world.

1.2. Authors

The authors are Zixuan Chen, Mazeyu Ji, Xue Bin Peng, Xuxin Cheng, Xuanbin Peng, and Xiaolong Wang. Their affiliations are:

UC San Diego
Simon Fraser University

The paper notes equal contribution from Zixuan Chen, Mazeyu Ji, and Xue Bin Peng, and equal advising.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server. While not a peer-reviewed journal or conference in its current form, arXiv is a widely recognized platform for disseminating early research in various fields, including robotics and AI. Papers posted here are often submitted to prestigious conferences or journals later.

1.4. Publication Year

The paper was published at 2025-06-17T17:59:33.000Z, which translates to June 17, 2025.

1.5. Abstract

The paper addresses the challenge of building general-purpose humanoid robots capable of tracking diverse whole-body motions in the real world. This task is difficult due to the temporal and kinematic variability of motions, the limitations of current control policies, and the complex coordination required for upper and lower bodies. To tackle these issues, the authors propose GMT (General Motion Tracking), a scalable framework that trains a single, unified policy. GMT features two main components: an Adaptive Sampling strategy to balance the training on easy and difficult motions, and a Motion Mixture-of-Experts (MoE) architecture to enable specialization across different regions of the motion manifold. Through extensive experiments in both simulation and real-world deployments, GMT demonstrates state-of-the-art performance across a wide spectrum of motions using a single general policy.

1.6. Original Source Link

The official source link is https://arxiv.org/abs/2506.14770. The PDF link is https://arxiv.org/pdf/2506.14770v2.pdf. This paper is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is enabling humanoid robots to perform a wide range of general whole-body motions in the real world using a single, unified controller. This is a crucial step towards building general-purpose humanoid robots that can operate effectively in human environments.

This problem is important because manually designing controllers for high-degree-of-freedom (DoF) humanoid systems is extremely challenging and labor-intensive. While learning-based methods have shown promise in simulated environments, transferring these capabilities to real-world robots faces several significant hurdles:

Partial Observability: Real robots lack full state information (e.g., linear velocities, global root positions) that is often assumed in simulations, making policy learning more difficult.
Hardware Limitations: Many human motions are infeasible for robots due to physical constraints, and even feasible motions may require more torque or speed than robots can provide, leading to mismatches.
Unbalanced Data Distribution: Large human motion datasets (like AMASS) are often dominated by common, simple motions (e.g., walking), with a scarcity of complex or dynamic skills, making it hard for policies to learn these less frequent but important movements.
Model Expressiveness: Simple neural network architectures (like MLPs) struggle to capture the complex temporal dependencies and diverse motion categories present in large motion datasets, limiting tracking performance and generalization.

Existing works have addressed some of these individual issues (e.g., teacher-student training for partial observability, specialized policies for different motion categories, transformer models for expressiveness), but developing a unified general motion tracking controller that handles all these challenges simultaneously remains an open problem.

The paper's entry point is to address the data distribution and model expressiveness problems jointly, combined with careful design decisions for partial observability and hardware issues, to create an effective system for training general motion tracking controllers for real humanoid robots.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

GMT Framework: Proposing GMT, a general and scalable motion-tracking framework that trains a single unified policy for real-world humanoid robots to track diverse motions.
Adaptive Sampling Strategy: Introducing a novel Adaptive Sampling strategy that mitigates issues arising from uneven motion category distributions in large datasets by dynamically balancing easy and difficult motions during training. This ensures that the policy dedicates more learning effort to challenging segments.
Motion Mixture-of-Experts (MoE) Architecture: Integrating a Motion Mixture-of-Experts architecture into the policy network to enhance model expressiveness and generalizability. This allows different "experts" to specialize in different types of motions, improving performance across the entire motion manifold.
Comprehensive Motion Input Design: Developing an effective motion input design that combines immediate next motion frames with compressed future motion sequences, enabling the policy to capture both long-term trends and immediate tracking targets.
State-of-the-Art Performance: Demonstrating through extensive simulation and real-world experiments (on a Unitree G1 robot) that GMT achieves state-of-the-art performance across a broad spectrum of motions (e.g., stylized walking, high kicking, dancing, boxing, running) using a single, unified policy.
Generalizability to MDM-Generated Motions: Showing that the trained policy can effectively track motions generated by Motion Diffusion Models (MDM), indicating its potential for broader applications in downstream tasks.

The key conclusions are that by jointly addressing data distribution and model expressiveness challenges, along with robust design choices for real-world deployment, it is possible to create a highly effective general motion tracking controller for humanoid robots. This unified controller can serve as a foundational component for future whole-body algorithm development.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following foundational concepts:

Humanoid Robots: Robots designed to resemble the human body, typically having two legs, a torso, two arms, and a head. They are characterized by a high number of Degrees of Freedom (DoFs), making their control complex.
Whole-Body Control: A control strategy that coordinates all joints and limbs of a robot simultaneously to achieve a desired task, maintaining balance, and interacting with the environment. It's crucial for dynamic and agile movements in humanoids.
Motion Tracking/Imitation: The task of making a robot follow a reference motion (often human motion data) as closely as possible. This involves minimizing the difference between the robot's pose and the target pose over time.
Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent learns a policy—a mapping from states to actions.
- Agent: The entity that learns and makes decisions. In this paper, it's the humanoid robot's controller.
- Environment: The physical or simulated world in which the agent operates.
- State ( $s$ ): A representation of the environment at a given time. For a robot, this includes joint positions, velocities, root orientation, etc.
- Action ( $a$ ): A command issued by the agent to the environment. For a robot, this could be target joint positions or torques.
- Reward ( $r$ ): A scalar feedback signal from the environment indicating how good or bad the agent's action was. The goal is to maximize cumulative reward.
- Policy ( $π$ ): The strategy that the agent uses to decide what action to take in a given state, often represented by a neural network.
Deep Reinforcement Learning (DRL): RL combined with deep neural networks to approximate policies and value functions, enabling learning in high-dimensional state and action spaces.
Proximal Policy Optimization (PPO) [38]: A popular DRL algorithm used for training RL agents. It's an on-policy algorithm that tries to take the largest possible step towards updating the policy without causing the new policy to perform too differently from the old policy, using a clipped objective function.
Mixture-of-Experts (MoE) [e.g., originally in 1991 by Jacobs et al.]: A neural network architecture that consists of multiple "expert" sub-networks and a "gating" network. The gating network learns to choose (or combine the outputs of) which expert(s) to use for a given input. This allows the model to specialize different parts of the network for different types of inputs or tasks, improving expressiveness and capacity without a proportional increase in computational cost for each input.
Teacher-Student Training Framework: A common approach in robotics and RL for sim-to-real transfer. A teacher policy (often with access to privileged information in simulation) is first trained. Then, a student policy (which only uses real-world observable information) is trained to imitate the teacher's actions. This can be done via behavioral cloning or Distillation.
DAgger (Dataset Aggregation) [39]: An Imitation Learning algorithm used in the teacher-student framework. It iteratively collects data by running the student policy, querying the teacher for optimal actions on the student's encountered states, and then aggregating this new data into the training dataset. This helps address the covariate shift problem (where the student policy encounters states not seen during initial training on expert data).
Domain Randomization [42]: A sim-to-real transfer technique where various physical parameters (e.g., friction, mass, sensor noise) in the simulation are randomly varied during training. This forces the policy to be robust to a wide range of conditions, making it more likely to generalize to the real world, which has inherent uncertainties.
Convolutional Neural Network (CNN) [41]: A type of deep neural network commonly used for processing grid-like data, such as images. They are effective at extracting hierarchical features. In this paper, a CNN is used as an encoder to compress sequences of motion frames into a latent vector.
Latent Vector: A low-dimensional representation of higher-dimensional data, capturing its essential features. Encoders are used to generate latent vectors.

3.2. Previous Works

The paper discusses related works in two main categories: Learning-based Humanoid Whole-Body Control and Humanoid Motion Imitation.

3.2.1. Learning-based Humanoid Whole-Body Control

Traditional model-based methods for humanoid control, like those for gait planning [12, 13, 14], are robust but labor-intensive due to complex dynamics. Recent learning-based approaches have used either hand-designed task rewards or human motions as reference.

Task-specific controllers: Many RL approaches focus on specific tasks, such as walking [15, 16, 17, 18, 19], jumping [25, 26, 27], or fall recovery [28, 29]. These policies are generally specialized and not easily transferable across tasks. For example, a policy trained for walking might not be able to jump. This highlights the need for general-purpose controllers.

3.2.2. Humanoid Motion Imitation

Leveraging human motion data is a promising path for general-purpose controllers, especially in character animation:

Simulated Characters: Works like DeepMimic [1], ASE [2], Perpetual Humanoid Control [3], PDP [4], and MaskedMimic [5] have achieved high-quality and general motion tracking for simulated characters. They can reproduce a wide variety of motions and perform diverse skills with human-like behaviors.
Real Robots Challenges: Transferring these techniques to real robots [10, 34, 9, 8, 7, 11, 35, 36, 37] is challenging due to partial observability and hardware limitations.
- Decoupled Control: Some approaches decouple upper-body and lower-body control (e.g., ExBody [10], Mobiletelevision [34]) to manage the trade-off between expressiveness and stability. This often means sacrificing some whole-body coordination.
- Full-sized Robot Imitation: HumanPlus [9] and OmniH2O [8] achieved whole-body motion imitation on full-sized robots, but often with unnatural movements in the lower body, indicating a lack of fine-tuned control or balance.
- Specialist Policies: ExBody2 [7] achieved better whole-body tracking but required several separate specialist policies, contradicting the goal of a single, unified controller.
- Mocap Dependency: VMP [35] showed high-fidelity reproduction but depended on a motion capture system during deployment, limiting its real-world applicability.
- Sim-to-Real Alignment: ASAP [11] focused on aligning simulation and real-world physics for agile whole-body skills.

3.2.3. Differentiation Analysis

The core innovations and differences of GMT compared to the main methods in related work are:

Single Unified Policy: Unlike ExBody2 [7] which uses separate specialist policies for different skills, GMT trains a single unified policy capable of tracking a broad range of diverse motions (upper-body, lower-body, dynamic, static) with high fidelity.
Addressing Data Imbalance: GMT introduces Adaptive Sampling to specifically tackle the unbalanced data distribution problem prevalent in large mocap datasets (like AMASS), where simpler motions dominate. This is distinct from prior works that might just filter or augment data.
Enhanced Model Expressiveness: GMT uses a Motion Mixture-of-Experts (MoE) architecture to improve the policy's capacity and specialization across the motion manifold. While transformer models (like in HumanPlus [9]) are also used for expressiveness, MoE provides a different mechanism for specialization, allowing different "experts" to handle distinct motion types.
Robust Motion Input: GMT utilizes a more sophisticated motion input representation that combines immediate next frames with a CNN-encoded latent vector of future motion sequences. This provides the policy with both short-term targets and long-term context, which is shown to be crucial for high-quality tracking, unlike prior works that might only use the immediate next frame [7, 10, 8].
Real-World Deployment with Stability: GMT successfully deploys its unified policy on a real-world humanoid robot (Unitree G1), demonstrating stable and generalized performance across diverse skills, which has been a persistent challenge for other whole-body imitation methods that often show unnatural movements or require external systems.

3.2.4. Technological Evolution

The field has evolved from manual controller design to task-specific learning-based controllers, then to general learning-based controllers in simulation, and now towards robust, general controllers for real-world humanoid robots. GMT fits into the latest stage by addressing key challenges in sim-to-real transfer, data handling, and model capacity to enable a single policy for diverse real-robot motions. It builds upon teacher-student frameworks and domain randomization but innovates in how it handles motion data complexity and policy specialization.

4. Methodology

4.1. Principles

The core idea behind GMT is to train a single, unified policy that can track a wide variety of whole-body human motions on a real humanoid robot. The theoretical basis or intuition behind it is that by effectively managing the diversity and imbalance of large motion datasets and enhancing the policy network's capacity to specialize, a general controller can emerge. This is achieved by combining insights from reinforcement learning and imitation learning, specifically using a teacher-student framework, adaptive sampling for data efficiency, and a Mixture-of-Experts architecture for model expressiveness.

The motion tracking problem is formulated as a goal-conditioned Reinforcement Learning (RL) problem. In RL, an agent learns to make decisions in an environment to maximize a cumulative reward. In goal-conditioned RL, the agent is given a specific goal to achieve.

At each timestep $t$ , the agent's policy $\pi$ takes the current state $\mathbf{s}_t$ and the goal $\mathbf{g}_t$ as input, and outputs an action $\mathbf{a}_t$ . This is formulated as $\pi(\mathbf{a}_t | \mathbf{s}_t, \mathbf{g}_t)$ . When this action $\mathbf{a}_t$ is applied to the environment, it leads to a new state $\mathbf{s}_{t+1}$ according to the environment dynamics $p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)$ . The agent receives a reward $r(\mathbf{s}_{t+1}, \mathbf{s}_t, \mathbf{a}_t)$ . The ultimate goal of the agent is to maximize the expected return, which is the sum of discounted future rewards: $J(\pi) = \mathbb{E}_{p(\tau \mid \pi)} \left[ \sum_{t=0}^{T-1} \gamma^t r_t \right]$ where:

$J(\pi)$ is the objective function to maximize for policy $\pi$ .
$\mathbb{E}_{p(\tau \mid \pi)}$ denotes the expectation over trajectories $\tau$ sampled according to policy $\pi$ . A trajectory $\tau$ is a sequence of states, actions, and rewards over time.
$T$ is the time horizon, representing the total number of timesteps in an episode.
$\gamma$ is the discount factor (a value between 0 and 1, typically close to 1), which determines the present value of future rewards. A higher $\gamma$ means future rewards are considered more important.
$r_t$ is the reward received at timestep $t$ .

4.2. Core Methodology In-depth (Layer by Layer)

GMT employs a two-stage teacher-student training framework for sim-to-real transfer, similar to prior works.

Stage 1: Training the Privileged Teacher Policy A privileged teacher policy is first trained in simulation. This policy has access to both proprioceptive observations (information a real robot can sense, like joint angles and velocities) and privileged information (information typically only available in simulation, like true linear velocities, contact forces, or internal model parameters). This teacher policy is trained using the PPO (Proximal Policy Optimization) algorithm. The output of the teacher policy is joint target actions.

Stage 2: Training the Deployable Student Policy A student policy is then trained to imitate the output of the teacher policy. This student policy is designed to be deployable on a real robot, meaning it only takes as input proprioceptive observations and their history. This training is done using the DAgger (Dataset Aggregation) algorithm. The student policy is optimized by minimizing the $\ell_2$ loss between its output and the teacher's output. The loss function is defined as: $\mathcal{L} = \| \hat{\mathbf{a}}_t - \mathbf{a}_t \|_2^2$ where:

$\mathcal{L}$ is the loss function to be minimized.
$\hat{\mathbf{a}}_t$ is the joint target action output by the teacher policy at timestep $t$ .
$\mathbf{a}_t$ is the joint target action output by the student policy at timestep $t$ .
$\| \cdot \|_2^2$ denotes the squared Euclidean norm, which calculates the squared difference between the teacher's and student's actions.

The GMT framework introduces two core components to enhance this process: Adaptive Sampling and Motion Mixture-of-Experts.

4.2.1. Adaptive Sampling

Large motion datasets, such as AMASS, suffer from category unbalance, meaning some motion types (e.g., walking, in-place activities) are much more frequent than others (e.g., complex dynamic skills). This imbalance hinders the learning of less frequent but often more challenging motions. Traditional sampling strategies often sample entire motion sequences, which are usually dominated by easier segments, leading to an inefficient learning signal for difficult parts.

Adaptive Sampling addresses this with two components:

Random Clipping:
- Motions longer than 10 seconds are clipped into several sub-clips, each with a maximum length of 10 seconds.
- To avoid artifacts at the boundaries of these clipped segments, a random offset of up to 2 seconds is introduced during clipping.
- All motions are re-clipped periodically during training to ensure diversity in the sampled sub-clips, preventing the policy from overfitting to specific segments.
Tracking Performance-based Probabilities:
- During training, the completion level $c_i$ for each motion clip $i$ is tracked.
- Initially, $c_i$ is set to 10.
- Each time a motion is successfully completed (i.e., tracked without exceeding a certain error threshold), $c_i$ is multiplied by 0.99, effectively decaying its value. The minimum value $c_i$ can reach is 1.
- An error threshold $E_i$ $E_{i}$ is dynamically set for each motion, which determines when tracking is considered to have failed. This threshold is calculated as: $E_i = 0.25 \exp \left( \frac{(c_i - 1)}{(10 - 1)} \times \log \left( \frac{0.6}{0.25} \right) \right)$ where:
  - $E_i$ is the error threshold for motion $i$ .
  - 0.25 is the base error threshold.
  - 0.6 is the maximum error threshold when $c_i=10$ .
  - $c_i$ is the completion level for motion $i$ , ranging from 1 to 10.
  - As $c_i$ decreases (meaning the motion is successfully completed more often), $E_i$ also decreases, making the motion "harder" to complete in terms of error tolerance.
- The sampling level $s_i$ $s_{i}$ for each motion $i$ $i$ is then defined. This level is used to determine the probability of sampling a given motion for training. $s_i = \begin{cases} \left( \min \left( E_{\text{max.key.body.error}} / 0.15, 1 \right) \right)^5, & c_i = 1 \\ c_i, & c_i > 1 \end{cases}$ where:
  - $s_i$ is the sampling level for motion $i$ .
  - $E_{\text{max.key.body.error}}$ is the maximum key body position error encountered during the latest attempt to track motion $i$ .
  - 0.15 is a normalization constant for the error.
  - The $\min(\cdot, 1)$ ensures the ratio does not exceed 1.
  - The exponent 5 amplifies the effect of high errors, making motions with significant errors much more likely to be sampled when $c_i=1$ .
  - If $c_i > 1$ , the sampling level is simply $c_i$ . This means motions that are still being learned (higher $c_i$ ) are sampled proportionally to their completion level. If $c_i=1$ , it means the motion is considered "mastered" in terms of completion, but its sampling probability is then driven by how well it was tracked in terms of maximum error, ensuring continued refinement on challenging parts.
- Finally, the actual sampling probability for each motion is obtained by normalizing these $s_i$ values across all motions. This mechanism prioritizes motions that are currently being struggled with (high error) or are not yet mastered (high $c_i$ ), dynamically adjusting the training focus.

4.2.2. Motion Mixture-of-Experts (MoE)

To address the model expressiveness issue for large and diverse motion datasets, GMT incorporates a soft MoE module into the teacher policy network.

The policy network consists of:

Expert Networks: A group of $n$ neural networks, each referred to as an "expert". Each expert network takes the robot's state observation and motion targets as input. Each expert $i$ outputs an action $\mathbf{a}_i$ .
Gating Network: Another neural network that takes the same input observations (robot state and motion targets) as the expert networks. Its role is to output a probability distribution over all experts, denoted as $p_1, p_2, \ldots, p_n$ , where $p_i$ is the probability assigned to expert $i$ .

The final action output $\mathbf{a}$ of the MoE policy is a weighted sum of the actions from each expert, where the weights are given by the probabilities from the gating network: $\mathbf{a} = \sum_{i=1}^n p_i \mathbf{a}_i$ where:
$\mathbf{a}$ is the final action produced by the MoE policy.
$n$ is the total number of expert networks.
$p_i$ is the probability assigned to expert $i$ by the gating network.
$\mathbf{a}_i$ is the action output by expert network $i$ .

This architecture allows different experts to specialize in different regions of the motion manifold (e.g., one expert for walking, another for kicking, another for dancing), and the gating network learns to smoothly transition between or combine these experts based on the current motion context. This significantly enhances the model's capacity and ability to generalize across diverse motions without requiring a massive single network.

The following figure (Figure 3 from the original paper) provides an overview of the GMT framework:

$Figure 3: An overview of GMT. Here $\\mathbf { \\sigma } _ { \\mathbf { \\sigma } _ { \\mathbf { \\sigma } _ { \\mathbf { \\lambda } } } } \\mathbf { \\sigma } _ { \\mathbf { \\sigma } _ { \\mathbf { \\lambda } } } \\mathbf { \\sigma } _ { \\mathbf { \\lambda } _ { \\mathbf { \\lambda } } } \\mathbf { \\sigma } _ { \\mathbf { \\lambda } _ { \\mathbf { \\lambda } } } \\mathbf { \\sigma } _ { \\mathbf { \\lambda } _ { \\mathbf { \\lambda } } } \\mathbf { \\sigma } _ { \\mathbf { \\lambda } _ { \\mathbf { \\lambda } } }$ denotes the motion target frame, $\\mathbf { } _ { o _ { t } }$ denotes proprioceptive observation, and `e _ { t }` denotes privileged information.$ 该图像是一个示意图，展示了GMT（一般运动跟踪）框架的结构。图中包含两个主要阶段：第一阶段通过适应采样处理动作捕捉数据集，并与专家混合模型结合；第二阶段展示了如何通过行为克隆训练学生策略，同时显示模拟中的跟踪误差。关键组件包括动作目标帧 $\mathbf { \sigma }$ 、本体观测 $\mathbf { o }$ 和特权信息 $e_t$ 。

4.2.3. Dataset Curation

The training dataset is a combination of AMASS [6] and LAFAN1 [40]. Since these raw datasets contain motions that are infeasible or dangerous for real robots (e.g., crawling, fallen states, extremely dynamic maneuvers), a two-stage data curation process is adopted:

Rule-based Filtering (First Stage): Motions that violate basic physical constraints or robot capabilities are removed. Examples include:
- Motions where the robot's root roll or pitch angles exceed predefined thresholds.
- Motions where the robot's root height is abnormally high or low.
- Motions that involve physical capabilities beyond the robot's hardware limits.
Performance-based Filtering (Second Stage): After the initial rule-based filtering, a preliminary policy is trained on the filtered dataset (with approximately 5 billion samples). Based on the completion rates (how often the policy successfully tracks a motion) achieved by this preliminary policy, motions that consistently fail are further filtered out. This results in a curated dataset of 8925 clips totaling 33.12 hours, which is more appropriate for training a high-quality robot controller.

4.2.4. Motion Inputs

The goal of motion tracking is to make the robot follow a specific target pose in each motion frame. The motion target $\mathbf{g}_t$ at timestep $t$ is represented as a vector: $\mathbf{g}_t = [ \mathbf{q}_t, \mathbf{v}_t^{\mathrm{base}}, \mathbf{r}_t^{\mathrm{base}}, \mathbf{\bar{p}}_t^{\mathrm{key}}, h_t^{\mathrm{root}} ]$ where:

$\mathbf{q}_t \in \mathbb{R}^{23}$ represents joint positions (angles) for the robot's 23 degrees of freedom.
$\mathbf{v}_t^{\mathrm{base}} \in \mathbb{R}^{6}$ denotes the base linear and angular velocities of the robot's root (pelvis).
$\mathbf{r}_t^{\mathrm{base}} \in \mathbb{R}^{2}$ represents the base roll and pitch angles of the robot's root.
$\mathbf{\bar{p}}_t^{\mathrm{key}} \in \mathbb{R}^{3 \times \text{num\_keybody}}$ corresponds to the local key body positions. Unlike some prior works that use global key body positions [8, 11], GMT uses local key body positions, similar to ExBody2 [7]. A further refinement is aligning these local key bodies relative to the robot's heading direction, which can provide a more robust and invariant representation.
$h_t^{\mathrm{root}}$ represents the root height.

To improve tracking performance, GMT moves beyond using only the immediate next motion frame as input. Instead, it incorporates future motion context:
Stacked Future Frames: A sequence of multiple consecutive frames, $\left[ \mathbf{g}_t, \ldots, \mathbf{g}_{t+100} \right]$ , covering approximately two seconds of future motion, is stacked. This sequence provides information about the long-term trends and upcoming movements.
Convolutional Encoder: These stacked frames are then processed by a convolutional encoder [41] to compress them into a latent vector $z_t \in \mathbb{R}^{128}$ . This latent vector effectively summarizes the complex temporal information of the future motion.
Combined Input: The latent vector $z_t$ is combined with the immediate next frame $\mathbf{g}_{t+1}$ (which represents the explicit, short-term tracking target). This combined input is then fed into the policy network.

This design allows the policy to:
Capture long-term trends of the motion sequence through the latent vector (context).
Explicitly recognize the immediate tracking target through $\mathbf{g}_{t+1}$ (precision).

The paper emphasizes that this combined input design is essential for achieving high-quality motion tracking.

4.2.5. Sim-to-Real Transfer Details

To ensure successful transfer of the learned policy from simulation to the real robot, GMT applies several techniques:

Domain Randomization [42, 43]: Various physical parameters in the simulation are randomized during training for both the teacher and student policies. This makes the policy robust to discrepancies between simulation and reality. Details of these randomizations are provided in the Experimental Setup.
Action Delay: This technique introduces a delay in the execution of actions in simulation, mimicking the inherent latency in real robot control systems.
Modeling Reduction Drive Inertia: The effect of the reduction drive's moment of inertia is explicitly modeled. For a reduction ratio $k$ and a reduction drive moment of inertia $I$ , the armature parameter in the simulator is configured as: $\mathsf{armature} = k^2 I$ This formula approximates the effective inertia introduced by the reduction drive, helping to align the simulated dynamics more closely with the real robot's dynamics.

4.2.6. Observations and Actions

Teacher Policy Observations:
- Proprioception ( $\mathbf{o}_t$ ): Root angular velocity (3 dimensions), root roll and pitch (2 dimensions), joint positions (23 dimensions), joint velocities (23 dimensions), and last action (23 dimensions).
- Privileged Information ( $e_t$ ): Root linear velocity (3 dimensions), root height (1 dimension), key body positions, feet contact mask (2 dimensions), mass randomization parameters (6 dimensions), and motor strength (46 dimensions).
- Motion Targets ( $\mathbf{g}_t$ ): As described in Section 4.2.4.
Student Policy Observations:
- Proprioception ( $\mathbf{o}_t$ ): Same as for the teacher.
- Proprioception History: A sequence of past proprioceptive observations, $o_{t-20}, \ldots, o_t$ . This history helps the student policy infer unobservable states (like linear velocity) from its own sensor readings over time.
- Motion Targets ( $\mathbf{g}_t$ ): Same as for the teacher.
Action Space: The output action for both policies consists of target joint positions. The robot's low-level controller then works to achieve these target positions.

4.2.7. Reward Functions

The reward functions used in the first-stage training (for the teacher policy) guide the robot to track the motion targets. The overall reward is a sum of individual reward terms, each designed to penalize deviations from the target motion or undesired behaviors. Here are the definitions from Table 3 of the paper:

Name	Definitions
tracking joint positions	exp(−k_q∥q_ref − q_t∥²)
tracking joint velocities	exp(−k_q̇∥q̇_ref − q̇_t∥²)
tracking root pose	exp(−k_r∥r_ref − r_t∥² − k_h∥h_ref − h_t∥²)
tracking root vel	exp(−k_v∥v_ref − v_t∥²)
tracking key body positions	exp(−k_p∥p_ref − p_t∥²)
alive	1.0
foot slip	−k_foot∥v_foot ∘ contact∥²
joint velocities	−k_{joint_vel}∥q̇_t∥²
joint accelerations	−k_{joint_acc}∥q̈_t∥²
action rate	−k_{action_rate}∥a_t − a_t−1∥²

Where:

exp(x) denotes the exponential function $e^x$ .
$k$ with various subscripts (e.g., $k_q, k_{\dot{q}}, k_r, k_h, k_v, k_p, k_{\text{foot}}, k_{\text{joint\_vel}}, k_{\text{joint\_acc}}, k_{\text{action\_rate}}$ ) represents scaling coefficients for each reward term, balancing their contribution to the total reward.
$\mathbf{q}_{\text{ref}}$ and $\mathbf{q}_t$ are the reference and current joint positions, respectively.
$\dot{\mathbf{q}}_{\text{ref}}$ and $\dot{\mathbf{q}}_t$ are the reference and current joint velocities, respectively.
$\mathbf{r}_{\text{ref}}$ and $\mathbf{r}_t$ are the reference and current root rotations (e.g., roll and pitch), respectively.
$h_{\text{ref}}$ and $h_t$ are the reference and current root heights, respectively.
$\mathbf{v}_{\text{ref}}$ and $\mathbf{v}_t$ are the reference and current root velocities, respectively.
$\mathbf{p}_{\text{ref}}$ and $\mathbf{p}_t$ are the reference and current key body positions, respectively.
$\mathbf{v}_{\text{foot}}$ is the foot velocity.
contact is a mask indicating whether a foot is in contact with the ground.
$\circ$ denotes the element-wise (Hadamard) product.
$\ddot{\mathbf{q}}_t$ is the current joint acceleration.
$\mathbf{a}_t$ and $\mathbf{a}_{t-1}$ are the current and previous actions, respectively.
$\|\cdot\|^2$ denotes the squared Euclidean norm.

The "tracking" rewards are designed to maximize similarity to the reference motion, using exponential functions so that smaller errors yield higher rewards. The "alive" reward is a constant positive reward for simply staying active. The "foot slip", "joint velocities", "joint accelerations", and "action rate" terms are regularization penalties to encourage stable, smooth, and energy-efficient movements.

5. Experimental Setup

5.1. Datasets

The policies were trained using a combination of two well-known human motion datasets:

AMASS (Archive of Motion Capture as Surface Shapes) [6]: This is a large, public dataset of human motion capture data.
- Source: It aggregates various existing mocap datasets into a common framework, providing body shape and pose information.
- Characteristics: Contains a vast array of human motions, including locomotion, interactions, sports, and daily activities. However, it exhibits significant category unbalance, with a large proportion of motions involving walking or in-place activities, as illustrated in Figure 2.
- Example of data distribution (Figure 2 from the original paper):
  
  该图像是图表，展示了AMASS数据集中不同动作类别的分布。图中显示了各个类别所占总动作时长的比例，其中‘Walk’和‘Inplace’的比例最高，而其他类别的比例相对较低。
  
  As can be seen from Figure 2, categories like walk and in-place constitute a very large portion of the dataset's duration, while more complex or dynamic motions like dance, kick, jump, and run are comparatively rare.
- Domain: Human motion data.
LAFAN1 [40]: A dataset specifically designed for learning character animation, often used for tasks like motion in-betweening and tracking.
- Source: A collection of motions with a focus on detailed foot contact information and diverse locomotion.
- Characteristics: Known for clean data and suitability for physics-based character control.
- Domain: Human motion data.
  
  Dataset Curation: As described in the Methodology section, the raw AMASS and LAFAN1 datasets underwent a two-stage filtering process (rule-based and performance-based) to remove infeasible or problematic motions for humanoid robots. This resulted in a curated training dataset of 8925 clips, totaling 33.12 hours of motion.

Test Sets:

AMASS-test: A separate test set derived from the AMASS dataset for evaluating generalization to unseen AMASS motions.
LAFAN1: Used as a test set to evaluate performance on the LAFAN1 dataset, even though it was part of the training data after filtering. This allows assessing how well the policy learned to track the motions present in LAFAN1.

Why these datasets were chosen: These datasets represent a broad spectrum of human movements, making them ideal for training a general motion tracking policy. The curation process ensures that the motions are physically plausible for a robot.

5.2. Evaluation Metrics

The policy tracking performance is quantitatively evaluated using four metrics:

Mean Per Keybody Position Error ( $E_{\mathrm{mpkpe}}$ ):
- Conceptual Definition: This metric quantifies the average spatial discrepancy between the positions of specific key bodies (e.g., hands, feet, head, pelvis) on the robot and their corresponding target positions in the reference motion. It measures how well the robot's overall body shape and limb positions match the target. A lower value indicates better tracking.
- Mathematical Formula: Not explicitly provided in the paper, but typically calculated as: $E_{\mathrm{mpkpe}} = \frac{1}{N_k T} \sum_{t=1}^T \sum_{j=1}^{N_k} \| \mathbf{p}_{t,j}^{\mathrm{robot}} - \mathbf{p}_{t,j}^{\mathrm{ref}} \|_2$
- Symbol Explanation:
  - $N_k$ : The number of key bodies being tracked.
  - $T$ : The total number of timesteps in the motion clip.
  - $\mathbf{p}_{t,j}^{\mathrm{robot}}$ : The 3D position vector of the $j$ -th key body on the robot at timestep $t$ .
  - $\mathbf{p}_{t,j}^{\mathrm{ref}}$ : The 3D position vector of the $j$ -th key body in the reference motion at timestep $t$ .
  - $\|\cdot\|_2$ : The Euclidean distance (L2 norm) in 3D space.
  - The unit of $E_{\mathrm{mpkpe}}$ is millimeters (mm).
Mean Per Joint Position Error ( $E_{\mathrm{mpjpe}}$ ):
- Conceptual Definition: This metric measures the average angular difference between the robot's joint angles and the target joint angles specified by the reference motion. It assesses the fidelity of the robot's internal pose and joint configurations compared to the desired motion. A lower value indicates better joint-level tracking.
- Mathematical Formula: Not explicitly provided in the paper, but typically calculated as: $E_{\mathrm{mpjpe}} = \frac{1}{N_j T} \sum_{t=1}^T \sum_{i=1}^{N_j} |\theta_{t,i}^{\mathrm{robot}} - \theta_{t,i}^{\mathrm{ref}}|$
- Symbol Explanation:
  - $N_j$ : The number of joints (degrees of freedom) in the robot.
  - $T$ : The total number of timesteps.
  - $\theta_{t,i}^{\mathrm{robot}}$ : The angular position of the $i$ -th joint on the robot at timestep $t$ .
  - $\theta_{t,i}^{\mathrm{ref}}$ : The angular position of the $i$ -th joint in the reference motion at timestep $t$ .
  - $|\cdot|$ : The absolute difference.
  - The unit of $E_{\mathrm{mpjpe}}$ is radians (rad).
Linear Velocity Error ( $E_{\mathrm{vel}}$ ):
- Conceptual Definition: This metric quantifies the difference between the robot's root linear velocity (e.g., how fast and in what direction its pelvis is moving) and the target linear velocity from the reference motion. It evaluates the robot's ability to match the translational speed and direction of the reference motion. A lower value indicates better global locomotion tracking.
- Mathematical Formula: Not explicitly provided in the paper, but typically calculated as: $E_{\mathrm{vel}} = \frac{1}{T} \sum_{t=1}^T \| \mathbf{v}_{t}^{\mathrm{robot, linear}} - \mathbf{v}_{t}^{\mathrm{ref, linear}} \|_2$
- Symbol Explanation:
  - $T$ : The total number of timesteps.
  - $\mathbf{v}_{t}^{\mathrm{robot, linear}}$ : The 3D linear velocity vector of the robot's root at timestep $t$ .
  - $\mathbf{v}_{t}^{\mathrm{ref, linear}}$ : The 3D linear velocity vector of the reference motion's root at timestep $t$ .
  - $\|\cdot\|_2$ : The Euclidean distance.
  - The unit of $E_{\mathrm{vel}}$ is meters per second (m/s).
Yaw Velocity Error ( $E_{\mathrm{yaw~vel}}$ ):
- Conceptual Definition: This metric measures the difference between the robot's root yaw velocity (rate of rotation around its vertical axis) and the target yaw velocity from the reference motion. It assesses the robot's ability to match the rotational speed around its vertical axis, crucial for turning and orientation changes. A lower value indicates better rotational tracking.
- Mathematical Formula: Not explicitly provided in the paper, but typically calculated as: $E_{\mathrm{yaw~vel}} = \frac{1}{T} \sum_{t=1}^T |\omega_{t}^{\mathrm{robot, yaw}} - \omega_{t}^{\mathrm{ref, yaw}}|$
- Symbol Explanation:
  - $T$ : The total number of timesteps.
  - $\omega_{t}^{\mathrm{robot, yaw}}$ : The scalar yaw angular velocity of the robot's root at timestep $t$ .
  - $\omega_{t}^{\mathrm{ref, yaw}}$ : The scalar yaw angular velocity of the reference motion's root at timestep $t$ .
  - $|\cdot|$ : The absolute difference.
  - The unit of $E_{\mathrm{yaw~vel}}$ is radians per second (rad/s).

5.3. Baselines

GMT's performance is compared against ExBody2 [7] in simulation.

ExBody2 [7]: This prior work also focuses on expressive humanoid whole-body control and achieved good tracking performance, but with several separate specialist policies. The authors re-implemented ExBody2 and trained it on their filtered $AMASS+LAFAN$ dataset to ensure a fair comparison under the same data conditions.

5.4. Training Details

Simulator: IsaacGym [44] was used as the physics simulator. IsaacGym is known for its ability to run thousands of simulation environments in parallel on a single GPU, significantly accelerating RL training.
Parallel Environments: 4096 parallel environments were used, enabling massive parallelization.
GPU: Training was performed on an RTX4090 GPU.
Training Time:
- Privileged Teacher Policy: Approximately 3 days.
- Deployable Student Policy: Approximately 1 day.
Simulation Frequency: $500 \mathrm{Hz}$ . This means the physics simulation updates 500 times per second.
Control Frequency: $50 \mathrm{Hz}$ . This means the policy outputs new actions 50 times per second. The ratio (500/50 = 10) implies that for every policy action, the simulator takes 10 physics steps.
Samples: Each policy was trained using approximately 6.8 billion samples.
Validation: The trained policy was validated in MuJoCo [47] (a physics engine often used for robotics research) before deployment onto the real robot. This provides an additional verification step.
Robot: For real-world experiments, the policy was deployed on a Unitree G1 [45], a medium-sized humanoid robot with 23 DoFs and a height of 1.32 meters.

5.5. Domain Randomizations

To mitigate the sim-to-real gap, extensive domain randomizations were applied during training. The details are shown in Table 4 from the original paper:

Name	Range
Terrain Height Gravity	[0, 0.02]m [−0.1, 0.1]
Friction Robot Base Mass	[0.1, 2.0]
Robot Base Mass Center	[−3, 3]kg, [−0.05, 0.05]m
Push Velocity Motor Strength	[0.0, 1.0]m/s

Where:

Terrain Height: Randomizes the height variations in the terrain, making the robot robust to uneven surfaces.
Gravity: Randomly perturbs the gravity vector, helping the robot learn to compensate for slight variations in perceived gravity or inertial forces.
Friction: Randomizes the friction coefficients of the robot's contacts, making the policy robust to varying surface slipperiness.
Robot Base Mass: Randomizes the mass of the robot's base (torso/pelvis), forcing the policy to adapt to different inertial properties. The range [0.1, 2.0] is likely a scaling factor applied to the nominal mass.
Robot Base Mass Center: Randomizes the position of the center of mass of the robot's base, which affects its balance and dynamics.
Push Velocity: Applies random external pushes to the robot, characterized by a velocity range, to improve its ability to recover from disturbances and maintain balance.
Motor Strength: Randomizes the strength (torque output) of the robot's motors, making the policy robust to variations in motor performance or wear.

6. Results & Analysis

6.1. Core Results Analysis

The experiments evaluate GMT's performance in both simulation and real-world settings, focusing on the contributions of Adaptive Sampling, Motion Mixture-of-Experts, and Motion Inputs.

The following are the results from Table 2 of the original paper:

	AMASS-Test				LAFAN1
Method	Empkpe ↓	Empipe ↓	Evel ↓	Eyaw vel	Empkpe ↓	Empipe↓	Evel ↓	Eyaw vel ↓
Teacher & Student
Privileged Policy	42.07	0.0834	0.1747	0.2238	45.16	0.0975	0.2837	0.3314
tutu Policy	4.01	0807	0.1943	0.2211	46.14	0.1060	0.3009	0.3489
Baseline
ExBody2 [7]	50.28±0.28	0.0925±0.001	0.1875±0.001	0.3402±0.004	58.36±0.48	0.1378±0.002	0.3461±0.005	0.4260±0.006
G (ours)	42.07±0.17	0.0834±0.001	0.1747±0.001	0.2238±0.002	45.16±0.35	0.0975±0.001	0.2837±0.004	0.3314±0.003
(a) Ablations
GMT w.o. MoE		0.0874±0.00	0.1902±0.002	0.2483±0.001	48.26±0.29		0.3111±0.003	0.3795±0.005
GMT w.o. A.S.	42.53±0.19 43.54±0.23	0.0872±0.001	0.2064±0.001	0.2593±0.001	49.61±0.30	0.1019±0.001 0.1041±0.002	0.3019±0.003	0.3574±0.003
GMT W.O. A.S. & MoE	44.34±0.21	0.0920±0.001	0.2121±0.001	0.2534±0.001	52.34±0.33	0.1110±0.002	0.3263±0.003	0.3584±0.007
GMT (ours)	42.07±0.17	0.0834±0.001	0.1747±0.001	0.2238±0.002	45.16±0.35	0.0975±0.001	0.2837±0.004	0.3314±0.003
(b) Motion Inputs
GMT-M	46.02±0.25	0.0942±0.001	0.2282±0.001	0.3311±0.003	51.16±0.34	0.1069±0.002	0.3476±0.001	0.4890±0.007
GMT-L0.5-M	43.64±0.19	0.0855±0.001	0.2051±0.001	0.2439±0.001	49.87±0.32	0.1032±0.002	0.3346±0.005	0.3648±0.003
GMT-L1-	43.15±0.22	0.0867±0.001	0.1989±0.002	0.2465±0.001	47.41±0.35	0.1007±0.002	0.3047±0.003	0.3513±0.002
GMT-L2	49.52±0.27	0.1016±0.002	0.2201±0.001	0.2888±0.003	61.24±0.42	0.1368±0.002	0.3925±0.008	0.5558±0.009
GMT-L2-M (ours)	42.07±0.17	0.0834±0.001	0.1747±0.001	0.2238±0.002	45.16±0.35	0.0975±0.001	0.2837±0.004	0.3314±0.003

6.1.1. Baseline Comparison

The row labeled G (ours) in the "Baseline" section of Table 2 represents the full GMT model. It consistently outperforms ExBody2 [7] across all metrics on both AMASS-Test and LAFAN1 datasets.

AMASS-Test: GMT achieves Empkpe of 42.07 mm compared to ExBody2's 50.28 mm, and Empipe of 0.0834 rad compared to 0.0925 rad. Similar improvements are observed in Evel (0.1747 m/s vs 0.1875 m/s) and Eyaw vel (0.2238 rad/s vs 0.3402 rad/s).
LAFAN1: The performance gap is even larger on LAFAN1, with GMT scoring 45.16 mm Empkpe vs ExBody2's 58.36 mm, and 0.0975 rad Empipe vs 0.1378 rad.

This strong performance demonstrates that GMT significantly improves both local tracking (key body and joint positions) and global tracking (linear and yaw velocities) compared to a state-of-the-art baseline, achieving better fidelity and generalization with a single policy.

6.1.2. Teacher & Student Performance

The "Teacher & Student" section shows the performance of the Privileged Policy (teacher) and tutu Policy (student). The term "tutu Policy" seems to be a typo or placeholder, as its Empkpe is suspiciously low (4.01), and Empipe is 0807 (likely 0.0807). If we assume the "tutu Policy" is the student policy, its performance is slightly worse than the privileged policy, which is expected as the student lacks privileged information. However, the student's performance is still quite strong and close to the teacher's, indicating successful knowledge transfer. The table entry "G (ours)" is identical to "Privileged Policy", confirming that the main comparisons and ablations are on the privileged teacher policy.

6.2. Ablation Studies

The "Ablations" section (a) in Table 2 investigates the contribution of Motion MoE and Adaptive Sampling.

GMT w.o. MoE (without Mixture-of-Experts):
- Empkpe increases to 48.26 mm (from 42.07 mm) on LAFAN1.
- Empipe increases to 0.0874 rad (from 0.0834 rad) on AMASS-Test.
- Evel and Eyaw vel also show degradation.
- This indicates that the MoE architecture is crucial for maintaining high tracking accuracy, especially for capturing the diversity and complexity of motions.
GMT w.o. A.S. (without Adaptive Sampling):
- Empkpe on AMASS-Test worsens to 42.53 mm (from 42.07 mm).
- Empkpe on LAFAN1 worsens to 49.61 mm (from 45.16 mm).
- All other metrics also show noticeable degradation.
- This confirms that Adaptive Sampling is effective in improving tracking performance, particularly by focusing training on more challenging motion segments and ensuring balanced learning across motion categories.
GMT W.O. A.S. & MoE (without both Adaptive Sampling and Mixture-of-Experts):
- This variant shows the worst performance across all metrics and datasets, with Empkpe of 44.34 mm on AMASS-Test and 52.34 mm on LAFAN1, and similarly poor results for other metrics.
- This clearly demonstrates the synergistic effect of Adaptive Sampling and MoE. When both are removed, the policy's ability to track diverse motions significantly deteriorates, highlighting their combined importance for GMT's state-of-the-art performance.

6.2.1. Motion MoE Analysis

The MoE (Mixture-of-Experts) architecture specifically helps in improving performance on more challenging motions.

Quantitative Evidence: As shown in Figure 5, the MoE helps significantly reduce tracking errors, especially in the top percentiles (i.e., for the hardest-to-track motions) for Empkpe, Empipe, Evel, and Eyaw vel. This suggests that the specialization of experts allows the policy to better handle complex and diverse movements that a single, monolithic network might struggle with.
Qualitative Evidence: Figure 4 visualizes the output of the gating network on a composite motion sequence (standing, kicking, walking backward, standing again). It shows clear transitions in expert activation across different phases of the motion. For example, one expert might be active during standing, another during kicking, and yet another during walking. This validates the intended role of MoE—individual experts specialize in different types of motion, allowing for better overall generalization and performance on a wide range of skills.

The following figure (Figure 4 from the original paper) plots the output of the gating network:

该图像是图表，展示了在一个动作片段中，根据时间变化的门控网络输出。上半部分为机器人在运动中的截图，下面部分显示了不同专家的门控权重变化，描绘了各专家在时间步长上的表现。

The following figure (Figure 5 from the original paper) shows top percentile tracking errors:

Figure 5: Top percentile tracking errors on the whole AMASS dataset. 该图像是一个图表，展示了GMT及其变种在整个AMASS数据集上的跟踪误差。其中，蓝色条表示GMT，橙色条表示去除自适应采样的结果，绿色条表示去除Motion Mixture-of-Experts的结果。图中包括关键身体、关节位置、线性速度和偏航速度的跟踪误差数据，以不同的百分比范围进行分组。

6.2.2. Adaptive Sampling Analysis

Similar to MoE, Adaptive Sampling also significantly improves performance, especially on challenging motions.

Quantitative Evidence: Table 2 (a) shows that GMT w.o. A.S. performs worse than the full GMT. Figure 5 also illustrates that Adaptive Sampling (blue vs. orange bars) leads to lower tracking errors, particularly in the higher percentiles, confirming its effectiveness on difficult motions.
Qualitative Evidence: Figure 6 compares policies trained with and without Adaptive Sampling on a difficult segment extracted from a long motion clip.
- Visualization (Figure 6a): Without Adaptive Sampling, the policy struggles to learn this clip, often failing to balance and exhibiting unstable behaviors. With Adaptive Sampling, the policy tracks the motion with high quality and stability.
- Torque Outputs (Figure 6b): The torque outputs for key joints (knee and hip roll) also differ. Without Adaptive Sampling, the torque profiles might be erratic or insufficient, leading to instability. With Adaptive Sampling, the torques are likely more controlled and effective in executing the motion. This shows that Adaptive Sampling is critical not only for simulation performance but also for practical real-world deployment.
  
  The following figure (Figure 6 from the original paper) illustrates the performance of policies with and without Adaptive Sampling:
  
  该图像是图表，展示了带有自适应采样和不带自适应采样的策略在长运动片段中的性能。上部分为机器人在模拟器中的执行表现，下部分显示了多个关节的扭矩输出。该图说明了自适应采样如何提高机器人稳定性和整体表现。

6.2.3. Motion Inputs Analysis

The "Motion Inputs" section (b) in Table 2 investigates how different configurations of motion inputs affect tracking performance.

GMT-M: Only the immediate next frame of motion is provided as input. This results in significantly worse performance across all metrics compared to the full GMT model (e.g., Empkpe 46.02 mm on AMASS-Test vs 42.07 mm for GMT). This confirms that immediate next frame alone is insufficient for high-quality tracking.
GMT-Lx-M (e.g., GMT-L0.5-M, GMT-L1- which is likely GMT-L1-M): Both the immediate next frame and a window of $x$ seconds of future motion frames are input.
- GMT-L0.5-M (0.5 seconds of future motion + immediate frame) shows improvement over GMT-M, but still worse than GMT.
- GMT-L1- (1 second of future motion + immediate frame) further improves, getting closer to the full GMT performance. This indicates that providing more future context is beneficial.
GMT-L2 (2 seconds of future motion, without immediate next frame): This variant shows a significant degradation in performance compared to GMT-L1-M and even GMT-M on LAFAN1. For instance, Empkpe is 49.52 mm on AMASS-Test and 61.24 mm on LAFAN1, which is much worse than the full GMT.
- Explanation: The authors explain that while a sequence of future frames captures the overall tendency of upcoming motions, it might lose some detailed information when compressed into a latent vector. The immediate next frame provides the nearest relevant information and the explicit tracking target. Therefore, combining the long-term context (latent vector from future frames) with the immediate target (next frame) is crucial for precision and high-quality tracking.
GMT-L2-M (ours): This is the full GMT model, which uses 2 seconds of future motion (encoded) plus the immediate next frame. It consistently achieves the best performance, validating the chosen motion input design.

6.3. Real-World Deployment

The efficacy of GMT is further validated through real-world deployment on a Unitree G1 humanoid robot. As shown in Figure 1, the robot is able to reproduce a wide array of human motions, including:

Stylized walking
High kicking
Dancing
Spinning
Crouch walking
Soccer kicking
Stretching
Kungfu
Boxing, running, side stepping, squatting

The deployment results demonstrate that the single, unified GMT policy achieves high fidelity and state-of-the-art performance in a real-world setting, which is a significant achievement given the challenges of sim-to-real transfer and the diversity of motions.

The following figure (Figure 1 from the original paper) showcases the real-world deployment:

Figure 1: We deploy the general unified motion tracking policy on a medium-sized humanoid robot. GMT can perform a wide range of motion skills with good stability and generalizability, including (a) stretching, (b) kicking-ball, (c) dancing, (d) high kicking, (e) kungfu, and (f) other dynamic skills such as boxing, running, side stepping, and squatting. 该图像是插图，展示了中型类人机器人执行各种运动技能的场景。机器人能够稳定地进行多种动作，包括（a）拉伸，(b) 踢球，(c) 舞蹈，(d) 高踢，(e) 功夫，以及(f) 其他动态技能，如拳击、跑步、侧步和下蹲。

6.4. Applications - Tracking MDM-Generated Motions

To further assess the generalizability of GMT, the trained policy was tested with motions generated by Motion Diffusion Models (MDM) [46] in MuJoCo sim-to-sim settings.

Results (Figure 7): GMT performs well on motions generated from text prompts by MDM, such as bowing, crouching, drinking while walking, sitting while pouring water, stretching arms, and various walking styles.
Significance: This demonstrates GMT's potential to be applied to other downstream tasks where motions might be synthesized rather than coming from traditional mocap datasets. It opens avenues for humanoids to execute a broader range of AI-generated behaviors.

The following figure (Figure 7 from the original paper) demonstrates motion tracking on MDM-generated motions:

该图像是一个示意图，展示了不同的动作追踪场景，包括鞠躬、蹲下、喝水、坐着倒水、伸展手臂和行走。每个动作都有相应的说明，展现了对多样动作的有效跟踪。

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces GMT, a novel, general, and scalable framework for training a single unified policy that enables humanoid robots to track diverse whole-body motions in the real world. The core innovations are the Adaptive Sampling strategy, which efficiently addresses dataset imbalance by prioritizing difficult motions, and the Motion Mixture-of-Experts (MoE) architecture, which enhances model expressiveness and specialization across the motion manifold. Extensive experiments in simulation and real-world deployment on a Unitree G1 robot demonstrate that GMT achieves state-of-the-art performance across a wide spectrum of complex and dynamic motions. Furthermore, GMT shows strong generalizability by effectively tracking motions generated by Motion Diffusion Models. The authors conclude that GMT provides a robust foundation for future whole-body control development in humanoid robotics.

7.2. Limitations & Future Work

The authors acknowledge several limitations of GMT:

Lack of Contact-Rich Skills: Due to the significant complexity in simulating contact-rich behaviors (e.g., precise friction, collision response, impact forces) and hardware limitations of current robots, GMT does not currently support skills like getting up from a fallen state or rolling on the ground. These skills involve intricate interactions with the environment that are hard to model and execute robustly.
Limitations on Challenging Terrains: The current policy is trained without any terrain observations and is not designed for imitation on challenging terrains such as slopes or stairs. Its performance is primarily validated on flat surfaces.

Based on these limitations, the authors suggest the following future work:
Extending the framework to develop a general and robust controller capable of operating across both flat and challenging terrains. This would likely involve incorporating terrain information into observations and potentially adapting the reward functions or policy architecture for such environments.
Future work could also explore methods to handle contact-rich skills by integrating more advanced physics modeling or specialized RL techniques for robust contact control.

7.3. Personal Insights & Critique

This paper presents a significant step forward in general motion tracking for humanoid robots. The Adaptive Sampling and MoE combination is an elegant solution to the perennial problems of data imbalance and model capacity in large-scale RL training, especially for complex real-world robotics. The demonstration on a real robot is particularly compelling, showcasing the practical utility of the proposed methods.

Inspirations and Applications:

The Adaptive Sampling strategy could be broadly applied to other RL tasks where training data is highly imbalanced or where certain "hard" examples are critical for robust performance. This goes beyond motion tracking and could be useful in areas like manipulation, navigation, or robot learning from demonstrations.
The MoE architecture, especially with its demonstrated ability to specialize for different motion phases, offers a powerful paradigm for designing highly expressive and flexible policies for diverse robot behaviors. It suggests a path towards policies that can seamlessly blend different skill sets.
The successful tracking of MDM-generated motions opens up exciting possibilities for content creation and personalization in robotics. Robots could potentially execute motions derived from natural language commands or artistic expressions, blurring the lines between human intent and robot execution.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Scaling of MoE: While MoE improves expressiveness, the computational overhead might increase with a larger number of experts or for very high-dimensional inputs. The paper does not deeply elaborate on the specific number of experts used or the computational cost trade-offs. Further analysis on the optimal number of experts and their computational efficiency could be valuable.
Generalizability of Adaptive Sampling: The current Adaptive Sampling relies on tracking performance-based probabilities and specific error thresholds. While effective for motion tracking, adapting this strategy to other RL tasks might require careful definition of "completion level" and "tracking error" for those specific tasks. The specific constants (e.g., 0.25, 0.6, 0.15, exponent 5) might need fine-tuning for different domains.
Teacher-Student Gap: Although the student policy performs well, there is still a performance gap compared to the privileged teacher. Investigating methods to further close this gap, perhaps through more advanced distillation techniques or meta-learning for sim-to-real transfer, could lead to even more robust real-world performance.
Energy Efficiency/Smoothness: While the reward function includes terms for joint velocities and action rate to promote smoothness, a more explicit focus on energy efficiency or motor wear in the reward formulation might be beneficial for long-term real-world deployment. The torque plots in Figure 6b are good qualitative indicators, but quantitative metrics for these aspects could be useful.
Robustness to Perception Errors: The student policy relies on proprioceptive observations and their history. Real-world sensor noise and potential drift can introduce errors. The paper mentions domain randomization, which helps, but specific robustness against various types of perception errors could be further explored.
Interaction with Objects: The current focus is on tracking motions. Extending GMT to handle dynamic interaction with objects (e.g., picking up, manipulating, pushing) while tracking complex motions would be a challenging but critical next step for truly general-purpose humanoids. This would require integrating force/torque sensing and reactive control.

Overall, GMT provides a powerful and practical framework for advancing humanoid robot control, setting a new benchmark for unified policy performance on diverse whole-body motions in the real world.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.