GMT: General Motion Tracking for Humanoid Whole-Body Control
TL;DR Summary
The paper presents GMT, a motion tracking framework enabling humanoid robots to track diverse full-body motions in real-world settings. It features an Adaptive Sampling strategy and a Motion Mixture-of-Experts architecture, demonstrating state-of-the-art performance through exten
Abstract
The ability to track general whole-body motions in the real world is a useful way to build general-purpose humanoid robots. However, achieving this can be challenging due to the temporal and kinematic diversity of the motions, the policy's capability, and the difficulty of coordination of the upper and lower bodies. To address these issues, we propose GMT, a general and scalable motion-tracking framework that trains a single unified policy to enable humanoid robots to track diverse motions in the real world. GMT is built upon two core components: an Adaptive Sampling strategy and a Motion Mixture-of-Experts (MoE) architecture. The Adaptive Sampling automatically balances easy and difficult motions during training. The MoE ensures better specialization of different regions of the motion manifold. We show through extensive experiments in both simulation and the real world the effectiveness of GMT, achieving state-of-the-art performance across a broad spectrum of motions using a unified general policy. Videos and additional information can be found at https://gmt-humanoid.github.io.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is "GMT: General Motion Tracking for Humanoid Whole-Body Control". The central topic is the development of a unified and scalable framework for humanoid robots to track diverse whole-body motions in the real world.
1.2. Authors
The authors are Zixuan Chen, Mazeyu Ji, Xue Bin Peng, Xuxin Cheng, Xuanbin Peng, and Xiaolong Wang. Their affiliations are:
-
UC San Diego
-
Simon Fraser University
The paper notes equal contribution from Zixuan Chen, Mazeyu Ji, and Xue Bin Peng, and equal advising.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server. While not a peer-reviewed journal or conference in its current form, arXiv is a widely recognized platform for disseminating early research in various fields, including robotics and AI. Papers posted here are often submitted to prestigious conferences or journals later.
1.4. Publication Year
The paper was published at 2025-06-17T17:59:33.000Z, which translates to June 17, 2025.
1.5. Abstract
The paper addresses the challenge of building general-purpose humanoid robots capable of tracking diverse whole-body motions in the real world. This task is difficult due to the temporal and kinematic variability of motions, the limitations of current control policies, and the complex coordination required for upper and lower bodies. To tackle these issues, the authors propose GMT (General Motion Tracking), a scalable framework that trains a single, unified policy. GMT features two main components: an Adaptive Sampling strategy to balance the training on easy and difficult motions, and a Motion Mixture-of-Experts (MoE) architecture to enable specialization across different regions of the motion manifold. Through extensive experiments in both simulation and real-world deployments, GMT demonstrates state-of-the-art performance across a wide spectrum of motions using a single general policy.
1.6. Original Source Link
The official source link is https://arxiv.org/abs/2506.14770. The PDF link is https://arxiv.org/pdf/2506.14770v2.pdf. This paper is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is enabling humanoid robots to perform a wide range of general whole-body motions in the real world using a single, unified controller. This is a crucial step towards building general-purpose humanoid robots that can operate effectively in human environments.
This problem is important because manually designing controllers for high-degree-of-freedom (DoF) humanoid systems is extremely challenging and labor-intensive. While learning-based methods have shown promise in simulated environments, transferring these capabilities to real-world robots faces several significant hurdles:
-
Partial Observability: Real robots lack full state information (e.g., linear velocities, global root positions) that is often assumed in simulations, making policy learning more difficult.
-
Hardware Limitations: Many human motions are infeasible for robots due to physical constraints, and even feasible motions may require more torque or speed than robots can provide, leading to mismatches.
-
Unbalanced Data Distribution: Large human motion datasets (like
AMASS) are often dominated by common, simple motions (e.g., walking), with a scarcity of complex or dynamic skills, making it hard for policies to learn these less frequent but important movements. -
Model Expressiveness: Simple neural network architectures (like
MLPs) struggle to capture the complex temporal dependencies and diverse motion categories present in large motion datasets, limiting tracking performance and generalization.Existing works have addressed some of these individual issues (e.g.,
teacher-student trainingfor partial observability, specialized policies for different motion categories,transformermodels for expressiveness), but developing a unified general motion tracking controller that handles all these challenges simultaneously remains an open problem.
The paper's entry point is to address the data distribution and model expressiveness problems jointly, combined with careful design decisions for partial observability and hardware issues, to create an effective system for training general motion tracking controllers for real humanoid robots.
2.2. Main Contributions / Findings
The primary contributions of this paper are:
-
GMT Framework: Proposing
GMT, a general and scalable motion-tracking framework that trains a single unified policy for real-world humanoid robots to track diverse motions. -
Adaptive Sampling Strategy: Introducing a novel
Adaptive Samplingstrategy that mitigates issues arising from uneven motion category distributions in large datasets by dynamically balancing easy and difficult motions during training. This ensures that the policy dedicates more learning effort to challenging segments. -
Motion Mixture-of-Experts (MoE) Architecture: Integrating a
Motion Mixture-of-Expertsarchitecture into the policy network to enhance model expressiveness and generalizability. This allows different "experts" to specialize in different types of motions, improving performance across the entire motion manifold. -
Comprehensive Motion Input Design: Developing an effective motion input design that combines immediate next motion frames with compressed future motion sequences, enabling the policy to capture both long-term trends and immediate tracking targets.
-
State-of-the-Art Performance: Demonstrating through extensive simulation and real-world experiments (on a Unitree G1 robot) that
GMTachieves state-of-the-art performance across a broad spectrum of motions (e.g., stylized walking, high kicking, dancing, boxing, running) using a single, unified policy. -
Generalizability to MDM-Generated Motions: Showing that the trained policy can effectively track motions generated by
Motion Diffusion Models (MDM), indicating its potential for broader applications in downstream tasks.The key conclusions are that by jointly addressing data distribution and model expressiveness challenges, along with robust design choices for real-world deployment, it is possible to create a highly effective general motion tracking controller for humanoid robots. This unified controller can serve as a foundational component for future whole-body algorithm development.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following foundational concepts:
- Humanoid Robots: Robots designed to resemble the human body, typically having two legs, a torso, two arms, and a head. They are characterized by a high number of
Degrees of Freedom (DoFs), making their control complex. - Whole-Body Control: A control strategy that coordinates all joints and limbs of a robot simultaneously to achieve a desired task, maintaining balance, and interacting with the environment. It's crucial for dynamic and agile movements in humanoids.
- Motion Tracking/Imitation: The task of making a robot follow a reference motion (often human motion data) as closely as possible. This involves minimizing the difference between the robot's pose and the target pose over time.
- Reinforcement Learning (RL): A machine learning paradigm where an
agentlearns to make decisions by performingactionsin anenvironmentto maximize a cumulativereward. The agent learns apolicy—a mapping fromstatestoactions.- Agent: The entity that learns and makes decisions. In this paper, it's the humanoid robot's controller.
- Environment: The physical or simulated world in which the agent operates.
- State (): A representation of the environment at a given time. For a robot, this includes joint positions, velocities, root orientation, etc.
- Action (): A command issued by the agent to the environment. For a robot, this could be target joint positions or torques.
- Reward (): A scalar feedback signal from the environment indicating how good or bad the agent's action was. The goal is to maximize cumulative reward.
- Policy (): The strategy that the agent uses to decide what action to take in a given state, often represented by a neural network.
- Deep Reinforcement Learning (DRL):
RLcombined with deep neural networks to approximate policies and value functions, enabling learning in high-dimensional state and action spaces. - Proximal Policy Optimization (PPO) [38]: A popular
DRLalgorithm used for trainingRLagents. It's anon-policyalgorithm that tries to take the largest possible step towards updating the policy without causing the new policy to perform too differently from the old policy, using a clipped objective function. - Mixture-of-Experts (MoE) [e.g., originally in 1991 by Jacobs et al.]: A neural network architecture that consists of multiple "expert" sub-networks and a "gating" network. The gating network learns to choose (or combine the outputs of) which expert(s) to use for a given input. This allows the model to specialize different parts of the network for different types of inputs or tasks, improving expressiveness and capacity without a proportional increase in computational cost for each input.
- Teacher-Student Training Framework: A common approach in robotics and
RLforsim-to-real transfer. Ateacher policy(often with access to privileged information in simulation) is first trained. Then, astudent policy(which only uses real-world observable information) is trained to imitate the teacher's actions. This can be done viabehavioral cloningorDistillation. - DAgger (Dataset Aggregation) [39]: An
Imitation Learningalgorithm used in theteacher-student framework. It iteratively collects data by running the student policy, querying the teacher for optimal actions on the student's encountered states, and then aggregating this new data into the training dataset. This helps address thecovariate shiftproblem (where the student policy encounters states not seen during initial training on expert data). - Domain Randomization [42]: A
sim-to-real transfertechnique where various physical parameters (e.g., friction, mass, sensor noise) in the simulation are randomly varied during training. This forces the policy to be robust to a wide range of conditions, making it more likely to generalize to the real world, which has inherent uncertainties. - Convolutional Neural Network (CNN) [41]: A type of deep neural network commonly used for processing grid-like data, such as images. They are effective at extracting hierarchical features. In this paper, a
CNNis used as an encoder to compress sequences of motion frames into a latent vector. - Latent Vector: A low-dimensional representation of higher-dimensional data, capturing its essential features.
Encodersare used to generate latent vectors.
3.2. Previous Works
The paper discusses related works in two main categories: Learning-based Humanoid Whole-Body Control and Humanoid Motion Imitation.
3.2.1. Learning-based Humanoid Whole-Body Control
Traditional model-based methods for humanoid control, like those for gait planning [12, 13, 14], are robust but labor-intensive due to complex dynamics. Recent learning-based approaches have used either hand-designed task rewards or human motions as reference.
- Task-specific controllers: Many
RLapproaches focus on specific tasks, such as walking [15, 16, 17, 18, 19], jumping [25, 26, 27], or fall recovery [28, 29]. These policies are generally specialized and not easily transferable across tasks. For example, a policy trained for walking might not be able to jump. This highlights the need for general-purpose controllers.
3.2.2. Humanoid Motion Imitation
Leveraging human motion data is a promising path for general-purpose controllers, especially in character animation:
- Simulated Characters: Works like
DeepMimic[1],ASE[2],Perpetual Humanoid Control[3],PDP[4], andMaskedMimic[5] have achieved high-quality and general motion tracking for simulated characters. They can reproduce a wide variety of motions and perform diverse skills with human-like behaviors. - Real Robots Challenges: Transferring these techniques to real robots [10, 34, 9, 8, 7, 11, 35, 36, 37] is challenging due to partial observability and hardware limitations.
- Decoupled Control: Some approaches decouple upper-body and lower-body control (e.g.,
ExBody[10],Mobiletelevision[34]) to manage the trade-off between expressiveness and stability. This often means sacrificing some whole-body coordination. - Full-sized Robot Imitation:
HumanPlus[9] andOmniH2O[8] achieved whole-body motion imitation on full-sized robots, but often with unnatural movements in the lower body, indicating a lack of fine-tuned control or balance. - Specialist Policies:
ExBody2[7] achieved better whole-body tracking but required several separate specialist policies, contradicting the goal of a single, unified controller. - Mocap Dependency:
VMP[35] showed high-fidelity reproduction but depended on a motion capture system during deployment, limiting its real-world applicability. - Sim-to-Real Alignment:
ASAP[11] focused on aligning simulation and real-world physics for agile whole-body skills.
- Decoupled Control: Some approaches decouple upper-body and lower-body control (e.g.,
3.2.3. Differentiation Analysis
The core innovations and differences of GMT compared to the main methods in related work are:
- Single Unified Policy: Unlike
ExBody2[7] which uses separate specialist policies for different skills,GMTtrains a single unified policy capable of tracking a broad range of diverse motions (upper-body, lower-body, dynamic, static) with high fidelity. - Addressing Data Imbalance:
GMTintroducesAdaptive Samplingto specifically tackle theunbalanced data distributionproblem prevalent in largemocapdatasets (likeAMASS), where simpler motions dominate. This is distinct from prior works that might just filter or augment data. - Enhanced Model Expressiveness:
GMTuses aMotion Mixture-of-Experts (MoE)architecture to improve the policy's capacity and specialization across the motion manifold. Whiletransformermodels (like inHumanPlus[9]) are also used for expressiveness,MoEprovides a different mechanism for specialization, allowing different "experts" to handle distinct motion types. - Robust Motion Input:
GMTutilizes a more sophisticated motion input representation that combines immediate next frames with aCNN-encoded latent vector of future motion sequences. This provides the policy with both short-term targets and long-term context, which is shown to be crucial for high-quality tracking, unlike prior works that might only use the immediate next frame [7, 10, 8]. - Real-World Deployment with Stability:
GMTsuccessfully deploys its unified policy on a real-world humanoid robot (Unitree G1), demonstrating stable and generalized performance across diverse skills, which has been a persistent challenge for other whole-body imitation methods that often show unnatural movements or require external systems.
3.2.4. Technological Evolution
The field has evolved from manual controller design to task-specific learning-based controllers, then to general learning-based controllers in simulation, and now towards robust, general controllers for real-world humanoid robots. GMT fits into the latest stage by addressing key challenges in sim-to-real transfer, data handling, and model capacity to enable a single policy for diverse real-robot motions. It builds upon teacher-student frameworks and domain randomization but innovates in how it handles motion data complexity and policy specialization.
4. Methodology
4.1. Principles
The core idea behind GMT is to train a single, unified policy that can track a wide variety of whole-body human motions on a real humanoid robot. The theoretical basis or intuition behind it is that by effectively managing the diversity and imbalance of large motion datasets and enhancing the policy network's capacity to specialize, a general controller can emerge. This is achieved by combining insights from reinforcement learning and imitation learning, specifically using a teacher-student framework, adaptive sampling for data efficiency, and a Mixture-of-Experts architecture for model expressiveness.
The motion tracking problem is formulated as a goal-conditioned Reinforcement Learning (RL) problem. In RL, an agent learns to make decisions in an environment to maximize a cumulative reward. In goal-conditioned RL, the agent is given a specific goal to achieve.
At each timestep , the agent's policy takes the current state and the goal as input, and outputs an action . This is formulated as .
When this action is applied to the environment, it leads to a new state according to the environment dynamics . The agent receives a reward .
The ultimate goal of the agent is to maximize the expected return, which is the sum of discounted future rewards:
where:
- is the objective function to maximize for policy .
- denotes the expectation over
trajectoriessampled according to policy . Atrajectoryis a sequence of states, actions, and rewards over time. - is the
time horizon, representing the total number of timesteps in an episode. - is the
discount factor(a value between 0 and 1, typically close to 1), which determines the present value of future rewards. A higher means future rewards are considered more important. - is the
rewardreceived at timestep .
4.2. Core Methodology In-depth (Layer by Layer)
GMT employs a two-stage teacher-student training framework for sim-to-real transfer, similar to prior works.
Stage 1: Training the Privileged Teacher Policy
A privileged teacher policy is first trained in simulation. This policy has access to both proprioceptive observations (information a real robot can sense, like joint angles and velocities) and privileged information (information typically only available in simulation, like true linear velocities, contact forces, or internal model parameters). This teacher policy is trained using the PPO (Proximal Policy Optimization) algorithm. The output of the teacher policy is joint target actions.
Stage 2: Training the Deployable Student Policy
A student policy is then trained to imitate the output of the teacher policy. This student policy is designed to be deployable on a real robot, meaning it only takes as input proprioceptive observations and their history. This training is done using the DAgger (Dataset Aggregation) algorithm. The student policy is optimized by minimizing the loss between its output and the teacher's output. The loss function is defined as:
where:
-
is the loss function to be minimized.
-
is the
joint target actionoutput by the teacher policy at timestep . -
is the
joint target actionoutput by the student policy at timestep . -
denotes the squared Euclidean norm, which calculates the squared difference between the teacher's and student's actions.
The
GMTframework introduces two core components to enhance this process:Adaptive SamplingandMotion Mixture-of-Experts.
4.2.1. Adaptive Sampling
Large motion datasets, such as AMASS, suffer from category unbalance, meaning some motion types (e.g., walking, in-place activities) are much more frequent than others (e.g., complex dynamic skills). This imbalance hinders the learning of less frequent but often more challenging motions. Traditional sampling strategies often sample entire motion sequences, which are usually dominated by easier segments, leading to an inefficient learning signal for difficult parts.
Adaptive Sampling addresses this with two components:
-
Random Clipping:
- Motions longer than 10 seconds are
clippedinto several sub-clips, each with a maximum length of 10 seconds. - To avoid artifacts at the boundaries of these clipped segments, a
random offsetof up to 2 seconds is introduced during clipping. - All motions are re-clipped periodically during training to ensure diversity in the sampled sub-clips, preventing the policy from overfitting to specific segments.
- Motions longer than 10 seconds are
-
Tracking Performance-based Probabilities:
- During training, the
completion levelfor each motion clip is tracked. - Initially, is set to 10.
- Each time a motion is successfully completed (i.e., tracked without exceeding a certain error threshold), is multiplied by 0.99, effectively decaying its value. The minimum value can reach is 1.
- An
error thresholdis dynamically set for each motion, which determines when tracking is considered to have failed. This threshold is calculated as: where:- is the error threshold for motion .
0.25is the base error threshold.0.6is the maximum error threshold when .- is the completion level for motion , ranging from 1 to 10.
- As decreases (meaning the motion is successfully completed more often), also decreases, making the motion "harder" to complete in terms of error tolerance.
- The
sampling levelfor each motion is then defined. This level is used to determine the probability of sampling a given motion for training. where:- is the sampling level for motion .
- is the maximum
key body position errorencountered during the latest attempt to track motion . 0.15is a normalization constant for the error.- The ensures the ratio does not exceed 1.
- The exponent
5amplifies the effect of high errors, making motions with significant errors much more likely to be sampled when . - If , the sampling level is simply . This means motions that are still being learned (higher ) are sampled proportionally to their completion level. If , it means the motion is considered "mastered" in terms of completion, but its sampling probability is then driven by how well it was tracked in terms of maximum error, ensuring continued refinement on challenging parts.
- Finally, the actual
sampling probabilityfor each motion is obtained by normalizing these values across all motions. This mechanism prioritizes motions that are currently being struggled with (high error) or are not yet mastered (high ), dynamically adjusting the training focus.
- During training, the
4.2.2. Motion Mixture-of-Experts (MoE)
To address the model expressiveness issue for large and diverse motion datasets, GMT incorporates a soft MoE module into the teacher policy network.
The policy network consists of:
-
Expert Networks: A group of neural networks, each referred to as an "expert". Each expert network takes the robot's
state observationandmotion targetsas input. Each expert outputs an action . -
Gating Network: Another neural network that takes the same input observations (robot state and motion targets) as the expert networks. Its role is to output a
probability distributionover all experts, denoted as , where is the probability assigned to expert .The
final action outputof theMoEpolicy is a weighted sum of the actions from each expert, where the weights are given by the probabilities from the gating network: where: -
is the final action produced by the
MoEpolicy. -
is the total number of expert networks.
-
is the probability assigned to expert by the gating network.
-
is the action output by expert network .
This architecture allows different experts to specialize in different regions of the
motion manifold(e.g., one expert for walking, another for kicking, another for dancing), and the gating network learns to smoothly transition between or combine these experts based on the current motion context. This significantly enhances the model's capacity and ability to generalize across diverse motions without requiring a massive single network.
The following figure (Figure 3 from the original paper) provides an overview of the GMT framework:
该图像是一个示意图,展示了GMT(一般运动跟踪)框架的结构。图中包含两个主要阶段:第一阶段通过适应采样处理动作捕捉数据集,并与专家混合模型结合;第二阶段展示了如何通过行为克隆训练学生策略,同时显示模拟中的跟踪误差。关键组件包括动作目标帧 、本体观测 和特权信息 。
4.2.3. Dataset Curation
The training dataset is a combination of AMASS [6] and LAFAN1 [40]. Since these raw datasets contain motions that are infeasible or dangerous for real robots (e.g., crawling, fallen states, extremely dynamic maneuvers), a two-stage data curation process is adopted:
- Rule-based Filtering (First Stage): Motions that violate basic physical constraints or robot capabilities are removed. Examples include:
- Motions where the robot's root
rollorpitch anglesexceed predefined thresholds. - Motions where the robot's root height is abnormally high or low.
- Motions that involve physical capabilities beyond the robot's hardware limits.
- Motions where the robot's root
- Performance-based Filtering (Second Stage): After the initial rule-based filtering, a
preliminary policyis trained on the filtered dataset (with approximately 5 billion samples). Based on thecompletion rates(how often the policy successfully tracks a motion) achieved by this preliminary policy, motions that consistently fail are further filtered out. This results in a curated dataset of 8925 clips totaling 33.12 hours, which is more appropriate for training a high-quality robot controller.
4.2.4. Motion Inputs
The goal of motion tracking is to make the robot follow a specific target pose in each motion frame. The motion target at timestep is represented as a vector:
where:
-
represents
joint positions(angles) for the robot's 23 degrees of freedom. -
denotes the
base linear and angular velocitiesof the robot's root (pelvis). -
represents the
base roll and pitch anglesof the robot's root. -
corresponds to the
local key body positions. Unlike some prior works that use global key body positions [8, 11],GMTuses local key body positions, similar toExBody2[7]. A further refinement is aligning these local key bodies relative to the robot's heading direction, which can provide a more robust and invariant representation. -
represents the
root height.To improve tracking performance,
GMTmoves beyond using only the immediate next motion frame as input. Instead, it incorporates future motion context: -
Stacked Future Frames: A sequence of multiple consecutive frames, , covering approximately two seconds of future motion, is stacked. This
sequenceprovides information about the long-term trends and upcoming movements. -
Convolutional Encoder: These stacked frames are then processed by a
convolutional encoder[41] to compress them into alatent vector. Thislatent vectoreffectively summarizes the complex temporal information of the future motion. -
Combined Input: The
latent vectoris combined with theimmediate next frame(which represents the explicit, short-term tracking target). This combined input is then fed into the policy network.This design allows the policy to:
-
Capture
long-term trendsof the motion sequence through thelatent vector(context). -
Explicitly recognize the
immediate tracking targetthrough (precision).The paper emphasizes that this combined input design is essential for achieving high-quality motion tracking.
4.2.5. Sim-to-Real Transfer Details
To ensure successful transfer of the learned policy from simulation to the real robot, GMT applies several techniques:
- Domain Randomization [42, 43]: Various physical parameters in the simulation are randomized during training for both the teacher and student policies. This makes the policy robust to discrepancies between simulation and reality. Details of these randomizations are provided in the Experimental Setup.
- Action Delay: This technique introduces a delay in the execution of actions in simulation, mimicking the inherent latency in real robot control systems.
- Modeling Reduction Drive Inertia: The effect of the
reduction drive's moment of inertiais explicitly modeled. For a reduction ratio and a reduction drive moment of inertia , thearmatureparameter in the simulator is configured as: This formula approximates theeffective inertiaintroduced by the reduction drive, helping to align the simulated dynamics more closely with the real robot's dynamics.
4.2.6. Observations and Actions
-
Teacher Policy Observations:
Proprioception(): Root angular velocity (3 dimensions), root roll and pitch (2 dimensions), joint positions (23 dimensions), joint velocities (23 dimensions), and last action (23 dimensions).Privileged Information(): Root linear velocity (3 dimensions), root height (1 dimension), key body positions, feet contact mask (2 dimensions), mass randomization parameters (6 dimensions), and motor strength (46 dimensions).Motion Targets(): As described in Section 4.2.4.
-
Student Policy Observations:
Proprioception(): Same as for the teacher.Proprioception History: A sequence of past proprioceptive observations, . This history helps the student policy infer unobservable states (like linear velocity) from its own sensor readings over time.Motion Targets(): Same as for the teacher.
-
Action Space: The output action for both policies consists of
target joint positions. The robot's low-level controller then works to achieve these target positions.
4.2.7. Reward Functions
The reward functions used in the first-stage training (for the teacher policy) guide the robot to track the motion targets. The overall reward is a sum of individual reward terms, each designed to penalize deviations from the target motion or undesired behaviors. Here are the definitions from Table 3 of the paper:
| Name | Definitions |
| tracking joint positions | exp(−kq∥qref − qt∥2) |
| tracking joint velocities | exp(−kq̇∥q̇ref − q̇t∥2) |
| tracking root pose | exp(−kr∥rref − rt∥2 − kh∥href − ht∥2) |
| tracking root vel | exp(−kv∥vref − vt∥2) |
| tracking key body positions | exp(−kp∥pref − pt∥2) |
| alive | 1.0 |
| foot slip | −kfoot∥vfoot ∘ contact∥2 |
| joint velocities | −kjoint_vel∥q̇t∥2 |
| joint accelerations | −kjoint_acc∥q̈t∥2 |
| action rate | −kaction_rate∥at − at−1∥2 |
Where:
-
exp(x)denotes the exponential function . -
with various subscripts (e.g., ) represents scaling coefficients for each reward term, balancing their contribution to the total reward.
-
and are the reference and current joint positions, respectively.
-
and are the reference and current joint velocities, respectively.
-
and are the reference and current root rotations (e.g., roll and pitch), respectively.
-
and are the reference and current root heights, respectively.
-
and are the reference and current root velocities, respectively.
-
and are the reference and current key body positions, respectively.
-
is the foot velocity.
-
contactis a mask indicating whether a foot is in contact with the ground. -
denotes the element-wise (Hadamard) product.
-
is the current joint acceleration.
-
and are the current and previous actions, respectively.
-
denotes the squared Euclidean norm.
The "tracking" rewards are designed to maximize similarity to the reference motion, using exponential functions so that smaller errors yield higher rewards. The "alive" reward is a constant positive reward for simply staying active. The "foot slip", "joint velocities", "joint accelerations", and "action rate" terms are regularization penalties to encourage stable, smooth, and energy-efficient movements.
5. Experimental Setup
5.1. Datasets
The policies were trained using a combination of two well-known human motion datasets:
- AMASS (Archive of Motion Capture as Surface Shapes) [6]: This is a large, public dataset of human motion capture data.
-
Source: It aggregates various existing
mocapdatasets into a common framework, providing body shape and pose information. -
Characteristics: Contains a vast array of human motions, including locomotion, interactions, sports, and daily activities. However, it exhibits significant
category unbalance, with a large proportion of motions involving walking or in-place activities, as illustrated in Figure 2. -
Example of data distribution (Figure 2 from the original paper):
该图像是图表,展示了AMASS数据集中不同动作类别的分布。图中显示了各个类别所占总动作时长的比例,其中‘Walk’和‘Inplace’的比例最高,而其他类别的比例相对较低。As can be seen from Figure 2, categories like
walkandin-placeconstitute a very large portion of the dataset's duration, while more complex or dynamic motions likedance,kick,jump, andrunare comparatively rare. -
Domain: Human motion data.
-
- LAFAN1 [40]: A dataset specifically designed for learning character animation, often used for tasks like motion in-betweening and tracking.
-
Source: A collection of motions with a focus on detailed foot contact information and diverse locomotion.
-
Characteristics: Known for clean data and suitability for physics-based character control.
-
Domain: Human motion data.
Dataset Curation: As described in the Methodology section, the raw
AMASSandLAFAN1datasets underwent a two-stage filtering process (rule-based and performance-based) to remove infeasible or problematic motions for humanoid robots. This resulted in a curated training dataset of 8925 clips, totaling 33.12 hours of motion.
-
Test Sets:
-
AMASS-test: A separate test set derived from the
AMASSdataset for evaluating generalization to unseenAMASSmotions. -
LAFAN1: Used as a test set to evaluate performance on the
LAFAN1dataset, even though it was part of the training data after filtering. This allows assessing how well the policy learned to track the motions present inLAFAN1.Why these datasets were chosen: These datasets represent a broad spectrum of human movements, making them ideal for training a general motion tracking policy. The curation process ensures that the motions are physically plausible for a robot.
5.2. Evaluation Metrics
The policy tracking performance is quantitatively evaluated using four metrics:
-
Mean Per Keybody Position Error ():
- Conceptual Definition: This metric quantifies the average spatial discrepancy between the positions of specific
key bodies(e.g., hands, feet, head, pelvis) on the robot and their corresponding target positions in the reference motion. It measures how well the robot's overall body shape and limb positions match the target. A lower value indicates better tracking. - Mathematical Formula: Not explicitly provided in the paper, but typically calculated as:
- Symbol Explanation:
- : The number of key bodies being tracked.
- : The total number of timesteps in the motion clip.
- : The 3D position vector of the -th key body on the robot at timestep .
- : The 3D position vector of the -th key body in the reference motion at timestep .
- : The Euclidean distance (L2 norm) in 3D space.
- The unit of is millimeters (mm).
- Conceptual Definition: This metric quantifies the average spatial discrepancy between the positions of specific
-
Mean Per Joint Position Error ():
- Conceptual Definition: This metric measures the average angular difference between the robot's joint angles and the target joint angles specified by the reference motion. It assesses the fidelity of the robot's internal pose and joint configurations compared to the desired motion. A lower value indicates better joint-level tracking.
- Mathematical Formula: Not explicitly provided in the paper, but typically calculated as:
- Symbol Explanation:
- : The number of joints (degrees of freedom) in the robot.
- : The total number of timesteps.
- : The angular position of the -th joint on the robot at timestep .
- : The angular position of the -th joint in the reference motion at timestep .
- : The absolute difference.
- The unit of is radians (rad).
-
Linear Velocity Error ():
- Conceptual Definition: This metric quantifies the difference between the robot's root linear velocity (e.g., how fast and in what direction its pelvis is moving) and the target linear velocity from the reference motion. It evaluates the robot's ability to match the translational speed and direction of the reference motion. A lower value indicates better global locomotion tracking.
- Mathematical Formula: Not explicitly provided in the paper, but typically calculated as:
- Symbol Explanation:
- : The total number of timesteps.
- : The 3D linear velocity vector of the robot's root at timestep .
- : The 3D linear velocity vector of the reference motion's root at timestep .
- : The Euclidean distance.
- The unit of is meters per second (m/s).
-
Yaw Velocity Error ():
- Conceptual Definition: This metric measures the difference between the robot's root
yaw velocity(rate of rotation around its vertical axis) and the targetyaw velocityfrom the reference motion. It assesses the robot's ability to match the rotational speed around its vertical axis, crucial for turning and orientation changes. A lower value indicates better rotational tracking. - Mathematical Formula: Not explicitly provided in the paper, but typically calculated as:
- Symbol Explanation:
- : The total number of timesteps.
- : The scalar yaw angular velocity of the robot's root at timestep .
- : The scalar yaw angular velocity of the reference motion's root at timestep .
- : The absolute difference.
- The unit of is radians per second (rad/s).
- Conceptual Definition: This metric measures the difference between the robot's root
5.3. Baselines
GMT's performance is compared against ExBody2 [7] in simulation.
- ExBody2 [7]: This prior work also focuses on expressive humanoid whole-body control and achieved good tracking performance, but with several separate specialist policies. The authors re-implemented
ExBody2and trained it on their filtered dataset to ensure a fair comparison under the same data conditions.
5.4. Training Details
- Simulator:
IsaacGym[44] was used as the physics simulator.IsaacGymis known for its ability to run thousands of simulation environments in parallel on a single GPU, significantly acceleratingRLtraining. - Parallel Environments: 4096 parallel environments were used, enabling massive parallelization.
- GPU: Training was performed on an
RTX4090 GPU. - Training Time:
- Privileged Teacher Policy: Approximately 3 days.
- Deployable Student Policy: Approximately 1 day.
- Simulation Frequency: . This means the physics simulation updates 500 times per second.
- Control Frequency: . This means the policy outputs new actions 50 times per second. The ratio (500/50 = 10) implies that for every policy action, the simulator takes 10 physics steps.
- Samples: Each policy was trained using approximately 6.8 billion samples.
- Validation: The trained policy was validated in
MuJoCo[47] (a physics engine often used for robotics research) before deployment onto the real robot. This provides an additional verification step. - Robot: For real-world experiments, the policy was deployed on a
Unitree G1[45], a medium-sized humanoid robot with 23DoFsand a height of 1.32 meters.
5.5. Domain Randomizations
To mitigate the sim-to-real gap, extensive domain randomizations were applied during training. The details are shown in Table 4 from the original paper:
| Name | Range |
| Terrain Height Gravity | [0, 0.02]m [−0.1, 0.1] |
| Friction Robot Base Mass | [0.1, 2.0] |
| Robot Base Mass Center | [−3, 3]kg, [−0.05, 0.05]m |
| Push Velocity Motor Strength | [0.0, 1.0]m/s |
Where:
- Terrain Height: Randomizes the height variations in the terrain, making the robot robust to uneven surfaces.
- Gravity: Randomly perturbs the gravity vector, helping the robot learn to compensate for slight variations in perceived gravity or inertial forces.
- Friction: Randomizes the friction coefficients of the robot's contacts, making the policy robust to varying surface slipperiness.
- Robot Base Mass: Randomizes the mass of the robot's base (torso/pelvis), forcing the policy to adapt to different inertial properties. The range
[0.1, 2.0]is likely a scaling factor applied to the nominal mass. - Robot Base Mass Center: Randomizes the position of the center of mass of the robot's base, which affects its balance and dynamics.
- Push Velocity: Applies random external pushes to the robot, characterized by a velocity range, to improve its ability to recover from disturbances and maintain balance.
- Motor Strength: Randomizes the strength (torque output) of the robot's motors, making the policy robust to variations in motor performance or wear.
6. Results & Analysis
6.1. Core Results Analysis
The experiments evaluate GMT's performance in both simulation and real-world settings, focusing on the contributions of Adaptive Sampling, Motion Mixture-of-Experts, and Motion Inputs.
The following are the results from Table 2 of the original paper:
| AMASS-Test | LAFAN1 | |||||||
| Method | Empkpe ↓ | Empipe ↓ | Evel ↓ | Eyaw vel | Empkpe ↓ | Empipe↓ | Evel ↓ | Eyaw vel ↓ |
| Teacher & Student | ||||||||
| Privileged Policy | 42.07 | 0.0834 | 0.1747 | 0.2238 | 45.16 | 0.0975 | 0.2837 | 0.3314 |
| tutu Policy | 4.01 | 0807 | 0.1943 | 0.2211 | 46.14 | 0.1060 | 0.3009 | 0.3489 |
| Baseline | ||||||||
| ExBody2 [7] | 50.28±0.28 | 0.0925±0.001 | 0.1875±0.001 | 0.3402±0.004 | 58.36±0.48 | 0.1378±0.002 | 0.3461±0.005 | 0.4260±0.006 |
| G (ours) | 42.07±0.17 | 0.0834±0.001 | 0.1747±0.001 | 0.2238±0.002 | 45.16±0.35 | 0.0975±0.001 | 0.2837±0.004 | 0.3314±0.003 |
| (a) Ablations | ||||||||
| GMT w.o. MoE | 0.0874±0.00 | 0.1902±0.002 | 0.2483±0.001 | 48.26±0.29 | 0.3111±0.003 | 0.3795±0.005 | ||
| GMT w.o. A.S. | 42.53±0.19 43.54±0.23 | 0.0872±0.001 | 0.2064±0.001 | 0.2593±0.001 | 49.61±0.30 | 0.1019±0.001 0.1041±0.002 | 0.3019±0.003 | 0.3574±0.003 |
| GMT W.O. A.S. & MoE | 44.34±0.21 | 0.0920±0.001 | 0.2121±0.001 | 0.2534±0.001 | 52.34±0.33 | 0.1110±0.002 | 0.3263±0.003 | 0.3584±0.007 |
| GMT (ours) | 42.07±0.17 | 0.0834±0.001 | 0.1747±0.001 | 0.2238±0.002 | 45.16±0.35 | 0.0975±0.001 | 0.2837±0.004 | 0.3314±0.003 |
| (b) Motion Inputs | ||||||||
| GMT-M | 46.02±0.25 | 0.0942±0.001 | 0.2282±0.001 | 0.3311±0.003 | 51.16±0.34 | 0.1069±0.002 | 0.3476±0.001 | 0.4890±0.007 |
| GMT-L0.5-M | 43.64±0.19 | 0.0855±0.001 | 0.2051±0.001 | 0.2439±0.001 | 49.87±0.32 | 0.1032±0.002 | 0.3346±0.005 | 0.3648±0.003 |
| GMT-L1- | 43.15±0.22 | 0.0867±0.001 | 0.1989±0.002 | 0.2465±0.001 | 47.41±0.35 | 0.1007±0.002 | 0.3047±0.003 | 0.3513±0.002 |
| GMT-L2 | 49.52±0.27 | 0.1016±0.002 | 0.2201±0.001 | 0.2888±0.003 | 61.24±0.42 | 0.1368±0.002 | 0.3925±0.008 | 0.5558±0.009 |
| GMT-L2-M (ours) | 42.07±0.17 | 0.0834±0.001 | 0.1747±0.001 | 0.2238±0.002 | 45.16±0.35 | 0.0975±0.001 | 0.2837±0.004 | 0.3314±0.003 |
6.1.1. Baseline Comparison
The row labeled G (ours) in the "Baseline" section of Table 2 represents the full GMT model. It consistently outperforms ExBody2 [7] across all metrics on both AMASS-Test and LAFAN1 datasets.
-
AMASS-Test:GMTachievesEmpkpeof 42.07 mm compared toExBody2's 50.28 mm, andEmpipeof 0.0834 rad compared to 0.0925 rad. Similar improvements are observed inEvel(0.1747 m/s vs 0.1875 m/s) andEyaw vel(0.2238 rad/s vs 0.3402 rad/s). -
LAFAN1: The performance gap is even larger onLAFAN1, withGMTscoring 45.16 mmEmpkpevsExBody2's 58.36 mm, and 0.0975 radEmpipevs 0.1378 rad.This strong performance demonstrates that
GMTsignificantly improves both local tracking (key body and joint positions) and global tracking (linear and yaw velocities) compared to a state-of-the-art baseline, achieving better fidelity and generalization with a single policy.
6.1.2. Teacher & Student Performance
The "Teacher & Student" section shows the performance of the Privileged Policy (teacher) and tutu Policy (student). The term "tutu Policy" seems to be a typo or placeholder, as its Empkpe is suspiciously low (4.01), and Empipe is 0807 (likely 0.0807). If we assume the "tutu Policy" is the student policy, its performance is slightly worse than the privileged policy, which is expected as the student lacks privileged information. However, the student's performance is still quite strong and close to the teacher's, indicating successful knowledge transfer. The table entry "G (ours)" is identical to "Privileged Policy", confirming that the main comparisons and ablations are on the privileged teacher policy.
6.2. Ablation Studies
The "Ablations" section (a) in Table 2 investigates the contribution of Motion MoE and Adaptive Sampling.
-
GMT w.o. MoE(without Mixture-of-Experts):Empkpeincreases to 48.26 mm (from 42.07 mm) onLAFAN1.Empipeincreases to 0.0874 rad (from 0.0834 rad) onAMASS-Test.EvelandEyaw velalso show degradation.- This indicates that the
MoEarchitecture is crucial for maintaining high tracking accuracy, especially for capturing the diversity and complexity of motions.
-
GMT w.o. A.S.(without Adaptive Sampling):EmpkpeonAMASS-Testworsens to 42.53 mm (from 42.07 mm).EmpkpeonLAFAN1worsens to 49.61 mm (from 45.16 mm).- All other metrics also show noticeable degradation.
- This confirms that
Adaptive Samplingis effective in improving tracking performance, particularly by focusing training on more challenging motion segments and ensuring balanced learning across motion categories.
-
GMT W.O. A.S. & MoE(without both Adaptive Sampling and Mixture-of-Experts):- This variant shows the worst performance across all metrics and datasets, with
Empkpeof 44.34 mm onAMASS-Testand 52.34 mm onLAFAN1, and similarly poor results for other metrics. - This clearly demonstrates the synergistic effect of
Adaptive SamplingandMoE. When both are removed, the policy's ability to track diverse motions significantly deteriorates, highlighting their combined importance forGMT's state-of-the-art performance.
- This variant shows the worst performance across all metrics and datasets, with
6.2.1. Motion MoE Analysis
The MoE (Mixture-of-Experts) architecture specifically helps in improving performance on more challenging motions.
-
Quantitative Evidence: As shown in Figure 5, the
MoEhelps significantly reduce tracking errors, especially in the top percentiles (i.e., for the hardest-to-track motions) forEmpkpe,Empipe,Evel, andEyaw vel. This suggests that the specialization of experts allows the policy to better handle complex and diverse movements that a single, monolithic network might struggle with. -
Qualitative Evidence: Figure 4 visualizes the output of the gating network on a composite motion sequence (standing, kicking, walking backward, standing again). It shows clear transitions in
expert activationacross different phases of the motion. For example, one expert might be active during standing, another during kicking, and yet another during walking. This validates the intended role ofMoE—individual experts specialize in different types of motion, allowing for better overall generalization and performance on a wide range of skills.The following figure (Figure 4 from the original paper) plots the output of the gating network:
该图像是图表,展示了在一个动作片段中,根据时间变化的门控网络输出。上半部分为机器人在运动中的截图,下面部分显示了不同专家的门控权重变化,描绘了各专家在时间步长上的表现。
The following figure (Figure 5 from the original paper) shows top percentile tracking errors:
该图像是一个图表,展示了GMT及其变种在整个AMASS数据集上的跟踪误差。其中,蓝色条表示GMT,橙色条表示去除自适应采样的结果,绿色条表示去除Motion Mixture-of-Experts的结果。图中包括关键身体、关节位置、线性速度和偏航速度的跟踪误差数据,以不同的百分比范围进行分组。
6.2.2. Adaptive Sampling Analysis
Similar to MoE, Adaptive Sampling also significantly improves performance, especially on challenging motions.
- Quantitative Evidence: Table 2 (a) shows that
GMT w.o. A.S.performs worse than the fullGMT. Figure 5 also illustrates thatAdaptive Sampling(blue vs. orange bars) leads to lower tracking errors, particularly in the higher percentiles, confirming its effectiveness on difficult motions. - Qualitative Evidence: Figure 6 compares policies trained with and without
Adaptive Samplingon a difficult segment extracted from a long motion clip.-
Visualization (Figure 6a): Without
Adaptive Sampling, the policy struggles to learn this clip, often failing to balance and exhibiting unstable behaviors. WithAdaptive Sampling, the policy tracks the motion with high quality and stability. -
Torque Outputs (Figure 6b): The torque outputs for key joints (knee and hip roll) also differ. Without
Adaptive Sampling, the torque profiles might be erratic or insufficient, leading to instability. WithAdaptive Sampling, the torques are likely more controlled and effective in executing the motion. This shows thatAdaptive Samplingis critical not only for simulation performance but also for practicalreal-world deployment.The following figure (Figure 6 from the original paper) illustrates the performance of policies with and without Adaptive Sampling:
该图像是图表,展示了带有自适应采样和不带自适应采样的策略在长运动片段中的性能。上部分为机器人在模拟器中的执行表现,下部分显示了多个关节的扭矩输出。该图说明了自适应采样如何提高机器人稳定性和整体表现。
-
6.2.3. Motion Inputs Analysis
The "Motion Inputs" section (b) in Table 2 investigates how different configurations of motion inputs affect tracking performance.
-
GMT-M: Only the immediate next frame of motion is provided as input. This results in significantly worse performance across all metrics compared to the fullGMTmodel (e.g.,Empkpe46.02 mm onAMASS-Testvs 42.07 mm forGMT). This confirms that immediate next frame alone is insufficient for high-quality tracking. -
GMT-Lx-M(e.g.,GMT-L0.5-M,GMT-L1-which is likelyGMT-L1-M): Both the immediate next frame and a window of seconds of future motion frames are input.GMT-L0.5-M(0.5 seconds of future motion + immediate frame) shows improvement overGMT-M, but still worse thanGMT.GMT-L1-(1 second of future motion + immediate frame) further improves, getting closer to the fullGMTperformance. This indicates that providing more future context is beneficial.
-
GMT-L2(2 seconds of future motion, without immediate next frame): This variant shows a significant degradation in performance compared toGMT-L1-Mand evenGMT-MonLAFAN1. For instance,Empkpeis 49.52 mm onAMASS-Testand 61.24 mm onLAFAN1, which is much worse than the fullGMT.- Explanation: The authors explain that while a sequence of future frames captures the
overall tendencyof upcoming motions, it mightlose some detailed informationwhen compressed into alatent vector. Theimmediate next frameprovides thenearest relevant informationand theexplicit tracking target. Therefore, combining the long-term context (latent vector from future frames) with the immediate target (next frame) is crucial for precision and high-quality tracking.
- Explanation: The authors explain that while a sequence of future frames captures the
-
GMT-L2-M (ours): This is the fullGMTmodel, which uses 2 seconds of future motion (encoded) plus the immediate next frame. It consistently achieves the best performance, validating the chosenmotion inputdesign.
6.3. Real-World Deployment
The efficacy of GMT is further validated through real-world deployment on a Unitree G1 humanoid robot. As shown in Figure 1, the robot is able to reproduce a wide array of human motions, including:
-
Stylized walking
-
High kicking
-
Dancing
-
Spinning
-
Crouch walking
-
Soccer kicking
-
Stretching
-
Kungfu
-
Boxing, running, side stepping, squatting
The deployment results demonstrate that the single, unified
GMTpolicy achieves high fidelity and state-of-the-art performance in a real-world setting, which is a significant achievement given the challenges ofsim-to-real transferand the diversity of motions.
The following figure (Figure 1 from the original paper) showcases the real-world deployment:
该图像是插图,展示了中型类人机器人执行各种运动技能的场景。机器人能够稳定地进行多种动作,包括(a)拉伸,(b) 踢球,(c) 舞蹈,(d) 高踢,(e) 功夫,以及(f) 其他动态技能,如拳击、跑步、侧步和下蹲。
6.4. Applications - Tracking MDM-Generated Motions
To further assess the generalizability of GMT, the trained policy was tested with motions generated by Motion Diffusion Models (MDM) [46] in MuJoCo sim-to-sim settings.
-
Results (Figure 7):
GMTperforms well on motions generated from text prompts byMDM, such as bowing, crouching, drinking while walking, sitting while pouring water, stretching arms, and various walking styles. -
Significance: This demonstrates
GMT's potential to be applied to otherdownstream taskswhere motions might be synthesized rather than coming from traditionalmocapdatasets. It opens avenues for humanoids to execute a broader range of AI-generated behaviors.The following figure (Figure 7 from the original paper) demonstrates motion tracking on MDM-generated motions:
该图像是一个示意图,展示了不同的动作追踪场景,包括鞠躬、蹲下、喝水、坐着倒水、伸展手臂和行走。每个动作都有相应的说明,展现了对多样动作的有效跟踪。
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces GMT, a novel, general, and scalable framework for training a single unified policy that enables humanoid robots to track diverse whole-body motions in the real world. The core innovations are the Adaptive Sampling strategy, which efficiently addresses dataset imbalance by prioritizing difficult motions, and the Motion Mixture-of-Experts (MoE) architecture, which enhances model expressiveness and specialization across the motion manifold. Extensive experiments in simulation and real-world deployment on a Unitree G1 robot demonstrate that GMT achieves state-of-the-art performance across a wide spectrum of complex and dynamic motions. Furthermore, GMT shows strong generalizability by effectively tracking motions generated by Motion Diffusion Models. The authors conclude that GMT provides a robust foundation for future whole-body control development in humanoid robotics.
7.2. Limitations & Future Work
The authors acknowledge several limitations of GMT:
-
Lack of Contact-Rich Skills: Due to the significant complexity in simulating
contact-rich behaviors(e.g., precise friction, collision response, impact forces) and hardware limitations of current robots,GMTdoes not currently support skills like getting up from a fallen state or rolling on the ground. These skills involve intricate interactions with the environment that are hard to model and execute robustly. -
Limitations on Challenging Terrains: The current policy is trained without any
terrain observationsand is not designed for imitation on challenging terrains such as slopes or stairs. Its performance is primarily validated on flat surfaces.Based on these limitations, the authors suggest the following future work:
-
Extending the framework to develop a general and robust controller capable of operating across both flat and challenging terrains. This would likely involve incorporating terrain information into observations and potentially adapting the reward functions or policy architecture for such environments.
-
Future work could also explore methods to handle
contact-rich skillsby integrating more advanced physics modeling or specializedRLtechniques for robust contact control.
7.3. Personal Insights & Critique
This paper presents a significant step forward in general motion tracking for humanoid robots. The Adaptive Sampling and MoE combination is an elegant solution to the perennial problems of data imbalance and model capacity in large-scale RL training, especially for complex real-world robotics. The demonstration on a real robot is particularly compelling, showcasing the practical utility of the proposed methods.
Inspirations and Applications:
- The
Adaptive Samplingstrategy could be broadly applied to otherRLtasks where training data is highly imbalanced or where certain "hard" examples are critical for robust performance. This goes beyond motion tracking and could be useful in areas likemanipulation,navigation, orrobot learningfrom demonstrations. - The
MoEarchitecture, especially with its demonstrated ability to specialize for different motion phases, offers a powerful paradigm for designing highly expressive and flexible policies for diverse robot behaviors. It suggests a path towards policies that can seamlessly blend different skill sets. - The successful tracking of
MDM-generated motions opens up exciting possibilities for content creation and personalization in robotics. Robots could potentially execute motions derived from natural language commands or artistic expressions, blurring the lines between human intent and robot execution.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Scaling of MoE: While
MoEimproves expressiveness, the computational overhead might increase with a larger number of experts or for very high-dimensional inputs. The paper does not deeply elaborate on the specific number of experts used or the computational cost trade-offs. Further analysis on the optimal number of experts and their computational efficiency could be valuable. -
Generalizability of Adaptive Sampling: The current
Adaptive Samplingrelies ontracking performance-based probabilitiesand specific error thresholds. While effective for motion tracking, adapting this strategy to otherRLtasks might require careful definition of "completion level" and "tracking error" for those specific tasks. The specific constants (e.g.,0.25,0.6,0.15, exponent5) might need fine-tuning for different domains. -
Teacher-Student Gap: Although the student policy performs well, there is still a performance gap compared to the privileged teacher. Investigating methods to further close this gap, perhaps through more advanced
distillation techniquesormeta-learningforsim-to-real transfer, could lead to even more robust real-world performance. -
Energy Efficiency/Smoothness: While the reward function includes terms for
joint velocitiesandaction rateto promote smoothness, a more explicit focus on energy efficiency or motor wear in the reward formulation might be beneficial for long-term real-world deployment. The torque plots in Figure 6b are good qualitative indicators, but quantitative metrics for these aspects could be useful. -
Robustness to Perception Errors: The student policy relies on
proprioceptive observationsand their history. Real-world sensor noise and potential drift can introduce errors. The paper mentionsdomain randomization, which helps, but specific robustness against various types of perception errors could be further explored. -
Interaction with Objects: The current focus is on tracking motions. Extending
GMTto handle dynamic interaction with objects (e.g., picking up, manipulating, pushing) while tracking complex motions would be a challenging but critical next step for truly general-purpose humanoids. This would require integrating force/torque sensing and reactive control.Overall,
GMTprovides a powerful and practical framework for advancing humanoid robot control, setting a new benchmark for unified policy performance on diverse whole-body motions in the real world.
Similar papers
Recommended via semantic vector search.