Paper status: completed

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Published:08/12/2025

Humanoid Motion Skill Learning (1)Diffusion-Based Motion Control (1)Motion Tracking and Imitation (1)Zero-Shot Task Control (1)Sim-to-Real Transfer (1)

Original Link PDF

Price: 0.100000

15 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

BeyondMimic uses guided diffusion to robustly track human motions and compose versatile skills, enabling zero-shot task control on real humanoid robots and bridging the sim-to-real gap for dynamic, generalizable motion control.

Abstract

Learning skills from human motions offers a promising path toward generalizable policies for versatile humanoid whole-body control, yet two key cornerstones are missing: (1) a high-quality motion tracking framework that faithfully transforms large-scale kinematic references into robust and extremely dynamic motions on real hardware, and (2) a distillation approach that can effectively learn these motion primitives and compose them to solve downstream tasks. We address these gaps with BeyondMimic, a real-world framework to learn from human motions for versatile and naturalistic humanoid control via guided diffusion. Our framework provides a motion tracking pipeline capable of challenging skills such as jumping spins, sprinting, and cartwheels with state-of-the-art motion quality. Moving beyond simply mimicking existing motions, we further introduce a unified diffusion policy that enables zero-shot task-specific control at test time using simple cost functions. Deployed on hardware, BeyondMimic performs diverse tasks at test time, including waypoint navigation, joystick teleoperation, and obstacle avoidance, bridging sim-to-real motion tracking and flexible synthesis of human motion primitives for whole-body control. https://beyondmimic.github.io/.

Mind Map

In-depth Reading

English Analysis~13 min read · 17,190 chars

1. Bibliographic Information

Title: BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion
Authors: Qiayuan Liao, Takara E. Truong, Xiaoyu Huang, Guy Tevet, Koushil Sreenath, and C. Karen Liu.
Affiliations: The authors are affiliated with two institutions, denoted as 1 and 2 in the paper. Based on the authors' public profiles, these correspond to the University of California, Berkeley, and Stanford University, respectively.
Journal/Conference: The paper is available on arXiv, a preprint server for academic articles. Its publication status is a preprint, meaning it has not yet completed a formal peer-review process for a conference or journal.
Publication Year: The paper's identifier on arXiv (2508.08241v3) suggests a submission date in 2025. Given the current date, this indicates it is a very recent work.
Abstract: The authors tackle two primary challenges in humanoid robotics: (1) creating a high-quality motion tracking framework to transfer a wide range of human motions, including highly dynamic ones, onto real hardware, and (2) developing a method to distill these learned skills into a single, versatile policy that can solve new tasks. They introduce BeyondMimic, a framework that first trains robust motion tracking policies for skills like sprinting and cartwheels. It then uses a novel guided diffusion policy to compose these learned motion primitives in a zero-shot manner to solve downstream tasks like waypoint navigation and joystick teleoperation at test time, using only simple cost functions without any retraining.
Original Source Link:
- Official Source: https://arxiv.org/abs/2508.08241
- PDF Link: https://arxiv.org/pdf/2508.08241v3.pdf

2. Executive Summary

Background & Motivation (Why): The ultimate goal of humanoid robotics is to create general-purpose robots that can operate in human environments. A promising approach is to teach them from the vast diversity of human motions. However, existing methods have major shortcomings. They either work only in simulation, produce jittery or low-quality motions on real robots, are limited to simple movements like walking, or require extensive, task-specific engineering for each new skill. A key missing piece is a unified framework that can both learn high-quality dynamic skills from human data and flexibly compose them to solve new, unseen tasks on a physical robot.
Main Contributions / Findings (What): The paper makes three core contributions to address these gaps:
1. Scalable Motion Tracking: An open-source framework for training robust, high-quality tracking policies for highly dynamic motions (e.g., jumping spins, cartwheels) on a real humanoid robot. A single set of hyperparameters is used for all motions, demonstrating scalability.
2. Guided Diffusion for Humanoids: The first successful demonstration of a loss-guided diffusion policy for real-world humanoid whole-body control. This allows the robot to perform diverse new tasks at test time (e.g., obstacle avoidance, teleoperation) by simply providing a corresponding cost function, without needing to retrain the policy.
3. End-to-End Framework: A complete, practical pipeline from raw motion capture data to hardware deployment. This includes motion tracking policy training, distillation into a diffusion model, and real-world deployment on a Unitree G1 humanoid.
  
  $Fig. 1. Overview of BeyondMimic: We train robust motion tracking policies $\\pi$ for effective sim-to-real transfer, then perform offline distillation to learn a state-action diffusion model, and fina…$ 该图像是BeyondMimic方法的示意图，展示了训练运动跟踪策略、进行稳健离线蒸馏以获得状态-动作扩散模型，然后通过导航和速度控制引导模型完成多样任务的流程。

Foundational Concepts:
- Motion Tracking: A technique in robotics and animation where a simulated or physical character is trained to imitate a sequence of poses from a reference motion (e.g., from a motion capture dataset). This is typically framed as a Reinforcement Learning (RL) problem, where the agent receives rewards for minimizing the difference between its state and the reference state. DeepMimic is a foundational paper in this area.
- Sim-to-Real Transfer: The process of transferring a control policy trained in a physics simulator to a physical robot. This is a major challenge because simulators are imperfect approximations of reality; a "sim-to-real gap" exists due to unmodeled dynamics, sensor noise, and hardware limitations.
- Diffusion Models: A class of generative models that learn to create data by reversing a noise-addition process. They start with random noise and iteratively "denoise" it over several steps to produce a clean, structured output (like an image or, in this case, a trajectory of robot states and actions).
- Guided Diffusion: A powerful feature of diffusion models where the generation (denoising) process can be steered or "guided" at test time towards a desired outcome. This is done by incorporating an external signal, such as a cost function gradient, into the update step, pushing the model's output to satisfy certain constraints or goals (e.g., "move towards this waypoint").
- Asymmetric Actor-Critic: An RL technique where the actor (the policy) has access only to information available on the real robot, while the critic (which estimates the value of states) can access privileged information from the simulator (e.g., true body positions). This helps the critic learn more accurately, which in turn improves the training of the actor, without making the final policy dependent on information unavailable in the real world.
Previous Works & Differentiation:
- On Motion Tracking: Early works required extensive reward engineering for each specific task. DeepMimic showed that learning from reference motions could produce more natural behaviors. However, subsequent robotics works like ASAP and KungfuBot focused on single, short, specialized motions, requiring motion-specific tuning. More recent "multi-motion" trackers (OmniH2O, HumanPlus, GMT) aimed to learn from diverse datasets but often suffered from degraded motion quality, jitter, or were limited to low-dynamic motions like walking. BeyondMimic differentiates itself by achieving state-of-the-art motion quality for highly dynamic skills on a real humanoid using a single, unified training pipeline.
- On Diffusion in Robotics: Diffusion models have been explored in three main ways:
  1. Kinematic Planners: Plan motions as kinematic sequences, which are then passed to a separate low-level controller. This often fails due to a "planning-control gap," where the planned motion is physically impossible for the controller to execute.
  2. End-to-End Policies (Diffusion Policy): Learn a direct state-to-action mapping. While robust, these policies cannot be adapted to new tasks at test time without being retrained.
  3. Joint State-Action Diffusion: Model a joint distribution over future states and actions. This allows for guidance but previous attempts (Diffuser, Decision Diffuser) showed limited robustness or were confined to simulation. Diffuse-CLoC showed strong performance in simulation. BeyondMimic is the first to successfully apply this guided, joint state-action diffusion approach to a complex, real-world humanoid, demonstrating robust and flexible control on hardware.

4. Methodology (Core Technology & Implementation)

The BeyondMimic framework is composed of two main stages: (1) training robust motion tracking policies for a wide variety of skills, and (2) distilling these skills into a single guided diffusion policy for versatile task execution.

Part 1: Scalable Motion Tracking

The goal is to create policies that can robustly track long and diverse human motion capture clips on a physical robot.

Tracking Objective: To avoid issues with global drift, the system does not track the absolute world-frame position of the robot. Instead, it selects an anchor body (e.g., the torso) and tracks the reference motion relative to it. This allows the robot to maintain the style of the motion even if its global position deviates from the reference. The desired twists (velocities) of all bodies are tracked directly from the reference.
Observations: The policy's input (observation) is a vector containing:
1. Reference Phase: Joint positions and velocities from the future reference motion, providing context on what is coming next.
2. Anchor Pose-Tracking Error: The position and orientation error of the anchor body relative to the reference. This provides a signal to correct drift and maintain balance.
3. Proprioception: The robot's own state, including root velocity, joint positions, joint velocities, and the previously executed action (to encourage smoothness).
Joint Impedance and Actions: A key insight for successful sim-to-real transfer is the use of low joint impedance (low stiffness and damping gains). This contrasts with animation work that uses high gains for precise tracking. Low gains make the robot more compliant, better at absorbing impacts, and less sensitive to sensor noise. The policy action $a_{j,t}$ is a normalized offset from a nominal joint configuration $\bar{q}_j$ : ${ \bf q } _ { j , t } = \bar { \bf q } _ { j } + \alpha _ { j } { \bf a } _ { j , t }$ This target position $q_{j,t}$ is then used by a low-level PD controller to compute the final joint torques.
Rewards: The reward function is simple and general.
- Task Reward: Encourages tracking of body poses and velocities. For each body, position, rotation, linear velocity, and angular velocity errors are calculated. These errors are squared, averaged across all target bodies, and passed through an exponential function to create a reward that is high when the error is low. The total task reward is the sum of these four components: $r _ { \mathrm { t a s k } } = \sum _ { \chi \in \{ p , R , v , w \} } r ( \bar { e } _ { \chi } , \sigma _ { \chi } )$ where $\chi$ represents position ( $p$ ), rotation ( $R$ ), linear velocity ( $v$ ), or angular velocity ( $w$ ).
- Regularization Penalties: Small penalties are added to discourage undesirable behaviors:
  - Joint limit penalty: Keeps joints within their safe operating range.
  - Action rate penalty: Encourages smooth actions to prevent jitter.
  - Self-collision penalty: Discourages the robot's limbs from colliding with each other.
Adaptive Sampling: To train efficiently on long motion sequences with varying difficulty, the framework uses adaptive sampling. The motion trajectory is divided into bins. Bins where the robot is more likely to fail are sampled more frequently. This focuses the training effort on the most challenging parts of the motion, leading to faster convergence and more robust policies. The sampling probability for a bin $s$ is calculated based on smoothed failure rates, mixed with a uniform distribution to ensure all parts of the motion are still visited.

Part 2: Trajectory Synthesis via Guided Diffusion

After training expert tracking policies, their behavior is distilled into a single, unified policy that can be guided at test time.

Training: The framework uses Diffuse-CLoC, which models a joint distribution over future states and actions. The model is trained to predict a future trajectory $\pmb{\tau}_t = [\pmb{a}_t, \pmb{s}_{t+1}, ..., \pmb{s}_{t+H}, \pmb{a}_{t+H}]$ over a horizon $H$ , conditioned on a history of past states and actions $O_t$ . The training follows a standard denoising diffusion process, where the model learns to predict a clean trajectory from a noisy one by minimizing the Mean Squared Error (MSE) loss: $\mathcal { L } = \mathrm { M S E } ( x _ { 0 , \theta } ( \tau _ { t } ^ { k } , O _ { t } , k ) , \tau _ { t } )$ Here, $x_{0,\theta}$ is the network that predicts the clean trajectory $\tau_t$ from a noised version $\tau_t^k$ at noise level $k$ .
Guidance: This is the core mechanism for zero-shot control. At inference time, the denoising process is "guided" by a task-specific cost function $G_c(\tau)$ . The gradient of this cost function, $-\nabla_{\tau} G_c(\tau)$ , is added to the update step, pushing the generated trajectory towards one that minimizes the cost. This allows a single trained model to solve many different tasks without retraining.
Downstream Tasks: The paper demonstrates this with three example cost functions:
1. Joystick Steering: The cost penalizes the difference between the robot's predicted planar velocity $V_{xy, t'}$ and the goal velocity $g_v$ from a joystick. $G _ { \mathrm { js } } ( \pmb { \tau } ) = \frac { 1 } { 2 } \sum _ { t ^ { \prime } = t } ^ { t + H } \Vert V _ { x y , t ^ { \prime } } ( \pmb { \tau _ { t ^ { \prime } } } ) - g _ { v } \Vert ^ { 2 }$
2. Waypoint Navigation: The cost encourages the robot to move towards a goal position $g_p$ while also penalizing its velocity as it gets closer, ensuring it stops upon arrival. $G _ { \mathrm { w p } } ( \pmb { \tau } ) = \sum _ { t ^ { \prime } = t } ^ { t + H } ( 1 - e ^ { - 2 d } ) \| P _ { x } ( \pmb { s } _ { t ^ { \prime } } ) - g _ { p } \| ^ { 2 } + e ^ { - 2 d } \| V _ { x , t ^ { \prime } } ( \pmb { \tau } _ { t ^ { \prime } } ) \| ^ { 2 }$ where $d$ is the distance to the goal.
3. Obstacle Avoidance: The cost uses a Signed Distance Field (SDF) to measure the distance to obstacles and penalizes the robot's body parts for getting too close, using a relaxed barrier function $B(x, \delta)$ to create a smooth repulsive force.

5. Experimental Setup

Hardware: All real-world experiments are conducted on a Unitree G1 humanoid robot.
Datasets:
- For motion tracking, the LAFAN1 dataset, a large-scale motion capture dataset with diverse human movements, was used.
- For training the diffusion policy, a subset of AMASS and LAFAN1 containing various walking motions was used to generate an offline dataset by running the expert tracking policies.
Evaluation Metrics:
- Motion Tracking: Performance is mainly evaluated qualitatively through successful real-world execution of challenging motion clips. Table I lists the specific segments tested.
- Diffusion Policy: Performance is measured quantitatively using Success Rate.
  - Conceptual Definition: This metric represents the percentage of trials in which the robot successfully completes a task without falling. A fall is defined as the robot's head height dropping below a threshold (0.2m).
  - Mathematical Formula: $\text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100\%$
Baselines: The primary comparison in the paper is an internal ablation study for the diffusion policy, comparing two different state representations:
- Body-Pos State: Represents the robot's state using the Cartesian positions and velocities of its body parts.
- Joint-Pos State: Represents the robot's state using its joint angles and velocities.

6. Results & Analysis

Core Results: Motion Tracking

The motion tracking framework demonstrates exceptional performance. In simulation, it successfully learns to track all 25 long, diverse motion sequences selected from the LAFAN1 dataset.
On the real Unitree G1 robot, the policies show remarkable sim-to-real transfer, successfully executing a wide array of motions that are categorized as:
- Short Dynamic Motions: Successfully replicates motions like the "Cristiano Ronaldo celebration" and a side kick, even repeating them in succession.
- Static Balance Motions: Achieves challenging balance poses like a single-leg stand.
- Extremely Dynamic Motions: Pushes the state-of-the-art by executing motions previously undemonstrated on similar hardware, including cartwheels, forward sprints, and jumping 360° spins.
- Stylized Motions: Faithfully reproduces the stylistic and expressive elements of motions like the Moonwalk, Charleston dance, and elderly-style walking.

The results from Table I (transcribed below) show the breadth of motions tested, with many full sequences or long clips successfully executed on the real robot.

Manual Transcription of Table I: Motion Segments Tested in Sim and Real.

Name	Sim	Real [s]
Short Sequency
Cristiano Ronaldo [14]	Full	Full
Side Kick [15]	Full	Full
Single Leg Balance [16]	Full	Full
Swallow Balance [16]	Full	Full
LAFAN1 [43] (about 3 minutes each)
walk1_subject1	Full	[0.0, 33.0] [81.2, 86.7]
walk1_subject2	Full
walk1_subject5	Full	[146.7, 159.0] [206.7, 263.7]
walk2_subject1	Full
walk2_subject3	Full	[42.7, 75.7] [217.6, 230.6]
walk2_subject4	Full	[154.4, 164.4] [218.6, 238.6]
dance1_subject1	Full	[0.0, 118.0]
dance1_subject2	Full	Full
dance1_subject3	Full	-
dance2_subject1	Full
dance2_subject2	Full
dance2_subject3	Full	[43.1, 163.1] [164.3, 184.3]
dance2_subject4	Full	[156.3, end]
fallAndGetUp1_subject4	Full
fallAndGetUp2 _subject2	Full	[0.0, 21.0] [74.0, 91.2]
fallAndGetUp2_subject3	Full	[94.0, 109.0] [26.5, 46.5]
run1_subject2	Full	[0.0, 50.0]
run1_subject4	Full
run1_subject5	Full
run2_subject1	Full	[0.0, 11.0] [167.4, 204.4]
jumps1_subject1	Full	[24.3, 42.3]
jumps1_subject1	Full	[71.6, 81.6] [205.5, 226.5]
jumps1_subject2	Full
jumps1_subject5	Full
fightAndSports1_subject1	Full	[16.8, 25.4]
fightAndSports1_subject4	Full	[201.6, end]
fight1_subject2	Full
fight1_subject3	Full
fight1_subject5	Full

Ablations / Parameter Sensitivity

Adaptive Sampling: The ablation study in Table II shows that adaptive sampling is critical. Without it, training on long sequences with difficult parts (like cartwheels or jump-spins) fails to converge. With it, these sequences converge successfully. It also significantly accelerates training for shorter motions.

Manual Transcription of Table II: Ablation on Iterations to Convergence with Adaptive Sampling (AS).

Motion	w/o AS	w/ AS
Christiano Ronaldo [14]	3k	1.5k
Swallow Balance [16]	2.8k	1.8k
dance1_subject1	Failed (cartwheel)	8k
dance2_subject1	Failed (jump-spining)	9k
fightAndSports1_subject1	Failed (balance)	10k

Diffusion State Representation: The ablation in Table III is crucial for the guided diffusion policy. The results show that representing the state with Cartesian body positions (Body-Pos State) significantly outperforms using joint angles (Joint-Pos State). The Joint-Pos representation achieved only 72% success on a walking-with-perturbations task and completely failed (0% success) at joystick control. The authors hypothesize this is because small errors in joint angle predictions can accumulate through the kinematic chain, leading to large end-effector errors, making the policy less robust.

Manual Transcription of Table III: Success Rate on Diffusion State Representation Ablation.

State Representation Walk + Perturb Joystick Control

Body-Pos State 100% 80%

Joint-Rot State 72% 0%

State Representation	Walk + Perturb	Joystick Control
Body-Pos State	100%	80%
Joint-Rot State	72%	0%

7. Conclusion & Reflections

Conclusion Summary: BeyondMimic presents a significant step forward for versatile humanoid control. It successfully bridges the gap between high-fidelity motion tracking and flexible, goal-directed behavior on real hardware. The framework's two key components—a scalable and robust motion tracking pipeline and a guided diffusion policy for zero-shot task composition—provide a practical and powerful foundation for creating general-purpose humanoids that can learn from human demonstration.
Limitations & Future Work: The authors acknowledge several areas for future work:
- State Estimation: The system still relies on state estimators that can fail or drift during complex contact scenarios (like getting up from the ground), highlighting the need for more robust, all-terrain state estimation.
- Skill Transitions: The guided diffusion policy struggles to transition between very distinct skills (e.g., from walking to crawling). This is a fundamental limitation of current approaches, and improving the ability to smoothly compose distant skills on the learned manifold is a key future direction.
- Out-of-Distribution Behavior: An interesting observation is that the diffusion policy exhibits "inert" and safe behavior when it fails (e.g., it tends to freeze rather than flail), which is a desirable property for real-world deployment.
Personal Insights & Critique:
- Significance: The primary achievement of this paper is the successful integration of multiple cutting-edge techniques into a single, cohesive framework that works on a real, complex humanoid robot. The quality and dynamism of the demonstrated motions (especially cartwheels and spin jumps) are state-of-the-art for learning-based control on this class of hardware. The open-sourcing of the tracking pipeline is a major contribution to the community.
- Critique and Open Questions:
  - The diffusion policy still requires an offboard GPU and has a 20ms inference latency. While impressive, this could be a limitation for tasks requiring even faster reactions.
  - The system still depends on a curated, offline dataset of expert trajectories generated from motion capture data. The next frontier is to learn these skills more directly, perhaps from video or with less reliance on perfect expert data.
  - The "skill transition" problem is a fundamental challenge for generative models in control. It raises the question of whether a single, monolithic model is the right approach, or if a hierarchical or modular architecture would be better for composing a truly vast library of skills.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.