KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

Xuelong Li

Paper status: completed

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

Published:06/15/2025

Bi-Level Optimization Framework (2)Humanoid Whole-Body Control (2)High-Dynamic Motion Imitation (1)Physics-Based Dynamic Tracking (1)Robot Skill Learning (1)

Original Link PDF

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents KungfuBot, a physics-based humanoid control framework that learns high-dynamic human behaviors like Kungfu and dance through multi-step motion processing and adaptive tracking, achieving significantly lower tracking errors successfully implemented on a robot.

Abstract

Humanoid robots are promising to acquire various skills by imitating human behaviors. However, existing algorithms are only capable of tracking smooth, low-speed human motions, even with delicate reward and curriculum design. This paper presents a physics-based humanoid control framework, aiming to master highly-dynamic human behaviors such as Kungfu and dancing through multi-steps motion processing and adaptive motion tracking. For motion processing, we design a pipeline to extract, filter out, correct, and retarget motions, while ensuring compliance with physical constraints to the maximum extent. For motion imitation, we formulate a bi-level optimization problem to dynamically adjust the tracking accuracy tolerance based on the current tracking error, creating an adaptive curriculum mechanism. We further construct an asymmetric actor-critic framework for policy training. In experiments, we train whole-body control policies to imitate a set of highly-dynamic motions. Our method achieves significantly lower tracking errors than existing approaches and is successfully deployed on the Unitree G1 robot, demonstrating stable and expressive behaviors. The project page is https://kungfu-bot.github.io.

Mind Map

In-depth Reading

English Analysis~29 min read · 38,444 chars

1. Bibliographic Information

1.1. Title

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

1.2. Authors

Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, Xuelong Li

Affiliations:
- Institute of Artificial Intelligence (TeleAI), China Telecom (Weiji Xie, Jinrui Han, Chenjia Bai, Xuelong Li, Jiyuan Shi, Huanyu Li, Xinzhe Liu, Jiakun Zheng)
- Shanghai Jiao Tong University (Weiji Xie, Jinrui Han, Weinan Zhang)
- East China University of Science and Technology (Jiakun Zheng)
- Harbin Institute of Technology (Huanyu Li)
- ShanghaiTech University (Xinzhe Liu)
Research Backgrounds: The authors come from prominent academic institutions and a corporate AI institute, indicating a strong background in artificial intelligence, robotics, and potentially telecommunications, with a focus on areas like reinforcement learning, motion control, and humanoid robotics.

1.3. Journal/Conference

The paper is published as a preprint on arXiv. It is likely submitted to a top-tier conference or journal in robotics, AI, or machine learning, given its scope and rigor. The NeurIPS Paper Checklist indicates it's likely targeting NeurIPS, a highly reputable conference in machine learning.

1.4. Publication Year

2025 (Published on 2025-06-15T13:58:53.000Z)

1.5. Abstract

This paper introduces KungfuBot, a physics-based humanoid control framework designed to enable humanoid robots to master highly-dynamic human behaviors, such as Kungfu and dancing. Current algorithms struggle with such dynamic motions, typically only tracking smooth, low-speed movements. KungfuBot addresses this through a two-stage approach: a multi-steps motion processing pipeline and adaptive motion tracking. The motion processing pipeline extracts, filters, corrects, and retargets human motions, ensuring physical constraint compliance. For motion imitation, it formulates a bi-level optimization problem to dynamically adjust a tracking accuracy tolerance (called tracking factor) based on current tracking error, creating an adaptive curriculum. An asymmetric actor-critic framework is used for policy training. Experimental results show that KungfuBot significantly reduces tracking errors compared to existing methods and demonstrates stable and expressive behaviors when deployed on a Unitree G1 robot, including complex Kungfu and dancing movements.

1.6. Original Source Link

https://arxiv.org/abs/2506.12851

Publication Status: Preprint (on arXiv).

2. Executive Summary

2.1. Background & Motivation

Core Problem: Humanoid robots face significant challenges in accurately and stably imitating highly-dynamic human behaviors like Kungfu and dancing. Existing Reinforcement Learning (RL)-based whole-body control algorithms are generally limited to tracking smooth, low-speed human motions, even with sophisticated reward and curriculum design.
Importance: Enabling humanoid robots to mimic diverse and dynamic human skills is crucial for their potential applications in various tasks, from physical assistance and rehabilitation to entertainment and education. The human-like morphology of these robots makes them ideal candidates for human-robot interaction (HRI) and tasks requiring human-like dexterity and movement.
Challenges/Gaps:
1. Physical Feasibility: Human motion capture (MoCap) data often contains movements that are physically impossible or unsafe for a robot due to differences in kinematics, dynamics, and joint limits. Directly using this data for RL training can lead to policies that are not feasible.
2. Tracking Accuracy for Dynamics: Existing methods lack robust mechanisms to handle the inherent difficulty and variability of highly-dynamic motions, often leading to poor tracking performance or instability.
3. Sim-to-Real Transfer: Bridging the gap between simulated training environments and real-world robot deployment remains a significant hurdle.
Paper's Entry Point/Innovative Idea: The paper proposes a comprehensive physics-based control framework that integrates multi-steps motion processing to ensure physical plausibility of reference motions and an adaptive motion tracking mechanism to dynamically adjust the reward tolerance for varying motion difficulties, enabling the learning of highly-dynamic skills.

2.2. Main Contributions / Findings

Primary Contributions:
1. Physics-Based Motion Processing Pipeline: Design and implementation of a pipeline to extract, filter out, correct, and retarget human motions from videos. This pipeline ensures that reference motions comply with the robot's physical constraints and includes novel physics-based metrics for filtering and contact-aware motion correction.
2. Adaptive Motion Tracking Mechanism: Formulation of a bi-level optimization problem to derive an optimal tracking factor ( $\sigma^*$ ) and development of an adaptive mechanism that dynamically adjusts this factor during RL training based on tracking error. This creates an adaptive curriculum that tightens tracking accuracy tolerance as the policy improves.
3. Asymmetric Actor-Critic Framework: Construction of an asymmetric actor-critic architecture that utilizes reward vectorization and privileged information for the critic to enhance value estimation, while the actor relies on local observations.
4. Demonstrated Real-World Capabilities: Successful deployment of learned policies on a Unitree G1 robot, showcasing stable and expressive execution of complex, highly-dynamic skills like Kungfu and dancing.
Key Conclusions/Findings:
1. The physics-based motion filtering effectively identifies and excludes untrackable motions from the dataset, leading to more efficient and effective policy learning.
2. The proposed method (PBHC) significantly outperforms existing approaches (e.g., OmniH2O, ExBody2) in simulation across various tracking error metrics for easy, medium, and hard motions.
3. The adaptive motion tracking mechanism is crucial for achieving superior tracking precision, dynamically adjusting to motion characteristics where fixed tracking factors would lead to suboptimal performance.
4. The trained policies exhibit robust sim-to-real transfer, allowing for direct deployment on the Unitree G1 robot with performance metrics closely aligned with simulation results.

3.1. Foundational Concepts

Humanoid Robots: Robots designed to resemble the human body, typically featuring a torso, head, two arms, and two legs. Their human-like morphology enables them to interact with human-designed environments and perform tasks that humans do. The Unitree G1 is an example of such a robot.
Degrees of Freedom (DoFs): In robotics, this refers to the number of independent parameters that define the configuration of a mechanical system. For a humanoid robot, this includes translational DoFs for the base (e.g., x, y, z position) and rotational DoFs for the base and each joint (e.g., pitch, roll, yaw). The Unitree G1 has 23 DoFs for control, excluding its hands.
Whole-Body Control (WBC): A control strategy that coordinates all DoFs of a robot (including the base, torso, arms, and legs) simultaneously to achieve complex tasks while respecting physical constraints (e.g., joint limits, torque limits, balance, contact forces).
Motion Capture (MoCap): Technology used to digitally record the movement of people or objects. MoCap systems produce motion data that can be used to animate digital models or serve as reference motions for robots to imitate.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal. The agent learns a policy—a mapping from observed states to actions.
- Markov Decision Process (MDP): A mathematical framework for modeling decision-making in RL. An MDP is defined by $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, \gamma)$ $M = (S, A, R, P, γ)$ , where:
  - $\mathcal{S}$ is the set of possible states.
  - $\mathcal{A}$ is the set of possible actions.
  - $\mathcal{R}$ is the reward function, defining the immediate reward an agent receives for transitioning from one state to another.
  - $\mathcal{P}$ is the transition function (or dynamics), defining the probability of moving to a new state given a current state and action.
  - $\gamma$ is the discount factor, which determines the present value of future rewards.
- Actor-Critic Methods: A class of RL algorithms that combine two components: an actor (which learns the policy directly) and a critic (which estimates the value function to evaluate the actor's actions).
  - Asymmetric Actor-Critic: A variation where the actor and critic have different observation spaces. Typically, the critic has access to privileged information (e.g., environmental parameters, internal states not directly observable by the robot) to learn a better value function, which then guides the actor (who only sees proprioceptive and exteroceptive observations) in learning a robust policy. This helps in sim-to-real transfer.
- Proximal Policy Optimization (PPO): A popular on-policy RL algorithm that aims to improve sample efficiency and stability compared to earlier methods. It updates the policy by taking multiple small steps, ensuring that the new policy does not deviate too much from the old one, which helps prevent destructive updates.
Skinned Multi-Person Linear (SMPL) Model: A widely used statistical 3D human body model that represents body shape and pose with a low-dimensional parameter space.
- Parameters: SMPL uses:
  - $\beta \in \mathbb{R}^{10}$ for body shapes (e.g., height, weight, build).
  - $\pmb{\theta} \in \mathbb{R}^{24 \times 3}$ for joint rotations (represented as axis-angle or rotation matrices).
  - $\bar{\boldsymbol{\psi}} \in \mathbb{R}^{3}$ for global translation (position in 3D space).
- Mapping to Mesh: These parameters are mapped to a 3D mesh (a collection of 3D vertices) via a differentiable skinning function $M(\cdot)$ , producing $\mathcal{V} = \bar{M}(\beta, \theta, \psi) \in \mathbb{R}^{6890 \times 3}$ .
Inverse Kinematics (IK): A method used in robotics and computer graphics to determine the joint parameters (angles) required to achieve a desired position and orientation for an end-effector (e.g., hand, foot). Differential IK uses derivatives to solve this problem iteratively.
Center of Mass (CoM) and Center of Pressure (CoP):
- CoM: The unique point where the weighted relative positions of the distributed mass sum to zero. It's the average position of all the mass that makes up the object.
- CoP: The point on the ground where the total ground reaction force acts. For stable standing or walking, the CoM projection should ideally stay within the base of support (the area enclosed by the CoP). The proximity of CoM and CoP indicates stability.
Exponential Moving Average (EMA): A type of moving average that places a greater weight and significance on the most recent data points. It is often used to smooth noisy data or to create a dynamically updating estimate of a parameter.

3.2. Previous Works

The paper contextualizes its work by discussing limitations of previous RL-based whole-body control frameworks for motion tracking, particularly their inability to handle highly-dynamic motions.

H2O and OmniH2O [9, 10]: These methods attempt to address physical feasibility issues by removing infeasible motions from datasets using a trained privileged imitation policy. While they contribute to cleaning motion data, they often still struggle with highly dynamic movements due to inherent difficulties in tracking and a lack of suitable tolerance mechanisms. OmniH2O specifically focuses on universal and dexterous human-to-humanoid whole-body teleoperation and learning.
ExBody [7] and ExBody2 [5]:
- ExBody constructs a feasible motion dataset by filtering via language labels.
- ExBody2 trains an initial policy on all motions and uses the tracking error to measure motion difficulty, aiming to optimize the dataset. However, this process can be costly and may not find an optimal dataset, and also lacks tolerance mechanisms for difficult motions.
ASAP [6]: Aligning Simulation and Real-world Physics for Learning Agile Humanoid Whole-body Skills. This method introduces a multi-stage mechanism and learns a residual policy to compensate for the sim-to-real gap, which helps in tracking agile motions. Unlike ASAP which focuses on the sim-to-real gap, KungfuBot focuses on improving motion feasibility and agility entirely within simulation.
MaskedMimic [21, 23]: Unified Physics-based Character Control through Masked Motion Inpainting. This method focuses on character animation and directly optimizes pose-level accuracy without explicit regularization of physical plausibility. While it performs well in character animation contexts, it is not directly deployable for robot control because it does not account for robot-specific constraints like partial observability and action smoothness. The paper uses MaskedMimic as a baseline for comparison but notes its limitations for direct robot deployment.

3.3. Technological Evolution

Early humanoid motion imitation efforts often relied on direct kinematic mapping or Inverse Kinematics (IK) controllers, which struggled with dynamic stability and physical constraints. The advent of Reinforcement Learning (RL) brought about physics-based controllers that could learn stable locomotion and basic movements (e.g., DeepMimic [21]). However, directly imitating complex human motions remained challenging due to the sim-to-real gap and the physical differences between humans and robots. Subsequent works focused on motion filtering (H2O, ExBody) to make human MoCap data more robot-feasible. More recently, approaches like ASAP have tackled agile motions by addressing the sim-to-real gap through residual policies.

KungfuBot represents an evolution by refining both the data preparation (physics-based processing) and the RL training loop (adaptive reward tolerance), allowing for the imitation of more extreme and dynamic motions that previous methods struggled with. It moves beyond simply filtering data to actively adapt the learning process to the inherent difficulty of dynamic movements.

3.4. Differentiation Analysis

KungfuBot differentiates itself from previous methods primarily in two key areas:

Comprehensive Physics-Based Motion Processing:
- Prior Work: Methods like H2O and ExBody primarily focus on filtering infeasible motions or constructing datasets based on labels. ASAP addresses sim-to-real gap for agile motions.
- KungfuBot's Innovation: KungfuBot introduces a more holistic multi-steps pipeline that not only filters (physics-based motion filtering using CoM-CoP stability criteria) but also corrects (contact-aware motion correction with EMA smoothing) and retargets motions. This ensures a higher degree of physical plausibility and quality for the reference motions before RL training, directly tackling out-of-distribution issues from HMR models and floating artifacts. This processing is done entirely in simulation, contrasting with ASAP's focus on sim-to-real compensation.
Adaptive Motion Tracking with Optimal Factor Derivation:
- Prior Work: Most RL-based imitation learning methods use fixed reward functions and tracking factors (e.g., OmniH2O, ExBody2). While ExBody2 attempts to measure motion difficulty, it lacks a dynamic tolerance mechanism. The MaskedMimic approach, while effective for character animation, does not prioritize physical plausibility for robot control and thus uses different optimization objectives.
- KungfuBot's Innovation: KungfuBot introduces a novel adaptive mechanism for dynamically adjusting the tracking factor ( $\sigma$ ) within the exponential reward function. This is grounded in a bi-level optimization problem that theoretically derives the optimal tracking factor as the average of the optimal tracking error. By using an Exponential Moving Average (EMA) to estimate this error online and iteratively tightening $\sigma$ , KungfuBot creates an adaptive curriculum that allows the policy to progressively improve its precision for motions of varying difficulty. This mechanism is a significant departure from fixed reward weighting or tolerance parameters, enabling superior performance on highly-dynamic skills.
  
  In summary, KungfuBot combines sophisticated physics-based motion preprocessing with an intelligent, adaptive RL reward mechanism, explicitly designed to overcome the limitations of prior work in handling the physical and dynamic complexities of highly-dynamic human behaviors for humanoid robots.

4. Methodology

The Physics-Based Humanoid motion Control (PBHC) framework is designed to enable humanoid robots to master highly-dynamic human behaviors. It operates through a two-stage process: motion processing and motion imitation.

The following figure (Figure 1 from the original paper) provides an overview of PBHC:

Figure 1: An overview of PBHC that includes three core components: (a) motion extraction from videos and multi-steps motion processing, (b) adaptive motion tracking based on the optimal tracking factor, (c) the RL training framework and sim-to-real deployment. 该图像是示意图，展示了PBHC的三个核心组件：（a）多步骤运动处理，包括从视频中提取动作和参考轨迹，以及接触掩码的生成；（b）自适应运动跟踪，基于最优跟踪因子的动态调整；（c）强化学习训练框架，展示了从观察到动作的训练过程及其部署。

As illustrated, the process begins with raw human videos (a). These videos undergo Human Motion Recovery (HMR) to produce SMPL-format motion sequences. These sequences are then refined through physics-based motion filtering and contact-aware motion correction, ensuring physical plausibility. The refined motions are then retargeted to the G1 robot to serve as reference motions. In the second stage (b), an adaptive motion tracking mechanism dynamically adjusts the tracking factor based on an optimal tracking factor derived from a bi-level optimization problem. Finally, the policies are trained using an RL training framework (c) and deployed on the real Unitree G1 robot.

4.1. Principles

The core idea behind PBHC is to systematically address the challenges of dynamic motion imitation in humanoid robots by:

Ensuring Physical Feasibility: Pre-processing human MoCap data to guarantee that the reference motions are physically executable by the robot, considering its kinematics, dynamics, and contact interactions.
Adaptive Learning: Dynamically adjusting the reward tolerance during Reinforcement Learning based on the agent's current performance and the inherent difficulty of the motion. This allows the agent to gradually refine its tracking precision without being prematurely penalized for small errors in complex movements, effectively creating an adaptive curriculum.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Motion Processing Pipeline

This pipeline converts raw human video into physically plausible and robot-executable reference motions for the G1 robot.

4.2.1.1. Motion Estimation from Videos

The process begins by extracting human motion from monocular videos.

Model Used: GVHMR [15] (Gravity-View Human Motion Recovery) is employed.
Key Features of GVHMR:
- It estimates SMPL-format motions.
- It introduces a gravity-view coordinate system, which naturally aligns motions with gravity, addressing body tilt issues that can arise from camera-centric reconstructions.
- It mitigates foot sliding artifacts by predicting non-stationary probabilities, improving motion quality.
Output: SMPL parameters $(\beta, \pmb{\theta}, \psi)$ , where $\beta \in \mathbb{R}^{10}$ represents body shapes, $\pmb{\theta} \in \mathbb{R}^{24 \times 3}$ represents joint axis-angle rotations, and $\bar{\boldsymbol{\psi}} \in \mathbb{R}^{3}$ represents global translation. These parameters can be mapped to a 3D mesh $\mathcal{V} = \bar{M}(\beta, \theta, \psi) \in \mathbb{R}^{6890 \times 3}$ via a differentiable skinning function $M(\cdot)$ .

4.2.1.2. Physics-based Motion Filtering

Motions extracted by HMR models can still contain physical and biomechanical constraint violations due to reconstruction inaccuracies and out-of-distribution issues. This step filters out such motions.

Stability Criterion: The stability of a motion frame is assessed based on the proximity of the Center of Mass (CoM) and Center of Pressure (CoP).
- Let $\bar{\pmb{p}}_t^{\mathrm{CoM}} = (p_{t,x}^{\mathrm{CoM}}, p_{t,y}^{\mathrm{CoM}})$ and $\bar{\pmb{p}}_t^{\mathrm{CoP}} = (p_{t,x}^{\mathrm{CoP}}, p_{t,y}^{\dot{\mathrm{CoP}}})$ be the projected coordinates of the CoM and CoP on the ground at frame $t$ , respectively.
- The distance between these projections is $\Delta \bar{d}_t^{\nu}$ .
- The stability criterion for a frame $t$ $t$ is defined as: $\Delta d_t = \lVert \bar{\pmb{p}}_t^{\mathrm{CoM}} - \bar{\pmb{p}}_t^{\mathrm{CoP}} \rVert_2 < \epsilon_{\mathrm{stab}}$
  - $\Delta d_t$ : The Euclidean distance between the 2D projections of the CoM and CoP on the ground at time $t$ .
  - $\bar{\pmb{p}}_t^{\mathrm{CoM}}$ : The 2D projection of the Center of Mass on the ground at time $t$ .
  - $\bar{\pmb{p}}_t^{\mathrm{CoP}}$ : The 2D projection of the Center of Pressure on the ground at time $t$ .
  - $\lVert \cdot \rVert_2$ : The Euclidean norm (distance).
  - $\epsilon_{\mathrm{stab}}$ : A stability threshold, empirically chosen (e.g., 0.1 as per Table 3). If the distance is below this threshold, the frame is considered stable.
Motion Sequence Stability: For an $N$ $N$ -frame motion sequence, let $B = [t_0, t_1, \dots, t_K]$ $B = [t_{0}, t_{1}, \dots, t_{K}]$ be the increasingly sorted list of frame indices that satisfy the stability criterion (Eq. 1). A motion sequence is considered stable if it meets two conditions:
1. Boundary-frame stability: The first frame ( $1 \in B$ ) and the last frame ( $N \in B$ ) must be stable.
2. Maximum instability gap: The maximum length of consecutive unstable frames must be less than a threshold $\epsilon_N$ , i.e., $\max_k t_{k+1} - t_k < \epsilon_N$ . (e.g., 100 as per Table 3).
Benefit: This filtering effectively removes motions that are inherently untrackable or dynamically unstable for a robot.

4.2.1.3. Motion Correction based on Contact Mask

This step refines motions by explicitly accounting for foot-ground contact.

Contact Mask Estimation: Contact masks are estimated by analyzing ankle displacement across consecutive frames, based on the zero-velocity assumption for feet in contact.
- Let $\bar{\pmb{p}}_t^{\mathrm{l-ankle}} \in \mathbb{R}^3$ be the position of the left ankle at time $t$ , and $c_t^{\mathrm{left}} \in \{0, 1\}$ be its corresponding contact mask (1 for contact, 0 for no contact).
- The contact mask is estimated as: $c_t^{\mathrm{left}} = \mathbb{I}[\|p_{t+1}^{\mathrm{l-ankle}} - p_t^{\mathrm{l-ankle}}\|_2^2 < \epsilon_{\mathrm{vel}}] \cdot \mathbb{I}[p_{t,z}^{\mathrm{l-ankle}} < \epsilon_{\mathrm{height}}]$
  - $c_t^{\mathrm{left}}$ : The binary contact mask for the left foot at time $t$ .
  - $\mathbb{I}[\cdot]$ : The indicator function, which equals 1 if the condition inside is true, and 0 otherwise.
  - $\|p_{t+1}^{\mathrm{l-ankle}} - p_t^{\mathrm{l-ankle}}\|_2^2$ : The squared Euclidean distance of the left ankle's displacement between frame $t$ and $t+1$ . This checks for near-zero velocity.
  - $\epsilon_{\mathrm{vel}}$ : An empirically chosen velocity threshold (e.g., 0.002 as per Table 3). If displacement is below this, velocity is considered zero.
  - $p_{t,z}^{\mathrm{l-ankle}}$ : The vertical ( $z$ ) coordinate of the left ankle at time $t$ .
  - $\epsilon_{\mathrm{height}}$ : An empirically chosen height threshold (e.g., 0.2 as per Table 3). This ensures the foot is near the ground.
- A similar process is applied for the right foot.
Correction Step: To address minor floating artifacts (where the feet appear to hover above the ground), a vertical offset is applied to the global translation if a foot is in contact.
- Let $\psi_t$ denote the global translation of the pose at time $t$ .
- The corrected vertical position is: $\psi_{t,z}^{\mathrm{corr}} = \psi_{t,z} - \Delta h_t$
  - $\psi_{t,z}^{\mathrm{corr}}$ : The corrected vertical component of global translation at time $t$ .
  - $\psi_{t,z}$ : The original vertical component of global translation at time $t$ .
  - $\Delta h_t = \operatorname*{min}_{v \in \mathcal{V}_t} p_{t,z}^v$ : The lowest $z$ -coordinate among all SMPL mesh vertices $\mathcal{V}_t$ at frame $t$ . This effectively brings the lowest point of the SMPL mesh to the ground.
Smoothing: This correction can introduce frame-to-frame jitter. To counter this, Exponential Moving Average (EMA) is applied to smooth the motion.

4.2.1.4. Motion Retargeting

The processed SMPL-format motions are then adapted to the target robot's kinematics.

Method: An Inverse Kinematics (IK)-based method [19] is used.
Process: This involves formulating a differentiable optimization problem that aligns the end-effector trajectories of the SMPL model with the robot's end-effectors while respecting the robot's joint limits.
Data Augmentation: To enhance diversity, additional motions from open-source datasets like AMASS [4] and LAFAN [20] are also processed through this pipeline.

4.2.2. Adaptive Motion Tracking

This mechanism dynamically adjusts how strictly the robot must adhere to the reference motion during RL training.

4.2.2.1. Exponential Form Tracking Reward

The reward function in PBHC consists of two parts: task-specific rewards (for motion tracking) and regularization rewards (for stability and smoothness).

Task-Specific Rewards: These include terms for aligning joint states, rigid body states, and foot contact masks.
Exponential Form: Most task-specific rewards (except foot contact tracking) follow an exponential form: $r(x) = \exp(-x / \sigma)$
- r(x): The reward value for a given tracking error.
- $x$ : The tracking error, typically measured as Mean Squared Error (MSE) of quantities like joint angles. A lower $x$ means better tracking.
- $\sigma$ : The tracking factor, which controls the tolerance of the error. A larger $\sigma$ means more tolerance (rewards remain high even for larger errors), while a smaller $\sigma$ means less tolerance (rewards drop sharply with increasing error).
Why Exponential Form? It's preferred over a simple negative error because it's bounded (max reward is 1), helps stabilize training, and offers an intuitive way to adjust reward weighting via $\sigma$ .

The following figure (Figure 2 from the original paper) illustrates the effect of the tracking factor $\sigma$ on the reward value:

$Figure 2: Illustration of the effect of tracking factor $\\sigma$ on the reward value.$ 该图像是图表，展示了追踪因子 au 对奖励值的影响。横轴为追踪误差 $x$ ，纵轴为奖励 $r(x) = ext{exp}(-x/ au)$ 。不同的曲线表示不同的追踪因子值，其中红色表示 $au=0.2$ ，绿色表示 $au=1.0$ ，蓝色表示 $au=5.0$ 。

As seen in the graph, for a fixed tracking error $x$ , a larger $\sigma$ (blue curve, $\sigma=5.0$ ) yields a higher reward, indicating more tolerance. A smaller $\sigma$ (red curve, $\sigma=0.2$ ) causes the reward to drop more sharply for the same error, requiring higher precision.

4.2.2.2. Optimal Tracking Factor

To determine the ideal $\sigma$ , the problem of motion tracking is modeled as a bi-level optimization (BLO) problem [11].

Intuition: The goal is to find a $\sigma$ that leads to the minimum accumulated tracking error from the converged policy. This mimics how a human engineer might iteratively tune $\sigma$ and observe results.
Problem Formulation:
- Let $\pi$ be a policy and $\pmb{x} \in \mathbb{R}_+^N$ be the sequence of expected tracking errors over $N$ steps of an episode rollout.
- The inner (lower-level) optimization represents the RL training procedure: given a fixed $\sigma$ $σ$ , the policy aims to maximize its accumulated reward. $\operatorname*{max}_{\pmb{x} \in \mathbb{R}_+^N} J^{\mathrm{in}}(\pmb{x}, \sigma) + R(\pmb{x})$
  - $J^{\mathrm{in}}(\pmb{x}, \sigma) = \sum_{i=1}^N \exp(-x_i / \sigma)$ : The simplified accumulated reward from the exponential tracking term.
  - $R(\pmb{x})$ : Represents other reward components and environment dynamics, including regularization rewards and external policy objectives.
- The outer (upper-level) optimization then selects the optimal $\sigma$ $σ$ to minimize the total tracking error of the final converged policy. $\operatorname*{max}_{\sigma \in \mathbb{R}_+} \quad J^{\mathrm{ex}}(\pmb{x}^*) \quad \mathrm{s.t.} \quad \pmb{x}^* \in \arg \operatorname*{max}_{\pmb{x} \in \mathbb{R}_+^N} J^{\mathrm{in}}(\pmb{x}, \sigma) + R(\pmb{x})$
  - $J^{\mathrm{ex}}(\pmb{x}^*) = \sum_{i=1}^N -x_i^*$ : The negative of the accumulated tracking error of the optimal error sequence $\pmb{x}^*$ from the inner problem. Maximizing this is equivalent to minimizing the accumulated tracking error.
Derivation of Optimal $\sigma^*$ (from Appendix A): Assuming $R(\pmb{x})$ is linear, and $J^{\mathrm{in}}$ and $J^{\mathrm{ex}}$ are twice continuously differentiable, and the lower-level problem has a unique solution $\pmb{x}^*(\sigma)$ , an implicit gradient approach can be used.
1. Gradient of $J^{\mathrm{ex}}$ w.r.t. $\sigma$ : $\frac{d J^{\mathrm{ex}}}{d \sigma} = \frac{d \pmb{x}^*(\sigma)}{d \sigma}^\top \nabla_{\pmb{x}} J^{\mathrm{ex}}(\pmb{x}^*(\sigma))$
  - $\frac{d J^{\mathrm{ex}}}{d \sigma}$ : The total derivative of the external objective with respect to $\sigma$ .
  - $\frac{d \pmb{x}^*(\sigma)}{d \sigma}^\top$ : The transpose of the derivative of the optimal error sequence with respect to $\sigma$ .
  - $\nabla_{\pmb{x}} J^{\mathrm{ex}}(\pmb{x}^*(\sigma))$ : The gradient of the external objective with respect to the error sequence, evaluated at the optimal error sequence.
2. Using the KKT condition for the lower-level problem: Since $\pmb{x}^*(\sigma)$ is the solution to the inner maximization problem, its gradient must be zero: $\nabla_{\pmb{x}} (J^{\mathrm{in}}(\pmb{x}^*(\sigma), \sigma) + R(\pmb{x})) = 0$
3. Taking the first-order derivative of the KKT condition w.r.t. $\sigma$ : $\frac{d}{d \sigma} \big( \nabla_{\boldsymbol{x}} \big( J^{\mathrm{in}}(\boldsymbol{x}^*(\sigma), \sigma) + R(\boldsymbol{x}) \big) \big) = \nabla_{\sigma, \boldsymbol{x}}^2 J^{\mathrm{in}} + \frac{d \boldsymbol{x}^*(\sigma)}{d \sigma}^\top \nabla_{\boldsymbol{x}, \boldsymbol{x}}^2 J^{\mathrm{in}} = 0$ This equation relates the mixed partial derivatives of $J^{\mathrm{in}}$ to the derivative of $\pmb{x}^*(\sigma)$ with respect to $\sigma$ .
4. Solving for $\frac{d \pmb{x}^*(\sigma)}{d \sigma}^\top$ : $\frac{d \pmb{x}^*(\sigma)}{d \sigma}^\top = - \nabla_{\sigma, \pmb{x}}^2 J^{\mathrm{in}}(\pmb{x}^*(\sigma), \sigma) \nabla_{\pmb{x}, \pmb{x}}^2 J^{\mathrm{in}}(\pmb{x}^*(\sigma), \sigma)^{-1}$
  - $\nabla_{\sigma, \pmb{x}}^2 J^{\mathrm{in}}$ : The mixed second-order partial derivative (Hessian) of $J^{\mathrm{in}}$ with respect to $\sigma$ and $\pmb{x}$ .
  - $\nabla_{\pmb{x}, \pmb{x}}^2 J^{\mathrm{in}}$ : The second-order partial derivative (Hessian) of $J^{\mathrm{in}}$ with respect to $\pmb{x}$ .
5. Substituting back into the gradient of $J^{\mathrm{ex}}$ : $\frac{d J^{\mathrm{ex}}}{d \sigma} = - \nabla_{\sigma, x}^2 J^{\mathrm{in}}(\pmb{x}^*(\sigma), \sigma) \nabla_{\pmb{x}, \pmb{x}}^2 J^{\mathrm{in}}(\pmb{x}^*(\sigma), \sigma)^{-1} \nabla_{\pmb{x}} J^{\mathrm{ex}}(\pmb{x}^*(\sigma))$
6. Explicit forms of $J^{\mathrm{ex}}$ and $J^{\mathrm{in}}$ : $\begin{array}{l} {\displaystyle {J^{\mathrm{ex}}}(\boldsymbol{x}) = \sum_{i=1}^N {-x_i}}, \\ {\displaystyle {J^{\mathrm{in}}}(\boldsymbol{x}, \sigma) = \sum_{i=1}^N {\mathrm{exp}(-x_i / \sigma)}}. \end{array}$
7. Compute first- and second-order gradients: $\begin{array}{l} {\nabla_{\boldsymbol{x}} J^{\mathrm{in}}(\boldsymbol{x}, \sigma) = \exp(-\boldsymbol{x} / \sigma) (- \displaystyle \frac 1 \sigma)}, \\ {\nabla_{\boldsymbol{x}} J^{\mathrm{ex}}(\boldsymbol{x}) = \mathbf{1}}, \\ {\nabla_{\sigma, \boldsymbol{x}}^2 J^{\mathrm{in}}(\boldsymbol{x}, \sigma) = \displaystyle \frac{\sigma - \boldsymbol{x}}{\sigma^3} \odot \exp(-\boldsymbol{x} / \sigma)}, \\ {\nabla_{\boldsymbol{x}, \boldsymbol{x}}^2 J^{\mathrm{in}}(\boldsymbol{x}, \sigma) = \mathrm{diag}(\exp(-\boldsymbol{x} / \sigma)) / \sigma^2}, \end{array}$
  - $\exp(-\boldsymbol{x} / \sigma)$ : An element-wise exponential function applied to the vector $-\boldsymbol{x} / \sigma$ .
  - $- \displaystyle \frac 1 \sigma$ : A scalar multiplier.
  - $\mathbf{1}$ : A vector of ones.
  - $\odot$ : Element-wise multiplication.
  - $\mathrm{diag}(\cdot)$ : Creates a diagonal matrix from a vector.
8. Setting the gradient to zero ( $\frac{d J^{\mathrm{ex}}}{d \sigma} = 0$ ) and simplifying: This leads to the conclusion that the optimal tracking factor $\sigma^*$ is the average of the optimal tracking errors: $\sigma^* = \Big( \sum_{i=1}^N x_i^* \Big) / N$
  - $\sigma^*$ : The optimal tracking factor.
  - $N$ : The number of steps in the motion sequence.
  - $x_i^*$ : The optimal tracking error at the $i$ -th step.

4.2.2.3. Adaptive Mechanism

Since $\sigma^*$ and $\pmb{x}^*$ are inter-dependent, direct computation of $\sigma^*$ is not possible. Also, a single fixed $\sigma$ is impractical for diverse motions. An adaptive mechanism is designed to dynamically adjust $\sigma$ during training.

Online Estimation: An Exponential Moving Average (EMA) $\hat{x}$ of the instantaneous tracking error is maintained over environment steps. This $\hat{x}$ serves as an online estimate of the expected tracking error for the current policy.
Feedback Loop: During training, PBHC updates $\sigma$ to the current value of $\hat{x}$ . This creates a closed-loop feedback system: as the tracking error decreases, $\hat{x}$ decreases, which in turn leads to a tightening of $\sigma$ . This process drives further policy refinement.
Update Rule: To ensure stability, $\sigma$ is constrained to be non-increasing and initialized with a relatively large value ( $\sigma^{\mathrm{init}}$ ). $\sigma \gets \operatorname*{min}(\sigma, \hat{x})$
- $\sigma$ : The tracking factor that is being adapted.
- $\hat{x}$ : The Exponential Moving Average (EMA) of the instantaneous tracking error.
- $\operatorname*{min}(\cdot, \cdot)$ : The minimum function, ensuring $\sigma$ only decreases or stays the same.
Benefit: This adaptive process allows the policy to progressively improve its tracking precision over time, as illustrated in the following figure (Figure 4 from the original paper).

$Figure 4: Example of the right hand $y$ -position for 'Horse-stance punch'. The adaptive $\\sigma$ can progressively improve the tracking precision. $\\sigma _ { \\mathrm { p o s \\_ v r } }$ is used for tracking the head and hands.$ 该图像是图表，展示了左手 $y$ 轴位置与时间的关系。图中标注了动作 'Horse Stance' 和 'Quick Punch' 的位置，并显示了不同步长下的跟踪精度变化，适应性 oldsymbol{ au} 可逐步提高跟踪精度。

This graph shows the right hand y-position for a 'Horse-stance punch' motion. The adaptive\sigma $(blue line) starts high and gradually decreases, leading to a tighter tracking of the reference (black line) over more `training steps`. In contrast, a `fixed`\sigma$ (red line) might not allow for this progressive improvement.

The following figure (Figure 3 from the original paper) depicts the closed-loop adjustment of the tracking factor in the adaptive mechanism:

Figure 3: Closed-loop adjustment of tracking factor in the proposed adaptive mechanism. 该图像是示意图，展示了在适应性机制中跟踪因子的闭环调整过程。图中包含四个关键部分：奖励 r(x) 形状、策略 $\pi$ 优化、跟踪因子 $\sigma$ 收紧和跟踪误差 $\hat{x}$ 减少。各部分之间通过箭头展示了相互关系，表明在优化过程中如何动态调整策略以减少跟踪误差。

The diagram shows a continuous loop: The tracking error $\hat{x}$ (estimated online) influences the tracking factor $\sigma$ . A reduction in $\hat{x}$ leads to a tightening of $\sigma$ (via $\sigma \gets \min(\sigma, \hat{x})$ ). This tightened $\sigma$ then reshapes the reward function r(x), which in turn drives policy optimization (to maximize rewards), leading to a further reduction in tracking error $\hat{x}$ . This positive feedback loop allows the system to converge to an optimal $\sigma$ .

4.2.3. RL Training Framework

4.2.3.1. Asymmetric Actor-Critic

The policy optimization uses an asymmetric actor-critic architecture, typical in RL for sim-to-real transfer.

Time Phase Variable: A time phase variable $\phi_t \in [0, 1]$ is introduced, representing the linear progress of the reference motion (0 at start, 1 at end).
Actor's Observation Space ( $\mathbf{\Xi}_{s_t^{\mathrm{actor}}}$ ): The actor operates with local observations. It receives:
- Proprioceptive state ( $s_t^{\mathrm{prop}}$ ) of the robot: This includes historical information (last 5 steps) of joint positions ( $\pmb{q}_t$ ), joint velocities ( $\dot{\pmb{q}}_t$ ), root angular velocity ( $\boldsymbol{\omega}_t^{\mathrm{root}}$ ), root projected gravity ( $\pmb{g}_t^{\mathrm{proj}}$ ), and the action from the previous step ( $\pmb{a}_{t-1}$ ).
- The time phase variable $\phi_t$ .
Critic's Observation Space ( $s_t^{\mathrm{critic}}$ ): The critic uses privileged information to learn a better value function. Its observations include:
- All components of the actor's observation ( $s_t^{\mathrm{prop}}$ and time phase).
- Additionally, reference motion positions, root linear velocity, and a set of randomized physical parameters (e.g., base CoM offset, link mass, stiffness, damping, friction coefficient, control delay). These privileged parameters are crucial for learning a robust value function that can generalize across different physical conditions, aiding sim-to-real transfer.

4.2.3.2. Reward Vectorization

To facilitate learning for multiple reward components, rewards and value functions are vectorized.
Instead of summing all reward components into a single scalar, each component $r_i$ is assigned to a separate value function $V_i(s)$ .
The critic network has multiple output heads, each estimating the return for a specific reward component.
All value functions are then aggregated to compute the action advantage. This design enhances value estimation and stability in policy optimization.

4.2.3.3. Reference State Initialization (RSI)

The robot's state is initialized from reference motion states at randomly sampled time phases.
This technique allows for parallel learning across different segments (phases) of a motion, significantly improving training efficiency by avoiding repetitive learning from the very beginning of a motion every time.

4.2.3.4. Sim-to-Real Transfer

Domain Randomization: To bridge the sim-to-real gap, domain randomization is applied during training. This involves varying physical parameters of the simulated environment and humanoid model (e.g., friction, mass, CoM offset, control delay, external perturbations). By training policies that are robust to these variations, the policies become more generalized and perform well on the real robot, even with slight mismatches between simulation and reality.
Zero-Shot Transfer: The policies trained with domain randomization are directly deployed to real robots without any fine-tuning, achieving zero-shot sim-to-real transfer.

5. Experimental Setup

5.1. Datasets

The experiments utilize a highly-dynamic motion dataset constructed using PBHC's motion processing pipeline.

Sources:
1. Video-based sources: Motions extracted from videos and fully processed by the PBHC pipeline.
2. Open-source datasets: Selected motions from AMASS [4] and LAFAN [20], partially processed through the PBHC pipeline (contact mask estimation, motion correction, retargeting).
Characteristics: The dataset comprises 13 distinct motions, categorized into three difficulty levels: easy, medium, and hard, based on their agility requirements.
Transition Smoothing: To ensure smooth transitions, linear interpolation is applied at the beginning and end of each sequence to move from a default pose to the reference motion and back.

The following figure (Figure 5 from the original paper) shows examples of motions in the constructed dataset:

该图像是一个示意图，展示了在我们构建的数据集中不同难度的动作示例，包括马步拳（简单）、伸腿（中等）、跳踢（中等）和360度旋转（困难）。图中蓝色轨迹表示动作路径，较深的透明度表示后续时间点。

As shown, motions range from relatively simple horse-stance punch to more complex stretch leg, jump kick, and 360-degree spin, with darker opacity indicating later timestamps in the motion trajectory.

The following are the details of the highly-dynamic motion dataset (Table 4 from the original paper):

Motion name	Motion frames	Source
Easy
Jabs punch	285	video
Hooks punch	175	video
Horse-stance pose	210	LAFAN
Horse-stance punch	200	video
Medium
Stretch leg	320	video
Tai Chi	500	video
Jump kick	145	video
Charleston dance	610	LAFAN
Bruce Lee's pose	330	AMASS
Hard
Roundhouse kick	158	AMASS
360-degree spin	180	video
Front kick	155	video
Side kick	179	AMASS

5.2. Evaluation Metrics

The tracking performance of policies is quantified using various error metrics, focusing on position, velocity, and acceleration errors across different body parts and joints. For a motion with $N$ frames, the expectation $\mathbb{E}[\cdot]$ is taken over all body parts/joints and all frames where the robot is active.

Global Mean Per Body Position Error ( $E_{\mathrm{g-mpbpe}}$ , mm)
- Conceptual Definition: This metric quantifies the average position error of all body parts (links) of the robot in global coordinates relative to the reference motion. It measures how far, on average, the robot's body is from the target positions in the global frame.
- Mathematical Formula: $E_{\mathrm{g-mpbpe}} = \mathbb{E} \left[ \left\| \pmb{p}_t - \pmb{p}_t^{\mathrm{ref}} \right\|_2 \right]$
- Symbol Explanation:
  - $E_{\mathrm{g-mpbpe}}$ : Global Mean Per Body Position Error.
  - $\mathbb{E}[\cdot]$ : Expectation, averaged over all body parts and time steps.
  - $\pmb{p}_t$ : The 3D position vector of a specific body part of the robot at time $t$ .
  - $\pmb{p}_t^{\mathrm{ref}}$ : The 3D position vector of the corresponding body part in the reference motion at time $t$ .
  - $\lVert \cdot \rVert_2$ : The Euclidean norm (L2 norm), representing the distance between the robot's body part and the reference.
Root-Relative Mean Per Body Position Error ( $E_{\mathrm{mpbpe}}$ , mm)
- Conceptual Definition: This metric measures the average position error of body parts relative to the robot's root (e.g., pelvis or base). It focuses on the internal posture and configuration error, decoupling it from any global drift of the entire robot.
- Mathematical Formula: $E_{\mathrm{mpbpe}} = \mathbb{E} \left[ \left\| \left( \pmb{p}_t - \pmb{p}_{\mathrm{root},t} \right) - \left( \pmb{p}_t^{\mathrm{ref}} - \pmb{p}_{\mathrm{root},t}^{\mathrm{ref}} \right) \right\|_2 \right]$
- Symbol Explanation:
  - $E_{\mathrm{mpbpe}}$ : Root-relative Mean Per Body Position Error.
  - $\mathbb{E}[\cdot]$ : Expectation, averaged over all body parts and time steps.
  - $\pmb{p}_t$ : The 3D position vector of a specific body part of the robot at time $t$ .
  - $\pmb{p}_{\mathrm{root},t}$ : The 3D position vector of the robot's root at time $t$ .
  - $\pmb{p}_t^{\mathrm{ref}}$ : The 3D position vector of the corresponding body part in the reference motion at time $t$ .
  - $\pmb{p}_{\mathrm{root},t}^{\mathrm{ref}}$ : The 3D position vector of the reference motion's root at time $t$ .
  - $\lVert \cdot \rVert_2$ : The Euclidean norm.
Mean Per Joint Position Error ( $E_{\mathrm{mpjpe}}$ , $10^{-3}$ rad)
- Conceptual Definition: This metric quantifies the average angular error for each joint's position (angle) across the robot's Degrees of Freedom. It directly measures how closely the robot's joint configurations match the reference.
- Mathematical Formula: $E_{\mathrm{mpjpe}} = \mathbb{E} \left[ \left\| \pmb{q}_t - \pmb{q}_t^{\mathrm{ref}} \right\|_2 \right]$
- Symbol Explanation:
  - $E_{\mathrm{mpjpe}}$ : Mean Per Joint Position Error.
  - $\mathbb{E}[\cdot]$ : Expectation, averaged over all joints and time steps.
  - $\pmb{q}_t$ : The vector of joint angles for the robot at time $t$ .
  - $\pmb{q}_t^{\mathrm{ref}}$ : The vector of joint angles for the reference motion at time $t$ .
  - $\lVert \cdot \rVert_2$ : The Euclidean norm.
Mean Per Joint Velocity Error ( $E_{\mathrm{mpjve}}$ , $10^{-3}$ rad/frame)
- Conceptual Definition: This metric measures the average error in the angular velocities of the robot's joints compared to the reference motion. It indicates how well the robot matches the speed of joint movements.
- Mathematical Formula: $E_{\mathrm{mpjve}} = \mathbb{E} \left[ \left\| \Delta \pmb{q}_t - \Delta \pmb{q}_t^{\mathrm{ref}} \right\|_2 \right]$ where $\Delta \pmb{q}_t = \pmb{q}_t - \pmb{q}_{t-1}$
- Symbol Explanation:
  - $E_{\mathrm{mpjve}}$ : Mean Per Joint Velocity Error.
  - $\mathbb{E}[\cdot]$ : Expectation, averaged over all joints and time steps.
  - $\Delta \pmb{q}_t$ : The vector of joint angular velocities for the robot at time $t$ , approximated as the difference in joint positions between $t$ and t-1.
  - $\Delta \pmb{q}_t^{\mathrm{ref}}$ : The vector of joint angular velocities for the reference motion at time $t$ .
  - $\lVert \cdot \rVert_2$ : The Euclidean norm.
Mean Per Body Velocity Error ( $E_{\mathrm{mpbve}}$ , mm/frame)
- Conceptual Definition: This metric measures the average error in the linear velocities of the robot's body parts compared to the reference motion. It assesses how accurately the robot matches the speed of its body segments' movements.
- Mathematical Formula: $E_{\mathrm{mpbve}} = \mathbb{E} \left[ \left\| \Delta \pmb{p}_t - \Delta \pmb{p}_t^{\mathrm{ref}} \right\|_2 \right]$ where $\Delta \pmb{p}_t = \pmb{p}_t - \pmb{p}_{t-1}$
- Symbol Explanation:
  - $E_{\mathrm{mpbve}}$ : Mean Per Body Velocity Error.
  - $\mathbb{E}[\cdot]$ : Expectation, averaged over all body parts and time steps.
  - $\Delta \pmb{p}_t$ : The vector of linear velocities for a body part of the robot at time $t$ , approximated as the difference in positions between $t$ and t-1.
  - $\Delta \pmb{p}_t^{\mathrm{ref}}$ : The vector of linear velocities for the corresponding body part in the reference motion at time $t$ .
  - $\lVert \cdot \rVert_2$ : The Euclidean norm.
Mean Per Body Acceleration Error ( $E_{\mathrm{mpbae}}$ , mm/frame $^2$ )
- Conceptual Definition: This metric measures the average error in the linear accelerations of the robot's body parts compared to the reference motion. It's particularly important for dynamic movements, indicating how well the robot matches the rates of change of velocities.
- Mathematical Formula: $E_{\mathrm{mpbae}} = \mathbb{E} \left[ \left\| \boldsymbol{\Delta}^2 \pmb{p}_t - \boldsymbol{\Delta}^2 \pmb{p}_t^{\mathrm{ref}} \right\|_2 \right]$ where $\boldsymbol{\Delta}^2 \pmb{p}_t = \Delta \pmb{p}_t - \Delta \pmb{p}_{t-1}$
- Symbol Explanation:
  - $E_{\mathrm{mpbae}}$ : Mean Per Body Acceleration Error.
  - $\mathbb{E}[\cdot]$ : Expectation, averaged over all body parts and time steps.
  - $\boldsymbol{\Delta}^2 \pmb{p}_t$ : The vector of linear accelerations for a body part of the robot at time $t$ , approximated as the difference in velocities ( $\Delta \pmb{p}_t$ ) between $t$ and t-1.
  - $\boldsymbol{\Delta}^2 \pmb{p}_t^{\mathrm{ref}}$ : The vector of linear accelerations for the corresponding body part in the reference motion at time $t$ .
  - $\lVert \cdot \rVert_2$ : The Euclidean norm.
Mean Foot Contact Mask Error ( $E_{\mathrm{contact-mask}}$ ) (Introduced in Appendix E.3)
- Conceptual Definition: This metric quantifies the average discrepancy between the robot's estimated foot contact state and the reference motion's foot contact state. It directly measures how accurately the robot's foot contacts (or lack thereof) match the intended contacts in the reference.
- Mathematical Formula: $E_{\mathrm{contact-mask}} = \mathbb{E} \left[ \| c_t - \hat{c}_t \|_1 \right]$
- Symbol Explanation:
  - $E_{\mathrm{contact-mask}}$ : Mean Foot Contact Mask Error.
  - $\mathbb{E}[\cdot]$ : Expectation, averaged over all feet and time steps.
  - $c_t$ : The actual (or estimated) binary contact mask for the robot's foot (1 for contact, 0 for no contact) at time $t$ .
  - $\hat{c}_t$ : The binary contact mask for the reference motion's foot at time $t$ .
  - $\lVert \cdot \rVert_1$ : The Manhattan norm (L1 norm), summing the absolute differences.

5.3. Baselines

PBHC is compared against three baseline methods, all of which use an exponential form for the tracking reward function (similar to PBHC's design in Section 3.2.1), but with empirically tuned and fixed parameters rather than adaptive ones.

OmniH2O [10]:
- Description: This method adopts a teacher-student training paradigm. The teacher policy learns in a privileged setting, and the student policy learns from the teacher's outputs.
- Implementation: The authors moderately increased the tracking reward weights to better match the G1 robot. The teacher and student policies were trained for 20 and 10 hours, respectively.
ExBody2 [5]:
- Description: This method utilizes a decoupled keypoint-velocity tracking mechanism. It focuses on expressive whole-body control.
- Implementation: Similar to OmniH2O, teacher and student policies were trained for 20 and 10 hours, respectively.
MaskedMimic [2]:
- Description: This method is primarily designed for character animation and focuses on optimizing pose-level accuracy. It comprises three sequential training phases.
- Implementation: The authors utilized only the first phase of MaskedMimic as the full method is not pertinent to robot control tasks (lacks constraints like partial observability and action smoothness). Each policy was trained for 18 hours.
- Note: The paper also trains an "Oracle" version of PBHC that, like MaskedMimic, overlooks robot-specific constraints for a fair comparison of pose-level accuracy in a less constrained setting.

5.4. Experiment Setup Details (from Appendix D.1, C.3, C.4, C.5, C.6)

5.4.1. Compute Platform

Hardware: Each experiment was conducted on a machine equipped with:
- CPU: 24-core Intel i7-13700 running at 5.2GHz.
- RAM: 32 GB.
- GPU: Single NVIDIA GeForce RTX 4090.
Operating System: Ubuntu 20.04.
Training Time: Each model was trained for 27 hours.

5.4.2. Real Robot Setup

Robot: Unitree G1 humanoid robot.
System Architecture:
- Onboard motion control board: Collects sensor data.
- External PC: Connected via Ethernet, receives sensor data (using DDS protocol), maintains observation history, performs policy inference, and sends target joint angles back to the control board.
- Control board: Issues motor commands based on received target joint angles.

5.4.3. Domain Randomization (from Appendix C.3)

To improve sim-to-real transfer and robustness, domain randomization is incorporated. The following are the domain randomization settings (Table 7 from the original paper):

Term	Value
Dynamics Randomization
Friction	U(0.2, 1.2)
PD gain	U(0.9, 1.1)
Link mass(kg)	U(0.9, 1.1)× default
Ankle inertia(kg.m2)	U(0.9, 1.1)× default
Base CoM offset(m)	U(-0.05, 0.05)
ERFI[58](N·m/kg)	0.05× torque limit
Control delay(ms)	U(0, 40)
External Perturbation
Random push interval(s)	[5, 10]
Random push velocity(m/s)	0.1

U(a, b): Uniform distribution between $a$ and $b$ . default refers to the nominal value for the Unitree G1 robot.
ERFI: External Resistive Force Impulse, a type of perturbation applied to the robot.

5.4.4. PPO Hyperparameters (from Appendix C.4)

The following are the PPO hyperparameters used for policy optimization (Table 8 from the original paper):

Hyperparameter	Value
Optimizer	Adam
Batch size	4096
Mini Batches	4
Learning epoches	5
Entropy coefficient	0.01
Value loss coefficient	1.0
Clip param	0.2
Max grad norm	1.0
Init noise std	0.8
Learning rate	1e-3
Desired KL	0.01
GAE decay factor( $\lambda$ )	0.95
GAE discount factor( $\gamma$ )	0.99
Actor MLP size	[512, 256, 128]
Critic MLP size	[768, 512, 128]
MLP Activation	ELU

5.4.5. Curriculum Learning (from Appendix C.5)

Two curriculum mechanisms are introduced to facilitate learning high-dynamic motions:

Termination Curriculum:
- Mechanism: The episode terminates early if the humanoid's motion deviates from the reference beyond a termination threshold $\theta$ .
- Progression: During training, this threshold is gradually decreased, making the task more difficult and requiring higher precision.
- Update Rule: $\theta \gets \mathrm{clip} \left( \theta \cdot (1 - \delta), \theta_{\mathrm{min}}, \theta_{\mathrm{max}} \right)$
  - $\theta$ : The termination threshold.
  - $\mathrm{clip}(\cdot, \cdot, \cdot)$ : Clips the value within the specified minimum and maximum bounds.
  - $\theta_{\mathrm{init}} = 1.5$ : Initial threshold.
  - $\theta_{\mathrm{min}} = 0.3$ : Minimum threshold.
  - $\theta_{\mathrm{max}} = 2.0$ : Maximum threshold.
  - $\delta = 2.5 \times 10^{-5}$ : Decay rate, causing $\theta$ to decrease over time.
Penalty Curriculum:
- Mechanism: A scaling factor $\alpha$ modulates the influence of regularization terms (penalties).
- Progression: $\alpha$ progressively increases, gradually enforcing stronger regularization and promoting more stable and physically plausible behaviors. This helps in early training stages by being less strict.
- Update Rule: $\alpha \gets \mathrm{clip}(\alpha \cdot (1 + \delta), \alpha_{\mathrm{min}}, \alpha_{\mathrm{max}}), \quad \hat{r}_{\mathrm{penalty}} \gets \alpha \cdot r_{\mathrm{penalty}}$
  - $\alpha$ : The penalty scaling factor.
  - $\mathrm{clip}(\cdot, \cdot, \cdot)$ : Clips the value within the specified minimum and maximum bounds.
  - $\alpha_{\mathrm{init}} = 0.1$ : Initial penalty scale.
  - $\alpha_{\mathrm{min}} = 0.0$ : Minimum penalty scale.
  - $\alpha_{\mathrm{max}} = 1.0$ : Maximum penalty scale.
  - $\delta = 1.0 \times 10^{-4}$ : Growth rate, causing $\alpha$ to increase over time.
  - $\hat{r}_{\mathrm{penalty}}$ : The scaled penalty reward.
  - $r_{\mathrm{penalty}}$ : The original penalty reward.

5.4.6. PD Controller Parameter (from Appendix C.6)

Mechanism: A Proportional-Derivative (PD) controller is used at the joint level to convert desired joint positions into motor torques.
Gains: The stiffness ( $k_p$ ) and damping ( $k_d$ ) gains are specified for different joint groups.

Numerical Stability: To improve numerical stability and fidelity in the simulator, the inertia of the ankle links is manually set to a fixed value of $5 \times 10^{-3}$ .

The following are the PD controller gains (Table 9 from the original paper):

Joint name	Stiffness (kp)	Damping (kd)
Left/right shoulder pitch/roll/yaw	100	2.0
Left/right shoulder yaw	50	2.0
Left/right elbow	50	2.0
Waist pitch/roll/yaw	400	5.0
Left/right hip pitch/roll/yaw	100	2.0
Left/right knee	150	4.0
Left/right ankle pitch/roll	40	2.0

6. Results & Analysis

6.1. Motion Filtering

This section addresses Q1: "Can our physics-based motion filtering effectively filter out untrackable motions?"

Methodology: The physics-based motion filtering method (Section 3.1) was applied to 10 motion sequences. Policies were then trained for each motion, and the Episode Length Ratio (ELR) was computed.
ELR Definition: The ratio of average episode length to reference motion length. A high ELR indicates that the policy successfully tracks the motion for a longer duration without early termination (e.g., due to falling or exceeding error thresholds).
Findings:
- 4 sequences were rejected by the filter, and 6 were accepted.
- Accepted motions consistently achieved high ELRs.
- Rejected motions achieved a maximum ELR of only $54\%$ , indicating frequent termination conditions violations.
Conclusion: The filtering method is effective in excluding inherently untrackable motions, thereby improving training efficiency by focusing on viable candidates.

The following figure (Figure 6 from the original paper) shows the distribution of ELR for accepted and rejected motions:

该图像是图表，展示了接受和拒绝动作的情节长度比率分布。图中蓝色圆点代表接受动作，而橙色圆点则表示拒绝动作。纵坐标表示情节长度比率（%），横坐标则为两个类别的比较。中央的虚线标示了54%的分界线。

The plot clearly shows a distinct separation: accepted motions (blue dots) have ELRs mostly above 80%, while rejected motions (orange dots) have ELRs mostly below 60%, confirming the filter's effectiveness.

6.2. Main Result

This section addresses Q2: "Does PBHC achieve superior tracking performance compared to prior methods in simulation?"

Comparison: PBHC is compared against OmniH2O, ExBody2, and MaskedMimic.
Evaluation: Policies are trained in IsaacGym [29] with three random seeds and evaluated over 1,000 rollout episodes. Motions are categorized into easy, medium, and hard difficulty levels.

Findings:

Superiority of PBHC: PBHC consistently outperforms OmniH2O and ExBody2 across all evaluation metrics (position, velocity, acceleration errors) and all difficulty levels. This is indicated by significantly lower error values.
Adaptive Mechanism's Role: The paper attributes PBHC's improvements to its adaptive motion tracking mechanism, which dynamically adjusts tracking factors. Baselines with fixed, empirically tuned parameters struggle to generalize across diverse motions.
MaskedMimic Context: MaskedMimic sometimes performs well on certain metrics but is designed for character animation and is not suitable for robot control due to neglecting physical constraints like partial observability and action smoothness.

Oracle Comparison: An oracle version of PBHC (which, like MaskedMimic, ignores robot constraints) also shows strong performance, suggesting that PBHC's core tracking capabilities are robust even when constraints are relaxed.

The following are the main results comparing different methods across difficulty levels (Table 1 from the original paper):

Method	Eg-mpbpe	Empbpe	Empjpe	Empbve	Empbae	Empjve
Easy
OmniH2O	233.54±4.013*	103.67±1.912*	1805.10±12.33*	8.54±0.125*	8.46±0.081*	224.70±2.043
ExBody2	588.22±11.43*	332.50±3.584*	4014.40±21.50*	14.29±0.172*	9.80±0.157*	206.01±1.346*
Ours	53.25±17.60	28.16±6.127	725.62±16.20	4.41±0.312	4.65±0.140	81.28±2.052
MaskedMimic (Oracle)	-41.79±1715	21.86±2.030	-739.96±19.94 *	5.20±0.245	7.40±0.3	132.01±8.941
Ours (Oracle)	45.02±6.760	22.95±15.22	710.30±16.66	4.63±1.580	4.89±0.960	73.44±12.42
Medium
OmniH2O	433.64±16.22*	151.42±7.340*	2333.90±49.50*	10.85±0.300	10.54±0.152	204.36±4.473
ExBody2	619.84±26.16*	261.01±1.592*	3738.70±26.90*	14.48±0.160*	11.25±0.173	204.33±2.172*
Ours	126.48±27.01	48.87±7.550	1043.30±104.4	6.62±0.412	7.19±0.254	105.30±5.941
MaskedMimic (Oracle)	150.92±33.4	61.69±46.01	934.25±155.0	8.16±1.974	10.01±0.83	176.84±26.1
Ours (Oracle)	66.85±50.29	29.56±14.53	753.69±100.2	5.34±0.425	6.58±0.291	82.73±3.108
Hard
OmniH2O	446.17±12.84	147.88±4.142	1939.50±23.90	14.98±0.643	14.40±0.580	190.13±8.211
ExBody2	689.68±11.80	246.40±1252*	4037.40±16.70*	19.90±0.210	16.72±0.160	254.76±3.409*
Ours	290.36±139.1	124.61±53.54	1326.60±378.9	11.93±2.622	12.36±2.401	135.05±16.43
MaskedMimic (Oracle)	47.74±2.762	27.2±1.615	829.02±15.41	-8.33±0.194	10.60±0.420*	146.90±13.32*
Ours (Oracle)	79.25±69.4	34.74±22.6	734.90±155.9	7.04±1.420	8.34±1.140	93.79±17.36

Interpretation: For all three difficulty levels (Easy, Medium, Hard), the Ours method (PBHC) consistently shows the lowest error values across almost all metrics (Eg-mpbpe, Empbpe, Empjpe, Empbve, Empbae, Empjve). The bolded numbers indicate PBHC's strong performance. For example, in the Easy category, PBHC achieves an Eg-mpbpe of $53.25 \pm 17.60$ , which is significantly lower than OmniH2O ( $233.54 \pm 4.013$ ) and ExBody2 ( $588.22 \pm 11.43$ ). Similar trends are observed for Medium and Hard motions, where PBHC maintains a substantial lead over the deployable baselines. The Oracle versions (which are not constrained by real-world robot physics and observability) naturally achieve lower errors in some cases, but PBHC still performs very competitively, especially when considering its real-world deployability. The asterisks $(^*)$ denote significant improvements ( $p < 0.05$ ) of PBHC over baselines, further solidifying its advantage.

6.3. Impact of Adaptive Motion Tracking Mechanism

This section addresses Q3: "Does the adaptive motion tracking mechanism improve tracking precision?"

Methodology: An ablation study was conducted, comparing PBHC's adaptive mechanism with four fixed tracking factor configurations: Coarse, Medium, UpperBound, and LowerBound. These fixed configurations represent different levels of reward tolerance, from very loose to very strict.
Findings:
- Inconsistency of Fixed Factors: The performance of fixed tracking factor configurations varied significantly across different motion types. A setting that performed well for one motion might be suboptimal for another. This highlights the difficulty of finding a universal fixed $\sigma$ .
- Adaptive Mechanism's Robustness: PBHC's adaptive motion tracking mechanism consistently achieved near-optimal performance across all motion types. This demonstrates its effectiveness in dynamically adjusting the tracking factor to suit the unique characteristics and difficulty of each motion.
  
  The following figure (Figure 7 from the original paper) illustrates the results of this ablation study:
  
  该图像是一个图表，展示了不同动作（如 Jab punch、Charleston dance、Roundhouse kick 和 Bruce Lee 的姿势）中自适应动作跟踪机制与固定跟踪因子的比较。图中的曲线显示了各种动作的性能，蓝色曲线代表本文的方法，展示出在各个动作中均接近最佳性能的表现。

The plot shows that for different motions (e.g., Jabs punch, Charleston dance, Roundhouse kick, Bruce Lee's pose), the Ours method (blue curve, representing the adaptive mechanism) consistently achieves the lowest or near-lowest error across various tracking factors. In contrast, fixed tracking factor variants (green, red, purple, and yellow curves) exhibit high variance in performance, sometimes doing well, sometimes poorly, depending on the specific motion.

The following are the ablation study results evaluating the impact of different tracking factors on four motion tasks (Table 12 from the original paper):

Method	Eg-mpbpe	Empbpe	Empjpe	Empbve ↓	Empbae ↓	Empjve
Jabs punch
Ours	44.38±7.118	28.00±3.533	783.36±11.73	5.52±0.156	6.23±0.063	88.01±2.465
Coarse	63.95±6.680	36.76±2.743	921.50±16.70	6.16±0.011	6.46±0.042	91.46±0.465
Medium	51.07±2.635	30.93±2.635	790.54±22.82	5.68±0.140	6.31±0.057	90.19±1.821
Upperbound	45.74±1.702	28.72±1.702	793.52±8.888	5.43±0.066	6.29±0.085	88.68±0.727
Lowerbound	48.66±0.488	28.97±0.487	781.73±16.72	5.61±0.079	6.31±0.06	88.44±1.397
Charleston dance
Ours	94.81±14.18	43.09±5.748	886.91±74.76	6.83±0.346	7.26±0.034	162.70±7.133
Coarse	119.24±4.501	55.80±1.324	1288.02±3.807	7.54±0.180	7.28±0.021	178.61±3.304
Medium	83.63±3.159	41.02±1.743	933.33±38.23	6.89±0.185	7.22±0.011	164.92±4.380
Upperbound	86.90±8.651	41.92±2.632	917.64±14.85	7.02±0.103	7.22±0.041	167.64±1.089
Lowerbound	358.82±10.35	145.42±1.109	1199.21±12.78	8.99±0.050	8.48±0.033	167.25±0.783
Roundhouse kick
Ours	52.53±2.106	28.39±1.400	708.55±16.04	6.85±0.196	7.13±0.046	106.22±0.715
Coarse	76.81±2.863	38.98±2.230	1008.32±29.74	7.49±0.234	7.57±0.044	108.40±0.010
Medium	63.12±5.178	33.74±2.336	806.84±66.23	7.03±0.125	7.32±0.046	104.77±1.319
Upperbound	54.95±2.164	31.31±0.344	766.32±12.92	6.93±0.013	7.19±0.012	105.64±1.911
Lowerbound	70.10±2.674	36.29±1.475	715.01±34.01	7.08±0.102	7.32±0.067	102.50±4.650
Bruce Lee's pose
Ours	196.22±17.03	69.12±2.392	972.04±49.27	7.57±0.214	8.54±0.198	94.36±3.750
Coarse	239.06±51.74	80.78±15.81	1678.34±394.3	8.42±0.525	8.93±0.422	112.30±10.87
Medium	470.24±249.2	206.92±116.1	4490.80±105.1	9.58±0.085	9.61±0.080	99.65±2.441
Upperbound	250.64±178.6	93.70±65.09	1358.02±561.6	8.31±2.160	8.94±1.384	106.30±23.06
Lowerbound	158.12±2.934	60.54±1.54	955.10±37.04	7.05±0.040	7.94±0.051	81.60±1.277

Interpretation: This detailed table confirms that Ours (PBHC with adaptive $\sigma$ ) generally achieves the best performance (lowest error values) across the different metrics and specific motions. For example, for 'Jabs punch', Ours reports Eg-mpbpe of 44.38, significantly better than Coarse (63.95), Medium (51.07), Upperbound (45.74), and Lowerbound (48.66). This trend holds for 'Charleston dance' and 'Roundhouse kick'. For 'Bruce Lee's pose', while Lowerbound achieves a slightly better Empjve, Ours still has overall superior performance in position errors. The results clearly demonstrate the adaptive mechanism's capability to tune the tracking factors effectively, leading to robust and superior performance across varied dynamic motions compared to any single fixed setting.

6.4. Real-World Deployment

This section addresses Q4: "How well does PBHC perform in real-world deployment?"

Methodology: Policies trained in simulation were directly deployed on the Unitree G1 robot without any fine-tuning (zero-shot sim-to-real transfer). Quantitative assessment was done by conducting 10 trials of the Tai Chi motion and computing evaluation metrics based on onboard sensor readings.
Findings:
- Stable and Expressive Behaviors: The Unitree G1 robot successfully demonstrated a diverse range of highly-dynamic skills, including martial arts techniques (punches, kicks), acrobatic movements (360-degree spins), flexible motions (squats, stretches), and artistic performances (dance, Tai Chi).
- Quantitative Alignment: The metrics obtained in the real world were closely aligned with those from the sim-to-sim platform (MuJoCo).
Conclusion: The policies robustly transfer from simulation to real-world deployment, maintaining high-performance control and showcasing PBHC's practical applicability.

The following figure (Figure 8 from the original paper) shows the robot mastering highly-dynamic skills in the real world:

该图像是一个插图，展示了我们的机器人在现实世界中掌握各种高动态技能的过程，包括马步拳、劈腿、直拳、太极、跳踢、李小龙的姿势、回旋踢、360度旋转、前踢和查尔斯顿舞。时间从左到右流动，体现了机器人从学习到掌握动态动作的连续性。

The image sequence displays various dynamic poses, such as horse-stance punch, stretch leg, jabs punch, Tai Chi, jump kick, Bruce Lee's pose, roundhouse kick, 360-degree spin, front kick, and Charleston dance, demonstrating the robot's versatility.

The following are the comparison of tracking performance of Tai Chi between real-world and simulation (Table 2 from the original paper):

Platform	Empbpe ↓	Empjpe ↓	Empbve ↓	Empbae ↓	Empjve ↓
MuJoCo	33.18±2.720	1061.24±83.27	2.96±0.342	2.90±0.498	67.71±6.747
Real	36.64±2.592	1130.05±9.478	3.01±0.126	3.12±0.056	65.68±1.972

Interpretation: The table shows very similar error values between the MuJoCo simulation platform and the Real robot for the Tai Chi motion. For example, Empbpe is $33.18 \pm 2.720$ in MuJoCo and $36.64 \pm 2.592$ in Real, indicating a close match. This quantitative comparison strongly supports the claim of successful zero-shot sim-to-real transfer. The root of the robot is fixed to the origin for this specific comparison, as direct root position/velocity measurements are often hard to access accurately on real-world robots for this specific type of comparison.

The following figure (Figure 12 from the original paper) presents additional real-world results of the robot mastering more dynamic skills:

该图像是图表，展示了机器人通过模仿动态技能的多个动作，包括：a) 钩拳，b) 马步姿势，c) 后踢，d) 侧踢，e) 五形态，f) 战斗连招，以及 g) 拍打舞。时间从左到右流动，展现出机器人在现实世界中掌握的动态技能。

This figure further showcases the robot's capabilities in hooks punch, horse-stance pose, back kick, side kick, five stance form, fighting combo, and tap dance, reinforcing the qualitative demonstration of PBHC's effectiveness in diverse dynamic movements.

6.5. Learning Curves

Methodology: Learning curves for mean episode length and mean reward are presented for three representative motions: Jabs Punch, Tai Chi, and Roundhouse Kick.
Findings: The curves show that training gradually stabilizes and converges after approximately 20,000 steps.
Conclusion: This demonstrates the reliability and efficiency of the PBHC approach in learning complex motion behaviors within a reasonable training duration.

The following figure (Figure 9 from the original paper) shows the mean episode length and mean reward across three motions:

该图像是图表，展示了三个动作（Jabs punch、Tai Chi、Roundhouse kick）在训练过程中的平均回合长度和平均奖励。曲线表明，训练在20k步后逐渐稳定。

The top row (a) shows the Mean Episode Length for the three motions, all plateauing around 20k steps. The bottom row (b) shows the Mean Reward, which also stabilizes and converges around the same training steps. This indicates that the RL agent successfully learns to maintain the reference motion for longer durations and achieves high cumulative rewards as training progresses.

6.6. Ablation Study of Contact Mask (from Appendix E.3)

Methodology: An ablation study was conducted to evaluate the effectiveness of the contact mask in PBHC. It compared the full Ours method with Ours w/o contact mask (without the contact mask component) for motions with distinct foot contact patterns (Charleston Dance, Jump Kick, Roundhouse Kick).
Metric: Mean Foot Contact Mask Error ( $E_{\mathrm{contact-mask}}$ ) was introduced.
Findings:
- The Ours method significantly reduced foot contact errors ( $E_{\mathrm{contact-mask}}$ ) compared to the baseline without the contact mask.
- This also led to noticeable improvements in other tracking metrics.
Conclusion: The contact-aware design is effective in improving tracking accuracy, especially for foot contacts.

The following figure (Figure 10 from the original paper) shows the accuracy of contact mask estimation across different methods:

该图像是一个柱状图，展示了不同方法在接触掩膜估计中的准确率。图中标注了三种方法的准确率，其中‘Height’为84.2%，‘Velocity’为85.6%，而‘Ours’的准确率达到了91.4%。

The bar chart shows that the proposed Ours method achieves $91.4\%$ accuracy, outperforming Height ( $84.2\%$ ) and Velocity ( $85.6\%$ ) based methods.

The following figure (Figure 11 from the original paper) presents a visual comparison of the efficacy of the proposed motion correction technique:

Figure 11: Visualization of motion correction effectiveness in mitigating floating artifacts. 该图像是示意图，展示了动作修正的有效性。图中分别显示了修正前后的人体姿态，通过比较可以明显看出，修正后的姿态更加符合地面的水平线，减少了浮动干扰。

The image visually demonstrates that motion correction effectively mitigates floating artifacts. The before correction image shows the human model's feet hovering above the ground, while the after correction image shows the feet accurately placed on the ground, highlighting the importance of the correction step.

The following are the ablation results of the contact mask (Table 13 from the original paper):

Method	Econtact-mask ↓	Empbpe ↓	Empjpe ↓	Empbve ↓	Empbae ↓
Charleston dance
Ours	217.82±47.97	43.09±5.748	886.91±74.76	6.83±0.346	7.26±0.034
Ours w/o contact mask	633.91±49.74	76.13±53.01	980.40±222.0	7.72±1.439	7.64±0.594
Jump kick
Ours	294.22±6.037	42.58±8.126	840.33±97.76	9.48±0.717	10.21±10.21
Ours w/o contact mask	386.75±6.036	170.28±97.29	1259.21±423.9	16.92±0.012	16.57±5.810
Roundhouse kick
Ours	243.16±1.778	28.39±1.400	708.55±16.04	6.85±0.196	7.33±0.046
Ours w/o contact mask	250.10±6.123	36.76±2.743	921.52±16.70	6.16±0.012	6.46±0.042

Interpretation: For all three motions, Ours (with contact mask) achieves significantly lower Econtact-mask compared to Ours w/o contact mask. For example, in 'Charleston dance', Econtact-mask drops from 633.91 to 217.82. This reduction in contact mask error also translates to lower Empbpe and Empjpe for 'Charleston dance' and 'Jump kick', indicating that accurate contact handling contributes to overall motion tracking fidelity. For 'Roundhouse kick', while Econtact-mask is only slightly lower for Ours, the other position and velocity errors are notably better, reinforcing the benefit of the contact-aware design.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces PBHC, a novel Reinforcement Learning framework for humanoid whole-body motion control that successfully enables robots to learn highly-dynamic human behaviors. Its key innovations lie in a physics-based multi-steps motion processing pipeline that ensures physical plausibility of reference motions and an adaptive motion tracking mechanism that dynamically adjusts reward tolerance during training. Experimental results demonstrate that PBHC achieves significantly lower tracking errors than existing baselines in simulation and exhibits robust zero-shot sim-to-real transfer to a Unitree G1 robot, performing complex Kungfu and dancing skills stably and expressively. The motion filtering metric efficiently identifies untrackable motions, and the adaptive reward mechanism consistently outperforms fixed-factor approaches.

7.2. Limitations & Future Work

The authors acknowledge the following limitations:

Lack of Environment Awareness: The current method does not incorporate terrain perception or obstacle avoidance. This restricts its deployment to unstructured real-world settings.
Limited Skill Generalization: While it excels at highly-dynamic motions, the method's ability to generalize to diverse motion repertoires (i.e., a wide range of different skills) needs further exploration.

Based on these limitations, the authors suggest future research directions:

Integrating environment awareness capabilities (e.g., perception of terrain, obstacles).
Research into maintaining high dynamic performance while simultaneously enabling broader skill generalization.

7.3. Personal Insights & Critique

This paper presents a significant step forward in humanoid motion imitation, particularly for highly-dynamic and expressive skills. The two-pronged approach of meticulous physics-based motion processing and an intelligent adaptive reward curriculum is a powerful combination.

Innovation of Adaptive Tracking: The bi-level optimization formulation and the EMA-driven adaptive tracking factor are particularly insightful. This moves beyond heuristics for reward shaping to a more principled, data-driven adjustment of learning difficulty. It addresses a fundamental challenge in imitation learning: how to reward partial progress on hard tasks without overly penalizing early, imperfect attempts, while still driving towards high precision.
Rigorous Motion Processing: The detailed motion processing pipeline, including CoM-CoP stability filtering and contact-aware correction, is crucial. Many RL approaches often overlook the quality of reference data, assuming it's perfect. This paper demonstrates the immense value of cleaning and validating input data against physical constraints.
Real-World Validation: The successful zero-shot sim-to-real transfer on the Unitree G1 robot with complex motions is compelling. This showcases the robustness of domain randomization and the efficacy of the proposed control framework.

Potential Issues/Areas for Improvement:

Computational Cost: Training RL policies, especially for complex humanoid robots, is computationally intensive. The 27-hour training time per policy on a powerful GPU suggests that scaling to an even broader range of skills or more complex environments might require significant computational resources. The bi-level optimization itself, while theoretically sound, adds a layer of complexity.
Generalization to Novel Motions: While the adaptive mechanism helps with different difficulty levels within a known set of motions, the question of truly novel, unseen motion types remains. How well would the adaptive factor generalize to motions structurally very different from the training data?
Reactive Behavior: The current framework focuses on imitating pre-defined reference motions. For real-world deployment, robots often need to react dynamically to unexpected external forces or changes in the environment, which is not directly addressed by motion imitation alone. The lack of environment awareness is a significant limitation for practical applications in unstructured settings.
Hyperparameter Sensitivity: Even with adaptive mechanisms, the initial tracking factor and the curriculum learning parameters (e.g., decay rates for $\theta$ and $\alpha$ ) might still require careful tuning and could influence the final performance.

Transferability: The principles of adaptive reward shaping based on online performance metrics and physics-informed data preprocessing are highly transferable.

Other Robotic Systems: This approach could be applied to other complex robotic systems (e.g., quadrupeds, manipulators) learning dynamic tasks.
Skill Learning: The adaptive curriculum concept could generalize to other RL skill learning problems where tasks have varying difficulties or require progressive precision.
Human-Robot Collaboration: The ability to accurately imitate human movements opens doors for more intuitive and effective human-robot collaboration and physical assistance.

Overall, KungfuBot provides a robust and innovative framework that pushes the boundaries of humanoid whole-body control, particularly for dynamic and expressive human skills, laying important groundwork for future humanoid intelligence and dexterity.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~29 min read · 38,444 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Motion Processing Pipeline

4.2.1.1. Motion Estimation from Videos

4.2.1.2. Physics-based Motion Filtering

4.2.1.3. Motion Correction based on Contact Mask

4.2.1.4. Motion Retargeting

4.2.2. Adaptive Motion Tracking

4.2.2.1. Exponential Form Tracking Reward

4.2.2.2. Optimal Tracking Factor

4.2.2.3. Adaptive Mechanism

4.2.3. RL Training Framework

4.2.3.1. Asymmetric Actor-Critic

4.2.3.2. Reward Vectorization

4.2.3.3. Reference State Initialization (RSI)

4.2.3.4. Sim-to-Real Transfer

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Experiment Setup Details (from Appendix D.1, C.3, C.4, C.5, C.6)

5.4.1. Compute Platform

5.4.2. Real Robot Setup

5.4.3. Domain Randomization (from Appendix C.3)

5.4.4. PPO Hyperparameters (from Appendix C.4)

5.4.5. Curriculum Learning (from Appendix C.5)

5.4.6. PD Controller Parameter (from Appendix C.6)

6. Results & Analysis

6.1. Motion Filtering

6.2. Main Result

6.3. Impact of Adaptive Motion Tracking Mechanism

6.4. Real-World Deployment

6.5. Learning Curves

6.6. Ablation Study of Contact Mask (from Appendix E.3)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers