AiPaper
Paper status: completed

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

Published:06/15/2025
Original LinkPDF
Price: 0.10
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents KungfuBot, a physics-based humanoid control framework that learns high-dynamic human behaviors like Kungfu and dance through multi-step motion processing and adaptive tracking, achieving significantly lower tracking errors successfully implemented on a robot.

Abstract

Humanoid robots are promising to acquire various skills by imitating human behaviors. However, existing algorithms are only capable of tracking smooth, low-speed human motions, even with delicate reward and curriculum design. This paper presents a physics-based humanoid control framework, aiming to master highly-dynamic human behaviors such as Kungfu and dancing through multi-steps motion processing and adaptive motion tracking. For motion processing, we design a pipeline to extract, filter out, correct, and retarget motions, while ensuring compliance with physical constraints to the maximum extent. For motion imitation, we formulate a bi-level optimization problem to dynamically adjust the tracking accuracy tolerance based on the current tracking error, creating an adaptive curriculum mechanism. We further construct an asymmetric actor-critic framework for policy training. In experiments, we train whole-body control policies to imitate a set of highly-dynamic motions. Our method achieves significantly lower tracking errors than existing approaches and is successfully deployed on the Unitree G1 robot, demonstrating stable and expressive behaviors. The project page is https://kungfu-bot.github.io.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

1.2. Authors

Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, Xuelong Li

  • Affiliations:
    • Institute of Artificial Intelligence (TeleAI), China Telecom (Weiji Xie, Jinrui Han, Chenjia Bai, Xuelong Li, Jiyuan Shi, Huanyu Li, Xinzhe Liu, Jiakun Zheng)
    • Shanghai Jiao Tong University (Weiji Xie, Jinrui Han, Weinan Zhang)
    • East China University of Science and Technology (Jiakun Zheng)
    • Harbin Institute of Technology (Huanyu Li)
    • ShanghaiTech University (Xinzhe Liu)
  • Research Backgrounds: The authors come from prominent academic institutions and a corporate AI institute, indicating a strong background in artificial intelligence, robotics, and potentially telecommunications, with a focus on areas like reinforcement learning, motion control, and humanoid robotics.

1.3. Journal/Conference

The paper is published as a preprint on arXiv. It is likely submitted to a top-tier conference or journal in robotics, AI, or machine learning, given its scope and rigor. The NeurIPS Paper Checklist indicates it's likely targeting NeurIPS, a highly reputable conference in machine learning.

1.4. Publication Year

2025 (Published on 2025-06-15T13:58:53.000Z)

1.5. Abstract

This paper introduces KungfuBot, a physics-based humanoid control framework designed to enable humanoid robots to master highly-dynamic human behaviors, such as Kungfu and dancing. Current algorithms struggle with such dynamic motions, typically only tracking smooth, low-speed movements. KungfuBot addresses this through a two-stage approach: a multi-steps motion processing pipeline and adaptive motion tracking. The motion processing pipeline extracts, filters, corrects, and retargets human motions, ensuring physical constraint compliance. For motion imitation, it formulates a bi-level optimization problem to dynamically adjust a tracking accuracy tolerance (called tracking factor) based on current tracking error, creating an adaptive curriculum. An asymmetric actor-critic framework is used for policy training. Experimental results show that KungfuBot significantly reduces tracking errors compared to existing methods and demonstrates stable and expressive behaviors when deployed on a Unitree G1 robot, including complex Kungfu and dancing movements.

https://arxiv.org/abs/2506.12851

  • Publication Status: Preprint (on arXiv).

2. Executive Summary

2.1. Background & Motivation

  • Core Problem: Humanoid robots face significant challenges in accurately and stably imitating highly-dynamic human behaviors like Kungfu and dancing. Existing Reinforcement Learning (RL)-based whole-body control algorithms are generally limited to tracking smooth, low-speed human motions, even with sophisticated reward and curriculum design.
  • Importance: Enabling humanoid robots to mimic diverse and dynamic human skills is crucial for their potential applications in various tasks, from physical assistance and rehabilitation to entertainment and education. The human-like morphology of these robots makes them ideal candidates for human-robot interaction (HRI) and tasks requiring human-like dexterity and movement.
  • Challenges/Gaps:
    1. Physical Feasibility: Human motion capture (MoCap) data often contains movements that are physically impossible or unsafe for a robot due to differences in kinematics, dynamics, and joint limits. Directly using this data for RL training can lead to policies that are not feasible.
    2. Tracking Accuracy for Dynamics: Existing methods lack robust mechanisms to handle the inherent difficulty and variability of highly-dynamic motions, often leading to poor tracking performance or instability.
    3. Sim-to-Real Transfer: Bridging the gap between simulated training environments and real-world robot deployment remains a significant hurdle.
  • Paper's Entry Point/Innovative Idea: The paper proposes a comprehensive physics-based control framework that integrates multi-steps motion processing to ensure physical plausibility of reference motions and an adaptive motion tracking mechanism to dynamically adjust the reward tolerance for varying motion difficulties, enabling the learning of highly-dynamic skills.

2.2. Main Contributions / Findings

  • Primary Contributions:
    1. Physics-Based Motion Processing Pipeline: Design and implementation of a pipeline to extract, filter out, correct, and retarget human motions from videos. This pipeline ensures that reference motions comply with the robot's physical constraints and includes novel physics-based metrics for filtering and contact-aware motion correction.
    2. Adaptive Motion Tracking Mechanism: Formulation of a bi-level optimization problem to derive an optimal tracking factor (σ\sigma^*) and development of an adaptive mechanism that dynamically adjusts this factor during RL training based on tracking error. This creates an adaptive curriculum that tightens tracking accuracy tolerance as the policy improves.
    3. Asymmetric Actor-Critic Framework: Construction of an asymmetric actor-critic architecture that utilizes reward vectorization and privileged information for the critic to enhance value estimation, while the actor relies on local observations.
    4. Demonstrated Real-World Capabilities: Successful deployment of learned policies on a Unitree G1 robot, showcasing stable and expressive execution of complex, highly-dynamic skills like Kungfu and dancing.
  • Key Conclusions/Findings:
    1. The physics-based motion filtering effectively identifies and excludes untrackable motions from the dataset, leading to more efficient and effective policy learning.
    2. The proposed method (PBHC) significantly outperforms existing approaches (e.g., OmniH2O, ExBody2) in simulation across various tracking error metrics for easy, medium, and hard motions.
    3. The adaptive motion tracking mechanism is crucial for achieving superior tracking precision, dynamically adjusting to motion characteristics where fixed tracking factors would lead to suboptimal performance.
    4. The trained policies exhibit robust sim-to-real transfer, allowing for direct deployment on the Unitree G1 robot with performance metrics closely aligned with simulation results.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Humanoid Robots: Robots designed to resemble the human body, typically featuring a torso, head, two arms, and two legs. Their human-like morphology enables them to interact with human-designed environments and perform tasks that humans do. The Unitree G1 is an example of such a robot.
  • Degrees of Freedom (DoFs): In robotics, this refers to the number of independent parameters that define the configuration of a mechanical system. For a humanoid robot, this includes translational DoFs for the base (e.g., x, y, z position) and rotational DoFs for the base and each joint (e.g., pitch, roll, yaw). The Unitree G1 has 23 DoFs for control, excluding its hands.
  • Whole-Body Control (WBC): A control strategy that coordinates all DoFs of a robot (including the base, torso, arms, and legs) simultaneously to achieve complex tasks while respecting physical constraints (e.g., joint limits, torque limits, balance, contact forces).
  • Motion Capture (MoCap): Technology used to digitally record the movement of people or objects. MoCap systems produce motion data that can be used to animate digital models or serve as reference motions for robots to imitate.
  • Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal. The agent learns a policy—a mapping from observed states to actions.
    • Markov Decision Process (MDP): A mathematical framework for modeling decision-making in RL. An MDP is defined by M=(S,A,R,P,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, \gamma), where:
      • S\mathcal{S} is the set of possible states.
      • A\mathcal{A} is the set of possible actions.
      • R\mathcal{R} is the reward function, defining the immediate reward an agent receives for transitioning from one state to another.
      • P\mathcal{P} is the transition function (or dynamics), defining the probability of moving to a new state given a current state and action.
      • γ\gamma is the discount factor, which determines the present value of future rewards.
    • Actor-Critic Methods: A class of RL algorithms that combine two components: an actor (which learns the policy directly) and a critic (which estimates the value function to evaluate the actor's actions).
      • Asymmetric Actor-Critic: A variation where the actor and critic have different observation spaces. Typically, the critic has access to privileged information (e.g., environmental parameters, internal states not directly observable by the robot) to learn a better value function, which then guides the actor (who only sees proprioceptive and exteroceptive observations) in learning a robust policy. This helps in sim-to-real transfer.
    • Proximal Policy Optimization (PPO): A popular on-policy RL algorithm that aims to improve sample efficiency and stability compared to earlier methods. It updates the policy by taking multiple small steps, ensuring that the new policy does not deviate too much from the old one, which helps prevent destructive updates.
  • Skinned Multi-Person Linear (SMPL) Model: A widely used statistical 3D human body model that represents body shape and pose with a low-dimensional parameter space.
    • Parameters: SMPL uses:
      • βR10\beta \in \mathbb{R}^{10} for body shapes (e.g., height, weight, build).
      • θR24×3\pmb{\theta} \in \mathbb{R}^{24 \times 3} for joint rotations (represented as axis-angle or rotation matrices).
      • ψˉR3\bar{\boldsymbol{\psi}} \in \mathbb{R}^{3} for global translation (position in 3D space).
    • Mapping to Mesh: These parameters are mapped to a 3D mesh (a collection of 3D vertices) via a differentiable skinning function M()M(\cdot), producing V=Mˉ(β,θ,ψ)R6890×3\mathcal{V} = \bar{M}(\beta, \theta, \psi) \in \mathbb{R}^{6890 \times 3}.
  • Inverse Kinematics (IK): A method used in robotics and computer graphics to determine the joint parameters (angles) required to achieve a desired position and orientation for an end-effector (e.g., hand, foot). Differential IK uses derivatives to solve this problem iteratively.
  • Center of Mass (CoM) and Center of Pressure (CoP):
    • CoM: The unique point where the weighted relative positions of the distributed mass sum to zero. It's the average position of all the mass that makes up the object.
    • CoP: The point on the ground where the total ground reaction force acts. For stable standing or walking, the CoM projection should ideally stay within the base of support (the area enclosed by the CoP). The proximity of CoM and CoP indicates stability.
  • Exponential Moving Average (EMA): A type of moving average that places a greater weight and significance on the most recent data points. It is often used to smooth noisy data or to create a dynamically updating estimate of a parameter.

3.2. Previous Works

The paper contextualizes its work by discussing limitations of previous RL-based whole-body control frameworks for motion tracking, particularly their inability to handle highly-dynamic motions.

  • H2O and OmniH2O [9, 10]: These methods attempt to address physical feasibility issues by removing infeasible motions from datasets using a trained privileged imitation policy. While they contribute to cleaning motion data, they often still struggle with highly dynamic movements due to inherent difficulties in tracking and a lack of suitable tolerance mechanisms. OmniH2O specifically focuses on universal and dexterous human-to-humanoid whole-body teleoperation and learning.
  • ExBody [7] and ExBody2 [5]:
    • ExBody constructs a feasible motion dataset by filtering via language labels.
    • ExBody2 trains an initial policy on all motions and uses the tracking error to measure motion difficulty, aiming to optimize the dataset. However, this process can be costly and may not find an optimal dataset, and also lacks tolerance mechanisms for difficult motions.
  • ASAP [6]: Aligning Simulation and Real-world Physics for Learning Agile Humanoid Whole-body Skills. This method introduces a multi-stage mechanism and learns a residual policy to compensate for the sim-to-real gap, which helps in tracking agile motions. Unlike ASAP which focuses on the sim-to-real gap, KungfuBot focuses on improving motion feasibility and agility entirely within simulation.
  • MaskedMimic [21, 23]: Unified Physics-based Character Control through Masked Motion Inpainting. This method focuses on character animation and directly optimizes pose-level accuracy without explicit regularization of physical plausibility. While it performs well in character animation contexts, it is not directly deployable for robot control because it does not account for robot-specific constraints like partial observability and action smoothness. The paper uses MaskedMimic as a baseline for comparison but notes its limitations for direct robot deployment.

3.3. Technological Evolution

Early humanoid motion imitation efforts often relied on direct kinematic mapping or Inverse Kinematics (IK) controllers, which struggled with dynamic stability and physical constraints. The advent of Reinforcement Learning (RL) brought about physics-based controllers that could learn stable locomotion and basic movements (e.g., DeepMimic [21]). However, directly imitating complex human motions remained challenging due to the sim-to-real gap and the physical differences between humans and robots. Subsequent works focused on motion filtering (H2O, ExBody) to make human MoCap data more robot-feasible. More recently, approaches like ASAP have tackled agile motions by addressing the sim-to-real gap through residual policies.

KungfuBot represents an evolution by refining both the data preparation (physics-based processing) and the RL training loop (adaptive reward tolerance), allowing for the imitation of more extreme and dynamic motions that previous methods struggled with. It moves beyond simply filtering data to actively adapt the learning process to the inherent difficulty of dynamic movements.

3.4. Differentiation Analysis

KungfuBot differentiates itself from previous methods primarily in two key areas:

  1. Comprehensive Physics-Based Motion Processing:

    • Prior Work: Methods like H2O and ExBody primarily focus on filtering infeasible motions or constructing datasets based on labels. ASAP addresses sim-to-real gap for agile motions.
    • KungfuBot's Innovation: KungfuBot introduces a more holistic multi-steps pipeline that not only filters (physics-based motion filtering using CoM-CoP stability criteria) but also corrects (contact-aware motion correction with EMA smoothing) and retargets motions. This ensures a higher degree of physical plausibility and quality for the reference motions before RL training, directly tackling out-of-distribution issues from HMR models and floating artifacts. This processing is done entirely in simulation, contrasting with ASAP's focus on sim-to-real compensation.
  2. Adaptive Motion Tracking with Optimal Factor Derivation:

    • Prior Work: Most RL-based imitation learning methods use fixed reward functions and tracking factors (e.g., OmniH2O, ExBody2). While ExBody2 attempts to measure motion difficulty, it lacks a dynamic tolerance mechanism. The MaskedMimic approach, while effective for character animation, does not prioritize physical plausibility for robot control and thus uses different optimization objectives.

    • KungfuBot's Innovation: KungfuBot introduces a novel adaptive mechanism for dynamically adjusting the tracking factor (σ\sigma) within the exponential reward function. This is grounded in a bi-level optimization problem that theoretically derives the optimal tracking factor as the average of the optimal tracking error. By using an Exponential Moving Average (EMA) to estimate this error online and iteratively tightening σ\sigma, KungfuBot creates an adaptive curriculum that allows the policy to progressively improve its precision for motions of varying difficulty. This mechanism is a significant departure from fixed reward weighting or tolerance parameters, enabling superior performance on highly-dynamic skills.

      In summary, KungfuBot combines sophisticated physics-based motion preprocessing with an intelligent, adaptive RL reward mechanism, explicitly designed to overcome the limitations of prior work in handling the physical and dynamic complexities of highly-dynamic human behaviors for humanoid robots.

4. Methodology

The Physics-Based Humanoid motion Control (PBHC) framework is designed to enable humanoid robots to master highly-dynamic human behaviors. It operates through a two-stage process: motion processing and motion imitation.

The following figure (Figure 1 from the original paper) provides an overview of PBHC:

Figure 1: An overview of PBHC that includes three core components: (a) motion extraction from videos and multi-steps motion processing, (b) adaptive motion tracking based on the optimal tracking factor, (c) the RL training framework and sim-to-real deployment. 该图像是示意图,展示了PBHC的三个核心组件:(a)多步骤运动处理,包括从视频中提取动作和参考轨迹,以及接触掩码的生成;(b)自适应运动跟踪,基于最优跟踪因子的动态调整;(c)强化学习训练框架,展示了从观察到动作的训练过程及其部署。

As illustrated, the process begins with raw human videos (a). These videos undergo Human Motion Recovery (HMR) to produce SMPL-format motion sequences. These sequences are then refined through physics-based motion filtering and contact-aware motion correction, ensuring physical plausibility. The refined motions are then retargeted to the G1 robot to serve as reference motions. In the second stage (b), an adaptive motion tracking mechanism dynamically adjusts the tracking factor based on an optimal tracking factor derived from a bi-level optimization problem. Finally, the policies are trained using an RL training framework (c) and deployed on the real Unitree G1 robot.

4.1. Principles

The core idea behind PBHC is to systematically address the challenges of dynamic motion imitation in humanoid robots by:

  1. Ensuring Physical Feasibility: Pre-processing human MoCap data to guarantee that the reference motions are physically executable by the robot, considering its kinematics, dynamics, and contact interactions.
  2. Adaptive Learning: Dynamically adjusting the reward tolerance during Reinforcement Learning based on the agent's current performance and the inherent difficulty of the motion. This allows the agent to gradually refine its tracking precision without being prematurely penalized for small errors in complex movements, effectively creating an adaptive curriculum.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Motion Processing Pipeline

This pipeline converts raw human video into physically plausible and robot-executable reference motions for the G1 robot.

4.2.1.1. Motion Estimation from Videos

The process begins by extracting human motion from monocular videos.

  • Model Used: GVHMR [15] (Gravity-View Human Motion Recovery) is employed.
  • Key Features of GVHMR:
    • It estimates SMPL-format motions.
    • It introduces a gravity-view coordinate system, which naturally aligns motions with gravity, addressing body tilt issues that can arise from camera-centric reconstructions.
    • It mitigates foot sliding artifacts by predicting non-stationary probabilities, improving motion quality.
  • Output: SMPL parameters (β,θ,ψ)(\beta, \pmb{\theta}, \psi), where βR10\beta \in \mathbb{R}^{10} represents body shapes, θR24×3\pmb{\theta} \in \mathbb{R}^{24 \times 3} represents joint axis-angle rotations, and ψˉR3\bar{\boldsymbol{\psi}} \in \mathbb{R}^{3} represents global translation. These parameters can be mapped to a 3D mesh V=Mˉ(β,θ,ψ)R6890×3\mathcal{V} = \bar{M}(\beta, \theta, \psi) \in \mathbb{R}^{6890 \times 3} via a differentiable skinning function M()M(\cdot).

4.2.1.2. Physics-based Motion Filtering

Motions extracted by HMR models can still contain physical and biomechanical constraint violations due to reconstruction inaccuracies and out-of-distribution issues. This step filters out such motions.

  • Stability Criterion: The stability of a motion frame is assessed based on the proximity of the Center of Mass (CoM) and Center of Pressure (CoP).
    • Let pˉtCoM=(pt,xCoM,pt,yCoM)\bar{\pmb{p}}_t^{\mathrm{CoM}} = (p_{t,x}^{\mathrm{CoM}}, p_{t,y}^{\mathrm{CoM}}) and pˉtCoP=(pt,xCoP,pt,yCoP˙)\bar{\pmb{p}}_t^{\mathrm{CoP}} = (p_{t,x}^{\mathrm{CoP}}, p_{t,y}^{\dot{\mathrm{CoP}}}) be the projected coordinates of the CoM and CoP on the ground at frame tt, respectively.
    • The distance between these projections is Δdˉtν\Delta \bar{d}_t^{\nu}.
    • The stability criterion for a frame tt is defined as: Δdt=pˉtCoMpˉtCoP2<ϵstab \Delta d_t = \lVert \bar{\pmb{p}}_t^{\mathrm{CoM}} - \bar{\pmb{p}}_t^{\mathrm{CoP}} \rVert_2 < \epsilon_{\mathrm{stab}}
      • Δdt\Delta d_t: The Euclidean distance between the 2D projections of the CoM and CoP on the ground at time tt.
      • pˉtCoM\bar{\pmb{p}}_t^{\mathrm{CoM}}: The 2D projection of the Center of Mass on the ground at time tt.
      • pˉtCoP\bar{\pmb{p}}_t^{\mathrm{CoP}}: The 2D projection of the Center of Pressure on the ground at time tt.
      • 2\lVert \cdot \rVert_2: The Euclidean norm (distance).
      • ϵstab\epsilon_{\mathrm{stab}}: A stability threshold, empirically chosen (e.g., 0.1 as per Table 3). If the distance is below this threshold, the frame is considered stable.
  • Motion Sequence Stability: For an NN-frame motion sequence, let B=[t0,t1,,tK]B = [t_0, t_1, \dots, t_K] be the increasingly sorted list of frame indices that satisfy the stability criterion (Eq. 1). A motion sequence is considered stable if it meets two conditions:
    1. Boundary-frame stability: The first frame (1B1 \in B) and the last frame (NBN \in B) must be stable.
    2. Maximum instability gap: The maximum length of consecutive unstable frames must be less than a threshold ϵN\epsilon_N, i.e., maxktk+1tk<ϵN\max_k t_{k+1} - t_k < \epsilon_N. (e.g., 100 as per Table 3).
  • Benefit: This filtering effectively removes motions that are inherently untrackable or dynamically unstable for a robot.

4.2.1.3. Motion Correction based on Contact Mask

This step refines motions by explicitly accounting for foot-ground contact.

  • Contact Mask Estimation: Contact masks are estimated by analyzing ankle displacement across consecutive frames, based on the zero-velocity assumption for feet in contact.
    • Let pˉtlankleR3\bar{\pmb{p}}_t^{\mathrm{l-ankle}} \in \mathbb{R}^3 be the position of the left ankle at time tt, and ctleft{0,1}c_t^{\mathrm{left}} \in \{0, 1\} be its corresponding contact mask (1 for contact, 0 for no contact).
    • The contact mask is estimated as: ctleft=I[pt+1lankleptlankle22<ϵvel]I[pt,zlankle<ϵheight] c_t^{\mathrm{left}} = \mathbb{I}[\|p_{t+1}^{\mathrm{l-ankle}} - p_t^{\mathrm{l-ankle}}\|_2^2 < \epsilon_{\mathrm{vel}}] \cdot \mathbb{I}[p_{t,z}^{\mathrm{l-ankle}} < \epsilon_{\mathrm{height}}]
      • ctleftc_t^{\mathrm{left}}: The binary contact mask for the left foot at time tt.
      • I[]\mathbb{I}[\cdot]: The indicator function, which equals 1 if the condition inside is true, and 0 otherwise.
      • pt+1lankleptlankle22\|p_{t+1}^{\mathrm{l-ankle}} - p_t^{\mathrm{l-ankle}}\|_2^2: The squared Euclidean distance of the left ankle's displacement between frame tt and t+1t+1. This checks for near-zero velocity.
      • ϵvel\epsilon_{\mathrm{vel}}: An empirically chosen velocity threshold (e.g., 0.002 as per Table 3). If displacement is below this, velocity is considered zero.
      • pt,zlanklep_{t,z}^{\mathrm{l-ankle}}: The vertical (zz) coordinate of the left ankle at time tt.
      • ϵheight\epsilon_{\mathrm{height}}: An empirically chosen height threshold (e.g., 0.2 as per Table 3). This ensures the foot is near the ground.
    • A similar process is applied for the right foot.
  • Correction Step: To address minor floating artifacts (where the feet appear to hover above the ground), a vertical offset is applied to the global translation if a foot is in contact.
    • Let ψt\psi_t denote the global translation of the pose at time tt.
    • The corrected vertical position is: ψt,zcorr=ψt,zΔht \psi_{t,z}^{\mathrm{corr}} = \psi_{t,z} - \Delta h_t
      • ψt,zcorr\psi_{t,z}^{\mathrm{corr}}: The corrected vertical component of global translation at time tt.
      • ψt,z\psi_{t,z}: The original vertical component of global translation at time tt.
      • Δht=minvVtpt,zv\Delta h_t = \operatorname*{min}_{v \in \mathcal{V}_t} p_{t,z}^v: The lowest zz-coordinate among all SMPL mesh vertices Vt\mathcal{V}_t at frame tt. This effectively brings the lowest point of the SMPL mesh to the ground.
  • Smoothing: This correction can introduce frame-to-frame jitter. To counter this, Exponential Moving Average (EMA) is applied to smooth the motion.

4.2.1.4. Motion Retargeting

The processed SMPL-format motions are then adapted to the target robot's kinematics.

  • Method: An Inverse Kinematics (IK)-based method [19] is used.
  • Process: This involves formulating a differentiable optimization problem that aligns the end-effector trajectories of the SMPL model with the robot's end-effectors while respecting the robot's joint limits.
  • Data Augmentation: To enhance diversity, additional motions from open-source datasets like AMASS [4] and LAFAN [20] are also processed through this pipeline.

4.2.2. Adaptive Motion Tracking

This mechanism dynamically adjusts how strictly the robot must adhere to the reference motion during RL training.

4.2.2.1. Exponential Form Tracking Reward

The reward function in PBHC consists of two parts: task-specific rewards (for motion tracking) and regularization rewards (for stability and smoothness).

  • Task-Specific Rewards: These include terms for aligning joint states, rigid body states, and foot contact masks.

  • Exponential Form: Most task-specific rewards (except foot contact tracking) follow an exponential form: r(x)=exp(x/σ)r(x) = \exp(-x / \sigma)

    • r(x): The reward value for a given tracking error.
    • xx: The tracking error, typically measured as Mean Squared Error (MSE) of quantities like joint angles. A lower xx means better tracking.
    • σ\sigma: The tracking factor, which controls the tolerance of the error. A larger σ\sigma means more tolerance (rewards remain high even for larger errors), while a smaller σ\sigma means less tolerance (rewards drop sharply with increasing error).
  • Why Exponential Form? It's preferred over a simple negative error because it's bounded (max reward is 1), helps stabilize training, and offers an intuitive way to adjust reward weighting via σ\sigma.

    The following figure (Figure 2 from the original paper) illustrates the effect of the tracking factor σ\sigma on the reward value:

    Figure 2: Illustration of the effect of tracking factor \(\\sigma\) on the reward value. 该图像是图表,展示了追踪因子 au 对奖励值的影响。横轴为追踪误差 xx,纵轴为奖励 r(x)=extexp(x/au)r(x) = ext{exp}(-x/ au)。不同的曲线表示不同的追踪因子值,其中红色表示 au=0.2 au=0.2,绿色表示 au=1.0 au=1.0,蓝色表示 au=5.0 au=5.0

As seen in the graph, for a fixed tracking error xx, a larger σ\sigma (blue curve, σ=5.0\sigma=5.0) yields a higher reward, indicating more tolerance. A smaller σ\sigma (red curve, σ=0.2\sigma=0.2) causes the reward to drop more sharply for the same error, requiring higher precision.

4.2.2.2. Optimal Tracking Factor

To determine the ideal σ\sigma, the problem of motion tracking is modeled as a bi-level optimization (BLO) problem [11].

  • Intuition: The goal is to find a σ\sigma that leads to the minimum accumulated tracking error from the converged policy. This mimics how a human engineer might iteratively tune σ\sigma and observe results.

  • Problem Formulation:

    • Let π\pi be a policy and xR+N\pmb{x} \in \mathbb{R}_+^N be the sequence of expected tracking errors over NN steps of an episode rollout.
    • The inner (lower-level) optimization represents the RL training procedure: given a fixed σ\sigma, the policy aims to maximize its accumulated reward. maxxR+NJin(x,σ)+R(x) \operatorname*{max}_{\pmb{x} \in \mathbb{R}_+^N} J^{\mathrm{in}}(\pmb{x}, \sigma) + R(\pmb{x})
      • Jin(x,σ)=i=1Nexp(xi/σ)J^{\mathrm{in}}(\pmb{x}, \sigma) = \sum_{i=1}^N \exp(-x_i / \sigma): The simplified accumulated reward from the exponential tracking term.
      • R(x)R(\pmb{x}): Represents other reward components and environment dynamics, including regularization rewards and external policy objectives.
    • The outer (upper-level) optimization then selects the optimal σ\sigma to minimize the total tracking error of the final converged policy. maxσR+Jex(x)s.t.xargmaxxR+NJin(x,σ)+R(x) \operatorname*{max}_{\sigma \in \mathbb{R}_+} \quad J^{\mathrm{ex}}(\pmb{x}^*) \quad \mathrm{s.t.} \quad \pmb{x}^* \in \arg \operatorname*{max}_{\pmb{x} \in \mathbb{R}_+^N} J^{\mathrm{in}}(\pmb{x}, \sigma) + R(\pmb{x})
      • Jex(x)=i=1NxiJ^{\mathrm{ex}}(\pmb{x}^*) = \sum_{i=1}^N -x_i^*: The negative of the accumulated tracking error of the optimal error sequence x\pmb{x}^* from the inner problem. Maximizing this is equivalent to minimizing the accumulated tracking error.
  • Derivation of Optimal σ\sigma^* (from Appendix A): Assuming R(x)R(\pmb{x}) is linear, and JinJ^{\mathrm{in}} and JexJ^{\mathrm{ex}} are twice continuously differentiable, and the lower-level problem has a unique solution x(σ)\pmb{x}^*(\sigma), an implicit gradient approach can be used.

    1. Gradient of JexJ^{\mathrm{ex}} w.r.t. σ\sigma: dJexdσ=dx(σ)dσxJex(x(σ)) \frac{d J^{\mathrm{ex}}}{d \sigma} = \frac{d \pmb{x}^*(\sigma)}{d \sigma}^\top \nabla_{\pmb{x}} J^{\mathrm{ex}}(\pmb{x}^*(\sigma))

      • dJexdσ\frac{d J^{\mathrm{ex}}}{d \sigma}: The total derivative of the external objective with respect to σ\sigma.
      • dx(σ)dσ\frac{d \pmb{x}^*(\sigma)}{d \sigma}^\top: The transpose of the derivative of the optimal error sequence with respect to σ\sigma.
      • xJex(x(σ))\nabla_{\pmb{x}} J^{\mathrm{ex}}(\pmb{x}^*(\sigma)): The gradient of the external objective with respect to the error sequence, evaluated at the optimal error sequence.
    2. Using the KKT condition for the lower-level problem: Since x(σ)\pmb{x}^*(\sigma) is the solution to the inner maximization problem, its gradient must be zero: x(Jin(x(σ),σ)+R(x))=0 \nabla_{\pmb{x}} (J^{\mathrm{in}}(\pmb{x}^*(\sigma), \sigma) + R(\pmb{x})) = 0

    3. Taking the first-order derivative of the KKT condition w.r.t. σ\sigma: ddσ(x(Jin(x(σ),σ)+R(x)))=σ,x2Jin+dx(σ)dσx,x2Jin=0 \frac{d}{d \sigma} \big( \nabla_{\boldsymbol{x}} \big( J^{\mathrm{in}}(\boldsymbol{x}^*(\sigma), \sigma) + R(\boldsymbol{x}) \big) \big) = \nabla_{\sigma, \boldsymbol{x}}^2 J^{\mathrm{in}} + \frac{d \boldsymbol{x}^*(\sigma)}{d \sigma}^\top \nabla_{\boldsymbol{x}, \boldsymbol{x}}^2 J^{\mathrm{in}} = 0 This equation relates the mixed partial derivatives of JinJ^{\mathrm{in}} to the derivative of x(σ)\pmb{x}^*(\sigma) with respect to σ\sigma.

    4. Solving for dx(σ)dσ\frac{d \pmb{x}^*(\sigma)}{d \sigma}^\top: dx(σ)dσ=σ,x2Jin(x(σ),σ)x,x2Jin(x(σ),σ)1 \frac{d \pmb{x}^*(\sigma)}{d \sigma}^\top = - \nabla_{\sigma, \pmb{x}}^2 J^{\mathrm{in}}(\pmb{x}^*(\sigma), \sigma) \nabla_{\pmb{x}, \pmb{x}}^2 J^{\mathrm{in}}(\pmb{x}^*(\sigma), \sigma)^{-1}

      • σ,x2Jin\nabla_{\sigma, \pmb{x}}^2 J^{\mathrm{in}}: The mixed second-order partial derivative (Hessian) of JinJ^{\mathrm{in}} with respect to σ\sigma and x\pmb{x}.
      • x,x2Jin\nabla_{\pmb{x}, \pmb{x}}^2 J^{\mathrm{in}}: The second-order partial derivative (Hessian) of JinJ^{\mathrm{in}} with respect to x\pmb{x}.
    5. Substituting back into the gradient of JexJ^{\mathrm{ex}}: dJexdσ=σ,x2Jin(x(σ),σ)x,x2Jin(x(σ),σ)1xJex(x(σ)) \frac{d J^{\mathrm{ex}}}{d \sigma} = - \nabla_{\sigma, x}^2 J^{\mathrm{in}}(\pmb{x}^*(\sigma), \sigma) \nabla_{\pmb{x}, \pmb{x}}^2 J^{\mathrm{in}}(\pmb{x}^*(\sigma), \sigma)^{-1} \nabla_{\pmb{x}} J^{\mathrm{ex}}(\pmb{x}^*(\sigma))

    6. Explicit forms of JexJ^{\mathrm{ex}} and JinJ^{\mathrm{in}}: Jex(x)=i=1Nxi,Jin(x,σ)=i=1Nexp(xi/σ). \begin{array}{l} {\displaystyle {J^{\mathrm{ex}}}(\boldsymbol{x}) = \sum_{i=1}^N {-x_i}}, \\ {\displaystyle {J^{\mathrm{in}}}(\boldsymbol{x}, \sigma) = \sum_{i=1}^N {\mathrm{exp}(-x_i / \sigma)}}. \end{array}

    7. Compute first- and second-order gradients: xJin(x,σ)=exp(x/σ)(1σ),xJex(x)=1,σ,x2Jin(x,σ)=σxσ3exp(x/σ),x,x2Jin(x,σ)=diag(exp(x/σ))/σ2, \begin{array}{l} {\nabla_{\boldsymbol{x}} J^{\mathrm{in}}(\boldsymbol{x}, \sigma) = \exp(-\boldsymbol{x} / \sigma) (- \displaystyle \frac 1 \sigma)}, \\ {\nabla_{\boldsymbol{x}} J^{\mathrm{ex}}(\boldsymbol{x}) = \mathbf{1}}, \\ {\nabla_{\sigma, \boldsymbol{x}}^2 J^{\mathrm{in}}(\boldsymbol{x}, \sigma) = \displaystyle \frac{\sigma - \boldsymbol{x}}{\sigma^3} \odot \exp(-\boldsymbol{x} / \sigma)}, \\ {\nabla_{\boldsymbol{x}, \boldsymbol{x}}^2 J^{\mathrm{in}}(\boldsymbol{x}, \sigma) = \mathrm{diag}(\exp(-\boldsymbol{x} / \sigma)) / \sigma^2}, \end{array}

      • exp(x/σ)\exp(-\boldsymbol{x} / \sigma): An element-wise exponential function applied to the vector x/σ-\boldsymbol{x} / \sigma.
      • 1σ- \displaystyle \frac 1 \sigma: A scalar multiplier.
      • 1\mathbf{1}: A vector of ones.
      • \odot: Element-wise multiplication.
      • diag()\mathrm{diag}(\cdot): Creates a diagonal matrix from a vector.
    8. Setting the gradient to zero (dJexdσ=0\frac{d J^{\mathrm{ex}}}{d \sigma} = 0) and simplifying: This leads to the conclusion that the optimal tracking factor σ\sigma^* is the average of the optimal tracking errors: σ=(i=1Nxi)/N \sigma^* = \Big( \sum_{i=1}^N x_i^* \Big) / N

      • σ\sigma^*: The optimal tracking factor.
      • NN: The number of steps in the motion sequence.
      • xix_i^*: The optimal tracking error at the ii-th step.

4.2.2.3. Adaptive Mechanism

Since σ\sigma^* and x\pmb{x}^* are inter-dependent, direct computation of σ\sigma^* is not possible. Also, a single fixed σ\sigma is impractical for diverse motions. An adaptive mechanism is designed to dynamically adjust σ\sigma during training.

  • Online Estimation: An Exponential Moving Average (EMA) x^\hat{x} of the instantaneous tracking error is maintained over environment steps. This x^\hat{x} serves as an online estimate of the expected tracking error for the current policy.

  • Feedback Loop: During training, PBHC updates σ\sigma to the current value of x^\hat{x}. This creates a closed-loop feedback system: as the tracking error decreases, x^\hat{x} decreases, which in turn leads to a tightening of σ\sigma. This process drives further policy refinement.

  • Update Rule: To ensure stability, σ\sigma is constrained to be non-increasing and initialized with a relatively large value (σinit\sigma^{\mathrm{init}}). σmin(σ,x^) \sigma \gets \operatorname*{min}(\sigma, \hat{x})

    • σ\sigma: The tracking factor that is being adapted.
    • x^\hat{x}: The Exponential Moving Average (EMA) of the instantaneous tracking error.
    • min(,)\operatorname*{min}(\cdot, \cdot): The minimum function, ensuring σ\sigma only decreases or stays the same.
  • Benefit: This adaptive process allows the policy to progressively improve its tracking precision over time, as illustrated in the following figure (Figure 4 from the original paper).

    Figure 4: Example of the right hand \(y\) -position for 'Horse-stance punch'. The adaptive \(\\sigma\) can progressively improve the tracking precision. \(\\sigma _ { \\mathrm { p o s \\_ v r } }\) is used for tracking the head and hands. 该图像是图表,展示了左手 yy 轴位置与时间的关系。图中标注了动作 'Horse Stance' 和 'Quick Punch' 的位置,并显示了不同步长下的跟踪精度变化,适应性 oldsymbol{ au} 可逐步提高跟踪精度。

This graph shows the right hand y-position for a 'Horse-stance punch' motion. The adaptive\sigma(blueline)startshighandgraduallydecreases,leadingtoatightertrackingofthereference(blackline)overmoretrainingsteps.Incontrast,afixedσ (blue line) starts high and gradually decreases, leading to a tighter tracking of the reference (black line) over more `training steps`. In contrast, a `fixed`\sigma (red line) might not allow for this progressive improvement.

The following figure (Figure 3 from the original paper) depicts the closed-loop adjustment of the tracking factor in the adaptive mechanism:

Figure 3: Closed-loop adjustment of tracking factor in the proposed adaptive mechanism. 该图像是示意图,展示了在适应性机制中跟踪因子的闭环调整过程。图中包含四个关键部分:奖励 r(x) 形状、策略 π\pi 优化、跟踪因子 σ\sigma 收紧和跟踪误差 x^\hat{x} 减少。各部分之间通过箭头展示了相互关系,表明在优化过程中如何动态调整策略以减少跟踪误差。

The diagram shows a continuous loop: The tracking error x^\hat{x} (estimated online) influences the tracking factor σ\sigma. A reduction in x^\hat{x} leads to a tightening of σ\sigma (via σmin(σ,x^)\sigma \gets \min(\sigma, \hat{x})). This tightened σ\sigma then reshapes the reward function r(x), which in turn drives policy optimization (to maximize rewards), leading to a further reduction in tracking error x^\hat{x}. This positive feedback loop allows the system to converge to an optimal σ\sigma.

4.2.3. RL Training Framework

4.2.3.1. Asymmetric Actor-Critic

The policy optimization uses an asymmetric actor-critic architecture, typical in RL for sim-to-real transfer.

  • Time Phase Variable: A time phase variable ϕt[0,1]\phi_t \in [0, 1] is introduced, representing the linear progress of the reference motion (0 at start, 1 at end).
  • Actor's Observation Space (Ξstactor\mathbf{\Xi}_{s_t^{\mathrm{actor}}}): The actor operates with local observations. It receives:
    • Proprioceptive state (stprops_t^{\mathrm{prop}}) of the robot: This includes historical information (last 5 steps) of joint positions (qt\pmb{q}_t), joint velocities (q˙t\dot{\pmb{q}}_t), root angular velocity (ωtroot\boldsymbol{\omega}_t^{\mathrm{root}}), root projected gravity (gtproj\pmb{g}_t^{\mathrm{proj}}), and the action from the previous step (at1\pmb{a}_{t-1}).
    • The time phase variable ϕt\phi_t.
  • Critic's Observation Space (stcritics_t^{\mathrm{critic}}): The critic uses privileged information to learn a better value function. Its observations include:
    • All components of the actor's observation (stprops_t^{\mathrm{prop}} and time phase).
    • Additionally, reference motion positions, root linear velocity, and a set of randomized physical parameters (e.g., base CoM offset, link mass, stiffness, damping, friction coefficient, control delay). These privileged parameters are crucial for learning a robust value function that can generalize across different physical conditions, aiding sim-to-real transfer.

4.2.3.2. Reward Vectorization

  • To facilitate learning for multiple reward components, rewards and value functions are vectorized.
  • Instead of summing all reward components into a single scalar, each component rir_i is assigned to a separate value function Vi(s)V_i(s).
  • The critic network has multiple output heads, each estimating the return for a specific reward component.
  • All value functions are then aggregated to compute the action advantage. This design enhances value estimation and stability in policy optimization.

4.2.3.3. Reference State Initialization (RSI)

  • The robot's state is initialized from reference motion states at randomly sampled time phases.
  • This technique allows for parallel learning across different segments (phases) of a motion, significantly improving training efficiency by avoiding repetitive learning from the very beginning of a motion every time.

4.2.3.4. Sim-to-Real Transfer

  • Domain Randomization: To bridge the sim-to-real gap, domain randomization is applied during training. This involves varying physical parameters of the simulated environment and humanoid model (e.g., friction, mass, CoM offset, control delay, external perturbations). By training policies that are robust to these variations, the policies become more generalized and perform well on the real robot, even with slight mismatches between simulation and reality.
  • Zero-Shot Transfer: The policies trained with domain randomization are directly deployed to real robots without any fine-tuning, achieving zero-shot sim-to-real transfer.

5. Experimental Setup

5.1. Datasets

The experiments utilize a highly-dynamic motion dataset constructed using PBHC's motion processing pipeline.

  • Sources:

    1. Video-based sources: Motions extracted from videos and fully processed by the PBHC pipeline.
    2. Open-source datasets: Selected motions from AMASS [4] and LAFAN [20], partially processed through the PBHC pipeline (contact mask estimation, motion correction, retargeting).
  • Characteristics: The dataset comprises 13 distinct motions, categorized into three difficulty levels: easy, medium, and hard, based on their agility requirements.

  • Transition Smoothing: To ensure smooth transitions, linear interpolation is applied at the beginning and end of each sequence to move from a default pose to the reference motion and back.

    The following figure (Figure 5 from the original paper) shows examples of motions in the constructed dataset:

    Figure 5: Example motions in our constructed dataset. Darker opacity indicates later timestamps. 该图像是一个示意图,展示了在我们构建的数据集中不同难度的动作示例,包括马步拳(简单)、伸腿(中等)、跳踢(中等)和360度旋转(困难)。图中蓝色轨迹表示动作路径,较深的透明度表示后续时间点。

As shown, motions range from relatively simple horse-stance punch to more complex stretch leg, jump kick, and 360-degree spin, with darker opacity indicating later timestamps in the motion trajectory.

The following are the details of the highly-dynamic motion dataset (Table 4 from the original paper):

Motion name Motion frames Source
Easy
Jabs punch 285 video
Hooks punch 175 video
Horse-stance pose 210 LAFAN
Horse-stance punch 200 video
Medium
Stretch leg 320 video
Tai Chi 500 video
Jump kick 145 video
Charleston dance 610 LAFAN
Bruce Lee's pose 330 AMASS
Hard
Roundhouse kick 158 AMASS
360-degree spin 180 video
Front kick 155 video
Side kick 179 AMASS

5.2. Evaluation Metrics

The tracking performance of policies is quantified using various error metrics, focusing on position, velocity, and acceleration errors across different body parts and joints. For a motion with NN frames, the expectation E[]\mathbb{E}[\cdot] is taken over all body parts/joints and all frames where the robot is active.

  1. Global Mean Per Body Position Error (EgmpbpeE_{\mathrm{g-mpbpe}}, mm)

    • Conceptual Definition: This metric quantifies the average position error of all body parts (links) of the robot in global coordinates relative to the reference motion. It measures how far, on average, the robot's body is from the target positions in the global frame.
    • Mathematical Formula: Egmpbpe=E[ptptref2] E_{\mathrm{g-mpbpe}} = \mathbb{E} \left[ \left\| \pmb{p}_t - \pmb{p}_t^{\mathrm{ref}} \right\|_2 \right]
    • Symbol Explanation:
      • EgmpbpeE_{\mathrm{g-mpbpe}}: Global Mean Per Body Position Error.
      • E[]\mathbb{E}[\cdot]: Expectation, averaged over all body parts and time steps.
      • pt\pmb{p}_t: The 3D position vector of a specific body part of the robot at time tt.
      • ptref\pmb{p}_t^{\mathrm{ref}}: The 3D position vector of the corresponding body part in the reference motion at time tt.
      • 2\lVert \cdot \rVert_2: The Euclidean norm (L2 norm), representing the distance between the robot's body part and the reference.
  2. Root-Relative Mean Per Body Position Error (EmpbpeE_{\mathrm{mpbpe}}, mm)

    • Conceptual Definition: This metric measures the average position error of body parts relative to the robot's root (e.g., pelvis or base). It focuses on the internal posture and configuration error, decoupling it from any global drift of the entire robot.
    • Mathematical Formula: Empbpe=E[(ptproot,t)(ptrefproot,tref)2] E_{\mathrm{mpbpe}} = \mathbb{E} \left[ \left\| \left( \pmb{p}_t - \pmb{p}_{\mathrm{root},t} \right) - \left( \pmb{p}_t^{\mathrm{ref}} - \pmb{p}_{\mathrm{root},t}^{\mathrm{ref}} \right) \right\|_2 \right]
    • Symbol Explanation:
      • EmpbpeE_{\mathrm{mpbpe}}: Root-relative Mean Per Body Position Error.
      • E[]\mathbb{E}[\cdot]: Expectation, averaged over all body parts and time steps.
      • pt\pmb{p}_t: The 3D position vector of a specific body part of the robot at time tt.
      • proot,t\pmb{p}_{\mathrm{root},t}: The 3D position vector of the robot's root at time tt.
      • ptref\pmb{p}_t^{\mathrm{ref}}: The 3D position vector of the corresponding body part in the reference motion at time tt.
      • proot,tref\pmb{p}_{\mathrm{root},t}^{\mathrm{ref}}: The 3D position vector of the reference motion's root at time tt.
      • 2\lVert \cdot \rVert_2: The Euclidean norm.
  3. Mean Per Joint Position Error (EmpjpeE_{\mathrm{mpjpe}}, 10310^{-3} rad)

    • Conceptual Definition: This metric quantifies the average angular error for each joint's position (angle) across the robot's Degrees of Freedom. It directly measures how closely the robot's joint configurations match the reference.
    • Mathematical Formula: Empjpe=E[qtqtref2] E_{\mathrm{mpjpe}} = \mathbb{E} \left[ \left\| \pmb{q}_t - \pmb{q}_t^{\mathrm{ref}} \right\|_2 \right]
    • Symbol Explanation:
      • EmpjpeE_{\mathrm{mpjpe}}: Mean Per Joint Position Error.
      • E[]\mathbb{E}[\cdot]: Expectation, averaged over all joints and time steps.
      • qt\pmb{q}_t: The vector of joint angles for the robot at time tt.
      • qtref\pmb{q}_t^{\mathrm{ref}}: The vector of joint angles for the reference motion at time tt.
      • 2\lVert \cdot \rVert_2: The Euclidean norm.
  4. Mean Per Joint Velocity Error (EmpjveE_{\mathrm{mpjve}}, 10310^{-3} rad/frame)

    • Conceptual Definition: This metric measures the average error in the angular velocities of the robot's joints compared to the reference motion. It indicates how well the robot matches the speed of joint movements.
    • Mathematical Formula: Empjve=E[ΔqtΔqtref2] E_{\mathrm{mpjve}} = \mathbb{E} \left[ \left\| \Delta \pmb{q}_t - \Delta \pmb{q}_t^{\mathrm{ref}} \right\|_2 \right] where Δqt=qtqt1\Delta \pmb{q}_t = \pmb{q}_t - \pmb{q}_{t-1}
    • Symbol Explanation:
      • EmpjveE_{\mathrm{mpjve}}: Mean Per Joint Velocity Error.
      • E[]\mathbb{E}[\cdot]: Expectation, averaged over all joints and time steps.
      • Δqt\Delta \pmb{q}_t: The vector of joint angular velocities for the robot at time tt, approximated as the difference in joint positions between tt and t-1.
      • Δqtref\Delta \pmb{q}_t^{\mathrm{ref}}: The vector of joint angular velocities for the reference motion at time tt.
      • 2\lVert \cdot \rVert_2: The Euclidean norm.
  5. Mean Per Body Velocity Error (EmpbveE_{\mathrm{mpbve}}, mm/frame)

    • Conceptual Definition: This metric measures the average error in the linear velocities of the robot's body parts compared to the reference motion. It assesses how accurately the robot matches the speed of its body segments' movements.
    • Mathematical Formula: Empbve=E[ΔptΔptref2] E_{\mathrm{mpbve}} = \mathbb{E} \left[ \left\| \Delta \pmb{p}_t - \Delta \pmb{p}_t^{\mathrm{ref}} \right\|_2 \right] where Δpt=ptpt1\Delta \pmb{p}_t = \pmb{p}_t - \pmb{p}_{t-1}
    • Symbol Explanation:
      • EmpbveE_{\mathrm{mpbve}}: Mean Per Body Velocity Error.
      • E[]\mathbb{E}[\cdot]: Expectation, averaged over all body parts and time steps.
      • Δpt\Delta \pmb{p}_t: The vector of linear velocities for a body part of the robot at time tt, approximated as the difference in positions between tt and t-1.
      • Δptref\Delta \pmb{p}_t^{\mathrm{ref}}: The vector of linear velocities for the corresponding body part in the reference motion at time tt.
      • 2\lVert \cdot \rVert_2: The Euclidean norm.
  6. Mean Per Body Acceleration Error (EmpbaeE_{\mathrm{mpbae}}, mm/frame2^2)

    • Conceptual Definition: This metric measures the average error in the linear accelerations of the robot's body parts compared to the reference motion. It's particularly important for dynamic movements, indicating how well the robot matches the rates of change of velocities.
    • Mathematical Formula: Empbae=E[Δ2ptΔ2ptref2] E_{\mathrm{mpbae}} = \mathbb{E} \left[ \left\| \boldsymbol{\Delta}^2 \pmb{p}_t - \boldsymbol{\Delta}^2 \pmb{p}_t^{\mathrm{ref}} \right\|_2 \right] where Δ2pt=ΔptΔpt1\boldsymbol{\Delta}^2 \pmb{p}_t = \Delta \pmb{p}_t - \Delta \pmb{p}_{t-1}
    • Symbol Explanation:
      • EmpbaeE_{\mathrm{mpbae}}: Mean Per Body Acceleration Error.
      • E[]\mathbb{E}[\cdot]: Expectation, averaged over all body parts and time steps.
      • Δ2pt\boldsymbol{\Delta}^2 \pmb{p}_t: The vector of linear accelerations for a body part of the robot at time tt, approximated as the difference in velocities (Δpt\Delta \pmb{p}_t) between tt and t-1.
      • Δ2ptref\boldsymbol{\Delta}^2 \pmb{p}_t^{\mathrm{ref}}: The vector of linear accelerations for the corresponding body part in the reference motion at time tt.
      • 2\lVert \cdot \rVert_2: The Euclidean norm.
  7. Mean Foot Contact Mask Error (EcontactmaskE_{\mathrm{contact-mask}}) (Introduced in Appendix E.3)

    • Conceptual Definition: This metric quantifies the average discrepancy between the robot's estimated foot contact state and the reference motion's foot contact state. It directly measures how accurately the robot's foot contacts (or lack thereof) match the intended contacts in the reference.
    • Mathematical Formula: Econtactmask=E[ctc^t1] E_{\mathrm{contact-mask}} = \mathbb{E} \left[ \| c_t - \hat{c}_t \|_1 \right]
    • Symbol Explanation:
      • EcontactmaskE_{\mathrm{contact-mask}}: Mean Foot Contact Mask Error.
      • E[]\mathbb{E}[\cdot]: Expectation, averaged over all feet and time steps.
      • ctc_t: The actual (or estimated) binary contact mask for the robot's foot (1 for contact, 0 for no contact) at time tt.
      • c^t\hat{c}_t: The binary contact mask for the reference motion's foot at time tt.
      • 1\lVert \cdot \rVert_1: The Manhattan norm (L1 norm), summing the absolute differences.

5.3. Baselines

PBHC is compared against three baseline methods, all of which use an exponential form for the tracking reward function (similar to PBHC's design in Section 3.2.1), but with empirically tuned and fixed parameters rather than adaptive ones.

  1. OmniH2O [10]:

    • Description: This method adopts a teacher-student training paradigm. The teacher policy learns in a privileged setting, and the student policy learns from the teacher's outputs.
    • Implementation: The authors moderately increased the tracking reward weights to better match the G1 robot. The teacher and student policies were trained for 20 and 10 hours, respectively.
  2. ExBody2 [5]:

    • Description: This method utilizes a decoupled keypoint-velocity tracking mechanism. It focuses on expressive whole-body control.
    • Implementation: Similar to OmniH2O, teacher and student policies were trained for 20 and 10 hours, respectively.
  3. MaskedMimic [2]:

    • Description: This method is primarily designed for character animation and focuses on optimizing pose-level accuracy. It comprises three sequential training phases.
    • Implementation: The authors utilized only the first phase of MaskedMimic as the full method is not pertinent to robot control tasks (lacks constraints like partial observability and action smoothness). Each policy was trained for 18 hours.
    • Note: The paper also trains an "Oracle" version of PBHC that, like MaskedMimic, overlooks robot-specific constraints for a fair comparison of pose-level accuracy in a less constrained setting.

5.4. Experiment Setup Details (from Appendix D.1, C.3, C.4, C.5, C.6)

5.4.1. Compute Platform

  • Hardware: Each experiment was conducted on a machine equipped with:
    • CPU: 24-core Intel i7-13700 running at 5.2GHz.
    • RAM: 32 GB.
    • GPU: Single NVIDIA GeForce RTX 4090.
  • Operating System: Ubuntu 20.04.
  • Training Time: Each model was trained for 27 hours.

5.4.2. Real Robot Setup

  • Robot: Unitree G1 humanoid robot.
  • System Architecture:
    • Onboard motion control board: Collects sensor data.
    • External PC: Connected via Ethernet, receives sensor data (using DDS protocol), maintains observation history, performs policy inference, and sends target joint angles back to the control board.
    • Control board: Issues motor commands based on received target joint angles.

5.4.3. Domain Randomization (from Appendix C.3)

To improve sim-to-real transfer and robustness, domain randomization is incorporated. The following are the domain randomization settings (Table 7 from the original paper):

Term Value
Dynamics Randomization
Friction U(0.2, 1.2)
PD gain U(0.9, 1.1)
Link mass(kg) U(0.9, 1.1)× default
Ankle inertia(kg.m2) U(0.9, 1.1)× default
Base CoM offset(m) U(-0.05, 0.05)
ERFI[58](N·m/kg) 0.05× torque limit
Control delay(ms) U(0, 40)
External Perturbation
Random push interval(s) [5, 10]
Random push velocity(m/s) 0.1
  • U(a, b): Uniform distribution between aa and bb. default refers to the nominal value for the Unitree G1 robot.
  • ERFI: External Resistive Force Impulse, a type of perturbation applied to the robot.

5.4.4. PPO Hyperparameters (from Appendix C.4)

The following are the PPO hyperparameters used for policy optimization (Table 8 from the original paper):

Hyperparameter Value
Optimizer Adam
Batch size 4096
Mini Batches 4
Learning epoches 5
Entropy coefficient 0.01
Value loss coefficient 1.0
Clip param 0.2
Max grad norm 1.0
Init noise std 0.8
Learning rate 1e-3
Desired KL 0.01
GAE decay factor(λ\lambda) 0.95
GAE discount factor(γ\gamma) 0.99
Actor MLP size [512, 256, 128]
Critic MLP size [768, 512, 128]
MLP Activation ELU

5.4.5. Curriculum Learning (from Appendix C.5)

Two curriculum mechanisms are introduced to facilitate learning high-dynamic motions:

  1. Termination Curriculum:

    • Mechanism: The episode terminates early if the humanoid's motion deviates from the reference beyond a termination threshold θ\theta.
    • Progression: During training, this threshold is gradually decreased, making the task more difficult and requiring higher precision.
    • Update Rule: θclip(θ(1δ),θmin,θmax) \theta \gets \mathrm{clip} \left( \theta \cdot (1 - \delta), \theta_{\mathrm{min}}, \theta_{\mathrm{max}} \right)
      • θ\theta: The termination threshold.
      • clip(,,)\mathrm{clip}(\cdot, \cdot, \cdot): Clips the value within the specified minimum and maximum bounds.
      • θinit=1.5\theta_{\mathrm{init}} = 1.5: Initial threshold.
      • θmin=0.3\theta_{\mathrm{min}} = 0.3: Minimum threshold.
      • θmax=2.0\theta_{\mathrm{max}} = 2.0: Maximum threshold.
      • δ=2.5×105\delta = 2.5 \times 10^{-5}: Decay rate, causing θ\theta to decrease over time.
  2. Penalty Curriculum:

    • Mechanism: A scaling factor α\alpha modulates the influence of regularization terms (penalties).
    • Progression: α\alpha progressively increases, gradually enforcing stronger regularization and promoting more stable and physically plausible behaviors. This helps in early training stages by being less strict.
    • Update Rule: αclip(α(1+δ),αmin,αmax),r^penaltyαrpenalty \alpha \gets \mathrm{clip}(\alpha \cdot (1 + \delta), \alpha_{\mathrm{min}}, \alpha_{\mathrm{max}}), \quad \hat{r}_{\mathrm{penalty}} \gets \alpha \cdot r_{\mathrm{penalty}}
      • α\alpha: The penalty scaling factor.
      • clip(,,)\mathrm{clip}(\cdot, \cdot, \cdot): Clips the value within the specified minimum and maximum bounds.
      • αinit=0.1\alpha_{\mathrm{init}} = 0.1: Initial penalty scale.
      • αmin=0.0\alpha_{\mathrm{min}} = 0.0: Minimum penalty scale.
      • αmax=1.0\alpha_{\mathrm{max}} = 1.0: Maximum penalty scale.
      • δ=1.0×104\delta = 1.0 \times 10^{-4}: Growth rate, causing α\alpha to increase over time.
      • r^penalty\hat{r}_{\mathrm{penalty}}: The scaled penalty reward.
      • rpenaltyr_{\mathrm{penalty}}: The original penalty reward.

5.4.6. PD Controller Parameter (from Appendix C.6)

  • Mechanism: A Proportional-Derivative (PD) controller is used at the joint level to convert desired joint positions into motor torques.

  • Gains: The stiffness (kpk_p) and damping (kdk_d) gains are specified for different joint groups.

  • Numerical Stability: To improve numerical stability and fidelity in the simulator, the inertia of the ankle links is manually set to a fixed value of 5×1035 \times 10^{-3}.

    The following are the PD controller gains (Table 9 from the original paper):

    Joint name Stiffness (kp) Damping (kd)
    Left/right shoulder pitch/roll/yaw 100 2.0
    Left/right shoulder yaw 50 2.0
    Left/right elbow 50 2.0
    Waist pitch/roll/yaw 400 5.0
    Left/right hip pitch/roll/yaw 100 2.0
    Left/right knee 150 4.0
    Left/right ankle pitch/roll 40 2.0

6. Results & Analysis

6.1. Motion Filtering

This section addresses Q1: "Can our physics-based motion filtering effectively filter out untrackable motions?"

  • Methodology: The physics-based motion filtering method (Section 3.1) was applied to 10 motion sequences. Policies were then trained for each motion, and the Episode Length Ratio (ELR) was computed.

  • ELR Definition: The ratio of average episode length to reference motion length. A high ELR indicates that the policy successfully tracks the motion for a longer duration without early termination (e.g., due to falling or exceeding error thresholds).

  • Findings:

    • 4 sequences were rejected by the filter, and 6 were accepted.
    • Accepted motions consistently achieved high ELRs.
    • Rejected motions achieved a maximum ELR of only 54%54\%, indicating frequent termination conditions violations.
  • Conclusion: The filtering method is effective in excluding inherently untrackable motions, thereby improving training efficiency by focusing on viable candidates.

    The following figure (Figure 6 from the original paper) shows the distribution of ELR for accepted and rejected motions:

    Figure 6: The distribution of ELR of accepted and rejected motions. 该图像是图表,展示了接受和拒绝动作的情节长度比率分布。图中蓝色圆点代表接受动作,而橙色圆点则表示拒绝动作。纵坐标表示情节长度比率(%),横坐标则为两个类别的比较。中央的虚线标示了54%的分界线。

The plot clearly shows a distinct separation: accepted motions (blue dots) have ELRs mostly above 80%, while rejected motions (orange dots) have ELRs mostly below 60%, confirming the filter's effectiveness.

6.2. Main Result

This section addresses Q2: "Does PBHC achieve superior tracking performance compared to prior methods in simulation?"

  • Comparison: PBHC is compared against OmniH2O, ExBody2, and MaskedMimic.

  • Evaluation: Policies are trained in IsaacGym [29] with three random seeds and evaluated over 1,000 rollout episodes. Motions are categorized into easy, medium, and hard difficulty levels.

  • Findings:

    • Superiority of PBHC: PBHC consistently outperforms OmniH2O and ExBody2 across all evaluation metrics (position, velocity, acceleration errors) and all difficulty levels. This is indicated by significantly lower error values.

    • Adaptive Mechanism's Role: The paper attributes PBHC's improvements to its adaptive motion tracking mechanism, which dynamically adjusts tracking factors. Baselines with fixed, empirically tuned parameters struggle to generalize across diverse motions.

    • MaskedMimic Context: MaskedMimic sometimes performs well on certain metrics but is designed for character animation and is not suitable for robot control due to neglecting physical constraints like partial observability and action smoothness.

    • Oracle Comparison: An oracle version of PBHC (which, like MaskedMimic, ignores robot constraints) also shows strong performance, suggesting that PBHC's core tracking capabilities are robust even when constraints are relaxed.

      The following are the main results comparing different methods across difficulty levels (Table 1 from the original paper):

      Method Eg-mpbpe Empbpe Empjpe Empbve Empbae Empjve
      Easy
      OmniH2O 233.54±4.013* 103.67±1.912* 1805.10±12.33* 8.54±0.125* 8.46±0.081* 224.70±2.043
      ExBody2 588.22±11.43* 332.50±3.584* 4014.40±21.50* 14.29±0.172* 9.80±0.157* 206.01±1.346*
      Ours 53.25±17.60 28.16±6.127 725.62±16.20 4.41±0.312 4.65±0.140 81.28±2.052
      MaskedMimic (Oracle) -41.79±1715 21.86±2.030 -739.96±19.94 * 5.20±0.245 7.40±0.3 132.01±8.941
      Ours (Oracle) 45.02±6.760 22.95±15.22 710.30±16.66 4.63±1.580 4.89±0.960 73.44±12.42
      Medium
      OmniH2O 433.64±16.22* 151.42±7.340* 2333.90±49.50* 10.85±0.300 10.54±0.152 204.36±4.473
      ExBody2 619.84±26.16* 261.01±1.592* 3738.70±26.90* 14.48±0.160* 11.25±0.173 204.33±2.172*
      Ours 126.48±27.01 48.87±7.550 1043.30±104.4 6.62±0.412 7.19±0.254 105.30±5.941
      MaskedMimic (Oracle) 150.92±33.4 61.69±46.01 934.25±155.0 8.16±1.974 10.01±0.83 176.84±26.1
      Ours (Oracle) 66.85±50.29 29.56±14.53 753.69±100.2 5.34±0.425 6.58±0.291 82.73±3.108
      Hard
      OmniH2O 446.17±12.84 147.88±4.142 1939.50±23.90 14.98±0.643 14.40±0.580 190.13±8.211
      ExBody2 689.68±11.80 246.40±1252* 4037.40±16.70* 19.90±0.210 16.72±0.160 254.76±3.409*
      Ours 290.36±139.1 124.61±53.54 1326.60±378.9 11.93±2.622 12.36±2.401 135.05±16.43
      MaskedMimic (Oracle) 47.74±2.762 27.2±1.615 829.02±15.41 -8.33±0.194 10.60±0.420* 146.90±13.32*
      Ours (Oracle) 79.25±69.4 34.74±22.6 734.90±155.9 7.04±1.420 8.34±1.140 93.79±17.36
  • Interpretation: For all three difficulty levels (Easy, Medium, Hard), the Ours method (PBHC) consistently shows the lowest error values across almost all metrics (Eg-mpbpe, Empbpe, Empjpe, Empbve, Empbae, Empjve). The bolded numbers indicate PBHC's strong performance. For example, in the Easy category, PBHC achieves an Eg-mpbpe of 53.25±17.6053.25 \pm 17.60, which is significantly lower than OmniH2O (233.54±4.013233.54 \pm 4.013) and ExBody2 (588.22±11.43588.22 \pm 11.43). Similar trends are observed for Medium and Hard motions, where PBHC maintains a substantial lead over the deployable baselines. The Oracle versions (which are not constrained by real-world robot physics and observability) naturally achieve lower errors in some cases, but PBHC still performs very competitively, especially when considering its real-world deployability. The asterisks ()(^*) denote significant improvements (p<0.05p < 0.05) of PBHC over baselines, further solidifying its advantage.

6.3. Impact of Adaptive Motion Tracking Mechanism

This section addresses Q3: "Does the adaptive motion tracking mechanism improve tracking precision?"

  • Methodology: An ablation study was conducted, comparing PBHC's adaptive mechanism with four fixed tracking factor configurations: Coarse, Medium, UpperBound, and LowerBound. These fixed configurations represent different levels of reward tolerance, from very loose to very strict.
  • Findings:
    • Inconsistency of Fixed Factors: The performance of fixed tracking factor configurations varied significantly across different motion types. A setting that performed well for one motion might be suboptimal for another. This highlights the difficulty of finding a universal fixed σ\sigma.

    • Adaptive Mechanism's Robustness: PBHC's adaptive motion tracking mechanism consistently achieved near-optimal performance across all motion types. This demonstrates its effectiveness in dynamically adjusting the tracking factor to suit the unique characteristics and difficulty of each motion.

      The following figure (Figure 7 from the original paper) illustrates the results of this ablation study:

      Figure 7: Ablation study comparing the adaptive motion tracking mechanism with fixed tracking factor variants. The adaptive mechanism consistently achieves near-optimal performance across all motions, whereas fixed variants exhibit varying performance depending on motions. 该图像是一个图表,展示了不同动作(如 Jab punch、Charleston dance、Roundhouse kick 和 Bruce Lee 的姿势)中自适应动作跟踪机制与固定跟踪因子的比较。图中的曲线显示了各种动作的性能,蓝色曲线代表本文的方法,展示出在各个动作中均接近最佳性能的表现。

The plot shows that for different motions (e.g., Jabs punch, Charleston dance, Roundhouse kick, Bruce Lee's pose), the Ours method (blue curve, representing the adaptive mechanism) consistently achieves the lowest or near-lowest error across various tracking factors. In contrast, fixed tracking factor variants (green, red, purple, and yellow curves) exhibit high variance in performance, sometimes doing well, sometimes poorly, depending on the specific motion.

The following are the ablation study results evaluating the impact of different tracking factors on four motion tasks (Table 12 from the original paper):

Method Eg-mpbpe Empbpe Empjpe Empbve ↓ Empbae ↓ Empjve
Jabs punch
Ours 44.38±7.118 28.00±3.533 783.36±11.73 5.52±0.156 6.23±0.063 88.01±2.465
Coarse 63.95±6.680 36.76±2.743 921.50±16.70 6.16±0.011 6.46±0.042 91.46±0.465
Medium 51.07±2.635 30.93±2.635 790.54±22.82 5.68±0.140 6.31±0.057 90.19±1.821
Upperbound 45.74±1.702 28.72±1.702 793.52±8.888 5.43±0.066 6.29±0.085 88.68±0.727
Lowerbound 48.66±0.488 28.97±0.487 781.73±16.72 5.61±0.079 6.31±0.06 88.44±1.397
Charleston dance
Ours 94.81±14.18 43.09±5.748 886.91±74.76 6.83±0.346 7.26±0.034 162.70±7.133
Coarse 119.24±4.501 55.80±1.324 1288.02±3.807 7.54±0.180 7.28±0.021 178.61±3.304
Medium 83.63±3.159 41.02±1.743 933.33±38.23 6.89±0.185 7.22±0.011 164.92±4.380
Upperbound 86.90±8.651 41.92±2.632 917.64±14.85 7.02±0.103 7.22±0.041 167.64±1.089
Lowerbound 358.82±10.35 145.42±1.109 1199.21±12.78 8.99±0.050 8.48±0.033 167.25±0.783
Roundhouse kick
Ours 52.53±2.106 28.39±1.400 708.55±16.04 6.85±0.196 7.13±0.046 106.22±0.715
Coarse 76.81±2.863 38.98±2.230 1008.32±29.74 7.49±0.234 7.57±0.044 108.40±0.010
Medium 63.12±5.178 33.74±2.336 806.84±66.23 7.03±0.125 7.32±0.046 104.77±1.319
Upperbound 54.95±2.164 31.31±0.344 766.32±12.92 6.93±0.013 7.19±0.012 105.64±1.911
Lowerbound 70.10±2.674 36.29±1.475 715.01±34.01 7.08±0.102 7.32±0.067 102.50±4.650
Bruce Lee's pose
Ours 196.22±17.03 69.12±2.392 972.04±49.27 7.57±0.214 8.54±0.198 94.36±3.750
Coarse 239.06±51.74 80.78±15.81 1678.34±394.3 8.42±0.525 8.93±0.422 112.30±10.87
Medium 470.24±249.2 206.92±116.1 4490.80±105.1 9.58±0.085 9.61±0.080 99.65±2.441
Upperbound 250.64±178.6 93.70±65.09 1358.02±561.6 8.31±2.160 8.94±1.384 106.30±23.06
Lowerbound 158.12±2.934 60.54±1.54 955.10±37.04 7.05±0.040 7.94±0.051 81.60±1.277
  • Interpretation: This detailed table confirms that Ours (PBHC with adaptive σ\sigma) generally achieves the best performance (lowest error values) across the different metrics and specific motions. For example, for 'Jabs punch', Ours reports Eg-mpbpe of 44.38, significantly better than Coarse (63.95), Medium (51.07), Upperbound (45.74), and Lowerbound (48.66). This trend holds for 'Charleston dance' and 'Roundhouse kick'. For 'Bruce Lee's pose', while Lowerbound achieves a slightly better Empjve, Ours still has overall superior performance in position errors. The results clearly demonstrate the adaptive mechanism's capability to tune the tracking factors effectively, leading to robust and superior performance across varied dynamic motions compared to any single fixed setting.

6.4. Real-World Deployment

This section addresses Q4: "How well does PBHC perform in real-world deployment?"

  • Methodology: Policies trained in simulation were directly deployed on the Unitree G1 robot without any fine-tuning (zero-shot sim-to-real transfer). Quantitative assessment was done by conducting 10 trials of the Tai Chi motion and computing evaluation metrics based on onboard sensor readings.

  • Findings:

    • Stable and Expressive Behaviors: The Unitree G1 robot successfully demonstrated a diverse range of highly-dynamic skills, including martial arts techniques (punches, kicks), acrobatic movements (360-degree spins), flexible motions (squats, stretches), and artistic performances (dance, Tai Chi).
    • Quantitative Alignment: The metrics obtained in the real world were closely aligned with those from the sim-to-sim platform (MuJoCo).
  • Conclusion: The policies robustly transfer from simulation to real-world deployment, maintaining high-performance control and showcasing PBHC's practical applicability.

    The following figure (Figure 8 from the original paper) shows the robot mastering highly-dynamic skills in the real world:

    Figure 8: Our robot masters highly-dynamic skills in the real world. Time flows left to right. 该图像是一个插图,展示了我们的机器人在现实世界中掌握各种高动态技能的过程,包括马步拳、劈腿、直拳、太极、跳踢、李小龙的姿势、回旋踢、360度旋转、前踢和查尔斯顿舞。时间从左到右流动,体现了机器人从学习到掌握动态动作的连续性。

The image sequence displays various dynamic poses, such as horse-stance punch, stretch leg, jabs punch, Tai Chi, jump kick, Bruce Lee's pose, roundhouse kick, 360-degree spin, front kick, and Charleston dance, demonstrating the robot's versatility.

The following are the comparison of tracking performance of Tai Chi between real-world and simulation (Table 2 from the original paper):

Platform Empbpe ↓ Empjpe ↓ Empbve ↓ Empbae ↓ Empjve ↓
MuJoCo 33.18±2.720 1061.24±83.27 2.96±0.342 2.90±0.498 67.71±6.747
Real 36.64±2.592 1130.05±9.478 3.01±0.126 3.12±0.056 65.68±1.972
  • Interpretation: The table shows very similar error values between the MuJoCo simulation platform and the Real robot for the Tai Chi motion. For example, Empbpe is 33.18±2.72033.18 \pm 2.720 in MuJoCo and 36.64±2.59236.64 \pm 2.592 in Real, indicating a close match. This quantitative comparison strongly supports the claim of successful zero-shot sim-to-real transfer. The root of the robot is fixed to the origin for this specific comparison, as direct root position/velocity measurements are often hard to access accurately on real-world robots for this specific type of comparison.

    The following figure (Figure 12 from the original paper) presents additional real-world results of the robot mastering more dynamic skills:

    Figure 12: Our robot masters more dynamic skills in the real world. Time flows left to right. 该图像是图表,展示了机器人通过模仿动态技能的多个动作,包括:a) 钩拳,b) 马步姿势,c) 后踢,d) 侧踢,e) 五形态,f) 战斗连招,以及 g) 拍打舞。时间从左到右流动,展现出机器人在现实世界中掌握的动态技能。

This figure further showcases the robot's capabilities in hooks punch, horse-stance pose, back kick, side kick, five stance form, fighting combo, and tap dance, reinforcing the qualitative demonstration of PBHC's effectiveness in diverse dynamic movements.

6.5. Learning Curves

  • Methodology: Learning curves for mean episode length and mean reward are presented for three representative motions: Jabs Punch, Tai Chi, and Roundhouse Kick.

  • Findings: The curves show that training gradually stabilizes and converges after approximately 20,000 steps.

  • Conclusion: This demonstrates the reliability and efficiency of the PBHC approach in learning complex motion behaviors within a reasonable training duration.

    The following figure (Figure 9 from the original paper) shows the mean episode length and mean reward across three motions:

    Figure 9: Mean episode length and mean reward across three motions. Both curves indicate that training gradually stabilizes after 20k steps. 该图像是图表,展示了三个动作(Jabs punch、Tai Chi、Roundhouse kick)在训练过程中的平均回合长度和平均奖励。曲线表明,训练在20k步后逐渐稳定。

The top row (a) shows the Mean Episode Length for the three motions, all plateauing around 20k steps. The bottom row (b) shows the Mean Reward, which also stabilizes and converges around the same training steps. This indicates that the RL agent successfully learns to maintain the reference motion for longer durations and achieves high cumulative rewards as training progresses.

6.6. Ablation Study of Contact Mask (from Appendix E.3)

  • Methodology: An ablation study was conducted to evaluate the effectiveness of the contact mask in PBHC. It compared the full Ours method with Ours w/o contact mask (without the contact mask component) for motions with distinct foot contact patterns (Charleston Dance, Jump Kick, Roundhouse Kick).

  • Metric: Mean Foot Contact Mask Error (EcontactmaskE_{\mathrm{contact-mask}}) was introduced.

  • Findings:

    • The Ours method significantly reduced foot contact errors (EcontactmaskE_{\mathrm{contact-mask}}) compared to the baseline without the contact mask.
    • This also led to noticeable improvements in other tracking metrics.
  • Conclusion: The contact-aware design is effective in improving tracking accuracy, especially for foot contacts.

    The following figure (Figure 10 from the original paper) shows the accuracy of contact mask estimation across different methods:

    Figure 10: Accuracy of contact mask estimation across different methods. 该图像是一个柱状图,展示了不同方法在接触掩膜估计中的准确率。图中标注了三种方法的准确率,其中‘Height’为84.2%,‘Velocity’为85.6%,而‘Ours’的准确率达到了91.4%。

The bar chart shows that the proposed Ours method achieves 91.4%91.4\% accuracy, outperforming Height (84.2%84.2\%) and Velocity (85.6%85.6\%) based methods.

The following figure (Figure 11 from the original paper) presents a visual comparison of the efficacy of the proposed motion correction technique:

Figure 11: Visualization of motion correction effectiveness in mitigating floating artifacts. 该图像是示意图,展示了动作修正的有效性。图中分别显示了修正前后的人体姿态,通过比较可以明显看出,修正后的姿态更加符合地面的水平线,减少了浮动干扰。

The image visually demonstrates that motion correction effectively mitigates floating artifacts. The before correction image shows the human model's feet hovering above the ground, while the after correction image shows the feet accurately placed on the ground, highlighting the importance of the correction step.

The following are the ablation results of the contact mask (Table 13 from the original paper):

Method Econtact-mask ↓ Empbpe ↓ Empjpe ↓ Empbve ↓ Empbae ↓
Charleston dance
Ours 217.82±47.97 43.09±5.748 886.91±74.76 6.83±0.346 7.26±0.034
Ours w/o contact mask 633.91±49.74 76.13±53.01 980.40±222.0 7.72±1.439 7.64±0.594
Jump kick
Ours 294.22±6.037 42.58±8.126 840.33±97.76 9.48±0.717 10.21±10.21
Ours w/o contact mask 386.75±6.036 170.28±97.29 1259.21±423.9 16.92±0.012 16.57±5.810
Roundhouse kick
Ours 243.16±1.778 28.39±1.400 708.55±16.04 6.85±0.196 7.33±0.046
Ours w/o contact mask 250.10±6.123 36.76±2.743 921.52±16.70 6.16±0.012 6.46±0.042
  • Interpretation: For all three motions, Ours (with contact mask) achieves significantly lower Econtact-mask compared to Ours w/o contact mask. For example, in 'Charleston dance', Econtact-mask drops from 633.91 to 217.82. This reduction in contact mask error also translates to lower Empbpe and Empjpe for 'Charleston dance' and 'Jump kick', indicating that accurate contact handling contributes to overall motion tracking fidelity. For 'Roundhouse kick', while Econtact-mask is only slightly lower for Ours, the other position and velocity errors are notably better, reinforcing the benefit of the contact-aware design.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces PBHC, a novel Reinforcement Learning framework for humanoid whole-body motion control that successfully enables robots to learn highly-dynamic human behaviors. Its key innovations lie in a physics-based multi-steps motion processing pipeline that ensures physical plausibility of reference motions and an adaptive motion tracking mechanism that dynamically adjusts reward tolerance during training. Experimental results demonstrate that PBHC achieves significantly lower tracking errors than existing baselines in simulation and exhibits robust zero-shot sim-to-real transfer to a Unitree G1 robot, performing complex Kungfu and dancing skills stably and expressively. The motion filtering metric efficiently identifies untrackable motions, and the adaptive reward mechanism consistently outperforms fixed-factor approaches.

7.2. Limitations & Future Work

The authors acknowledge the following limitations:

  1. Lack of Environment Awareness: The current method does not incorporate terrain perception or obstacle avoidance. This restricts its deployment to unstructured real-world settings.

  2. Limited Skill Generalization: While it excels at highly-dynamic motions, the method's ability to generalize to diverse motion repertoires (i.e., a wide range of different skills) needs further exploration.

    Based on these limitations, the authors suggest future research directions:

  • Integrating environment awareness capabilities (e.g., perception of terrain, obstacles).
  • Research into maintaining high dynamic performance while simultaneously enabling broader skill generalization.

7.3. Personal Insights & Critique

This paper presents a significant step forward in humanoid motion imitation, particularly for highly-dynamic and expressive skills. The two-pronged approach of meticulous physics-based motion processing and an intelligent adaptive reward curriculum is a powerful combination.

  • Innovation of Adaptive Tracking: The bi-level optimization formulation and the EMA-driven adaptive tracking factor are particularly insightful. This moves beyond heuristics for reward shaping to a more principled, data-driven adjustment of learning difficulty. It addresses a fundamental challenge in imitation learning: how to reward partial progress on hard tasks without overly penalizing early, imperfect attempts, while still driving towards high precision.
  • Rigorous Motion Processing: The detailed motion processing pipeline, including CoM-CoP stability filtering and contact-aware correction, is crucial. Many RL approaches often overlook the quality of reference data, assuming it's perfect. This paper demonstrates the immense value of cleaning and validating input data against physical constraints.
  • Real-World Validation: The successful zero-shot sim-to-real transfer on the Unitree G1 robot with complex motions is compelling. This showcases the robustness of domain randomization and the efficacy of the proposed control framework.

Potential Issues/Areas for Improvement:

  • Computational Cost: Training RL policies, especially for complex humanoid robots, is computationally intensive. The 27-hour training time per policy on a powerful GPU suggests that scaling to an even broader range of skills or more complex environments might require significant computational resources. The bi-level optimization itself, while theoretically sound, adds a layer of complexity.
  • Generalization to Novel Motions: While the adaptive mechanism helps with different difficulty levels within a known set of motions, the question of truly novel, unseen motion types remains. How well would the adaptive factor generalize to motions structurally very different from the training data?
  • Reactive Behavior: The current framework focuses on imitating pre-defined reference motions. For real-world deployment, robots often need to react dynamically to unexpected external forces or changes in the environment, which is not directly addressed by motion imitation alone. The lack of environment awareness is a significant limitation for practical applications in unstructured settings.
  • Hyperparameter Sensitivity: Even with adaptive mechanisms, the initial tracking factor and the curriculum learning parameters (e.g., decay rates for θ\theta and α\alpha) might still require careful tuning and could influence the final performance.

Transferability: The principles of adaptive reward shaping based on online performance metrics and physics-informed data preprocessing are highly transferable.

  • Other Robotic Systems: This approach could be applied to other complex robotic systems (e.g., quadrupeds, manipulators) learning dynamic tasks.

  • Skill Learning: The adaptive curriculum concept could generalize to other RL skill learning problems where tasks have varying difficulties or require progressive precision.

  • Human-Robot Collaboration: The ability to accurately imitate human movements opens doors for more intuitive and effective human-robot collaboration and physical assistance.

    Overall, KungfuBot provides a robust and innovative framework that pushes the boundaries of humanoid whole-body control, particularly for dynamic and expressive human skills, laying important groundwork for future humanoid intelligence and dexterity.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.