KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills
TL;DR Summary
This paper presents KungfuBot, a physics-based humanoid control framework that learns high-dynamic human behaviors like Kungfu and dance through multi-step motion processing and adaptive tracking, achieving significantly lower tracking errors successfully implemented on a robot.
Abstract
Humanoid robots are promising to acquire various skills by imitating human behaviors. However, existing algorithms are only capable of tracking smooth, low-speed human motions, even with delicate reward and curriculum design. This paper presents a physics-based humanoid control framework, aiming to master highly-dynamic human behaviors such as Kungfu and dancing through multi-steps motion processing and adaptive motion tracking. For motion processing, we design a pipeline to extract, filter out, correct, and retarget motions, while ensuring compliance with physical constraints to the maximum extent. For motion imitation, we formulate a bi-level optimization problem to dynamically adjust the tracking accuracy tolerance based on the current tracking error, creating an adaptive curriculum mechanism. We further construct an asymmetric actor-critic framework for policy training. In experiments, we train whole-body control policies to imitate a set of highly-dynamic motions. Our method achieves significantly lower tracking errors than existing approaches and is successfully deployed on the Unitree G1 robot, demonstrating stable and expressive behaviors. The project page is https://kungfu-bot.github.io.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills
1.2. Authors
Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, Xuelong Li
- Affiliations:
- Institute of Artificial Intelligence (TeleAI), China Telecom (Weiji Xie, Jinrui Han, Chenjia Bai, Xuelong Li, Jiyuan Shi, Huanyu Li, Xinzhe Liu, Jiakun Zheng)
- Shanghai Jiao Tong University (Weiji Xie, Jinrui Han, Weinan Zhang)
- East China University of Science and Technology (Jiakun Zheng)
- Harbin Institute of Technology (Huanyu Li)
- ShanghaiTech University (Xinzhe Liu)
- Research Backgrounds: The authors come from prominent academic institutions and a corporate AI institute, indicating a strong background in artificial intelligence, robotics, and potentially telecommunications, with a focus on areas like reinforcement learning, motion control, and humanoid robotics.
1.3. Journal/Conference
The paper is published as a preprint on arXiv. It is likely submitted to a top-tier conference or journal in robotics, AI, or machine learning, given its scope and rigor. The NeurIPS Paper Checklist indicates it's likely targeting NeurIPS, a highly reputable conference in machine learning.
1.4. Publication Year
2025 (Published on 2025-06-15T13:58:53.000Z)
1.5. Abstract
This paper introduces KungfuBot, a physics-based humanoid control framework designed to enable humanoid robots to master highly-dynamic human behaviors, such as Kungfu and dancing. Current algorithms struggle with such dynamic motions, typically only tracking smooth, low-speed movements. KungfuBot addresses this through a two-stage approach: a multi-steps motion processing pipeline and adaptive motion tracking. The motion processing pipeline extracts, filters, corrects, and retargets human motions, ensuring physical constraint compliance. For motion imitation, it formulates a bi-level optimization problem to dynamically adjust a tracking accuracy tolerance (called tracking factor) based on current tracking error, creating an adaptive curriculum. An asymmetric actor-critic framework is used for policy training. Experimental results show that KungfuBot significantly reduces tracking errors compared to existing methods and demonstrates stable and expressive behaviors when deployed on a Unitree G1 robot, including complex Kungfu and dancing movements.
1.6. Original Source Link
https://arxiv.org/abs/2506.12851
- Publication Status: Preprint (on arXiv).
2. Executive Summary
2.1. Background & Motivation
- Core Problem: Humanoid robots face significant challenges in accurately and stably imitating highly-dynamic human behaviors like
Kungfuand dancing. ExistingReinforcement Learning (RL)-basedwhole-body controlalgorithms are generally limited to tracking smooth, low-speed human motions, even with sophisticatedrewardandcurriculum design. - Importance: Enabling humanoid robots to mimic diverse and dynamic human skills is crucial for their potential applications in various tasks, from physical assistance and rehabilitation to entertainment and education. The
human-like morphologyof these robots makes them ideal candidates forhuman-robot interaction (HRI)and tasks requiring human-like dexterity and movement. - Challenges/Gaps:
- Physical Feasibility: Human
motion capture (MoCap)data often contains movements that are physically impossible or unsafe for a robot due to differences in kinematics, dynamics, and joint limits. Directly using this data forRLtraining can lead to policies that are not feasible. - Tracking Accuracy for Dynamics: Existing methods lack robust mechanisms to handle the inherent difficulty and variability of highly-dynamic motions, often leading to poor tracking performance or instability.
- Sim-to-Real Transfer: Bridging the gap between simulated training environments and real-world robot deployment remains a significant hurdle.
- Physical Feasibility: Human
- Paper's Entry Point/Innovative Idea: The paper proposes a comprehensive
physics-basedcontrol framework that integratesmulti-steps motion processingto ensure physical plausibility of reference motions and anadaptive motion trackingmechanism to dynamically adjust thereward tolerancefor varying motion difficulties, enabling the learning of highly-dynamic skills.
2.2. Main Contributions / Findings
- Primary Contributions:
- Physics-Based Motion Processing Pipeline: Design and implementation of a pipeline to
extract,filter out,correct, andretargethuman motions from videos. This pipeline ensures that reference motions comply with the robot's physical constraints and includes novelphysics-based metricsfor filtering andcontact-aware motion correction. - Adaptive Motion Tracking Mechanism: Formulation of a
bi-level optimizationproblem to derive anoptimal tracking factor() and development of anadaptive mechanismthat dynamically adjusts this factor duringRLtraining based ontracking error. This creates anadaptive curriculumthat tightenstracking accuracy toleranceas the policy improves. - Asymmetric Actor-Critic Framework: Construction of an
asymmetric actor-criticarchitecture that utilizesreward vectorizationandprivileged informationfor the critic to enhance value estimation, while the actor relies on local observations. - Demonstrated Real-World Capabilities: Successful deployment of learned policies on a
Unitree G1robot, showcasing stable and expressive execution of complex, highly-dynamic skills likeKungfuand dancing.
- Physics-Based Motion Processing Pipeline: Design and implementation of a pipeline to
- Key Conclusions/Findings:
- The
physics-based motion filteringeffectively identifies and excludes untrackable motions from the dataset, leading to more efficient and effective policy learning. - The proposed method (
PBHC) significantly outperforms existing approaches (e.g.,OmniH2O,ExBody2) in simulation across varioustracking error metricsfor easy, medium, and hard motions. - The
adaptive motion tracking mechanismis crucial for achieving superior tracking precision, dynamically adjusting to motion characteristics where fixedtracking factorswould lead to suboptimal performance. - The trained policies exhibit robust
sim-to-real transfer, allowing for direct deployment on theUnitree G1robot with performance metrics closely aligned with simulation results.
- The
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Humanoid Robots: Robots designed to resemble the human body, typically featuring a torso, head, two arms, and two legs. Their
human-like morphologyenables them to interact with human-designed environments and perform tasks that humans do. TheUnitree G1is an example of such a robot. - Degrees of Freedom (DoFs): In robotics, this refers to the number of independent parameters that define the configuration of a mechanical system. For a humanoid robot, this includes translational
DoFsfor the base (e.g.,x, y, zposition) and rotationalDoFsfor the base and each joint (e.g., pitch, roll, yaw). TheUnitree G1has 23DoFsfor control, excluding its hands. - Whole-Body Control (WBC): A control strategy that coordinates all
DoFsof a robot (including the base, torso, arms, and legs) simultaneously to achieve complex tasks while respecting physical constraints (e.g., joint limits, torque limits, balance, contact forces). - Motion Capture (MoCap): Technology used to digitally record the movement of people or objects.
MoCapsystems producemotion datathat can be used to animate digital models or serve asreference motionsfor robots to imitate. - Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative
reward signal. The agent learns apolicy—a mapping from observed states to actions.- Markov Decision Process (MDP): A mathematical framework for modeling decision-making in
RL. AnMDPis defined by , where:- is the set of possible
states. - is the set of possible
actions. - is the
reward function, defining the immediate reward an agent receives for transitioning from one state to another. - is the
transition function(or dynamics), defining the probability of moving to a new state given a current state and action. - is the
discount factor, which determines the present value of future rewards.
- is the set of possible
- Actor-Critic Methods: A class of
RLalgorithms that combine two components: anactor(which learns thepolicydirectly) and acritic(which estimates thevalue functionto evaluate the actor's actions).- Asymmetric Actor-Critic: A variation where the
actorandcritichave different observation spaces. Typically, thecritichas access toprivileged information(e.g., environmental parameters, internal states not directly observable by the robot) to learn a better value function, which then guides theactor(who only seesproprioceptiveandexteroceptiveobservations) in learning a robustpolicy. This helps insim-to-real transfer.
- Asymmetric Actor-Critic: A variation where the
- Proximal Policy Optimization (PPO): A popular
on-policyRLalgorithm that aims to improvesample efficiencyandstabilitycompared to earlier methods. It updates thepolicyby taking multiple small steps, ensuring that the newpolicydoes not deviate too much from the old one, which helps prevent destructive updates.
- Markov Decision Process (MDP): A mathematical framework for modeling decision-making in
- Skinned Multi-Person Linear (SMPL) Model: A widely used statistical 3D human body model that represents body shape and pose with a low-dimensional parameter space.
- Parameters:
SMPLuses:- for
body shapes(e.g., height, weight, build). - for
joint rotations(represented asaxis-angleorrotation matrices). - for
global translation(position in 3D space).
- for
- Mapping to Mesh: These parameters are mapped to a 3D mesh (a collection of 3D vertices) via a
differentiable skinning function, producing .
- Parameters:
- Inverse Kinematics (IK): A method used in robotics and computer graphics to determine the joint parameters (angles) required to achieve a desired position and orientation for an
end-effector(e.g., hand, foot).Differential IKuses derivatives to solve this problem iteratively. - Center of Mass (CoM) and Center of Pressure (CoP):
- CoM: The unique point where the weighted relative positions of the distributed mass sum to zero. It's the average position of all the mass that makes up the object.
- CoP: The point on the ground where the total ground reaction force acts. For stable standing or walking, the
CoMprojection should ideally stay within thebase of support(the area enclosed by theCoP). The proximity ofCoMandCoPindicates stability.
- Exponential Moving Average (EMA): A type of moving average that places a greater weight and significance on the most recent data points. It is often used to smooth noisy data or to create a dynamically updating estimate of a parameter.
3.2. Previous Works
The paper contextualizes its work by discussing limitations of previous RL-based whole-body control frameworks for motion tracking, particularly their inability to handle highly-dynamic motions.
- H2O and OmniH2O [9, 10]: These methods attempt to address
physical feasibilityissues by removinginfeasible motionsfrom datasets using a trainedprivileged imitation policy. While they contribute to cleaning motion data, they often still struggle with highly dynamic movements due to inherent difficulties in tracking and a lack of suitabletolerance mechanisms.OmniH2Ospecifically focuses on universal and dexteroushuman-to-humanoid whole-body teleoperation and learning. - ExBody [7] and ExBody2 [5]:
ExBodyconstructs afeasible motion datasetbyfiltering via language labels.ExBody2trains an initial policy on all motions and uses thetracking errorto measure motion difficulty, aiming to optimize the dataset. However, this process can be costly and may not find an optimal dataset, and also lackstolerance mechanismsfor difficult motions.
- ASAP [6]:
Aligning Simulation and Real-world Physics for Learning Agile Humanoid Whole-body Skills. This method introduces amulti-stage mechanismand learns aresidual policyto compensate for thesim-to-real gap, which helps in trackingagile motions. UnlikeASAPwhich focuses on thesim-to-real gap,KungfuBotfocuses on improvingmotion feasibilityandagilityentirely withinsimulation. - MaskedMimic [21, 23]:
Unified Physics-based Character Control through Masked Motion Inpainting. This method focuses oncharacter animationand directly optimizespose-level accuracywithout explicit regularization ofphysical plausibility. While it performs well in character animation contexts, it is not directly deployable forrobot controlbecause it does not account forrobot-specific constraintslikepartial observabilityandaction smoothness. The paper usesMaskedMimicas a baseline for comparison but notes its limitations for direct robot deployment.
3.3. Technological Evolution
Early humanoid motion imitation efforts often relied on direct kinematic mapping or Inverse Kinematics (IK) controllers, which struggled with dynamic stability and physical constraints. The advent of Reinforcement Learning (RL) brought about physics-based controllers that could learn stable locomotion and basic movements (e.g., DeepMimic [21]). However, directly imitating complex human motions remained challenging due to the sim-to-real gap and the physical differences between humans and robots. Subsequent works focused on motion filtering (H2O, ExBody) to make human MoCap data more robot-feasible. More recently, approaches like ASAP have tackled agile motions by addressing the sim-to-real gap through residual policies.
KungfuBot represents an evolution by refining both the data preparation (physics-based processing) and the RL training loop (adaptive reward tolerance), allowing for the imitation of more extreme and dynamic motions that previous methods struggled with. It moves beyond simply filtering data to actively adapt the learning process to the inherent difficulty of dynamic movements.
3.4. Differentiation Analysis
KungfuBot differentiates itself from previous methods primarily in two key areas:
-
Comprehensive Physics-Based Motion Processing:
- Prior Work: Methods like
H2OandExBodyprimarily focus on filteringinfeasible motionsor constructing datasets based on labels.ASAPaddressessim-to-real gapforagile motions. - KungfuBot's Innovation:
KungfuBotintroduces a more holisticmulti-steps pipelinethat not only filters (physics-based motion filteringusingCoM-CoPstability criteria) but alsocorrects(contact-aware motion correctionwithEMAsmoothing) andretargetsmotions. This ensures a higher degree ofphysical plausibilityand quality for thereference motionsbeforeRLtraining, directly tacklingout-of-distributionissues fromHMRmodels andfloating artifacts. This processing is done entirely in simulation, contrasting withASAP's focus onsim-to-real compensation.
- Prior Work: Methods like
-
Adaptive Motion Tracking with Optimal Factor Derivation:
-
Prior Work: Most
RL-based imitation learningmethods use fixedreward functionsandtracking factors(e.g.,OmniH2O,ExBody2). WhileExBody2attempts to measure motion difficulty, it lacks a dynamictolerance mechanism. TheMaskedMimicapproach, while effective forcharacter animation, does not prioritizephysical plausibilityfor robot control and thus uses different optimization objectives. -
KungfuBot's Innovation:
KungfuBotintroduces a noveladaptive mechanismfor dynamically adjusting thetracking factor() within theexponential reward function. This is grounded in abi-level optimizationproblem that theoretically derives theoptimal tracking factoras the average of the optimaltracking error. By using anExponential Moving Average (EMA)to estimate this error online and iteratively tightening ,KungfuBotcreates anadaptive curriculumthat allows the policy to progressively improve its precision for motions of varying difficulty. This mechanism is a significant departure from fixedreward weightingortolerance parameters, enabling superior performance onhighly-dynamic skills.In summary,
KungfuBotcombines sophisticatedphysics-based motion preprocessingwith an intelligent, adaptiveRL reward mechanism, explicitly designed to overcome the limitations of prior work in handling the physical and dynamic complexities ofhighly-dynamic human behaviorsfor humanoid robots.
-
4. Methodology
The Physics-Based Humanoid motion Control (PBHC) framework is designed to enable humanoid robots to master highly-dynamic human behaviors. It operates through a two-stage process: motion processing and motion imitation.
The following figure (Figure 1 from the original paper) provides an overview of PBHC:
该图像是示意图,展示了PBHC的三个核心组件:(a)多步骤运动处理,包括从视频中提取动作和参考轨迹,以及接触掩码的生成;(b)自适应运动跟踪,基于最优跟踪因子的动态调整;(c)强化学习训练框架,展示了从观察到动作的训练过程及其部署。
As illustrated, the process begins with raw human videos (a). These videos undergo Human Motion Recovery (HMR) to produce SMPL-format motion sequences. These sequences are then refined through physics-based motion filtering and contact-aware motion correction, ensuring physical plausibility. The refined motions are then retargeted to the G1 robot to serve as reference motions. In the second stage (b), an adaptive motion tracking mechanism dynamically adjusts the tracking factor based on an optimal tracking factor derived from a bi-level optimization problem. Finally, the policies are trained using an RL training framework (c) and deployed on the real Unitree G1 robot.
4.1. Principles
The core idea behind PBHC is to systematically address the challenges of dynamic motion imitation in humanoid robots by:
- Ensuring Physical Feasibility: Pre-processing human
MoCapdata to guarantee that thereference motionsare physically executable by the robot, considering itskinematics,dynamics, andcontact interactions. - Adaptive Learning: Dynamically adjusting the
reward toleranceduringReinforcement Learningbased on the agent's current performance and the inherent difficulty of the motion. This allows the agent to gradually refine itstracking precisionwithout being prematurely penalized for small errors in complex movements, effectively creating anadaptive curriculum.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Motion Processing Pipeline
This pipeline converts raw human video into physically plausible and robot-executable reference motions for the G1 robot.
4.2.1.1. Motion Estimation from Videos
The process begins by extracting human motion from monocular videos.
- Model Used:
GVHMR[15] (Gravity-View Human Motion Recovery) is employed. - Key Features of GVHMR:
- It estimates
SMPL-format motions. - It introduces a
gravity-view coordinate system, which naturally aligns motions with gravity, addressingbody tilt issuesthat can arise from camera-centric reconstructions. - It mitigates
foot sliding artifactsby predictingnon-stationary probabilities, improvingmotion quality.
- It estimates
- Output:
SMPL parameters, where representsbody shapes, representsjoint axis-angle rotations, and representsglobal translation. These parameters can be mapped to a 3D mesh via adifferentiable skinning function.
4.2.1.2. Physics-based Motion Filtering
Motions extracted by HMR models can still contain physical and biomechanical constraint violations due to reconstruction inaccuracies and out-of-distribution issues. This step filters out such motions.
- Stability Criterion: The stability of a motion frame is assessed based on the proximity of the
Center of Mass (CoM)andCenter of Pressure (CoP).- Let and be the projected coordinates of the
CoMandCoPon the ground at frame , respectively. - The distance between these projections is .
- The stability criterion for a frame is defined as:
- : The Euclidean distance between the 2D projections of the
CoMandCoPon the ground at time . - : The 2D projection of the
Center of Masson the ground at time . - : The 2D projection of the
Center of Pressureon the ground at time . - : The Euclidean norm (distance).
- : A
stability threshold, empirically chosen (e.g.,0.1as per Table 3). If the distance is below this threshold, the frame is considered stable.
- : The Euclidean distance between the 2D projections of the
- Let and be the projected coordinates of the
- Motion Sequence Stability: For an -frame motion sequence, let be the increasingly sorted list of frame indices that satisfy the stability criterion (Eq. 1). A motion sequence is considered stable if it meets two conditions:
Boundary-frame stability: The first frame () and the last frame () must be stable.Maximum instability gap: The maximum length of consecutive unstable frames must be less than a threshold , i.e., . (e.g.,100as per Table 3).
- Benefit: This filtering effectively removes motions that are inherently
untrackableor dynamicallyunstablefor a robot.
4.2.1.3. Motion Correction based on Contact Mask
This step refines motions by explicitly accounting for foot-ground contact.
- Contact Mask Estimation:
Contact masksare estimated by analyzingankle displacementacross consecutive frames, based on thezero-velocity assumptionfor feet in contact.- Let be the position of the left ankle at time , and be its corresponding
contact mask(1 for contact, 0 for no contact). - The
contact maskis estimated as:- : The binary
contact maskfor the left foot at time . - : The
indicator function, which equals 1 if the condition inside is true, and 0 otherwise. - : The squared Euclidean distance of the left ankle's displacement between frame and . This checks for
near-zero velocity. - : An empirically chosen
velocity threshold(e.g.,0.002as per Table 3). If displacement is below this, velocity is considered zero. - : The vertical () coordinate of the left ankle at time .
- : An empirically chosen
height threshold(e.g.,0.2as per Table 3). This ensures the foot is near the ground.
- : The binary
- A similar process is applied for the right foot.
- Let be the position of the left ankle at time , and be its corresponding
- Correction Step: To address minor
floating artifacts(where the feet appear to hover above the ground), avertical offsetis applied to the global translation if a foot is in contact.- Let denote the
global translationof the pose at time . - The corrected vertical position is:
- : The corrected vertical component of
global translationat time . - : The original vertical component of
global translationat time . - : The lowest -coordinate among all
SMPL mesh verticesat frame . This effectively brings the lowest point of theSMPL meshto the ground.
- : The corrected vertical component of
- Let denote the
- Smoothing: This correction can introduce
frame-to-frame jitter. To counter this,Exponential Moving Average (EMA)is applied to smooth the motion.
4.2.1.4. Motion Retargeting
The processed SMPL-format motions are then adapted to the target robot's kinematics.
- Method: An
Inverse Kinematics (IK)-based method[19] is used. - Process: This involves formulating a
differentiable optimization problemthat aligns theend-effector trajectoriesof theSMPL modelwith the robot'send-effectorswhile respecting the robot'sjoint limits. - Data Augmentation: To enhance diversity, additional motions from open-source datasets like
AMASS[4] andLAFAN[20] are also processed through this pipeline.
4.2.2. Adaptive Motion Tracking
This mechanism dynamically adjusts how strictly the robot must adhere to the reference motion during RL training.
4.2.2.1. Exponential Form Tracking Reward
The reward function in PBHC consists of two parts: task-specific rewards (for motion tracking) and regularization rewards (for stability and smoothness).
-
Task-Specific Rewards: These include terms for
aligning joint states,rigid body states, andfoot contact masks. -
Exponential Form: Most
task-specific rewards(exceptfoot contact tracking) follow anexponential form:r(x): Thereward valuefor a giventracking error.- : The
tracking error, typically measured asMean Squared Error (MSE)of quantities likejoint angles. A lower means better tracking. - : The
tracking factor, which controls thetoleranceof the error. A larger means more tolerance (rewards remain high even for larger errors), while a smaller means less tolerance (rewards drop sharply with increasing error).
-
Why Exponential Form? It's preferred over a simple
negative errorbecause it'sbounded(max reward is 1), helpsstabilize training, and offers an intuitive way to adjustreward weightingvia .The following figure (Figure 2 from the original paper) illustrates the effect of the
tracking factoron thereward value:
该图像是图表,展示了追踪因子 au对奖励值的影响。横轴为追踪误差 ,纵轴为奖励 。不同的曲线表示不同的追踪因子值,其中红色表示 ,绿色表示 ,蓝色表示 。
As seen in the graph, for a fixed tracking error , a larger (blue curve, ) yields a higher reward, indicating more tolerance. A smaller (red curve, ) causes the reward to drop more sharply for the same error, requiring higher precision.
4.2.2.2. Optimal Tracking Factor
To determine the ideal , the problem of motion tracking is modeled as a bi-level optimization (BLO) problem [11].
-
Intuition: The goal is to find a that leads to the minimum accumulated
tracking errorfrom theconverged policy. This mimics how a human engineer might iteratively tune and observe results. -
Problem Formulation:
- Let be a
policyand be the sequence ofexpected tracking errorsover steps of anepisode rollout. - The inner (lower-level) optimization represents the
RL training procedure: given a fixed , thepolicyaims to maximize its accumulatedreward.- : The simplified
accumulated rewardfrom theexponential tracking term. - : Represents other
reward componentsandenvironment dynamics, includingregularization rewardsand externalpolicy objectives.
- : The simplified
- The outer (upper-level) optimization then selects the optimal to minimize the total
tracking errorof thefinal converged policy.- : The negative of the
accumulated tracking errorof the optimalerror sequencefrom the inner problem. Maximizing this is equivalent to minimizing the accumulatedtracking error.
- : The negative of the
- Let be a
-
Derivation of Optimal (from Appendix A): Assuming is linear, and and are twice continuously differentiable, and the lower-level problem has a unique solution , an
implicit gradient approachcan be used.-
Gradient of w.r.t. :
- : The total derivative of the external objective with respect to .
- : The transpose of the derivative of the optimal error sequence with respect to .
- : The gradient of the external objective with respect to the error sequence, evaluated at the optimal error sequence.
-
Using the KKT condition for the lower-level problem: Since is the solution to the inner maximization problem, its gradient must be zero:
-
Taking the first-order derivative of the KKT condition w.r.t. : This equation relates the mixed partial derivatives of to the derivative of with respect to .
-
Solving for :
- : The mixed second-order partial derivative (Hessian) of with respect to and .
- : The second-order partial derivative (Hessian) of with respect to .
-
Substituting back into the gradient of :
-
Explicit forms of and :
-
Compute first- and second-order gradients:
- : An element-wise exponential function applied to the vector .
- : A scalar multiplier.
- : A vector of ones.
- : Element-wise multiplication.
- : Creates a diagonal matrix from a vector.
-
Setting the gradient to zero () and simplifying: This leads to the conclusion that the optimal
tracking factoris the average of the optimaltracking errors:- : The
optimal tracking factor. - : The number of steps in the motion sequence.
- : The
optimal tracking errorat the -th step.
- : The
-
4.2.2.3. Adaptive Mechanism
Since and are inter-dependent, direct computation of is not possible. Also, a single fixed is impractical for diverse motions. An adaptive mechanism is designed to dynamically adjust during training.
-
Online Estimation: An
Exponential Moving Average (EMA)of theinstantaneous tracking erroris maintained over environment steps. This serves as an online estimate of theexpected tracking errorfor the currentpolicy. -
Feedback Loop: During training,
PBHCupdates to the current value of . This creates a closed-loop feedback system: as thetracking errordecreases, decreases, which in turn leads to a tightening of . This process drives furtherpolicy refinement. -
Update Rule: To ensure
stability, is constrained to benon-increasingand initialized with arelatively large value().- : The
tracking factorthat is being adapted. - : The
Exponential Moving Average (EMA)of theinstantaneous tracking error. - : The minimum function, ensuring only decreases or stays the same.
- : The
-
Benefit: This adaptive process allows the
policyto progressively improve itstracking precisionover time, as illustrated in the following figure (Figure 4 from the original paper).
该图像是图表,展示了左手 轴位置与时间的关系。图中标注了动作 'Horse Stance' 和 'Quick Punch' 的位置,并显示了不同步长下的跟踪精度变化,适应性 oldsymbol{ au}可逐步提高跟踪精度。
This graph shows the right hand y-position for a 'Horse-stance punch' motion. The adaptive\sigma (red line) might not allow for this progressive improvement.
The following figure (Figure 3 from the original paper) depicts the closed-loop adjustment of the tracking factor in the adaptive mechanism:
该图像是示意图,展示了在适应性机制中跟踪因子的闭环调整过程。图中包含四个关键部分:奖励 r(x) 形状、策略 优化、跟踪因子 收紧和跟踪误差 减少。各部分之间通过箭头展示了相互关系,表明在优化过程中如何动态调整策略以减少跟踪误差。
The diagram shows a continuous loop: The tracking error (estimated online) influences the tracking factor . A reduction in leads to a tightening of (via ). This tightened then reshapes the reward function r(x), which in turn drives policy optimization (to maximize rewards), leading to a further reduction in tracking error . This positive feedback loop allows the system to converge to an optimal .
4.2.3. RL Training Framework
4.2.3.1. Asymmetric Actor-Critic
The policy optimization uses an asymmetric actor-critic architecture, typical in RL for sim-to-real transfer.
- Time Phase Variable: A
time phase variableis introduced, representing thelinear progressof thereference motion(0 at start, 1 at end). - Actor's Observation Space (): The actor operates with
local observations. It receives:Proprioceptive state() of the robot: This includes historical information (last 5 steps) ofjoint positions(),joint velocities(),root angular velocity(),root projected gravity(), and theactionfrom the previous step ().- The
time phase variable.
- Critic's Observation Space (): The critic uses
privileged informationto learn a bettervalue function. Its observations include:- All components of the
actor's observation( andtime phase). - Additionally,
reference motion positions,root linear velocity, and a set ofrandomized physical parameters(e.g.,base CoM offset,link mass,stiffness,damping,friction coefficient,control delay). Theseprivileged parametersare crucial for learning a robustvalue functionthat can generalize across different physical conditions, aidingsim-to-real transfer.
- All components of the
4.2.3.2. Reward Vectorization
- To facilitate learning for multiple
reward components,rewardsandvalue functionsarevectorized. - Instead of summing all
reward componentsinto a single scalar, each component is assigned to a separatevalue function. - The
critic networkhas multipleoutput heads, each estimating the return for a specificreward component. - All
value functionsare then aggregated to compute theaction advantage. This design enhancesvalue estimationandstabilityinpolicy optimization.
4.2.3.3. Reference State Initialization (RSI)
- The robot's state is initialized from
reference motion statesat randomly sampledtime phases. - This technique allows for parallel learning across different segments (phases) of a motion, significantly improving
training efficiencyby avoiding repetitive learning from the very beginning of a motion every time.
4.2.3.4. Sim-to-Real Transfer
- Domain Randomization: To bridge the
sim-to-real gap,domain randomizationis applied during training. This involves varyingphysical parametersof thesimulated environmentandhumanoid model(e.g.,friction,mass,CoM offset,control delay,external perturbations). By training policies that are robust to these variations, the policies become more generalized and perform well on the real robot, even with slight mismatches between simulation and reality. - Zero-Shot Transfer: The policies trained with
domain randomizationare directly deployed to real robots without anyfine-tuning, achievingzero-shot sim-to-real transfer.
5. Experimental Setup
5.1. Datasets
The experiments utilize a highly-dynamic motion dataset constructed using PBHC's motion processing pipeline.
-
Sources:
- Video-based sources: Motions extracted from videos and fully processed by the
PBHC pipeline. - Open-source datasets: Selected motions from
AMASS[4] andLAFAN[20], partially processed through thePBHC pipeline(contact mask estimation, motion correction, retargeting).
- Video-based sources: Motions extracted from videos and fully processed by the
-
Characteristics: The dataset comprises 13 distinct motions, categorized into three
difficulty levels:easy,medium, andhard, based on theiragility requirements. -
Transition Smoothing: To ensure smooth transitions,
linear interpolationis applied at the beginning and end of each sequence to move from adefault poseto thereference motionand back.The following figure (Figure 5 from the original paper) shows examples of motions in the constructed dataset:
该图像是一个示意图,展示了在我们构建的数据集中不同难度的动作示例,包括马步拳(简单)、伸腿(中等)、跳踢(中等)和360度旋转(困难)。图中蓝色轨迹表示动作路径,较深的透明度表示后续时间点。
As shown, motions range from relatively simple horse-stance punch to more complex stretch leg, jump kick, and 360-degree spin, with darker opacity indicating later timestamps in the motion trajectory.
The following are the details of the highly-dynamic motion dataset (Table 4 from the original paper):
| Motion name | Motion frames | Source |
| Easy | ||
| Jabs punch | 285 | video |
| Hooks punch | 175 | video |
| Horse-stance pose | 210 | LAFAN |
| Horse-stance punch | 200 | video |
| Medium | ||
| Stretch leg | 320 | video |
| Tai Chi | 500 | video |
| Jump kick | 145 | video |
| Charleston dance | 610 | LAFAN |
| Bruce Lee's pose | 330 | AMASS |
| Hard | ||
| Roundhouse kick | 158 | AMASS |
| 360-degree spin | 180 | video |
| Front kick | 155 | video |
| Side kick | 179 | AMASS |
5.2. Evaluation Metrics
The tracking performance of policies is quantified using various error metrics, focusing on position, velocity, and acceleration errors across different body parts and joints. For a motion with frames, the expectation is taken over all body parts/joints and all frames where the robot is active.
-
Global Mean Per Body Position Error (, mm)
- Conceptual Definition: This metric quantifies the average position error of all body parts (links) of the robot in global coordinates relative to the reference motion. It measures how far, on average, the robot's body is from the target positions in the global frame.
- Mathematical Formula:
- Symbol Explanation:
- : Global Mean Per Body Position Error.
- : Expectation, averaged over all body parts and time steps.
- : The 3D position vector of a specific body part of the robot at time .
- : The 3D position vector of the corresponding body part in the reference motion at time .
- : The Euclidean norm (L2 norm), representing the distance between the robot's body part and the reference.
-
Root-Relative Mean Per Body Position Error (, mm)
- Conceptual Definition: This metric measures the average position error of body parts relative to the robot's root (e.g., pelvis or base). It focuses on the internal posture and configuration error, decoupling it from any global drift of the entire robot.
- Mathematical Formula:
- Symbol Explanation:
- : Root-relative Mean Per Body Position Error.
- : Expectation, averaged over all body parts and time steps.
- : The 3D position vector of a specific body part of the robot at time .
- : The 3D position vector of the robot's root at time .
- : The 3D position vector of the corresponding body part in the reference motion at time .
- : The 3D position vector of the reference motion's root at time .
- : The Euclidean norm.
-
Mean Per Joint Position Error (, rad)
- Conceptual Definition: This metric quantifies the average angular error for each joint's position (angle) across the robot's
Degrees of Freedom. It directly measures how closely the robot's joint configurations match the reference. - Mathematical Formula:
- Symbol Explanation:
- : Mean Per Joint Position Error.
- : Expectation, averaged over all joints and time steps.
- : The vector of joint angles for the robot at time .
- : The vector of joint angles for the reference motion at time .
- : The Euclidean norm.
- Conceptual Definition: This metric quantifies the average angular error for each joint's position (angle) across the robot's
-
Mean Per Joint Velocity Error (, rad/frame)
- Conceptual Definition: This metric measures the average error in the angular velocities of the robot's joints compared to the reference motion. It indicates how well the robot matches the speed of joint movements.
- Mathematical Formula: where
- Symbol Explanation:
- : Mean Per Joint Velocity Error.
- : Expectation, averaged over all joints and time steps.
- : The vector of joint angular velocities for the robot at time , approximated as the difference in joint positions between and
t-1. - : The vector of joint angular velocities for the reference motion at time .
- : The Euclidean norm.
-
Mean Per Body Velocity Error (, mm/frame)
- Conceptual Definition: This metric measures the average error in the linear velocities of the robot's body parts compared to the reference motion. It assesses how accurately the robot matches the speed of its body segments' movements.
- Mathematical Formula: where
- Symbol Explanation:
- : Mean Per Body Velocity Error.
- : Expectation, averaged over all body parts and time steps.
- : The vector of linear velocities for a body part of the robot at time , approximated as the difference in positions between and
t-1. - : The vector of linear velocities for the corresponding body part in the reference motion at time .
- : The Euclidean norm.
-
Mean Per Body Acceleration Error (, mm/frame)
- Conceptual Definition: This metric measures the average error in the linear accelerations of the robot's body parts compared to the reference motion. It's particularly important for dynamic movements, indicating how well the robot matches the rates of change of velocities.
- Mathematical Formula: where
- Symbol Explanation:
- : Mean Per Body Acceleration Error.
- : Expectation, averaged over all body parts and time steps.
- : The vector of linear accelerations for a body part of the robot at time , approximated as the difference in velocities () between and
t-1. - : The vector of linear accelerations for the corresponding body part in the reference motion at time .
- : The Euclidean norm.
-
Mean Foot Contact Mask Error () (Introduced in Appendix E.3)
- Conceptual Definition: This metric quantifies the average discrepancy between the robot's estimated foot contact state and the reference motion's foot contact state. It directly measures how accurately the robot's foot contacts (or lack thereof) match the intended contacts in the reference.
- Mathematical Formula:
- Symbol Explanation:
- : Mean Foot Contact Mask Error.
- : Expectation, averaged over all feet and time steps.
- : The actual (or estimated) binary
contact maskfor the robot's foot (1 for contact, 0 for no contact) at time . - : The binary
contact maskfor the reference motion's foot at time . - : The Manhattan norm (L1 norm), summing the absolute differences.
5.3. Baselines
PBHC is compared against three baseline methods, all of which use an exponential form for the tracking reward function (similar to PBHC's design in Section 3.2.1), but with empirically tuned and fixed parameters rather than adaptive ones.
-
OmniH2O [10]:
- Description: This method adopts a
teacher-student training paradigm. Theteacherpolicy learns in aprivilegedsetting, and thestudentpolicy learns from theteacher'soutputs. - Implementation: The authors moderately increased the
tracking reward weightsto better match theG1 robot. Theteacherandstudentpolicies were trained for 20 and 10 hours, respectively.
- Description: This method adopts a
-
ExBody2 [5]:
- Description: This method utilizes a
decoupled keypoint-velocity tracking mechanism. It focuses onexpressive whole-body control. - Implementation: Similar to
OmniH2O,teacherandstudentpolicies were trained for 20 and 10 hours, respectively.
- Description: This method utilizes a
-
MaskedMimic [2]:
- Description: This method is primarily designed for
character animationand focuses on optimizingpose-level accuracy. It comprises three sequential training phases. - Implementation: The authors utilized only the first phase of
MaskedMimicas the full method is not pertinent to robot control tasks (lacks constraints likepartial observabilityandaction smoothness). Each policy was trained for 18 hours. - Note: The paper also trains an "Oracle" version of
PBHCthat, likeMaskedMimic, overlooksrobot-specific constraintsfor a fair comparison ofpose-level accuracyin a less constrained setting.
- Description: This method is primarily designed for
5.4. Experiment Setup Details (from Appendix D.1, C.3, C.4, C.5, C.6)
5.4.1. Compute Platform
- Hardware: Each experiment was conducted on a machine equipped with:
CPU: 24-core Intel i7-13700 running at 5.2GHz.RAM: 32 GB.GPU: Single NVIDIA GeForce RTX 4090.
- Operating System: Ubuntu 20.04.
- Training Time: Each model was trained for 27 hours.
5.4.2. Real Robot Setup
- Robot:
Unitree G1humanoid robot. - System Architecture:
Onboard motion control board: Collects sensor data.External PC: Connected via Ethernet, receives sensor data (usingDDS protocol), maintainsobservation history, performspolicy inference, and sendstarget joint anglesback to the control board.Control board: Issuesmotor commandsbased on receivedtarget joint angles.
5.4.3. Domain Randomization (from Appendix C.3)
To improve sim-to-real transfer and robustness, domain randomization is incorporated.
The following are the domain randomization settings (Table 7 from the original paper):
| Term | Value |
| Dynamics Randomization | |
| Friction | U(0.2, 1.2) |
| PD gain | U(0.9, 1.1) |
| Link mass(kg) | U(0.9, 1.1)× default |
| Ankle inertia(kg.m2) | U(0.9, 1.1)× default |
| Base CoM offset(m) | U(-0.05, 0.05) |
| ERFI[58](N·m/kg) | 0.05× torque limit |
| Control delay(ms) | U(0, 40) |
| External Perturbation | |
| Random push interval(s) | [5, 10] |
| Random push velocity(m/s) | 0.1 |
U(a, b): Uniform distribution between and .defaultrefers to the nominal value for theUnitree G1robot.ERFI: External Resistive Force Impulse, a type of perturbation applied to the robot.
5.4.4. PPO Hyperparameters (from Appendix C.4)
The following are the PPO hyperparameters used for policy optimization (Table 8 from the original paper):
| Hyperparameter | Value |
| Optimizer | Adam |
| Batch size | 4096 |
| Mini Batches | 4 |
| Learning epoches | 5 |
| Entropy coefficient | 0.01 |
| Value loss coefficient | 1.0 |
| Clip param | 0.2 |
| Max grad norm | 1.0 |
| Init noise std | 0.8 |
| Learning rate | 1e-3 |
| Desired KL | 0.01 |
| GAE decay factor() | 0.95 |
| GAE discount factor() | 0.99 |
| Actor MLP size | [512, 256, 128] |
| Critic MLP size | [768, 512, 128] |
| MLP Activation | ELU |
5.4.5. Curriculum Learning (from Appendix C.5)
Two curriculum mechanisms are introduced to facilitate learning high-dynamic motions:
-
Termination Curriculum:
- Mechanism: The episode terminates early if the humanoid's motion deviates from the reference beyond a
termination threshold. - Progression: During training, this threshold is gradually decreased, making the task more difficult and requiring higher precision.
- Update Rule:
- : The
termination threshold. - : Clips the value within the specified minimum and maximum bounds.
- : Initial threshold.
- : Minimum threshold.
- : Maximum threshold.
- : Decay rate, causing to decrease over time.
- : The
- Mechanism: The episode terminates early if the humanoid's motion deviates from the reference beyond a
-
Penalty Curriculum:
- Mechanism: A
scaling factormodulates the influence ofregularization terms(penalties). - Progression: progressively increases, gradually enforcing stronger
regularizationand promoting morestableandphysically plausiblebehaviors. This helps in early training stages by being less strict. - Update Rule:
- : The
penalty scaling factor. - : Clips the value within the specified minimum and maximum bounds.
- : Initial penalty scale.
- : Minimum penalty scale.
- : Maximum penalty scale.
- : Growth rate, causing to increase over time.
- : The scaled
penalty reward. - : The original
penalty reward.
- : The
- Mechanism: A
5.4.6. PD Controller Parameter (from Appendix C.6)
-
Mechanism: A
Proportional-Derivative (PD) controlleris used at the joint level to convert desired joint positions into motor torques. -
Gains: The
stiffness() anddamping() gains are specified for different joint groups. -
Numerical Stability: To improve
numerical stabilityandfidelityin the simulator, theinertia of the ankle linksis manually set to a fixed value of .The following are the
PD controller gains(Table 9 from the original paper):Joint name Stiffness (kp) Damping (kd) Left/right shoulder pitch/roll/yaw 100 2.0 Left/right shoulder yaw 50 2.0 Left/right elbow 50 2.0 Waist pitch/roll/yaw 400 5.0 Left/right hip pitch/roll/yaw 100 2.0 Left/right knee 150 4.0 Left/right ankle pitch/roll 40 2.0
6. Results & Analysis
6.1. Motion Filtering
This section addresses Q1: "Can our physics-based motion filtering effectively filter out untrackable motions?"
-
Methodology: The
physics-based motion filteringmethod (Section 3.1) was applied to 10motion sequences. Policies were then trained for each motion, and theEpisode Length Ratio (ELR)was computed. -
ELRDefinition: Theratio of average episode length to reference motion length. A highELRindicates that the policy successfully tracks the motion for a longer duration withoutearly termination(e.g., due to falling or exceeding error thresholds). -
Findings:
- 4 sequences were rejected by the filter, and 6 were accepted.
Accepted motionsconsistently achievedhigh ELRs.Rejected motionsachieved a maximumELRof only , indicating frequenttermination conditionsviolations.
-
Conclusion: The filtering method is effective in excluding
inherently untrackable motions, thereby improvingtraining efficiencyby focusing on viable candidates.The following figure (Figure 6 from the original paper) shows the distribution of
ELRforacceptedandrejected motions:
该图像是图表,展示了接受和拒绝动作的情节长度比率分布。图中蓝色圆点代表接受动作,而橙色圆点则表示拒绝动作。纵坐标表示情节长度比率(%),横坐标则为两个类别的比较。中央的虚线标示了54%的分界线。
The plot clearly shows a distinct separation: accepted motions (blue dots) have ELRs mostly above 80%, while rejected motions (orange dots) have ELRs mostly below 60%, confirming the filter's effectiveness.
6.2. Main Result
This section addresses Q2: "Does PBHC achieve superior tracking performance compared to prior methods in simulation?"
-
Comparison:
PBHCis compared againstOmniH2O,ExBody2, andMaskedMimic. -
Evaluation: Policies are trained in
IsaacGym[29] with three random seeds and evaluated over 1,000rollout episodes. Motions are categorized intoeasy,medium, andharddifficulty levels. -
Findings:
-
Superiority of PBHC:
PBHCconsistently outperformsOmniH2OandExBody2across allevaluation metrics(position, velocity, acceleration errors) and alldifficulty levels. This is indicated by significantly lower error values. -
Adaptive Mechanism's Role: The paper attributes
PBHC's improvements to itsadaptive motion tracking mechanism, which dynamically adjuststracking factors. Baselines with fixed, empirically tuned parameters struggle to generalize across diverse motions. -
MaskedMimic Context:
MaskedMimicsometimes performs well on certain metrics but is designed forcharacter animationand is not suitable forrobot controldue to neglectingphysical constraintslikepartial observabilityandaction smoothness. -
Oracle Comparison: An
oracle versionofPBHC(which, likeMaskedMimic, ignores robot constraints) also shows strong performance, suggesting thatPBHC's core tracking capabilities are robust even when constraints are relaxed.The following are the main results comparing different methods across difficulty levels (Table 1 from the original paper):
Method Eg-mpbpe Empbpe Empjpe Empbve Empbae Empjve Easy OmniH2O 233.54±4.013* 103.67±1.912* 1805.10±12.33* 8.54±0.125* 8.46±0.081* 224.70±2.043 ExBody2 588.22±11.43* 332.50±3.584* 4014.40±21.50* 14.29±0.172* 9.80±0.157* 206.01±1.346* Ours 53.25±17.60 28.16±6.127 725.62±16.20 4.41±0.312 4.65±0.140 81.28±2.052 MaskedMimic (Oracle) -41.79±1715 21.86±2.030 -739.96±19.94 * 5.20±0.245 7.40±0.3 132.01±8.941 Ours (Oracle) 45.02±6.760 22.95±15.22 710.30±16.66 4.63±1.580 4.89±0.960 73.44±12.42 Medium OmniH2O 433.64±16.22* 151.42±7.340* 2333.90±49.50* 10.85±0.300 10.54±0.152 204.36±4.473 ExBody2 619.84±26.16* 261.01±1.592* 3738.70±26.90* 14.48±0.160* 11.25±0.173 204.33±2.172* Ours 126.48±27.01 48.87±7.550 1043.30±104.4 6.62±0.412 7.19±0.254 105.30±5.941 MaskedMimic (Oracle) 150.92±33.4 61.69±46.01 934.25±155.0 8.16±1.974 10.01±0.83 176.84±26.1 Ours (Oracle) 66.85±50.29 29.56±14.53 753.69±100.2 5.34±0.425 6.58±0.291 82.73±3.108 Hard OmniH2O 446.17±12.84 147.88±4.142 1939.50±23.90 14.98±0.643 14.40±0.580 190.13±8.211 ExBody2 689.68±11.80 246.40±1252* 4037.40±16.70* 19.90±0.210 16.72±0.160 254.76±3.409* Ours 290.36±139.1 124.61±53.54 1326.60±378.9 11.93±2.622 12.36±2.401 135.05±16.43 MaskedMimic (Oracle) 47.74±2.762 27.2±1.615 829.02±15.41 -8.33±0.194 10.60±0.420* 146.90±13.32* Ours (Oracle) 79.25±69.4 34.74±22.6 734.90±155.9 7.04±1.420 8.34±1.140 93.79±17.36
-
-
Interpretation: For all three difficulty levels (
Easy,Medium,Hard), theOursmethod (PBHC) consistently shows the lowest error values across almost all metrics (Eg-mpbpe,Empbpe,Empjpe,Empbve,Empbae,Empjve). The bolded numbers indicatePBHC's strong performance. For example, in theEasycategory,PBHCachieves anEg-mpbpeof , which is significantly lower thanOmniH2O() andExBody2(). Similar trends are observed forMediumandHardmotions, wherePBHCmaintains a substantial lead over the deployable baselines. TheOracleversions (which are not constrained by real-world robot physics and observability) naturally achieve lower errors in some cases, butPBHCstill performs very competitively, especially when considering its real-world deployability. The asterisks denotesignificant improvements() ofPBHCover baselines, further solidifying its advantage.
6.3. Impact of Adaptive Motion Tracking Mechanism
This section addresses Q3: "Does the adaptive motion tracking mechanism improve tracking precision?"
- Methodology: An
ablation studywas conducted, comparingPBHC'sadaptive mechanismwith four fixedtracking factorconfigurations:Coarse,Medium,UpperBound, andLowerBound. These fixed configurations represent different levels ofreward tolerance, from very loose to very strict. - Findings:
-
Inconsistency of Fixed Factors: The performance of fixed
tracking factorconfigurations varied significantly across different motion types. A setting that performed well for one motion might be suboptimal for another. This highlights the difficulty of finding a universal fixed . -
Adaptive Mechanism's Robustness:
PBHC'sadaptive motion tracking mechanismconsistently achievednear-optimal performanceacross all motion types. This demonstrates its effectiveness indynamically adjustingthetracking factorto suit the unique characteristics and difficulty of each motion.The following figure (Figure 7 from the original paper) illustrates the results of this
ablation study:
该图像是一个图表,展示了不同动作(如 Jab punch、Charleston dance、Roundhouse kick 和 Bruce Lee 的姿势)中自适应动作跟踪机制与固定跟踪因子的比较。图中的曲线显示了各种动作的性能,蓝色曲线代表本文的方法,展示出在各个动作中均接近最佳性能的表现。
-
The plot shows that for different motions (e.g., Jabs punch, Charleston dance, Roundhouse kick, Bruce Lee's pose), the Ours method (blue curve, representing the adaptive mechanism) consistently achieves the lowest or near-lowest error across various tracking factors. In contrast, fixed tracking factor variants (green, red, purple, and yellow curves) exhibit high variance in performance, sometimes doing well, sometimes poorly, depending on the specific motion.
The following are the ablation study results evaluating the impact of different tracking factors on four motion tasks (Table 12 from the original paper):
| Method | Eg-mpbpe | Empbpe | Empjpe | Empbve ↓ | Empbae ↓ | Empjve |
| Jabs punch | ||||||
| Ours | 44.38±7.118 | 28.00±3.533 | 783.36±11.73 | 5.52±0.156 | 6.23±0.063 | 88.01±2.465 |
| Coarse | 63.95±6.680 | 36.76±2.743 | 921.50±16.70 | 6.16±0.011 | 6.46±0.042 | 91.46±0.465 |
| Medium | 51.07±2.635 | 30.93±2.635 | 790.54±22.82 | 5.68±0.140 | 6.31±0.057 | 90.19±1.821 |
| Upperbound | 45.74±1.702 | 28.72±1.702 | 793.52±8.888 | 5.43±0.066 | 6.29±0.085 | 88.68±0.727 |
| Lowerbound | 48.66±0.488 | 28.97±0.487 | 781.73±16.72 | 5.61±0.079 | 6.31±0.06 | 88.44±1.397 |
| Charleston dance | ||||||
| Ours | 94.81±14.18 | 43.09±5.748 | 886.91±74.76 | 6.83±0.346 | 7.26±0.034 | 162.70±7.133 |
| Coarse | 119.24±4.501 | 55.80±1.324 | 1288.02±3.807 | 7.54±0.180 | 7.28±0.021 | 178.61±3.304 |
| Medium | 83.63±3.159 | 41.02±1.743 | 933.33±38.23 | 6.89±0.185 | 7.22±0.011 | 164.92±4.380 |
| Upperbound | 86.90±8.651 | 41.92±2.632 | 917.64±14.85 | 7.02±0.103 | 7.22±0.041 | 167.64±1.089 |
| Lowerbound | 358.82±10.35 | 145.42±1.109 | 1199.21±12.78 | 8.99±0.050 | 8.48±0.033 | 167.25±0.783 |
| Roundhouse kick | ||||||
| Ours | 52.53±2.106 | 28.39±1.400 | 708.55±16.04 | 6.85±0.196 | 7.13±0.046 | 106.22±0.715 |
| Coarse | 76.81±2.863 | 38.98±2.230 | 1008.32±29.74 | 7.49±0.234 | 7.57±0.044 | 108.40±0.010 |
| Medium | 63.12±5.178 | 33.74±2.336 | 806.84±66.23 | 7.03±0.125 | 7.32±0.046 | 104.77±1.319 |
| Upperbound | 54.95±2.164 | 31.31±0.344 | 766.32±12.92 | 6.93±0.013 | 7.19±0.012 | 105.64±1.911 |
| Lowerbound | 70.10±2.674 | 36.29±1.475 | 715.01±34.01 | 7.08±0.102 | 7.32±0.067 | 102.50±4.650 |
| Bruce Lee's pose | ||||||
| Ours | 196.22±17.03 | 69.12±2.392 | 972.04±49.27 | 7.57±0.214 | 8.54±0.198 | 94.36±3.750 |
| Coarse | 239.06±51.74 | 80.78±15.81 | 1678.34±394.3 | 8.42±0.525 | 8.93±0.422 | 112.30±10.87 |
| Medium | 470.24±249.2 | 206.92±116.1 | 4490.80±105.1 | 9.58±0.085 | 9.61±0.080 | 99.65±2.441 |
| Upperbound | 250.64±178.6 | 93.70±65.09 | 1358.02±561.6 | 8.31±2.160 | 8.94±1.384 | 106.30±23.06 |
| Lowerbound | 158.12±2.934 | 60.54±1.54 | 955.10±37.04 | 7.05±0.040 | 7.94±0.051 | 81.60±1.277 |
- Interpretation: This detailed table confirms that
Ours(PBHC with adaptive ) generally achieves the best performance (lowest error values) across the different metrics and specific motions. For example, for 'Jabs punch',OursreportsEg-mpbpeof44.38, significantly better thanCoarse(63.95),Medium(51.07),Upperbound(45.74), andLowerbound(48.66). This trend holds for 'Charleston dance' and 'Roundhouse kick'. For 'Bruce Lee's pose', whileLowerboundachieves a slightly betterEmpjve,Oursstill has overall superior performance inposition errors. The results clearly demonstrate the adaptive mechanism's capability to tune thetracking factorseffectively, leading to robust and superior performance across varied dynamic motions compared to any single fixed setting.
6.4. Real-World Deployment
This section addresses Q4: "How well does PBHC perform in real-world deployment?"
-
Methodology: Policies trained in simulation were directly deployed on the
Unitree G1robot without anyfine-tuning(zero-shot sim-to-real transfer). Quantitative assessment was done by conducting 10 trials of theTai Chi motionand computingevaluation metricsbased ononboard sensor readings. -
Findings:
- Stable and Expressive Behaviors: The
Unitree G1robot successfully demonstrated a diverse range ofhighly-dynamic skills, includingmartial arts techniques(punches, kicks),acrobatic movements(360-degree spins),flexible motions(squats, stretches), andartistic performances(dance,Tai Chi). - Quantitative Alignment: The
metricsobtained in thereal worldwere closely aligned with those from thesim-to-sim platform(MuJoCo).
- Stable and Expressive Behaviors: The
-
Conclusion: The policies robustly transfer from
simulation to real-world deployment, maintaininghigh-performance controland showcasingPBHC's practical applicability.The following figure (Figure 8 from the original paper) shows the robot mastering
highly-dynamic skillsin thereal world:
该图像是一个插图,展示了我们的机器人在现实世界中掌握各种高动态技能的过程,包括马步拳、劈腿、直拳、太极、跳踢、李小龙的姿势、回旋踢、360度旋转、前踢和查尔斯顿舞。时间从左到右流动,体现了机器人从学习到掌握动态动作的连续性。
The image sequence displays various dynamic poses, such as horse-stance punch, stretch leg, jabs punch, Tai Chi, jump kick, Bruce Lee's pose, roundhouse kick, 360-degree spin, front kick, and Charleston dance, demonstrating the robot's versatility.
The following are the comparison of tracking performance of Tai Chi between real-world and simulation (Table 2 from the original paper):
| Platform | Empbpe ↓ | Empjpe ↓ | Empbve ↓ | Empbae ↓ | Empjve ↓ |
| MuJoCo | 33.18±2.720 | 1061.24±83.27 | 2.96±0.342 | 2.90±0.498 | 67.71±6.747 |
| Real | 36.64±2.592 | 1130.05±9.478 | 3.01±0.126 | 3.12±0.056 | 65.68±1.972 |
-
Interpretation: The table shows very similar error values between the
MuJoCosimulation platform and theRealrobot for theTai Chimotion. For example,Empbpeis inMuJoCoand inReal, indicating a close match. This quantitative comparison strongly supports the claim of successfulzero-shot sim-to-real transfer. The root of the robot is fixed to the origin for this specific comparison, as direct root position/velocity measurements are often hard to access accurately on real-world robots for this specific type of comparison.The following figure (Figure 12 from the original paper) presents additional real-world results of the robot mastering more dynamic skills:
该图像是图表,展示了机器人通过模仿动态技能的多个动作,包括:a) 钩拳,b) 马步姿势,c) 后踢,d) 侧踢,e) 五形态,f) 战斗连招,以及 g) 拍打舞。时间从左到右流动,展现出机器人在现实世界中掌握的动态技能。
This figure further showcases the robot's capabilities in hooks punch, horse-stance pose, back kick, side kick, five stance form, fighting combo, and tap dance, reinforcing the qualitative demonstration of PBHC's effectiveness in diverse dynamic movements.
6.5. Learning Curves
-
Methodology:
Learning curvesformean episode lengthandmean rewardare presented for three representative motions:Jabs Punch,Tai Chi, andRoundhouse Kick. -
Findings: The curves show that
training gradually stabilizes and convergesafter approximately 20,000 steps. -
Conclusion: This demonstrates the
reliabilityandefficiencyof thePBHCapproach in learning complex motion behaviors within a reasonable training duration.The following figure (Figure 9 from the original paper) shows the
mean episode lengthandmean rewardacross three motions:
该图像是图表,展示了三个动作(Jabs punch、Tai Chi、Roundhouse kick)在训练过程中的平均回合长度和平均奖励。曲线表明,训练在20k步后逐渐稳定。
The top row (a) shows the Mean Episode Length for the three motions, all plateauing around 20k steps. The bottom row (b) shows the Mean Reward, which also stabilizes and converges around the same training steps. This indicates that the RL agent successfully learns to maintain the reference motion for longer durations and achieves high cumulative rewards as training progresses.
6.6. Ablation Study of Contact Mask (from Appendix E.3)
-
Methodology: An
ablation studywas conducted to evaluate the effectiveness of thecontact maskinPBHC. It compared the fullOursmethod withOurs w/o contact mask(without thecontact maskcomponent) for motions with distinctfoot contact patterns(Charleston Dance,Jump Kick,Roundhouse Kick). -
Metric:
Mean Foot Contact Mask Error() was introduced. -
Findings:
- The
Oursmethod significantly reducedfoot contact errors() compared to the baseline without thecontact mask. - This also led to noticeable improvements in other
tracking metrics.
- The
-
Conclusion: The
contact-aware designis effective in improvingtracking accuracy, especially forfoot contacts.The following figure (Figure 10 from the original paper) shows the accuracy of
contact mask estimationacross different methods:
该图像是一个柱状图,展示了不同方法在接触掩膜估计中的准确率。图中标注了三种方法的准确率,其中‘Height’为84.2%,‘Velocity’为85.6%,而‘Ours’的准确率达到了91.4%。
The bar chart shows that the proposed Ours method achieves accuracy, outperforming Height () and Velocity () based methods.
The following figure (Figure 11 from the original paper) presents a visual comparison of the efficacy of the proposed motion correction technique:
该图像是示意图,展示了动作修正的有效性。图中分别显示了修正前后的人体姿态,通过比较可以明显看出,修正后的姿态更加符合地面的水平线,减少了浮动干扰。
The image visually demonstrates that motion correction effectively mitigates floating artifacts. The before correction image shows the human model's feet hovering above the ground, while the after correction image shows the feet accurately placed on the ground, highlighting the importance of the correction step.
The following are the ablation results of the contact mask (Table 13 from the original paper):
| Method | Econtact-mask ↓ | Empbpe ↓ | Empjpe ↓ | Empbve ↓ | Empbae ↓ |
| Charleston dance | |||||
| Ours | 217.82±47.97 | 43.09±5.748 | 886.91±74.76 | 6.83±0.346 | 7.26±0.034 |
| Ours w/o contact mask | 633.91±49.74 | 76.13±53.01 | 980.40±222.0 | 7.72±1.439 | 7.64±0.594 |
| Jump kick | |||||
| Ours | 294.22±6.037 | 42.58±8.126 | 840.33±97.76 | 9.48±0.717 | 10.21±10.21 |
| Ours w/o contact mask | 386.75±6.036 | 170.28±97.29 | 1259.21±423.9 | 16.92±0.012 | 16.57±5.810 |
| Roundhouse kick | |||||
| Ours | 243.16±1.778 | 28.39±1.400 | 708.55±16.04 | 6.85±0.196 | 7.33±0.046 |
| Ours w/o contact mask | 250.10±6.123 | 36.76±2.743 | 921.52±16.70 | 6.16±0.012 | 6.46±0.042 |
- Interpretation: For all three motions,
Ours(withcontact mask) achieves significantly lowerEcontact-maskcompared toOurs w/o contact mask. For example, in 'Charleston dance',Econtact-maskdrops from633.91to217.82. This reduction incontact mask erroralso translates to lowerEmpbpeandEmpjpefor 'Charleston dance' and 'Jump kick', indicating that accuratecontact handlingcontributes to overallmotion tracking fidelity. For 'Roundhouse kick', whileEcontact-maskis only slightly lower forOurs, the other position and velocity errors are notably better, reinforcing the benefit of thecontact-aware design.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces PBHC, a novel Reinforcement Learning framework for humanoid whole-body motion control that successfully enables robots to learn highly-dynamic human behaviors. Its key innovations lie in a physics-based multi-steps motion processing pipeline that ensures physical plausibility of reference motions and an adaptive motion tracking mechanism that dynamically adjusts reward tolerance during training. Experimental results demonstrate that PBHC achieves significantly lower tracking errors than existing baselines in simulation and exhibits robust zero-shot sim-to-real transfer to a Unitree G1 robot, performing complex Kungfu and dancing skills stably and expressively. The motion filtering metric efficiently identifies untrackable motions, and the adaptive reward mechanism consistently outperforms fixed-factor approaches.
7.2. Limitations & Future Work
The authors acknowledge the following limitations:
-
Lack of Environment Awareness: The current method does not incorporate
terrain perceptionorobstacle avoidance. This restricts its deployment tounstructured real-world settings. -
Limited Skill Generalization: While it excels at
highly-dynamic motions, the method's ability to generalize todiverse motion repertoires(i.e., a wide range of different skills) needs further exploration.Based on these limitations, the authors suggest
future research directions:
- Integrating
environment awarenesscapabilities (e.g., perception of terrain, obstacles). - Research into maintaining
high dynamic performancewhile simultaneously enablingbroader skill generalization.
7.3. Personal Insights & Critique
This paper presents a significant step forward in humanoid motion imitation, particularly for highly-dynamic and expressive skills. The two-pronged approach of meticulous physics-based motion processing and an intelligent adaptive reward curriculum is a powerful combination.
- Innovation of Adaptive Tracking: The
bi-level optimizationformulation and theEMA-driven adaptivetracking factorare particularly insightful. This moves beyond heuristics forreward shapingto a more principled, data-driven adjustment of learning difficulty. It addresses a fundamental challenge inimitation learning: how to reward partial progress on hard tasks without overly penalizing early, imperfect attempts, while still driving towards high precision. - Rigorous Motion Processing: The detailed
motion processing pipeline, includingCoM-CoP stability filteringandcontact-aware correction, is crucial. ManyRLapproaches often overlook the quality ofreference data, assuming it's perfect. This paper demonstrates the immense value of cleaning and validating input data against physical constraints. - Real-World Validation: The successful
zero-shot sim-to-real transferon theUnitree G1robot with complex motions is compelling. This showcases the robustness ofdomain randomizationand the efficacy of the proposed control framework.
Potential Issues/Areas for Improvement:
- Computational Cost: Training
RLpolicies, especially for complex humanoid robots, is computationally intensive. The 27-hour training time per policy on a powerful GPU suggests that scaling to an even broader range of skills or more complex environments might require significant computational resources. Thebi-level optimizationitself, while theoretically sound, adds a layer of complexity. - Generalization to Novel Motions: While the
adaptive mechanismhelps with different difficulty levels within a known set of motions, the question of truly novel, unseen motion types remains. How well would theadaptive factorgeneralize to motions structurally very different from the training data? - Reactive Behavior: The current framework focuses on imitating
pre-defined reference motions. For real-world deployment, robots often need to react dynamically to unexpected external forces or changes in the environment, which is not directly addressed bymotion imitationalone. Thelack of environment awarenessis a significant limitation for practical applications in unstructured settings. - Hyperparameter Sensitivity: Even with adaptive mechanisms, the initial
tracking factorand thecurriculum learningparameters (e.g., decay rates for and ) might still require careful tuning and could influence the final performance.
Transferability:
The principles of adaptive reward shaping based on online performance metrics and physics-informed data preprocessing are highly transferable.
-
Other Robotic Systems: This approach could be applied to other complex robotic systems (e.g.,
quadrupeds,manipulators) learningdynamic tasks. -
Skill Learning: The adaptive curriculum concept could generalize to other
RL skill learningproblems where tasks have varying difficulties or require progressive precision. -
Human-Robot Collaboration: The ability to accurately imitate human movements opens doors for more intuitive and effective
human-robot collaborationandphysical assistance.Overall,
KungfuBotprovides a robust and innovative framework that pushes the boundaries ofhumanoid whole-body control, particularly for dynamic and expressive human skills, laying important groundwork for future humanoidintelligenceanddexterity.
Similar papers
Recommended via semantic vector search.