Perpetual Humanoid Control for Real-time Simulated Avatars
TL;DR Summary
This paper presents a physics-based humanoid controller for high-fidelity motion imitation and fault tolerance. The progressive multiplicative control policy (PMCP) enables dynamic resource allocation, facilitating large-scale learning and task expansion without catastrophic forg
Abstract
We present a physics-based humanoid controller that achieves high-fidelity motion imitation and fault-tolerant behavior in the presence of noisy input (e.g. pose estimates from video or generated from language) and unexpected falls. Our controller scales up to learning ten thousand motion clips without using any external stabilizing forces and learns to naturally recover from fail-state. Given reference motion, our controller can perpetually control simulated avatars without requiring resets. At its core, we propose the progressive multiplicative control policy (PMCP), which dynamically allocates new network capacity to learn harder and harder motion sequences. PMCP allows efficient scaling for learning from large-scale motion databases and adding new tasks, such as fail-state recovery, without catastrophic forgetting. We demonstrate the effectiveness of our controller by using it to imitate noisy poses from video-based pose estimators and language-based motion generators in a live and real-time multi-person avatar use case.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Perpetual Humanoid Control for Real-time Simulated Avatars
1.2. Authors
The authors of this paper are Zhengyi Luo, Jinkun Cao, Alexander Winkler, Kris Kitani, and Weipeng Xu.
-
Zhengyi Luo: Affiliated with Reality Labs Research, Meta, and Carnegie Mellon University.
-
Jinkun Cao: Affiliated with Carnegie Mellon University.
-
Alexander Winkler: Affiliated with Reality Labs Research, Meta.
-
Kris Kitani: Affiliated with Reality Labs Research, Meta, and Carnegie Mellon University.
-
Weipeng Xu: Affiliated with Reality Labs Research, Meta.
Their affiliations suggest a background in computer vision, robotics, reinforcement learning, and computer graphics, with a focus on real-world applications in virtual reality and simulated environments, given their connection to Meta's Reality Labs Research.
1.3. Journal/Conference
The paper was published at ICCV 2023.
ICCV (International Conference on Computer Vision) is one of the top-tier conferences in the field of computer vision, highly regarded for publishing cutting-edge research. Its influence is substantial, attracting a global audience of researchers and practitioners. Publication at ICCV indicates a high level of rigor, novelty, and impact in the research presented.
1.4. Publication Year
2023
1.5. Abstract
This paper introduces a physics-based humanoid controller designed for real-time simulated avatars, aiming for high-fidelity motion imitation and fault-tolerant behavior. The controller can handle noisy input, such as pose estimates from video or language-generated motions, and naturally recover from unexpected falls. It scales to learning ten thousand motion clips without relying on external stabilizing forces and can perpetually control simulated avatars without requiring resets. The core innovation is the progressive multiplicative control policy (PMCP), which dynamically allocates new network capacity to learn increasingly difficult motion sequences. PMCP enables efficient scaling for large motion databases and the integration of new tasks, like fail-state recovery, without suffering from catastrophic forgetting. The controller's effectiveness is demonstrated by its ability to imitate noisy poses from video-based pose estimators and language-based motion generators in live, real-time multi-person avatar scenarios.
1.6. Original Source Link
https://openaccess.thecvf.com/content/ICCV2023/papers/Luo_Perpetual_Humanoid_Control_for_Real-time_Simulated_Avatars_ICCV_2023_paper.pdf
This paper is officially published at ICCV 2023.
2. Executive Summary
2.1. Background & Motivation
The creation of realistic and interactive human motion in simulated environments is a long-standing goal in computer graphics and robotics, with significant potential for virtual avatars and human-computer interaction. However, controlling high-degree-of-freedom (DOF) humanoids in physics-based simulations presents substantial challenges.
Core Problem: Existing physics-based controllers struggle with maintaining stability and imitating motion faithfully, especially under two critical real-world conditions:
- Noisy Input: When the reference motion comes from imperfect sources like video-based pose estimators or language-based motion generators, it often contains artifacts such as floating, foot sliding, or physically impossible poses. Current controllers tend to fall or deviate significantly.
- Unexpected Falls and Failures: Simulated humanoids can easily lose balance. Most prior methods resort to
resettingthe humanoid to a kinematic pose upon failure. This leads toteleportationartifacts and is highly undesirable forreal-time virtual avatarapplications whereperpetual controlis needed. Moreover,resettingto a noisy reference pose can create a vicious cycle of falling and re-resetting.
Challenges/Gaps in Prior Research:
-
Scalability: Learning to imitate large-scale motion datasets (e.g.,
AMASSwith tens of thousands of clips) with a single policy is largely unachieved due to the diversity and complexity of human motion. -
Physical Realism vs. Stability: Many successful motion imitators, like
UHC, rely onresidual force control (RFC), which applies non-physical "stabilizing forces." While effective for stability,RFCcompromisesphysical realismand can introduce artifacts likeflyingorfloating. -
Catastrophic Forgetting: As
Reinforcement Learning (RL)policies are trained on diverse or sequential tasks, they often forget previously learned skills when acquiring new ones. This is a major hurdle for scaling to large datasets and integrating multiple capabilities (like imitation and recovery). -
Natural Recovery: Existing
fail-safe mechanismsoften result in unnaturalteleportationor require distinctrecovery policiesthat might not track the original motion smoothly.Paper's Entry Point / Innovative Idea: The paper addresses these limitations by aiming to create a single, robust,
physics-based humanoid controllercalledPerpetual Humanoid Controller(PHC) that operates without external forces, can handle noisy inputs, and naturally recovers fromfail-statesto resume imitation. The core innovation for scalability and multi-task learning is theprogressive multiplicative control policy(PMCP).
2.2. Main Contributions / Findings
The paper makes several significant contributions:
- Perpetual Humanoid Controller (PHC): Introduction of a novel
physics-based humanoid controllerthat successfully imitates98.9%of theAMASSdataset without employing any external forces, thus maintainingphysical realism. This controller canperpetuallyoperate simulated avatars inreal-timewithout requiring manualresets, even when facing noisy input or unexpected falls. - Progressive Multiplicative Control Policy (PMCP): Proposal of
PMCP, a novelRLtraining strategy that allows the controller to learn from large motion datasets and integrate new capabilities (likefail-state recovery) without suffering fromcatastrophic forgetting.PMCPdynamically allocates new network capacity (primitives) to progressively learn harder motion sequences and additional tasks, making the learning process efficient and scalable. - Robustness and Task-Agnostic Design: Demonstration that
PHCis robust to noisy inputs fromoff-the-shelf video-based pose estimators(e.g.,HybrIK,MeTRAbs) andlanguage-based motion generators(e.g.,MDM). It serves as adrop-in solutionfor real-time virtual avatar applications, supporting bothrotation-basedandkeypoint-basedimitation, with the latter even outperforming the former in noisy conditions.
Key Conclusions/Findings:
- The
PHCachieves state-of-the-artmotion imitation success rateson largeMoCapdatasets without compromisingphysical realismby avoidingexternal forces. PMCPeffectively mitigatescatastrophic forgetting, enabling a single policy to learn an extensive range of motions andfail-state recoverybehaviors.- The controller is highly
fault-tolerant, capable of naturally recovering fromfallenorfar-away statesand seamlessly re-engaging with the reference motion. PHCcan directly drivereal-time simulated avatarsusing noisy input from live video streams orlanguage-based motion generation, demonstrating its practical applicability.- The
keypoint-basedimitation variant ofPHCshows surprising robustness and performance, especially with noisy inputs, suggesting a simpler and potentially more robust input modality for certain applications.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the Perpetual Humanoid Controller (PHC) and its underlying progressive multiplicative control policy (PMCP), familiarity with several core concepts in robotics, computer graphics, and machine learning is beneficial.
-
Humanoid Control: This refers to the methods and algorithms used to make virtual or physical human-like robots move and interact with their environment. The goal is often to produce realistic, stable, and responsive behaviors.
- High-Degree-of-Freedom (DOF) Humanoids: A
degree of freedomis an independent parameter that defines the state of a physical system. For a humanoid, each joint (e.g., shoulder, elbow, knee) can have multipleDOFs(e.g., pitch, yaw, roll rotations). Ahigh-DOFhumanoid means it has many such joints and rotation/translation parameters, making its control space very complex. - Physics-based Simulation: Instead of simply playing back pre-recorded motions (kinematic control),
physics-based simulationinvolves modeling the humanoid's mass, inertia, joints, and forces (gravity, friction, joint torques). Aphysics enginethen calculates how the humanoid moves and interacts with its environment according to the laws of physics. This leads to more realistic and robust behaviors but is also much harder to control.
- High-Degree-of-Freedom (DOF) Humanoids: A
-
Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to make decisions by performingactionsin anenvironmentto maximize a cumulativerewardsignal.- Agent: The decision-making entity (in this paper, the
humanoid controller). - Environment: The simulated world where the humanoid operates (including its physics, the ground, etc.).
- State (): A complete description of the
environmentat a given time step (e.g., humanoid's joint positions, velocities, orientation, position, and goal information). - Action (): The output of the
agentthat influences theenvironment(in this paper, the targetjoint torquesfor thePD controller). - Reward (): A scalar feedback signal from the
environmentindicating how good or bad theagent'slastactionwas (e.g., a high reward for imitating motion closely, a penalty for falling). - Policy (): The
agent'sstrategy for choosingactionsgiven astate. The goal ofRLis to learn an optimalpolicy. - Markov Decision Process (MDP): The mathematical framework for modeling
RLproblems. AnMDPis defined by a tuple :- : A set of possible
states. - : A set of possible
actions. - : The
transition dynamicsfunction, specifying the probability of reaching state after taking action in state . - : The
reward function, defining the immediate rewardR(s, a)for taking action in state . - : The
discount factor(), which determines the present value of future rewards.
- : A set of possible
- Proximal Policy Optimization (PPO): A popular
RLalgorithm used in this paper.PPOis anon-policyalgorithm that strikes a balance between ease of implementation, sample efficiency, and good performance. It works by making small updates to thepolicyto avoid large changes that could destabilize training, ensuring that the newpolicydoes not stray too far from the old one, often via a clipped objective function.
- Agent: The decision-making entity (in this paper, the
-
Neural Networks:
- Multi-layer Perceptron (MLP): A fundamental type of
artificial neural networkconsisting of at least three layers: an input layer, one or more hidden layers, and an output layer. Each layer is fully connected to the next.MLPsare general-purposefunction approximators.
- Multi-layer Perceptron (MLP): A fundamental type of
-
Control Systems:
- Proportional-Derivative (PD) Controller: A classic feedback control loop mechanism. For a joint, a
PD controllercalculates thetorquerequired to move the joint to atarget positionby considering two terms:- Proportional (P) Term: Proportional to the difference (error) between the
target positionand thecurrent position. This term drives the joint towards the target. - Derivative (D) Term: Proportional to the rate of change of the error (i.e., the current velocity difference). This term damps oscillations and reduces overshoot.
The
RL policyin this paper outputs thetarget positionsfor thesePD controllers, which then generate the actualjoint torques.
- Proportional (P) Term: Proportional to the difference (error) between the
- Proportional-Derivative (PD) Controller: A classic feedback control loop mechanism. For a joint, a
-
Human Body Models:
- SMPL (Skinned Multi-Person Linear) Model: A widely used
3D statistical human body model.SMPLcan represent a wide variety of human body shapes and poses using a small number of parameters. It provides a mesh (a 3D surface model) that can beskinned(attached to askeleton) and deformed based onposeandshapeparameters. This makes it suitable for animating realistic human motion.
- SMPL (Skinned Multi-Person Linear) Model: A widely used
-
Adversarial Learning:
- Adversarial Motion Prior (AMP): A technique that uses an
adversarial discriminatorto encourage generated motions to look "human-like" and natural. Instead of just matching a reference motion (which might be noisy or incomplete),AMPtrains adiscriminatorto distinguish between realMoCapdata and simulated motions. TheRL agentreceives a reward from thisdiscriminatorfor producing motions that thediscriminatorclassifies as "real." This helps theagentlearn naturalmotion stylesand recover gracefully, especially in scenarios where directmotion imitationmight lead to unnatural artifacts.
- Adversarial Motion Prior (AMP): A technique that uses an
-
Challenges in Sequential Learning:
- Catastrophic Forgetting: A significant problem in
artificial neural networkswhere, when a network is sequentially trained on new tasks, its performance on previously learned tasks drastically degrades. Theweightsin the network that were adapted for earlier tasks are overwritten by the new training, causing the network to "forget" what it learned before.
- Catastrophic Forgetting: A significant problem in
3.2. Previous Works
The paper extensively discusses prior research in physics-based motion imitation, fail-state recovery, and progressive reinforcement learning.
3.2.1. Physics-based Motion Imitation
Early work in physics-based characters focused on specific, small-scale tasks or required significant manual tuning. The advent of Reinforcement Learning (RL) brought about methods capable of learning complex locomotion and interaction skills.
- Small-scale use cases: Many
RLapproaches focused on interactive control based on user input [43, 2, 33, 32], specific sports actions [46, 18, 26], or modular tasks likereaching goals[47] ordribbling[33]. These often involved training on limited motion data or for specific skills. - Scaling to datasets: Imitating large
motion datasetshas been a harder challenge.DeepMimic[29] was a pioneering work that learned to imitate single motion clips, demonstratingRL's potential for realisticphysics-based motion.ScaDiver[45] attempted to scale to larger datasets (like theCMU MoCap dataset) by using amixture of expert policies, achieving around80% success rate(measured by time to failure). Amixture of experts (MOE)typically involves multipleexpert networks, with agating networkdeciding whichexpertor combination ofexpertsto use for a given input. The paper notes thatScaDiverhad some success but did not scale to the largest datasets.Unicon[43] showed qualitative results in imitation and transfer but did not quantify performance on large datasets.MoCapAct[42] learnedsingle-clip expertsonCMU MoCapand thendistilledthem into a single policy, achieving80%of the experts' performance.
UHC(Universal Humanoid Control) [20]: This is the closest and most relevant prior work toPHC.UHCsuccessfully imitated97%of theAMASSdataset. However, its success heavily relied onresidual force control (RFC)[50].- Residual Force Control (RFC) [50, 49]: A technique where an
external forceis applied at the root of thehumanoidto assist in balancing and tracking reference motion. Thisnon-physical forceacts like a "hand of God," making it easier for theRL agentto maintain stability, especially for challenging or noisy motions.- Drawbacks of RFC: As highlighted by the authors,
RFCcompromisesphysical realismand can introduce artifacts likefloatingorswinging, particularly with complex motions. ThePHCpaper explicitly aims to overcome this limitation by not usingexternal forces.
- Drawbacks of RFC: As highlighted by the authors,
RFChas been applied inhuman pose estimation from video[52, 21, 11] andlanguage-based motion generation[51], showing its effectiveness for stabilization, but also its inherent trade-off with realism.
- Residual Force Control (RFC) [50, 49]: A technique where an
3.2.2. Fail-state Recovery for Simulated Characters
Maintaining balance is a constant challenge for physics-based characters. Prior approaches to fail-state recovery include:
- Ignoring Physics/Compromising Realism:
PhysCap[37]: Used afloating-base humanoidthat did not require balancing, thereby compromisingphysical realism.
- Resetting Mechanisms:
Egopose[49]: Designed afail-safe mechanismtoresetthehumanoidto thekinematic posewhen it was about to fall. The paper notes this can lead toteleportation behaviorif thehumanoidkeepsresettingto unreliablekinematic poses(e.g., from noisy input).
- Sampling-based Control:
NeuroMoCon[13]: Utilizedsampling-based controlandreran the sampling processif thehumanoidfell. While effective, this approach doesn't guarantee success and might be too slow forreal-timeuse cases.
- Additional Recovery Policies:
- Some methods [6] use an
additional recovery policywhen thehumanoiddeviates. However, without access to the original reference motion, such policies can produceunnatural behaviorlikehigh-frequency jitters. ASE[32]: Demonstrated the ability torise naturally from the groundfor asword-swinging policy. ThePHCpaper points out that formotion imitation, the policy not only needs to get up but also seamlesslytrack the reference motionagain.
- Some methods [6] use an
- PHC's Approach: The
PHCproposes a comprehensive solution where the controller canrise from a fallen state,naturally walk backto the reference motion, andresume imitation, integrating this capability into a single,perpetualpolicy.
3.2.3. Progressive Reinforcement Learning
When learning from diverse data or multiple tasks, catastrophic forgetting [8, 25] is a major issue. Various methods have been proposed:
- Combating Catastrophic Forgetting:
Regularizing network weights[16]: Penalizing changes to important weights.Learning multiple experts[15] orincreasing capacityusingmixture of experts[54, 36, 45]: These involve specialized networks for different tasks or data subsets.Multiplicative control[31]: A mechanism to combine the outputs of multiple policies.
- Progressive Learning Paradigms:
Progressive learning[5, 4] orcurriculum learning[1]: Training models by gradually introducing more complex data or tasks.Progressive Reinforcement Learning[3]: Distilling skills from multipleexpert policiesto find a single policy that matches their action distribution.
- Progressive Neural Networks (PNN) [34]: A direct predecessor to
PMCP.PNNsavoidcatastrophic forgettingbyfreezing the weightsof previously learnedsubnetworks(for older tasks) and initializingadditional subnetworksfor new tasks.Lateral connectionsare used to feed experiences from oldersubnetworksto newer ones.- Limitation of PNN for Motion Imitation: The paper notes that
PNNrequires manually choosing whichsubnetworkto use based ontask labels. Inmotion imitation, the boundary between "hard" and "easy" motion sequences is blurred, and there are no cleartask labelsfor different motion types, making direct application difficult.
- Limitation of PNN for Motion Imitation: The paper notes that
3.3. Technological Evolution
The field has evolved from purely kinematic animation (pre-recorded, inflexible motion) to physics-based character control enabled by RL. Early RL solutions were often limited to specific, short tasks or required external forces for stability. The scaling challenge for physics-based motion imitation across vast, diverse MoCap datasets emerged as a key hurdle, leading to mixture of experts approaches. Simultaneously, the need for robust fail-state recovery became apparent, moving from simple resets to more natural, physics-informed recovery strategies. This paper's work represents a step forward by combining scalable motion imitation with natural fail-state recovery and robustness to noisy real-world inputs, all while maintaining physical realism by abstaining from external forces. It builds upon PNN and multiplicative control concepts to achieve this multi-faceted goal.
3.4. Differentiation Analysis
The Perpetual Humanoid Controller (PHC) differentiates itself from prior work in several key aspects:
-
Elimination of External Stabilizing Forces: Unlike
UHC[20], which relies onresidual force control (RFC)to achieve highmotion imitation success rates,PHCoperates entirely without anyexternal forces. This ensuresphysical realismand prevents artifacts likefloatingorswingingthat can arise fromRFC, especially during challenging motions. This is a crucial distinction as it makes the simulated avatar's behavior more physically plausible. -
Comprehensive and Natural Fail-state Recovery: While
fail-state recoveryhas been addressed by methods likeEgopose[49] (resetting) orASE[32] (getting up),PHCprovides a more integrated and natural solution. It not only enables thehumanoidtorise from a fallen statebut also toapproach the reference motion naturallyandseamlessly resume imitation. The policy is trained to handlefallen,faraway, or combinedfail-states, ensuring continuous,perpetual controlwithout disruptiveresets. -
Scalability and Catastrophic Forgetting Mitigation via PMCP:
- Progressive Learning:
PHCemploysprogressive multiplicative control policy (PMCP)to scale learning to the entireAMASSdataset (ten thousand clips). Unlike standardRLtraining on large datasets which often leads tocatastrophic forgetting,PMCPdynamically allocates newnetwork capacity(primitives) to learn increasinglyharder motion sequencesand new tasks likefail-state recovery. - Dynamic Composition:
PMCPbuilds uponProgressive Neural Networks (PNN)[34] but adapts it formotion imitation. WhilePNNrequires explicittask labelsand manualsubnetworkselection,PMCPtrains acomposertodynamically combinepretrained primitivesbased on the current state, without relying on predefined task boundaries. - Multiplicative Control:
PMCPusesmultiplicative control[31] rather than a typicalMixture of Experts (MOE)[45]. Instead of activating only oneexpert(top-1 MOE),MCPcombines the distributions of allprimitives, allowing for a richer and more nuanced control policy that benefits from the collective experience of its components.
- Progressive Learning:
-
Robustness to Noisy and Diverse Inputs:
PHCis explicitly designed to berobust to noisy inputfromvideo-based pose estimators(e.g.,HybrIK,MeTRAbs) andlanguage-based motion generators(e.g.,MDM). This makes it atask-agnosticsolution directly applicable toreal-time virtual avatarapplications, a critical feature for practical deployment. The paper also demonstrates the effectiveness of akeypoint-basedcontroller, which is often more robust to noise thanrotation-basedmethods.In essence,
PHCdistinguishes itself by offering a robust, physically realistic, scalable, and perpetually operatinghumanoid controllerthat can seamlessly handle diverse, noisy inputs andfail-states, addressing fundamental limitations of priorphysics-based motion imitationmethods.
4. Methodology
The paper proposes the Perpetual Humanoid Controller (PHC), a physics-based humanoid controller designed for real-time simulated avatars. It achieves high-fidelity motion imitation, fault-tolerant behavior, and scalability through a novel progressive multiplicative control policy (PMCP).
4.1. Principles
The core idea behind PHC is to learn a single, robust RL policy that can imitate a vast range of human motions while inherently possessing the ability to recover from unexpected fail-states (like falling) without external intervention or manual resets. This is achieved by combining several principles:
- Goal-Conditioned Reinforcement Learning: The
humanoidlearns to performactionsto match a specifiedreference motionor keypoints as itsgoal. - Adversarial Motion Prior (AMP): To ensure that the
humanoid'smovements arenaturalandhuman-like, even duringrecoveryor when dealing withnoisy input, anAMP discriminatoris used to provide astyle reward. - No External Forces: Unlike previous approaches that relied on
residual forcesfor stability,PHCis designed to be physically realistic by controlling thehumanoidsolely throughjoint torquesgenerated byPD controllers. - Progressive Learning for Scalability and Multi-tasking: The
progressive multiplicative control policy (PMCP)is introduced to overcomecatastrophic forgettingwhen learning from large, diverse motion datasets and when adding new, distinct tasks (likefail-state recovery). It does this by dynamically expandingnetwork capacity(primitives) and then composing them. - Robustness to Noise: The controller is designed to handle imperfections in
reference motiondata, such as those arising fromvideo-based pose estimatorsorlanguage-based generators.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Reference Motion Representation
The paper defines the reference pose at time as .
-
: Represents the
3D joint rotationfor all links of thehumanoid. The6 DoF rotation representation[53] is used, which typically encodes rotations using two 3D vectors (e.g., 6D continuous representation). -
: Represents the
3D positionof all links of thehumanoid.From a sequence of
reference poses, thereference velocitiescan be computed usingfinite difference. -
: Represents the
angularandlinear velocities.-
:
Angular velocitiesfor all links. -
:
Linear velocitiesfor all links.The paper distinguishes between two types of
motion imitationbased on input:
-
-
Rotation-based imitation: Requires
reference poses(bothrotationandkeypoints). -
Keypoint-based imitation: Only requires
3D keypoints.A notation convention is established:
-
: Represents
kinematic quantities(without physics simulation) obtained frompose estimatorsorkeypoint detectors. -
: Denotes
ground truth quantitiesfromMotion Capture (MoCap). -
Normal symbols (without accents): Denote values from the
physics simulation.
4.2.2. Goal-Conditioned Motion Imitation with Adversarial Motion Prior
The PHC controller is built upon a general goal-conditioned Reinforcement Learning (RL) framework (visualized in Figure 3). The goal-conditioned policy is tasked with imitating reference motion or keypoints .
The task is formulated as a Markov Decision Process (MDP): .
-
Physics Simulation: Determines the
stateand thetransition dynamics. -
Policy: Our policy computes the per-step
action. -
Reward Function: Based on the
simulation stateandreference motion, thereward functioncomputes arewardas thelearning signal. -
Objective: The policy aims to maximize the
discounted reward: $ \mathbb{E} \left[ \sum_{t=1}^T \gamma^{t-1} r_t \right] $ where denotes the expectation over trajectories, is thediscount factor, and is therewardat time step . -
Learning Algorithm:
Proximal Policy Optimization (PPO)[35] is used to learn .The following figure (Figure 3 from the original paper) shows the goal-conditioned RL framework with Adversarial Motion Prior:
该图像是示意图,展示了目标条件强化学习框架与对抗运动先验的结合。图中阐述了政策 的训练过程,包括行动 和状态 的定义,以及运动数据 和判别器 的作用。图像还显示了如何通过参考状态 和目标奖励 来实现学习及反馈。整体流程强调了物理仿真环境(如 Isaac Gym)的应用。
Figure 3: Goal-conditioned RL framework with Adversarial Motion Prior. Each primitive and composer is trained using the same procedure, and here we visualize the final product .
4.2.2.1. State
The simulation state consists of two main parts:
-
Humanoid Proprioception (): This describes the
humanoid'sinternal state.- : The
3D body pose(joint rotations and root position/orientation) of the simulatedhumanoid. - : The
velocity(linear and angular velocities of all joints) of the simulatedhumanoid. - : (Optionally)
Body shapesparameters. When trained with different body shapes, provides information about the length of each limb [22].
- : The
-
Goal State (): This describes the target motion to be imitated. It is defined as the difference between the next time step's
reference quantitiesand theirsimulated counterparts. These differences serve as thegoalfor thepolicy.-
For Rotation-based Motion Imitation: $ s _ { t } ^ { \mathrm { g-rot } } \triangleq \big ( \hat { \pmb { \theta } } _ { t + 1 } \ominus \pmb { \theta } _ { t } , \hat { \pmb { p } } _ { t + 1 } - \pmb { p } _ { t } , \hat { \pmb { v } } _ { t + 1 } - \pmb { v } _ { t } , \hat { \omega } _ { t } - \omega _ { t } , \hat { \pmb { \theta } } _ { t + 1 } , \hat { \pmb { p } } _ { t + 1 } \big ) $
- : The
rotation differencebetween the nextreference joint rotationand thecurrent simulated joint rotation. The operator calculates this difference, likely in a way that respects the geometry of rotations (e.g., quaternion difference). - : The
position differencebetween the nextreference joint positionand thecurrent simulated joint position. - : The
linear velocity differencebetween the nextreference linear velocityand thecurrent simulated linear velocity. - : The
angular velocity differencebetween the nextreference angular velocityand thecurrent simulated angular velocity. - : The
next reference joint rotationitself. - : The
next reference joint positionitself.
- : The
-
For Keypoint-only Motion Imitation: This is a simplified
goal statethat only uses3D keypoints. $ s _ { t } ^ { \mathrm { g-kp } } \triangleq \big ( \hat { p } _ { t + 1 } - p _ { t } , \hat { v } _ { t + 1 } - v _ { t } , \hat { p } _ { t + 1 } \big ) . $-
: The
position differencebetween the nextreference keypoint positionand thecurrent simulated keypoint position. -
: The
linear velocity differencebetween the nextreference linear velocityand thecurrent simulated linear velocity. -
: The
next reference keypoint positionitself.Normalization: All quantities in and are normalized with respect to the
humanoid's current facing directionandroot position[47, 20]. This helps make thestate representationinvariant to global translation and rotation, allowing the policy to learn more general behaviors.
-
-
4.2.2.2. Reward
The reward function provides the learning signal for the RL policy. Unlike prior methods that solely focused on motion imitation rewards, PHC incorporates Adversarial Motion Prior (AMP) for more natural motion and an energy penalty.
The total reward is defined as:
$
r _ { t } = 0 . 5 r _ { t } ^ { \mathrm { g } } + 0 . 5 r _ { t } ^ { \mathrm { a m p } } + r _ { t } ^ { \mathrm { e n e r g y } } .
$
- : The
task reward, which changes based on the current objective (motion imitation or fail-state recovery). - : The
style rewardfrom theAMP discriminator. This term encourages thesimulated motionto looknaturalandhuman-likeby rewarding motions that are indistinguishable from realMoCapdata. It is crucial forstableandnatural fail-state recovery. - : An
energy penaltyterm [29]. This penalty helps to regulate thepolicyand preventhigh-frequency jitter(especially in the foot), which can occur in policies trained withoutexternal forces. $ r _ { t } ^ { \mathrm { e n e r g y } } = - 0 . 0 0 0 5 \cdot \sum _ { j \in \mathrm { j o i n t s } } { \left| \mu _ { j } \omega _ { j } \right| } ^ { 2 } $- : The
joint torqueapplied at joint . - : The
joint angular velocityof joint . The penalty is proportional to the square of the product oftorqueandangular velocity, effectively penalizingpower consumption. This encourages moreenergy-efficientand smoother movements.
- : The
The task reward is defined as follows:
-
For Motion Imitation: $ r _ { t } ^ { \mathrm { g-imitation } } = \mathcal { R } ^ { \mathrm { imitation } } ( s _ { t } , \hat { q } _ { t } ) = w _ { \mathrm { j p } } e ^ { - 100 | \hat { p } _ { t } - p _ { t } | } + w _ { \mathrm { j r } } e ^ { - 10 | \hat { q } _ { t } \ominus q _ { t } | } + w _ { \mathrm { j v } } e ^ { - 0 . 1 | \hat { v } _ { t } - v _ { t } | } + w _ { \mathrm { j\omega } } e ^ { - 0 . 1 | \hat { \omega } _ { t } - \omega _ { t } | } $ This
reward functionencourages thesimulated humanoidto match thereference motionacross several key metrics:- : Rewards
joint positionmatching. is a weight. Theexponential termensures that even small deviations are heavily penalized, and the reward quickly drops to zero if positions diverge. - : Rewards
joint rotationmatching. is a weight. Therotation differenceis used. - : Rewards
linear velocitymatching. is a weight. - : Rewards
angular velocitymatching. is a weight. All norms areL2 norms(Euclidean distance). Theweighting factors(e.g., 100, 10, 0.1) determine the sensitivity to deviations for each component.
- : Rewards
-
For Fail-state Recovery: The reward is defined in Eq. 3, which will be detailed in the
Fail-state Recoverysection.
4.2.2.3. Action
The policy outputs actions that control the humanoid.
- PD Controller: A
proportional derivative (PD) controlleris used at eachDegree of Freedom (DoF)of thehumanoid. Theactionspecifies thePD target. - Target Joint Set: With the
target joint setas , thetorqueapplied at each joint is calculated as: $ \pmb { \tau } ^ { i } = \pmb { k } ^ { p } \circ ( \pmb { a } _ { t } - \pmb { q } _ { t } ) - \pmb { k } ^ { d } \circ \dot { \pmb { q } } _ { t } $- : The
torquevector applied at joint . - : The
proportional gainvector (element-wise). - : The
derivative gainvector (element-wise). - : The
PD targetoutput by theRL policy. This represents the desiredjoint positionfor thePD controller. - : The
current joint positionof the simulatedhumanoid. - : The
current joint velocityof the simulatedhumanoid. - : Denotes element-wise multiplication (Hadamard product).
- : The
- Distinction from Residual Action: This formulation is different from
residual action representation[50, 20, 28] used in prior work, where theactionis added to thereference pose(). By removing this dependency onreference motionin theaction space,PHCaims for robustness tonoisyandill-posed reference motions. - No External Forces: Crucially, the controller explicitly states that it does not use any
external forces[50] ormeta-PD control[52].
4.2.2.4. Control Policy and Discriminator
- Control Policy (): Represents a
Gaussian distributionoveractions. $ \pi _ { \mathrm { PHC } } ( \mathbf { a } _ { t } | \mathbf { s } _ { t } ) = \mathcal { N } ( \mu ( \mathbf { \boldsymbol { s } } _ { t } ) , \boldsymbol { \sigma } ) $- : Denotes a
Gaussian (normal) distribution. - : The
meanof theGaussian distribution, which is a function of thestatecomputed by theneural networkpolicy. - : The
standard deviation(orcovariance matrix) of theGaussian distribution. Here, it's afixed diagonal covariance.
- : Denotes a
- AMP Discriminator ():
$
\mathcal { D } \big ( { s } _ { t - 10 : t } ^ { \mathrm { p } } \big )
$
The
discriminatorcomputes arealorfake valuebased on a history of thehumanoid's proprioception(). This means it observes thehumanoid'sbody state over the last 10 time steps to determine if its motion looksnatural. It uses thesame observations,loss formulation, andgradient penaltyas the originalAMPpaper [33]. - Network Architecture: All
neural networksin the framework (thediscriminator, eachprimitive policy, thevalue function, and thecomposer) aretwo-layer Multilayer Perceptrons (MLPs)with dimensions[1024, 512]. This means they have an input layer, two hidden layers with 1024 and 512 neurons respectively, and an output layer.
4.2.2.5. Humanoid
The humanoid controller is designed to be flexible and can support any human kinematic structure.
- SMPL Model: The paper uses the
SMPL [19]kinematic structure, following previous works [52, 20, 21]. - Degrees of Freedom: The
SMPL bodyconsists of24 rigid bodies, of which23 are actuated. This results in anaction spaceof , meaning for each of the 23 actuated joints, thepolicyoutputs 3PD targets(e.g., forroll,pitch,yaw). - Body Shape: The
body proportioncan vary based on abody shape parameter, which allows for controlling avatars with different physical builds.
4.2.2.6. Initialization and Relaxed Early Termination (RET)
- Reference State Initialization (RSI) [29]: During training,
RSIis used, where thehumanoid'sinitial state is set to match a randomly selected starting point within a motion clip. This helps to efficiently explore the state space and learnmotion imitation. - Early Termination: An episode terminates if the
humanoid'sjoints deviate by more than0.5 meterson average globally from thereference motion. - Relaxed Early Termination (RET): A key modification proposed by the paper. Unlike
UHC[20],PHCremoves theankleandtoe jointsfrom thetermination condition.- Rationale:
Simulated humanoidsoften have adynamics mismatchwithreal humans, especially concerning themulti-segment foot[27]. Blindly followingMoCapfoot movements can cause thesimulated humanoidto lose balance.RETallows these joints toslightly deviatefrom theMoCap motionto maintain balance. - Prevention of Unnatural Movement: Despite
RET, thehumanoidstill receivesimitationanddiscriminator rewardsfor these body parts, preventing them from moving in anon-humanmanner. This is deemed a small but crucial detail for achieving a highmotion imitation success rate.
- Rationale:
4.2.2.7. Hard Negative Mining
To efficiently train on large motion datasets, it's important to focus on challenging sequences as training progresses.
- Procedure: Similar to
UHC[20], the paper employshard negative mining. - Definition of Hard Sequences:
Hard sequencesare defined as those which the current controllerfails to imitatesuccessfully. - Training Strategy: By
evaluating the model over the entire datasetand selectingfailed sequences, the training can be biased towards theseharder examples, making the learning process more informative. The paper notes that evenhard negative miningalone can suffer fromcatastrophic forgetting, whichPMCPaims to address.
4.2.3. Progressive Multiplicative Control Policy (PMCP)
The PMCP is the core innovation for enabling PHC to scale to large datasets and learn new tasks without catastrophic forgetting. The observation is that model performance plateaus and forgets older sequences when learning new ones.
The following figure (Figure 2 from the original paper) illustrates the training progress and the role of primitives:
Figure 2: The image is a diagram illustrating the training progress based on reference motion and simulated motion, involving primitive action sequences of varying difficulty and detailing the fail recovery strategy along with the full dataset. The strategies mentioned in the image include hard mining and composer C.
4.2.3.1. Progressive Neural Networks (PNN) Foundation
PMCP is inspired by Progressive Neural Networks (PNN) [34].
- PNN Mechanism: A
PNNstarts with a singleprimitive networktrained on the full dataset . Once has converged on using the imitation task, a subset ofhard motions() is identified by evaluating on .- The parameters of are then
frozen. - A
new primitive(randomly initialized) is created. Lateral connectionsare established, connecting each layer of to . This allows to leverage features learned by .
- The parameters of are then
- PMCP Adaptation: In
PMCP, each new primitive is responsible for learning a new andharder subset of motion sequences. The paper also considers a variant where newprimitivesare initialized from theweightsof the prior layer (aweight sharing scheme), similar tofine-tuningonharder sequenceswhile preserving the previous primitive's ability to imitate learned sequences.
4.2.3.2. Fail-state Recovery as a New Task
Learning fail-state recovery is treated as a new task within the PMCP framework.
- Types of Fail-states:
Fallen on the ground: Thehumanoidhas lost balance and is lying down.Faraway from the reference motion (> 0.5m): Thehumanoidhas significantly deviated from the target path.Combination: Bothfallenandfaraway.
- Recovery Objective: In these situations, the
humanoidshouldget up,approach the reference motion naturally, andresume imitation. - Dedicated Primitive (): A new
primitiveis initialized at the end of theprimitive stackspecifically forfail-state recovery. - Modified State Space for Recovery: During
fail-state recovery, thereference motionitself is not useful (a fallenhumanoidshouldn't imitate a standing reference). Therefore, thestate spaceis modified to remove allreference motion informationexcept theroot's reference.- Reference Joint Rotation Modification: For the
reference joint rotation(where is the joint's rotation), a modified reference is constructed: $ \hat { \pmb { \theta } } _ { t } ^ { \prime } = [ \hat { \pmb { \theta } } _ { t } ^ { 0 } , \pmb { \theta } _ { t } ^ { 1 } , \cdot \cdot \cdot \pmb { \theta } _ { t } ^ { j } ] $ Here, allnon-root joint rotations(i.e., for ) are replaced with thesimulated values. This effectively sets thenon-root joint goalsto be identity (no change from current simulated pose) during recovery, except for theroot joint. - Modified Goal State for Fail-state (): This new
goal statereflects the recovery objective. $ s _ { t } ^ { \mathrm { g-Fai l } } \stackrel { \Delta } { = } \left( \hat { \pmb { \theta } } _ { t } ^ { \prime } \ominus \pmb { \theta } _ { t } , \hat { p } _ { t } ^ { \prime } - p _ { t } , \hat { v } _ { t } ^ { \prime } - v _ { t } , \hat { \omega } _ { t } ^ { \prime } - \omega _ { t } , \hat { \theta _ { t } ^ { \prime } } , \hat { p } _ { t } ^ { \prime } \right) $ The onlyreference informationguiding this state is therelative position and orientation of the target root. - Clamping Goal Position: If the
reference rootis too far (), the goal position difference is normalized and clamped to prevent excessively large goal signals: $ \frac { 5 \times ( \hat { \pmb { p } } _ { t } ^ { \prime } - \pmb { p } _ { t } ) } { | \hat { \pmb { p } } _ { t } ^ { \prime } - \pmb { p } _ { t } | _ { 2 } } $ - Switching Condition: The
goal stateseamlessly switches betweenfail-state recoveryandfull-motion imitationbased on theroot's distanceto thereference root: $ s _ { t } ^ { \mathrm { g } } = \left{ \begin{array} { l l } { s _ { t } ^ { \mathrm { g } } } & { | \hat { p } _ { t } ^ { \mathrm { 0 } } - p _ { t } ^ { \mathrm { 0 } } | _ { 2 } \leq 0 . 5 } \ { s _ { t } ^ { \mathrm { g-Fai l } } } & { \mathrm { o t h e rw i s e } . } \end{array} \right. $ If thesimulated root positionis within0.5mof thereference root position, the normalgoal state(for imitation) is used. Otherwise, thefail-state goal stateis used.
- Reference Joint Rotation Modification: For the
- Creating Fail-states for Training:
Fallen states: Thehumanoidis randomly dropped on the ground, andrandom joint torquesare applied for 150 time steps at the beginning of an episode (similar toASE[32]).Far-states: Thehumanoidis initialized meters away from thereference motion.
- Reward for Fail-state Recovery:
$
r _ { t } ^ { \mathrm { g-recover } } = \pmb { \mathcal { R } } ^ { \mathrm { recover } } ( \pmb { s } _ { t } , \hat { \pmb { q } } _ { t } ) = 0 . 5 r _ { t } ^ { \mathrm { g-point } } + 0 . 5 r _ { t } ^ { \mathrm { amp } } + 0 . 1 r _ { t } ^ { \mathrm { energy } }
$
- : A
point-based rewardthat encourages thehumanoidtoreduce the distancebetween its root and thereference root[47]. - The
AMP style rewardandenergy penaltyare also included to ensurenaturalandstablerecovery.
- : A
- Training Data for Recovery: is trained using a handpicked subset of the
AMASSdataset named , which contains mainlywalkingandrunning sequences. This biases thediscriminatorandAMP rewardtowards simplelocomotion, which is appropriate for basic recovery. - Value Function and Discriminator: The existing
value functionanddiscriminatorare continuouslyfine-tunedwithout initializing new ones forrecovery.
4.2.3.3. Multiplicative Control
Once each primitive () has been trained to imitate a subset of the dataset or perform fail-state recovery, a composer is trained to dynamically combine these learned primitives.
- Composer (): The
composertakes the sameinput stateas theprimitivesand outputsweightsto activate theprimitives. Theseweightsdetermine how much eachprimitive'soutput contributes to the finalaction. - PHC's Output Distribution: The final
PHCpolicy's outputaction distributionis amultiplicative combinationof the individualprimitivedistributions. $ \pi _ { \mathrm { PHC } } ( \boldsymbol { a } _ { t } \mid \boldsymbol { s } _ { t } ) = \frac { 1 } { \mathcal { C } ( \boldsymbol { s } _ { t } ) } \prod _ { i } ^ { k } \mathcal { P } ^ { ( i ) } ( \boldsymbol { a } _ { t } ^ { ( i ) } \mid \boldsymbol { s } _ { t } ) ^ { \mathcal { C } ( \boldsymbol { s } _ { t } ) } , \quad \mathcal { C } ( \boldsymbol { s } _ { t } ) \ge 0 . $- : The
action distributionpredicted by theprimitive. - : The sum of the
activation weightsfrom thecomposer. This term normalizes the combined distribution.
- : The
- Combined Action Distribution (for Gaussian Primitives): Since each is an
independent Gaussianpolicy, theirmultiplicative combinationalso results in aGaussian distributionwith the following parameters for each action dimension : $ \mathcal { N } \left( \frac { 1 } { \sum _ { l } ^ { k } \frac { C _ { i } ( s _ { t } ) } { \sigma _ { l } ^ { j } ( s _ { t } ) } } \sum _ { i } ^ { k } \frac { \pmb { C } _ { i } ( s _ { t } ) } { \sigma _ { i } ^ { j } ( s _ { t } ) } \mu _ { i } ^ { j } ( s _ { t } ) , \sigma ^ { j } ( s _ { t } ) = \left( \sum _ { i } ^ { k } \frac { \pmb { C } _ { i } ( s _ { t } ) } { \sigma _ { i } ^ { j } ( s _ { t } ) } \right) ^ { - 1 } \right) $- : The
meanoutput by primitive for action dimension . - : The
variance(square of standard deviation) output by primitive for action dimension . - : The
weight(activation) for primitive given state from thecomposer. This formula shows that themeanof the combined policy is aweighted averageof theprimitive means, and thevarianceis an inverseweighted sumof theprimitive variances.
- : The
- MOE vs. MCP: Unlike a
Mixture of Expert (MOE)policy (which typically usestop-1 MOEand only activates oneexpertat a time),Multiplicative Control Policy (MCP)combines the distributions of allprimitives, similar to atop-inf MOE. This allows for a richer and more combinedpolicy output. - Training Process (Alg. 1): The primitives are progressively trained, and then the composer is trained to combine them. This ensures that the composite policy leverages the specialized knowledge of each primitive. During
composer training,fail-state recovery trainingis interleaved.
Algorithm 1: PMCP Training Procedure
The following is the Progressive Multiplicative Control Policy (PMCP) training procedure as described in Algorithm 1 of the original paper:
1 Function TrainPPO (π, Q(k), D, V, R):
2 while not converged do
3 M ← ∅ initialize sampling memory ;
4 while M not full do
5 q_1:T ← sample motion from Q ;
6 for t = 1 . . . T do
7 s_t ← (s_t^p, s_t^g) ;
8 a_t ← π(a_t | s_t) ;
9 s_{t+1} ← T(s_{t+1} | s_t, a_t) // simulation;
10 r_t ← R(s_t, q_{t+1}) ;
11 store (s_t, a_t, r_t, s_{t+1}) into memory M ;
12 P^(k), V ← PPO update using experiences collected in M ;
13 D ← Discriminator update using experiences collected in M ;
14 return π;
15 Input: Ground truth motion dataset Q ;
16 D, V, Q_hard^(1) ← Q . // Initialize discriminator, value
function, and dataset;
17 for k ← 1 . . . K do
18 Initialize P^(k) // with lateral connection/weight sharing;
19 Q_hard^(k+1) ← eval(P^(k), Q_hard^(k)) // Collect new hard sequences from Q_hard^(k) that P^(k) fails on;
20 P^(k) ← TrainPPO(P^(k), Q_hard^(k+1), D, V, R_imitation) ;
21 freeze P^(k) ;
22 P^(F) ← TrainPPO(P^(F), Q_loco, D, V, R_recover) // Fail-state Recovery;
23 π_PHC ← {P^(1) ... P^(K), P^(F), c} // The final PHC is composed of all primitives and a composer c;
24 PHC ← TrainPPO(π_PHC, Q, D, V, {R_imitation, R_recover}) // Train Composer;
Explanation of Algorithm 1:
TrainPPO Function (Lines 1-14): This is a generic PPO training loop used for training individual policies (primitives) and the final composer.
-
Line 1: Defines the
TrainPPOfunction, taking apolicy(), amotion dataset( or a subset), adiscriminator(), avalue function(), and areward function() as inputs. -
Line 2: The training continues until a convergence criterion is met.
-
Line 3: Initializes an empty
sampling memory. This memory will storeexperience tuplescollected duringpolicy rollout. -
Line 4: Loop until the
sampling memoryis full. -
Line 5: Samples a
motion sequencefrom the providedmotion dataset. -
Line 6: Iterates through each time step of the sampled
motion sequence. -
Line 7: Constructs the current
stateby combiningproprioceptionandgoal state. -
Line 8: The
policytakes the currentstateand outputs anaction. -
Line 9: The
physics simulation(transition dynamics ) updates theenvironmentfrom to based on theaction. -
Line 10: The
reward functioncalculates therewardbased on thecurrent stateand thenext reference pose. -
Line 11: The
experience tupleis stored in thesampling memory. -
Line 12: Once is full,
PPOis used to update thepolicyand thevalue functionbased on the collected experiences. -
Line 13: The
discriminatoris also updated using the collected experiences from . -
Line 14: Returns the trained
policy.Main Training Loop (Lines 15-24): This describes the
progressivetraining of theprimitivesandcomposer. -
Line 15: Input is the full
ground truth motion dataset. -
Line 16: Initializes the
discriminator,value function, and sets the initialhard sequencesdataset to be the entireground truth motion dataset. -
Line 17: Loop for from 1 to , where is the total number of
motion imitation primitives. This signifies the progressive training stages. -
Line 18: Initializes the
primitive. This involves setting uplateral connectionsto previousprimitivesorweight sharingfrom the previous primitive (as discussed inPNN). -
Line 19: Evaluates the currently trained
primitiveon the previous set ofhard sequences. It identifies the sequences that stillfails to imitateand designates them as the new,harder sequencesfor the next stage. -
Line 20: Trains the
primitiveusing theTrainPPOfunction on the newly identifiedhard sequencesand theimitation reward function. -
Line 21: After is trained on its
hard sequences, its parameters arefrozento preventcatastrophic forgetting. -
Line 22: After all
motion imitation primitivesare trained and frozen, a dedicatedfail-state recovery primitiveis trained. It uses theTrainPPOfunction on a specificlocomotion datasetand thefail-state recovery reward function. -
Line 23: The final
Perpetual Humanoid Controlleris formed by combining all the trainedprimitives() and introducing acomposer. -
Line 24: The
composer(as part of ) is then trained using theTrainPPOfunction on the fullground truth motion dataset, utilizing a combined reward structure including bothimitationandrecovery rewards. This final stage teaches thecomposerhow to dynamically select and blend theprimitivesbased on the current situation (imitation or recovery).
4.2.4. Connecting with Motion Estimators
The PHC is designed to be task-agnostic, meaning it only requires the next-timestep reference pose or keypoint for motion tracking. This modular design allows it to be integrated with various off-the-shelf motion estimators and generators.
- Video-based Pose Estimators:
HybrIK [17]: Used to estimatejoint rotations.MeTRAbs [39, 38]: Used to estimateglobal 3D keypoints.- Distinction:
HybrIKprovides rotations, whileMeTRAbsprovides keypoints, which aligns with thePHC'ssupport for bothrotation-basedandkeypoint-basedimitation.
- Language-based Motion Generation:
-
Motion Diffusion Model (MDM) [41]: A model that generates disjointmotion sequencesfromtext prompts.PHC'srecovery abilityis crucial here to achievein-betweening(smoothly transitioning between disconnected generated clips).The figure (Figure 4 from the original paper) shows examples of noisy motion imitation from video and language:
Figure 4: (d) Noisy Motion Imitation (Video: real-time & live, pose estimation from MeTRAbs) a webcam stream for a real-time simulated avatar.
-
5. Experimental Setup
The paper evaluates the Perpetual Humanoid Controller (PHC) on its ability to imitate high-quality MoCap sequences, noisy motion sequences estimated from videos, and its capacity for fail-state recovery.
5.1. Datasets
-
AMASS [23]:
- Source: A large-scale
Motion Capture (MoCap)dataset that aggregates diverseMoCapdata from various labs. - Characteristics: Contains
tens of thousands of clips(40 hours of motion) with correspondingsurface shapes(viaSMPLparameters). - Usage:
PHCis primarily trained on thetraining splitofAMASS. - Filtering: Following
UHC[20], sequences that arenoisyor involvehuman-object interactionsare removed. This results in11,313 high-quality training sequencesand140 test sequencesused for evaluation. Q_loco: A handpicked subset of theAMASSdataset containing mainlywalkingandrunning sequencesis used to train thefail-state recovery primitive().
- Source: A large-scale
-
H36M (Human3.6M) [14]:
- Source: A popular
human motion datasetcontaining3.6 million3D human posescaptured in various scenarios. - Usage: Used to evaluate the
policy's abilityto handleunseen MoCap sequencesandnoisy pose estimatesfromvision-based pose estimation methods. - Subsets Derived:
- H36M-Motion*: Contains
140 high-quality MoCap sequencesfrom the entireH36Mdataset. This set testsPHC'sgeneralization tounseen MoCapdata. - H36M-Test-Video*: Contains
160 sequences of noisy poses estimated from videosin theH36M test split. This is crucial for evaluatingPHC's robustnessto real-world, noisy input from video. (The*indicates removal ofhuman-chair interactionsequences).
- H36M-Motion*: Contains
- Source: A popular
5.2. Evaluation Metrics
The paper uses a comprehensive set of pose-based and physics-based metrics to evaluate motion imitation performance.
-
Success Rate (Succ):
- Conceptual Definition: This metric quantifies the percentage of
motion clipsthat thehumanoid controllercan successfully imitate without major failures. Animitation episodeis deemedunsuccessfulif, at any point, theaverage global deviationof thehumanoid's jointsfrom thereference motionexceeds0.5 meters. - Purpose:
Succmeasures thehumanoid's abilitytotrack the reference motioncontinuously without losing balance or significantly lagging behind. - Mathematical Formula: The paper does not provide a specific formula, but it implies a binary outcome per clip: 1 if successful, 0 if failed. The success rate is then the average over all clips.
$
\mathrm{Succ} = \frac{1}{N_{clips}} \sum_{i=1}^{N_{clips}} \mathbf{1}(\mathrm{episode}_i \text{ is successful}) \times 100%
$
- : Total number of
motion clipsevaluated. - : Indicator function, which is 1 if the condition is true, and 0 otherwise.
- An episode is successful if , where and are the
simulatedandreference 3D positionsof joint at time , and is the number of joints.
- : Total number of
- Conceptual Definition: This metric quantifies the percentage of
-
Root-relative Mean Per-Joint Position Error (MPJPE):
- Conceptual Definition: Measures the average Euclidean distance between the
simulated 3D joint positionsand theground truth 3D joint positions, after aligning theroot joints(e.g., pelvis) of both thesimulatedandreference poses. This metric focuses on therelative accuracyofjoint positionswithin the body, effectively removing global translation differences. - Purpose:
MPJPEassesses thelocal fidelityof theimitated motion's pose structure. - Mathematical Formula:
$
E_{\mathrm{mpjpe}} = \frac{1}{N_F \cdot J} \sum_{k=1}^{N_F} \sum_{j=1}^J | (\hat{P}{k,j} - \hat{P}{k,root}) - (P_{k,j} - P_{k,root}) |_2
$
- : The total number of frames evaluated across all successful episodes.
- : The total number of joints in the
human body model. - : The
ground truth 3D position vectorof joint at frame . - : The
simulated 3D position vectorof joint at frame . - : The
ground truth 3D position vectorof theroot joint(e.g., pelvis) at frame . - : The
simulated 3D position vectorof theroot jointat frame . - : The
Euclidean (L2) norm, representing the distance between two vectors. - : The
relative positionof joint to the root in theground truth pose. - : The
relative positionof joint to the root in thesimulated pose.
- Conceptual Definition: Measures the average Euclidean distance between the
-
Global MPJPE ():
- Conceptual Definition: Measures the average Euclidean distance between the
simulated 3D joint positionsand theground truth 3D joint positionswithout any root alignment. - Purpose:
Global MPJPEconsiders both therelative joint accuracyand theglobal position accuracyof the entire pose. - Mathematical Formula:
$
E_{\mathrm{g-mpjpe}} = \frac{1}{N_F \cdot J} \sum_{k=1}^{N_F} \sum_{j=1}^J | \hat{P}{k,j} - P{k,j} |_2
$
- : The total number of frames evaluated across all successful episodes.
- : The total number of joints in the
human body model. - : The
ground truth 3D position vectorof joint at frame . - : The
simulated 3D position vectorof joint at frame . - : The
Euclidean (L2) norm, representing the distance between two vectors.
- Conceptual Definition: Measures the average Euclidean distance between the
-
Acceleration Error ():
- Conceptual Definition: Measures the difference in
accelerationbetween thesimulated motionand thereference MoCap motion. - Purpose: This
physics-based metricindicates the smoothness andphysical plausibilityof thesimulated motion. Highacceleration errorcan suggest jerky or unnatural movements. - Mathematical Formula: The paper does not provide a specific formula. Generally,
accelerationfor a joint position can be approximated byfinite differenceas . The error would be the average Euclidean distance betweensimulatedandreference accelerations. $ E_{\mathrm{acc}} = \frac{1}{N_F \cdot J} \sum_{k=1}^{N_F} \sum_{j=1}^J | A_{k,j} - \hat{A}_{k,j} |_2 $- : Total number of frames.
- : Total number of joints.
- :
Simulated accelerationof joint at frame . - :
Reference accelerationof joint at frame .
- Conceptual Definition: Measures the difference in
-
Velocity Error ():
- Conceptual Definition: Measures the difference in
velocitybetween thesimulated motionand thereference MoCap motion. - Purpose: Another
physics-based metricthat, similar toacceleration error, reflects thephysical realismandsmoothnessof theimitated motion. - Mathematical Formula: The paper does not provide a specific formula. Generally,
velocityfor a joint position can be approximated byfinite differenceas . The error would be the average Euclidean distance betweensimulatedandreference velocities. $ E_{\mathrm{vel}} = \frac{1}{N_F \cdot J} \sum_{k=1}^{N_F} \sum_{j=1}^J | V_{k,j} - \hat{V}_{k,j} |_2 $- : Total number of frames.
- : Total number of joints.
- :
Simulated velocityof joint at frame . - :
Reference velocityof joint at frame .
- Conceptual Definition: Measures the difference in
5.3. Baselines
The main baseline for comparison is the state-of-the-art (SOTA) motion imitator UHC [20].
- UHC [20]: The Universal Humanoid Control, which previously achieved a
97% success rateon theAMASSdataset.- Comparison modes:
PHCis compared againstUHCin two configurations:UHC with Residual Force Control (RFC): This is the original, high-performingUHCsetup that uses external forces for stabilization.UHC without Residual Force Control (RFC): This configuration removes the external stabilizing forces fromUHCto provide a fair comparison onphysical realismand inherent stability. This is crucial for evaluatingPHC'sability to perform without such a "hand of God."
- Comparison modes:
5.4. Implementation Details
- Primitives: The
PHCusesfour primitivesin total, including thefail-state recovery primitive. - Training Time: Training all
primitivesand thecomposertakes approximatelyone weekon asingle NVIDIA A100 GPU. - Real-time Performance: Once trained, the composite policy runs at (frames per second), which is sufficient for
real-time applications. - Physics Simulator:
NVIDIA's Isaac Gym[24] is used forphysics simulation.Isaac Gymis known for its high-performance,GPU-accelerated physics simulation, which is critical forRLtraining efficiency. - Control Frequency: The
control policyruns at30 Hz. - Simulation Frequency: The
physics simulationruns at60 Hz. This means thephysics enginetakes two simulation steps for each control step. - Body Shape: For evaluation purposes,
body shape variationis not considered, and themean SMPL body shapeis used. This simplifies the evaluation and focuses onmotion imitationandrecoverycapabilities.
6. Results & Analysis
The experimental results demonstrate the PHC's superior performance in motion imitation on both high-quality and noisy data, as well as its robust fail-state recovery capabilities.
6.1. Motion Imitation
6.1.1. Motion Imitation on High-quality MoCap
The following are the results from Table 1 of the original paper:
| AMASS-Train* | AMASS-Test* | H36M-Motion* | ||||||||||||||
| Method | RFC | Succ ↑ | Eg-mppe ↓ | Empipe ↓ | Eacc ↓ | Evel ↓ | Succ ↑ | Eg-mppe ↓ | Empipe ↓ | Eacc↓ | Evel ↓ | Succ ↑ | Eg-mpjpe ↓ | Empipe ↓ | Eacc ↓ | Evel ↓ |
| UHC | ✓ | 97.0 % | 36.4 | 25.1 | 4.4 | 5.9 | 96.4 % | 50.0 | 31.2 | 9.7 | 12.1 | 87.0% | 59.7 | 35.4 | 4.9 | 7.4 |
| UHC | X | 84.5 % | 62.7 | 39.6 | 10.9 | 10.9 | 62.6% | 58.2 | 98.1 | 22.8 | 21.9 | 23.6% | 133.14 | 67.4 | 14.9 | 17.2 |
| Ours | X | 98.9 % | 37.5 | 26.9 | 3.3 | 4.9 | 96.4% | 47.4 | 30.9 | 6.8 | 9.1 | 92.9% | 50.3 | 33.3 | 3.7 | 5.5 |
| Ours-kp | X | 98.7% | 40.7 | 32.3 | 3.5 | 5.5 | 97.1% | 53.1 | 39.5 | 7.5 | 10.4 | 95.7% | 49.5 | 39.2 | 3.7 | 5.8 |
Table 1: Quantitative results on imitating MoCap motion sequences ( indicates removing sequences containing human-object interaction). AMASS-Train, AMASS-Test, and H36M-Motion* are high-quality MoCap datasets. RFC: Residual Force Control. Succ ↑: Higher is better. Eg-mpjpe ↓: Lower is better. Empipe ↓: Lower is better. Eacc ↓: Lower is better. Evel ↓: Lower is better.*
Analysis:
-
PHC (Ours) vs. UHC with RFC: The
PHC(Ours, noRFC) generally outperformsUHCwithRFCon almost all metrics across all datasets.- On
AMASS-Train*,PHCachieves asuccess rateof98.9%(vs.97.0%forUHCwithRFC), while having comparable or slightly higherMPJPEs (Eg-mpjpe37.5 vs 36.4,Empipe26.9 vs 25.1) but significantly loweracceleration(3.3 vs 4.4) andvelocity(4.9 vs 5.9) errors. This indicatesPHCcan imitatetraining sequenceswith higherphysical realismandsmoothnesswhile achieving a highersuccess rate. - On
AMASS-Test*andH36M-Motion*(unseen data),PHCmaintains a highsuccess rate(96.4%and92.9%respectively), which is comparable to or better thanUHCwithRFC(96.4%and87.0%). Crucially,PHCconsistently shows loweraccelerationandvelocityerrors, indicating morenaturalandphysically plausiblemotion even on unseen data.
- On
-
PHC (Ours) vs. UHC without RFC:
UHCtrained withoutRFCperforms significantly worse, especially ontest sets. Itssuccess ratedrops dramatically (AMASS-Test*:62.6%,H36M-Motion*:23.6%). It also exhibits much higheraccelerationandvelocityerrors, suggesting it struggles to stay balanced and resorts to high-frequency, unnatural movements. This highlights the effectiveness ofPHC'sdesign in achieving stability without external aids. -
Keypoint-based Controller (Ours-kp): Surprisingly, the
keypoint-basedcontroller (Ours-kp) is competitive with, and in some cases even outperforms, therotation-basedcontroller (Ours). For instance, onH36M-Motion*,Ours-kpachieves the highestsuccess rateof95.7%and comparableerror metrics. This suggests that providing only3D keypointscan be a strong and potentially simpler input modality. The authors hypothesize this is due to thekeypoint-basedcontroller having morefreedominjoint configurationto matchkeypoints, making it more robust.Overall:
PHCdemonstrates superiormotion imitationcapabilities compared to theRFC-enabled UHCwhile entirely abstaining fromexternal forces, leading to morephysically realisticandnatural motions. Itsprogressive trainingandAMPintegration are likely key contributors to this performance.
6.1.2. Motion Imitation on Noisy Input from Video
The following are the results from Table 2 of the original paper:
| H36M-Test-Video* | |||||
| Method | RFC | Pose Estimate | Succ ↑ | Eg-mppe ↓ | Empipe ↓ |
| UHC | ✓ | HybrIK + MeTRAbs (root) | 58.1% | 75.5 | 49.3 |
| UHC | X | HybrIK + MeTRAbs (root) | 18.1% | 126.1 | 67.1 |
| Ours | X | HybrIK + MeTRAbs (root) | 88.7% | 55.4 | 34.7 |
| Ours-kp | X | HybrIK + MeTRAbs (root) | 90.0% | 55.8 | 41.0 |
| Ours-kp | X | MeTRAbs (all keypoints) | 91.9% | 55.7 | 41.1 |
Table 2: Motion imitation on noisy motion. We use HybrIK[17] to estimate the joint rotations and uses MeTRAbs [39] for global 3D keypoints . HybrIK MeTRAbs (root): using joint rotations from HybrIK and root position from MeTRAbs. MeTRAbs (all keypoints): using all keypoints from MeTRAbs, only applicable to our keypoint-based controller.
Analysis:
- Robustness to Noisy Input: This table evaluates the performance on
H36M-Test-Video*, which containsnoisy pose estimatesfrom video. This is a highly challenging scenario due todepth ambiguity,monocular global pose estimation, anddepth-wise jitter. - PHC (Ours) vs. UHC:
PHC(Ours, noRFC) significantly outperforms bothUHCvariants. WithHybrIK + MeTRAbs (root)input,PHCachieves an88.7% success rate, far surpassingUHCwithRFC(58.1%) andUHCwithoutRFC(18.1%).- The
MPJPEs forPHCare also substantially lower (Eg-mpjpe55.4,Empipe34.7) compared toUHCwithRFC(Eg-mpjpe75.5,Empipe49.3) andUHCwithoutRFC(Eg-mpjpe126.1,Empipe67.1). This demonstratesPHC'sexceptionalrobustnesstonoisy motion, making it practical forvideo-driven avatar control.
- Keypoint-based Controller (Ours-kp) Advantage:
- The
keypoint-basedcontroller (Ours-kp) consistently performs best. When usingHybrIK + MeTRAbs (root)input, it reaches a90.0% success rate. - When provided with
MeTRAbs (all keypoints),Ours-kpachieves the highestsuccess rateof91.9%. - Explanation: The authors provide two reasons for
Ours-kp'ssuperior performance:Estimating 3D keypoints directly from imagesmight be an easier task thanestimating joint rotations, leading to higher qualityMeTRAbs keypointscompared toHybrIK rotations.- The
keypoint-based controllerhas morefreedomto find ajoint configurationthat matches the givenkeypoints, which makes it inherently morerobust to noisy inputthat might specify inconsistent rotations.
- The
6.2. Ablation Studies
The following are the results from Table 3 of the original paper:
| H36M-Test-Video* | |||||||
| RET | MCP | PNN | Rotation | Fail-Recover | Succ ↑ | Eg-mpjpe ↓ | Empipe ↓ |
| X | X | X | ✓ | X | 51.2% | 56.2 | 34.4 |
| ✓ | X | X | ✓ | X | 59.4% | 60.2 | 37.2 |
| ✓ | ✓ | X | ✓ | X | 66.2% | 59.0 | 38.3 |
| ✓ | ✓ | ✓ | ✓ | X | 86.9% | 53.1 | 33.7 |
| ✓ | ✓ | ✓ | ✓ | ✓ | 88.7% | 55.4 | 34.7 |
| ✓ | ✓ | ✓ | X | ✓ | 90.0% | 55.8 | 41.0 |
Table 3: Ablation on components of our pipeline, performed using noisy pose estimate from HybrIK Metrabs (root) on the H36M-Test-Video data. RET: relaxed early termination. MCP: multiplicative control policy. PNN: progressive neural networks.*
Analysis:
The ablation study is performed on the H36M-Test-Video* dataset with noisy pose estimates to highlight the impact of each component on robustness.
-
Impact of Relaxed Early Termination (RET) (R1 vs. R2):
- R1 (no
RET) achieves a51.2% success rate. - R2 (with
RET) improves to59.4%. - Finding:
RETsignificantly boosts thesuccess rateby allowing theankleandtoe jointsto slightly deviate for better balance, confirming its importance. TheEg-mpjpeandEmpipemetrics are slightly higher for R2, suggesting a minor trade-off in exactpose matchingfor increasedstability.
- R1 (no
-
Impact of Multiplicative Control Policy (MCP) without Progressive Training (R2 vs. R3):
- R2 (no
MCP, noPNN) has59.4% success. - R3 (with
MCPbut noPNN-- i.e.,MCPtrained on a single policy without progressive primitive learning) improves to66.2% success. - Finding: Even without
progressive training,MCPprovides a boost in performance, likely due to itsenlarged network capacityand ability to blend multipleexpertsimplicitly (if it was a single large network,MCPprovides a way to combine different outputs).
- R2 (no
-
Impact of Progressive Neural Networks (PNN) / PMCP Pipeline (R3 vs. R4):
- R3 (with
MCP, noPNN) has66.2% success. - R4 (with
MCPandPNN– representing thePMCPpipeline for imitation only) jumps to86.9% success. - Finding: The full
PMCPpipeline (which includesPNN's progressive learning andMCP's composition)significantly boosts robustnessandimitation performance. This validates thatprogressively learningwithnew network capacityis crucial for handling diverse andharder motion sequenceseffectively and mitigatingcatastrophic forgetting.
- R3 (with
-
Impact of Fail-state Recovery (R4 vs. R5):
- R4 (full
PMCPfor imitation, nofail-state recovery) achieves86.9% success. - R5 (full
PMCPfor imitation andfail-state recovery) achieves88.7% success. - Finding: Adding
fail-state recovery capabilityimproves thesuccess ratewithout compromisingmotion imitation. This is a strong result, demonstrating thatPMCPis effective in adding new tasks withoutcatastrophic forgetting, and therecovery primitivecontributes positively to overall robustness, even for typical imitation tasks (by providing graceful handling of failures).
- R4 (full
-
Impact of Keypoint-based vs. Rotation-based (R5 vs. R6):
- R5 (
Rotation-basedwith fullPMCPandfail-state recovery) has88.7% success. - R6 (
Keypoint-basedwith fullPMCPandfail-state recovery) has90.0% success. - Finding: The
keypoint-basedcontroller (Ours-kp) outperforms therotation-basedone onnoisy video input. This reinforces the idea thatkeypoint-based imitationcan be a simpler and morerobust alternative, especially when input quality is compromised.
- R5 (
6.3. Real-time Simulated Avatars
The paper demonstrates PHC's capability for real-time simulated avatars.
- A
live demo(30 fps) is shown where awebcam videois used as input. - The
keypoint-based controllerandMeTRAbs-estimated keypointsare used in a streaming fashion. - The
humanoidcan perform various motions likeposingandjumpingwhile remaining stable. - The controller can also imitate
reference motiongenerated by amotion language modellikeMDM[41]. Therecovery abilityofPHCis crucial forin-betweening(smoothly connecting) the disjoint motion sequences generated byMDM.
6.4. Fail-state Recovery
The following are the results from Table 4 of the original paper:
| Fallen-State | Far-State | Fallen + Far-State | ||||
| Method | Succ-5s ↑ | Succ-10s ↑ | Succ-5s ↑ | Succ-10s ↑ | Succ-5s ↑ | Succ-10s ↑ |
| Ours | 95.0% | 98.8% | 83.7% | 99.5% | 93.4% | 98.8% |
| Ours-kp | 92.5% | 94.6% | 95.1% | 96.0% | 79.4% | 93.2% |
Table 4: We measure whether our controller can recover from the fail-states by generating these scenarios (dropping the humanoid on the ground & far from the reference motion) and measuring the time it takes to resume tracking. Succ-5s ↑: Percentage of successful recoveries within 5 seconds. Succ-10s ↑: Percentage of successful recoveries within 10 seconds.
Analysis:
This table evaluates the humanoid's ability to recover from various fail-states and resume tracking a standing-still reference motion. Recovery is considered successful if the humanoid reaches the reference motion within a given time frame (5 or 10 seconds).
- High Success Rates: Both the
rotation-based(Ours) andkeypoint-based(Ours-kp) controllers demonstrate very highsuccess ratesforfail-state recovery.- For
Fallen-StateandFar-State, both methods achievesuccess rateswell above90%within10 seconds, and often above80-90%within5 seconds. This shows excellent capability in getting up and moving back to the reference. - Even in the most challenging
Fallen + Far-Statescenario (where thehumanoidis both lying down and far from the target),Oursachieves93.4%within5sand98.8%within10s, whileOurs-kpachieves79.4%and93.2%respectively.
- For
- Minor Differences between Rotation-based and Keypoint-based:
Ours(rotation-based) shows slightly highersuccess ratesforFallen-StateandFallen + Far-Statein the5swindow, suggesting it might be slightly quicker to recover in some complex scenarios.Ours-kp(keypoint-based) performs exceptionally well in theFar-State(95.1%in5s), even outperformingOurs. This indicates it might be particularly effective at navigating towards a distant target.
- Conclusion: The results robustly confirm that
PHCcan effectively and rapidly recover from diversefail-states, ensuringperpetual controlwithout the need for manualresets. TheAMP rewardand dedicatedrecovery primitiveare key to thisnaturalandstable recovery.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces the Perpetual Humanoid Controller (PHC), a significant advancement in physics-based humanoid control for real-time simulated avatars. The PHC achieves high-fidelity motion imitation of large and diverse MoCap datasets (98.9% on AMASS-Train*) while rigorously adhering to physical realism by eliminating the need for external stabilizing forces. A core innovation, the progressive multiplicative control policy (PMCP), enables efficient scalability to tens of thousands of motion clips and the seamless integration of new tasks, such as natural fail-state recovery, without suffering from catastrophic forgetting. The PHC is demonstrated to be highly robust to noisy input originating from video-based pose estimators and language-based motion generators, making it a practical and task-agnostic solution for live, real-time multi-person avatar applications. Its ability to recover gracefully from various fail-states (fallen, faraway, or combined) and resume motion imitation ensures continuous, perpetual control without disruptive resets.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
-
Dynamic Motion Imitation: While
PHCachieves highsuccess rates, it doesn't reach100%on the training set. Highlydynamic motionssuch ashigh jumpingandbackflippingremain challenging. The authors hypothesize that learning such motions, especially when combined with simpler movements, requires moreplanningandintentthan what is conveyed by a singlenext-frame pose target(). -
Training Time: The
progressive training procedureofPMCPresults in a relativelylong training time(around a week on anA100 GPU). -
Disjoint Process for Downstream Tasks: The current setup involves a
disjoint processwherevideo pose estimatorsormotion generatorsoperate independently of thephysics simulation. For enhanceddownstream tasks, tighter integration betweenpose estimation[52, 21],language-based motion generation[51], and thephysics-based controlleris needed.Based on these limitations, the authors suggest the following future work:
- Improved Imitation Capability: Focus on learning to imitate
100%of themotion sequencesin the training set, particularly challengingdynamic movements. This might involve incorporating longer-termplanningorintentinto the controller. - Terrain and Scene Awareness: Integrate
terrainandscene awarenessinto the controller to enable more complexhuman-object interactionsand navigation in varied environments. - Tighter Integration with Downstream Tasks: Develop more integrated systems where the
motion estimatororgeneratorand thephysics-based controllercan co-adapt or provide feedback to each other.
7.3. Personal Insights & Critique
The Perpetual Humanoid Controller (PHC) is a highly impactful paper that addresses several critical challenges in physics-based character control.
Inspirations and Strengths:
- Perpetual Control: The concept of "perpetual control" without resets is a game-changer for
real-time avatarapplications. It makessimulated charactersfeel more alive and responsive to continuous, real-world input, which is crucial forVR/ARandmetaverseapplications. - Physical Realism without Compromise: The commitment to avoiding
external forcesis commendable. WhileRFCis effective for stabilizing, it's an artificial crutch.PHCdemonstrates that high-performancemotion imitationis achievable within truephysical realism, setting a new bar for plausibility. - PMCP for Scalability: The
progressive multiplicative control policyis an elegant solution tocatastrophic forgettingand thescalability probleminRL. By dynamically expanding capacity and composingprimitives, it allows a single policy to master a vast motion repertoire and new tasks efficiently. This architecture could inspire similar solutions in othermulti-taskorcurriculum learningRLdomains. - Robustness to Noisy Input: The
PHC'sdemonstrated robustness tonoisy pose estimatesfromvideois highly practical. Real-worldpose estimationis inherently imperfect, and a controller that can gracefully handle these imperfections is invaluable for real-world deployment. The finding thatkeypoint-basedcontrol can be superior in noisy conditions is a valuable insight. - Natural Fail-state Recovery: The seamless and
natural recoveryfrom falls, without jerky movements orteleportation, adds a significant layer of realism and usability. The integration of this into the overallPMCPframework, rather than as a separate, ad-hoc module, is well-designed.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Computational Cost of PMCP: While
PMCPis efficient in terms of learning, maintaining multipleprimitive networksand acomposer, even withfrozen weights, adds to thememory footprintandcomputational complexityduring inference compared to a single monolithic policy. The paper states it runs at , which is good, but for extremely resource-constrained devices, this could be a factor. -
Planning Horizon for Dynamic Motions: The limitation regarding
high-dynamic motions(jumping, backflipping) is insightful. It highlights a common challenge inimitation learning: faithfully replicatingmomentary targetsmight not capture thelong-term planningandintentrequired for such actions. Future work might explorehierarchical RLorlatent skill discoveryto address this, where higher-level policies dictate intent over longer horizons. -
Generalizability to Diverse Environments: The current work focuses on flat ground. Incorporating
terrainandscene awarenessis listed as future work, but this is a substantial challenge. Handlinguneven terrain,stairs, orcomplex object interactions(beyond just imitatingMoCap) requires significant extensions to thestate spaceandreward functions. -
AMP Discriminator Bias: While
AMPis powerful fornatural motion, itsdiscriminatoris trained onMoCapdata. If theMoCapitself has certain biases or limitations (e.g., lack of extreme motions or specific interaction types), theAMP rewardmight inadvertently limit thepolicy'sability to explore truly novel or very energetic behaviors beyond its training distribution. -
Hyperparameter Tuning:
RLsystems with multiplereward terms,multiple policies, andprogressive training stagesoften have a large number ofhyperparameters(e.g., reward weights,PPOparameters,PD gains,PMCPprimitive count,freezepoints). The effort required to tune these for optimal performance can be substantial.Overall,
PHCmakes a substantial contribution towards realizing robust and realisticreal-time avatars. ItsPMCParchitecture for scalablemulti-task learninginRLis particularly noteworthy and has broader implications beyondhumanoid control.
Similar papers
Recommended via semantic vector search.