Paper status: completed

Perpetual Humanoid Control for Real-time Simulated Avatars

Published:10/01/2023

Multi-Task Learning (2)Humanoid Dynamic Control (1)Real-time Simulation Control (1)Progressive Multiplicative Control Policy (1)Motion Imitation and Recovery (1)

Original Link

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents a physics-based humanoid controller for high-fidelity motion imitation and fault tolerance. The progressive multiplicative control policy (PMCP) enables dynamic resource allocation, facilitating large-scale learning and task expansion without catastrophic forg

Abstract

We present a physics-based humanoid controller that achieves high-fidelity motion imitation and fault-tolerant behavior in the presence of noisy input (e.g. pose estimates from video or generated from language) and unexpected falls. Our controller scales up to learning ten thousand motion clips without using any external stabilizing forces and learns to naturally recover from fail-state. Given reference motion, our controller can perpetually control simulated avatars without requiring resets. At its core, we propose the progressive multiplicative control policy (PMCP), which dynamically allocates new network capacity to learn harder and harder motion sequences. PMCP allows efficient scaling for learning from large-scale motion databases and adding new tasks, such as fail-state recovery, without catastrophic forgetting. We demonstrate the effectiveness of our controller by using it to imitate noisy poses from video-based pose estimators and language-based motion generators in a live and real-time multi-person avatar use case.

Mind Map

In-depth Reading

English Analysis~40 min read · 51,949 chars

1. Bibliographic Information

1.1. Title

Perpetual Humanoid Control for Real-time Simulated Avatars

1.2. Authors

The authors of this paper are Zhengyi Luo, Jinkun Cao, Alexander Winkler, Kris Kitani, and Weipeng Xu.

Zhengyi Luo: Affiliated with Reality Labs Research, Meta, and Carnegie Mellon University.
Jinkun Cao: Affiliated with Carnegie Mellon University.
Alexander Winkler: Affiliated with Reality Labs Research, Meta.
Kris Kitani: Affiliated with Reality Labs Research, Meta, and Carnegie Mellon University.
Weipeng Xu: Affiliated with Reality Labs Research, Meta.

Their affiliations suggest a background in computer vision, robotics, reinforcement learning, and computer graphics, with a focus on real-world applications in virtual reality and simulated environments, given their connection to Meta's Reality Labs Research.

1.3. Journal/Conference

The paper was published at ICCV 2023. ICCV (International Conference on Computer Vision) is one of the top-tier conferences in the field of computer vision, highly regarded for publishing cutting-edge research. Its influence is substantial, attracting a global audience of researchers and practitioners. Publication at ICCV indicates a high level of rigor, novelty, and impact in the research presented.

1.4. Publication Year

2023

1.5. Abstract

This paper introduces a physics-based humanoid controller designed for real-time simulated avatars, aiming for high-fidelity motion imitation and fault-tolerant behavior. The controller can handle noisy input, such as pose estimates from video or language-generated motions, and naturally recover from unexpected falls. It scales to learning ten thousand motion clips without relying on external stabilizing forces and can perpetually control simulated avatars without requiring resets. The core innovation is the progressive multiplicative control policy (PMCP), which dynamically allocates new network capacity to learn increasingly difficult motion sequences. PMCP enables efficient scaling for large motion databases and the integration of new tasks, like fail-state recovery, without suffering from catastrophic forgetting. The controller's effectiveness is demonstrated by its ability to imitate noisy poses from video-based pose estimators and language-based motion generators in live, real-time multi-person avatar scenarios.

1.6. Original Source Link

https://openaccess.thecvf.com/content/ICCV2023/papers/Luo_Perpetual_Humanoid_Control_for_Real-time_Simulated_Avatars_ICCV_2023_paper.pdf This paper is officially published at ICCV 2023.

2. Executive Summary

2.1. Background & Motivation

The creation of realistic and interactive human motion in simulated environments is a long-standing goal in computer graphics and robotics, with significant potential for virtual avatars and human-computer interaction. However, controlling high-degree-of-freedom (DOF) humanoids in physics-based simulations presents substantial challenges.

Core Problem: Existing physics-based controllers struggle with maintaining stability and imitating motion faithfully, especially under two critical real-world conditions:

Noisy Input: When the reference motion comes from imperfect sources like video-based pose estimators or language-based motion generators, it often contains artifacts such as floating, foot sliding, or physically impossible poses. Current controllers tend to fall or deviate significantly.
Unexpected Falls and Failures: Simulated humanoids can easily lose balance. Most prior methods resort to resetting the humanoid to a kinematic pose upon failure. This leads to teleportation artifacts and is highly undesirable for real-time virtual avatar applications where perpetual control is needed. Moreover, resetting to a noisy reference pose can create a vicious cycle of falling and re-resetting.

Challenges/Gaps in Prior Research:

Scalability: Learning to imitate large-scale motion datasets (e.g., AMASS with tens of thousands of clips) with a single policy is largely unachieved due to the diversity and complexity of human motion.
Physical Realism vs. Stability: Many successful motion imitators, like UHC, rely on residual force control (RFC), which applies non-physical "stabilizing forces." While effective for stability, RFC compromises physical realism and can introduce artifacts like flying or floating.
Catastrophic Forgetting: As Reinforcement Learning (RL) policies are trained on diverse or sequential tasks, they often forget previously learned skills when acquiring new ones. This is a major hurdle for scaling to large datasets and integrating multiple capabilities (like imitation and recovery).
Natural Recovery: Existing fail-safe mechanisms often result in unnatural teleportation or require distinct recovery policies that might not track the original motion smoothly.

Paper's Entry Point / Innovative Idea: The paper addresses these limitations by aiming to create a single, robust, physics-based humanoid controller called Perpetual Humanoid Controller (PHC) that operates without external forces, can handle noisy inputs, and naturally recovers from fail-states to resume imitation. The core innovation for scalability and multi-task learning is the progressive multiplicative control policy (PMCP).

2.2. Main Contributions / Findings

The paper makes several significant contributions:

Perpetual Humanoid Controller (PHC): Introduction of a novel physics-based humanoid controller that successfully imitates 98.9% of the AMASS dataset without employing any external forces, thus maintaining physical realism. This controller can perpetually operate simulated avatars in real-time without requiring manual resets, even when facing noisy input or unexpected falls.
Progressive Multiplicative Control Policy (PMCP): Proposal of PMCP, a novel RL training strategy that allows the controller to learn from large motion datasets and integrate new capabilities (like fail-state recovery) without suffering from catastrophic forgetting. PMCP dynamically allocates new network capacity (primitives) to progressively learn harder motion sequences and additional tasks, making the learning process efficient and scalable.
Robustness and Task-Agnostic Design: Demonstration that PHC is robust to noisy inputs from off-the-shelf video-based pose estimators (e.g., HybrIK, MeTRAbs) and language-based motion generators (e.g., MDM). It serves as a drop-in solution for real-time virtual avatar applications, supporting both rotation-based and keypoint-based imitation, with the latter even outperforming the former in noisy conditions.

Key Conclusions/Findings:

The PHC achieves state-of-the-art motion imitation success rates on large MoCap datasets without compromising physical realism by avoiding external forces.
PMCP effectively mitigates catastrophic forgetting, enabling a single policy to learn an extensive range of motions and fail-state recovery behaviors.
The controller is highly fault-tolerant, capable of naturally recovering from fallen or far-away states and seamlessly re-engaging with the reference motion.
PHC can directly drive real-time simulated avatars using noisy input from live video streams or language-based motion generation, demonstrating its practical applicability.
The keypoint-based imitation variant of PHC shows surprising robustness and performance, especially with noisy inputs, suggesting a simpler and potentially more robust input modality for certain applications.

3.1. Foundational Concepts

To understand the Perpetual Humanoid Controller (PHC) and its underlying progressive multiplicative control policy (PMCP), familiarity with several core concepts in robotics, computer graphics, and machine learning is beneficial.

Humanoid Control: This refers to the methods and algorithms used to make virtual or physical human-like robots move and interact with their environment. The goal is often to produce realistic, stable, and responsive behaviors.
- High-Degree-of-Freedom (DOF) Humanoids: A degree of freedom is an independent parameter that defines the state of a physical system. For a humanoid, each joint (e.g., shoulder, elbow, knee) can have multiple DOFs (e.g., pitch, yaw, roll rotations). A high-DOF humanoid means it has many such joints and rotation/translation parameters, making its control space very complex.
- Physics-based Simulation: Instead of simply playing back pre-recorded motions (kinematic control), physics-based simulation involves modeling the humanoid's mass, inertia, joints, and forces (gravity, friction, joint torques). A physics engine then calculates how the humanoid moves and interacts with its environment according to the laws of physics. This leads to more realistic and robust behaviors but is also much harder to control.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal.
- Agent: The decision-making entity (in this paper, the humanoid controller).
- Environment: The simulated world where the humanoid operates (including its physics, the ground, etc.).
- State ( $\mathcal{S}$ ): A complete description of the environment at a given time step (e.g., humanoid's joint positions, velocities, orientation, position, and goal information).
- Action ( $\mathcal{A}$ ): The output of the agent that influences the environment (in this paper, the target joint torques for the PD controller).
- Reward ( $\mathcal{R}$ ): A scalar feedback signal from the environment indicating how good or bad the agent's last action was (e.g., a high reward for imitating motion closely, a penalty for falling).
- Policy ( $\pi$ ): The agent's strategy for choosing actions given a state. The goal of RL is to learn an optimal policy.
- Markov Decision Process (MDP): The mathematical framework for modeling RL problems. An MDP is defined by a tuple $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \gamma \rangle$ $M = ⟨ S, A, T, R, γ ⟩$ :
  - $\mathcal{S}$ : A set of possible states.
  - $\mathcal{A}$ : A set of possible actions.
  - $\mathcal{T}$ : The transition dynamics function, specifying the probability $P(s' | s, a)$ of reaching state $s'$ after taking action $a$ in state $s$ .
  - $\mathcal{R}$ : The reward function, defining the immediate reward R(s, a) for taking action $a$ in state $s$ .
  - $\gamma$ : The discount factor ( $0 \le \gamma \le 1$ ), which determines the present value of future rewards.
- Proximal Policy Optimization (PPO): A popular RL algorithm used in this paper. PPO is an on-policy algorithm that strikes a balance between ease of implementation, sample efficiency, and good performance. It works by making small updates to the policy to avoid large changes that could destabilize training, ensuring that the new policy does not stray too far from the old one, often via a clipped objective function.
Neural Networks:
- Multi-layer Perceptron (MLP): A fundamental type of artificial neural network consisting of at least three layers: an input layer, one or more hidden layers, and an output layer. Each layer is fully connected to the next. MLPs are general-purpose function approximators.
Control Systems:
- Proportional-Derivative (PD) Controller: A classic feedback control loop mechanism. For a joint, a PD controller calculates the torque required to move the joint to a target position by considering two terms:
  1. Proportional (P) Term: Proportional to the difference (error) between the target position and the current position. This term drives the joint towards the target.
  2. Derivative (D) Term: Proportional to the rate of change of the error (i.e., the current velocity difference). This term damps oscillations and reduces overshoot. The RL policy in this paper outputs the target positions for these PD controllers, which then generate the actual joint torques.
Human Body Models:
- SMPL (Skinned Multi-Person Linear) Model: A widely used 3D statistical human body model. SMPL can represent a wide variety of human body shapes and poses using a small number of parameters. It provides a mesh (a 3D surface model) that can be skinned (attached to a skeleton) and deformed based on pose and shape parameters. This makes it suitable for animating realistic human motion.
Adversarial Learning:
- Adversarial Motion Prior (AMP): A technique that uses an adversarial discriminator to encourage generated motions to look "human-like" and natural. Instead of just matching a reference motion (which might be noisy or incomplete), AMP trains a discriminator to distinguish between real MoCap data and simulated motions. The RL agent receives a reward from this discriminator for producing motions that the discriminator classifies as "real." This helps the agent learn natural motion styles and recover gracefully, especially in scenarios where direct motion imitation might lead to unnatural artifacts.
Challenges in Sequential Learning:
- Catastrophic Forgetting: A significant problem in artificial neural networks where, when a network is sequentially trained on new tasks, its performance on previously learned tasks drastically degrades. The weights in the network that were adapted for earlier tasks are overwritten by the new training, causing the network to "forget" what it learned before.

3.2. Previous Works

The paper extensively discusses prior research in physics-based motion imitation, fail-state recovery, and progressive reinforcement learning.

3.2.1. Physics-based Motion Imitation

Early work in physics-based characters focused on specific, small-scale tasks or required significant manual tuning. The advent of Reinforcement Learning (RL) brought about methods capable of learning complex locomotion and interaction skills.

Small-scale use cases: Many RL approaches focused on interactive control based on user input [43, 2, 33, 32], specific sports actions [46, 18, 26], or modular tasks like reaching goals [47] or dribbling [33]. These often involved training on limited motion data or for specific skills.
Scaling to datasets: Imitating large motion datasets has been a harder challenge.
- DeepMimic [29] was a pioneering work that learned to imitate single motion clips, demonstrating RL's potential for realistic physics-based motion.
- ScaDiver [45] attempted to scale to larger datasets (like the CMU MoCap dataset) by using a mixture of expert policies, achieving around 80% success rate (measured by time to failure). A mixture of experts (MOE) typically involves multiple expert networks, with a gating network deciding which expert or combination of experts to use for a given input. The paper notes that ScaDiver had some success but did not scale to the largest datasets.
- Unicon [43] showed qualitative results in imitation and transfer but did not quantify performance on large datasets.
- MoCapAct [42] learned single-clip experts on CMU MoCap and then distilled them into a single policy, achieving 80% of the experts' performance.
UHC (Universal Humanoid Control) [20]: This is the closest and most relevant prior work to PHC. UHC successfully imitated 97% of the AMASS dataset. However, its success heavily relied on residual force control (RFC) [50].
- Residual Force Control (RFC) [50, 49]: A technique where an external force is applied at the root of the humanoid to assist in balancing and tracking reference motion. This non-physical force acts like a "hand of God," making it easier for the RL agent to maintain stability, especially for challenging or noisy motions.
  - Drawbacks of RFC: As highlighted by the authors, RFC compromises physical realism and can introduce artifacts like floating or swinging, particularly with complex motions. The PHC paper explicitly aims to overcome this limitation by not using external forces.
- RFC has been applied in human pose estimation from video [52, 21, 11] and language-based motion generation [51], showing its effectiveness for stabilization, but also its inherent trade-off with realism.

3.2.2. Fail-state Recovery for Simulated Characters

Maintaining balance is a constant challenge for physics-based characters. Prior approaches to fail-state recovery include:

Ignoring Physics/Compromising Realism:
- PhysCap [37]: Used a floating-base humanoid that did not require balancing, thereby compromising physical realism.
Resetting Mechanisms:
- Egopose [49]: Designed a fail-safe mechanism to reset the humanoid to the kinematic pose when it was about to fall. The paper notes this can lead to teleportation behavior if the humanoid keeps resetting to unreliable kinematic poses (e.g., from noisy input).
Sampling-based Control:
- NeuroMoCon [13]: Utilized sampling-based control and reran the sampling process if the humanoid fell. While effective, this approach doesn't guarantee success and might be too slow for real-time use cases.
Additional Recovery Policies:
- Some methods [6] use an additional recovery policy when the humanoid deviates. However, without access to the original reference motion, such policies can produce unnatural behavior like high-frequency jitters.
- ASE [32]: Demonstrated the ability to rise naturally from the ground for a sword-swinging policy. The PHC paper points out that for motion imitation, the policy not only needs to get up but also seamlessly track the reference motion again.
PHC's Approach: The PHC proposes a comprehensive solution where the controller can rise from a fallen state, naturally walk back to the reference motion, and resume imitation, integrating this capability into a single, perpetual policy.

3.2.3. Progressive Reinforcement Learning

When learning from diverse data or multiple tasks, catastrophic forgetting [8, 25] is a major issue. Various methods have been proposed:

Combating Catastrophic Forgetting:
- Regularizing network weights [16]: Penalizing changes to important weights.
- Learning multiple experts [15] or increasing capacity using mixture of experts [54, 36, 45]: These involve specialized networks for different tasks or data subsets.
- Multiplicative control [31]: A mechanism to combine the outputs of multiple policies.
Progressive Learning Paradigms:
- Progressive learning [5, 4] or curriculum learning [1]: Training models by gradually introducing more complex data or tasks.
- Progressive Reinforcement Learning [3]: Distilling skills from multiple expert policies to find a single policy that matches their action distribution.
Progressive Neural Networks (PNN) [34]: A direct predecessor to PMCP. PNNs avoid catastrophic forgetting by freezing the weights of previously learned subnetworks (for older tasks) and initializing additional subnetworks for new tasks. Lateral connections are used to feed experiences from older subnetworks to newer ones.
- Limitation of PNN for Motion Imitation: The paper notes that PNN requires manually choosing which subnetwork to use based on task labels. In motion imitation, the boundary between "hard" and "easy" motion sequences is blurred, and there are no clear task labels for different motion types, making direct application difficult.

3.3. Technological Evolution

The field has evolved from purely kinematic animation (pre-recorded, inflexible motion) to physics-based character control enabled by RL. Early RL solutions were often limited to specific, short tasks or required external forces for stability. The scaling challenge for physics-based motion imitation across vast, diverse MoCap datasets emerged as a key hurdle, leading to mixture of experts approaches. Simultaneously, the need for robust fail-state recovery became apparent, moving from simple resets to more natural, physics-informed recovery strategies. This paper's work represents a step forward by combining scalable motion imitation with natural fail-state recovery and robustness to noisy real-world inputs, all while maintaining physical realism by abstaining from external forces. It builds upon PNN and multiplicative control concepts to achieve this multi-faceted goal.

3.4. Differentiation Analysis

The Perpetual Humanoid Controller (PHC) differentiates itself from prior work in several key aspects:

Elimination of External Stabilizing Forces: Unlike UHC [20], which relies on residual force control (RFC) to achieve high motion imitation success rates, PHC operates entirely without any external forces. This ensures physical realism and prevents artifacts like floating or swinging that can arise from RFC, especially during challenging motions. This is a crucial distinction as it makes the simulated avatar's behavior more physically plausible.
Comprehensive and Natural Fail-state Recovery: While fail-state recovery has been addressed by methods like Egopose [49] (resetting) or ASE [32] (getting up), PHC provides a more integrated and natural solution. It not only enables the humanoid to rise from a fallen state but also to approach the reference motion naturally and seamlessly resume imitation. The policy is trained to handle fallen, faraway, or combined fail-states, ensuring continuous, perpetual control without disruptive resets.
Scalability and Catastrophic Forgetting Mitigation via PMCP:
- Progressive Learning: PHC employs progressive multiplicative control policy (PMCP) to scale learning to the entire AMASS dataset (ten thousand clips). Unlike standard RL training on large datasets which often leads to catastrophic forgetting, PMCP dynamically allocates new network capacity (primitives) to learn increasingly harder motion sequences and new tasks like fail-state recovery.
- Dynamic Composition: PMCP builds upon Progressive Neural Networks (PNN) [34] but adapts it for motion imitation. While PNN requires explicit task labels and manual subnetwork selection, PMCP trains a composer to dynamically combine pretrained primitives based on the current state, without relying on predefined task boundaries.
- Multiplicative Control: PMCP uses multiplicative control [31] rather than a typical Mixture of Experts (MOE) [45]. Instead of activating only one expert (top-1 MOE), MCP combines the distributions of all primitives, allowing for a richer and more nuanced control policy that benefits from the collective experience of its components.
Robustness to Noisy and Diverse Inputs: PHC is explicitly designed to be robust to noisy input from video-based pose estimators (e.g., HybrIK, MeTRAbs) and language-based motion generators (e.g., MDM). This makes it a task-agnostic solution directly applicable to real-time virtual avatar applications, a critical feature for practical deployment. The paper also demonstrates the effectiveness of a keypoint-based controller, which is often more robust to noise than rotation-based methods.

In essence, PHC distinguishes itself by offering a robust, physically realistic, scalable, and perpetually operating humanoid controller that can seamlessly handle diverse, noisy inputs and fail-states, addressing fundamental limitations of prior physics-based motion imitation methods.

4. Methodology

The paper proposes the Perpetual Humanoid Controller (PHC), a physics-based humanoid controller designed for real-time simulated avatars. It achieves high-fidelity motion imitation, fault-tolerant behavior, and scalability through a novel progressive multiplicative control policy (PMCP).

4.1. Principles

The core idea behind PHC is to learn a single, robust RL policy that can imitate a vast range of human motions while inherently possessing the ability to recover from unexpected fail-states (like falling) without external intervention or manual resets. This is achieved by combining several principles:

Goal-Conditioned Reinforcement Learning: The humanoid learns to perform actions to match a specified reference motion or keypoints as its goal.
Adversarial Motion Prior (AMP): To ensure that the humanoid's movements are natural and human-like, even during recovery or when dealing with noisy input, an AMP discriminator is used to provide a style reward.
No External Forces: Unlike previous approaches that relied on residual forces for stability, PHC is designed to be physically realistic by controlling the humanoid solely through joint torques generated by PD controllers.
Progressive Learning for Scalability and Multi-tasking: The progressive multiplicative control policy (PMCP) is introduced to overcome catastrophic forgetting when learning from large, diverse motion datasets and when adding new, distinct tasks (like fail-state recovery). It does this by dynamically expanding network capacity (primitives) and then composing them.
Robustness to Noise: The controller is designed to handle imperfections in reference motion data, such as those arising from video-based pose estimators or language-based generators.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Reference Motion Representation

The paper defines the reference pose at time $t$ as $\hat { \pmb q } _ { t } \triangleq ( \hat { \pmb \theta } _ { t } , \hat { \pmb p } _ { t } )$ .

$\hat { \pmb \theta } _ { t } \in \mathbb { R } ^ { J \times 6 }$ : Represents the 3D joint rotation for all $J$ links of the humanoid. The 6 DoF rotation representation [53] is used, which typically encodes rotations using two 3D vectors (e.g., 6D continuous representation).
$\hat { \pmb p } _ { t } \in \mathbb { R } ^ { J \times 3 }$ : Represents the 3D position of all $J$ links of the humanoid.

From a sequence of reference poses $\hat { \pmb { q } } _ { 1 : T }$ , the reference velocities $\hat { \dot { \pmb q } } _ { 1 : T }$ can be computed using finite difference.
$\hat { \dot { \pmb q } } _ { t } \triangleq ( \hat { \pmb \omega } _ { t } , \hat { \pmb v } _ { t } )$ : Represents the angular and linear velocities.
- $\hat { \pmb \omega } _ { t } \in \mathbb { R } ^ { J \times 3 }$ : Angular velocities for all links.
- $\hat { \pmb v } _ { t } \in \mathbb { R } ^ { J \times 3 }$ : Linear velocities for all links.
  
  The paper distinguishes between two types of motion imitation based on input:
Rotation-based imitation: Requires reference poses $\hat { \pmb q } _ { 1 : T }$ (both rotation and keypoints).
Keypoint-based imitation: Only requires 3D keypoints $\hat { \pmb p } _ { 1 : T }$ .

A notation convention is established:
$\tilde{}$ : Represents kinematic quantities (without physics simulation) obtained from pose estimators or keypoint detectors.
$\hat{}$ : Denotes ground truth quantities from Motion Capture (MoCap).
Normal symbols (without accents): Denote values from the physics simulation.

4.2.2. Goal-Conditioned Motion Imitation with Adversarial Motion Prior

The PHC controller is built upon a general goal-conditioned Reinforcement Learning (RL) framework (visualized in Figure 3). The goal-conditioned policy $\pi_{\mathrm{PHC}}$ is tasked with imitating reference motion $\hat{\pmb{q}}_{1:T}$ or keypoints $\hat{\pmb{p}}_{1:T}$ .

The task is formulated as a Markov Decision Process (MDP): $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \gamma \rangle$ .

Physics Simulation: Determines the state $\mathbf{\Psi}_{\mathbf{s}_t} \in \mathcal{S}$ and the transition dynamics $\mathcal{T}$ .
Policy: Our policy $\pi_{\mathrm{PHC}}$ computes the per-step action $\mathbf{\Omega}_{\mathbf{a}_t} \in \mathcal{A}$ .
Reward Function: Based on the simulation state $\mathbf{s}_t$ and reference motion $\hat{\pmb{q}}_t$ , the reward function $\mathcal{R}$ computes a reward $r_t = \mathcal{R}(s_t, \hat{q}_t)$ as the learning signal.
Objective: The policy aims to maximize the discounted reward: $ \mathbb{E} \left[ \sum_{t=1}^T \gamma^{t-1} r_t \right] $ where $\mathbb{E}$ denotes the expectation over trajectories, $\gamma$ is the discount factor, and $r_t$ is the reward at time step $t$ .
Learning Algorithm: Proximal Policy Optimization (PPO) [35] is used to learn $\pi_{\mathrm{PHC}}$ .

The following figure (Figure 3 from the original paper) shows the goal-conditioned RL framework with Adversarial Motion Prior:

$Figure 3: Goal-conditioned RL framework with Adversarial Motion Prior. Each primitive $\\mathcal { P } ^ { ( k ) }$ and composer $c$ is trained using the same procedure, and here we visualize the final product $\\pi _ { \\mathrm { P H C } }$ .$ 该图像是示意图，展示了目标条件强化学习框架与对抗运动先验的结合。图中阐述了政策 $\pi _ { \mathrm { P H C } }$ 的训练过程，包括行动 $a_t$ 和状态 $s_t$ 的定义，以及运动数据 $\hat{Q}$ 和判别器 $D$ 的作用。图像还显示了如何通过参考状态 $\hat{s}_t$ 和目标奖励 $r_g^t$ 来实现学习及反馈。整体流程强调了物理仿真环境（如 Isaac Gym）的应用。 Figure 3: Goal-conditioned RL framework with Adversarial Motion Prior. Each primitive $\\mathcal { P } ^ { ( k ) }$ and composer $c$ is trained using the same procedure, and here we visualize the final product $\\pi _ { \\mathrm { P H C } }$ .

4.2.2.1. State

The simulation state $\pmb{s}_t \triangleq (\pmb{s}_t^{\mathrm{p}}, \pmb{s}_t^{\mathrm{g}})$ consists of two main parts:

Humanoid Proprioception ( $\pmb{s}_t^{\mathrm{p}}$ ): This describes the humanoid's internal state. $\pmb{s}_t^{\mathrm{p}} \triangleq (\pmb{q}_t, \dot{\pmb{q}}_t, \beta)$
- $\pmb{q}_t$ : The 3D body pose (joint rotations and root position/orientation) of the simulated humanoid.
- $\dot{\pmb{q}}_t$ : The velocity (linear and angular velocities of all joints) of the simulated humanoid.
- $\beta$ : (Optionally) Body shapes parameters. When trained with different body shapes, $\beta$ provides information about the length of each limb [22].
Goal State ( $\pmb{s}_t^{\mathrm{g}}$ ): This describes the target motion to be imitated. It is defined as the difference between the next time step's reference quantities and their simulated counterparts. These differences serve as the goal for the policy.
- For Rotation-based Motion Imitation: $ s _ { t } ^ { \mathrm { g-rot } } \triangleq \big ( \hat { \pmb { \theta } } _ { t + 1 } \ominus \pmb { \theta } _ { t } , \hat { \pmb { p } } _ { t + 1 } - \pmb { p } _ { t } , \hat { \pmb { v } } _ { t + 1 } - \pmb { v } _ { t } , \hat { \omega } _ { t } - \omega _ { t } , \hat { \pmb { \theta } } _ { t + 1 } , \hat { \pmb { p } } _ { t + 1 } \big ) $
  - $\hat { \pmb { \theta } } _ { t + 1 } \ominus \pmb { \theta } _ { t }$ : The rotation difference between the next reference joint rotation and the current simulated joint rotation. The $\ominus$ operator calculates this difference, likely in a way that respects the geometry of rotations (e.g., quaternion difference).
  - $\hat { \pmb { p } } _ { t + 1 } - \pmb { p } _ { t }$ : The position difference between the next reference joint position and the current simulated joint position.
  - $\hat { \pmb { v } } _ { t + 1 } - \pmb { v } _ { t }$ : The linear velocity difference between the next reference linear velocity and the current simulated linear velocity.
  - $\hat { \omega } _ { t } - \omega _ { t }$ : The angular velocity difference between the next reference angular velocity and the current simulated angular velocity.
  - $\hat { \pmb { \theta } } _ { t + 1 }$ : The next reference joint rotation itself.
  - $\hat { \pmb { p } } _ { t + 1 }$ : The next reference joint position itself.
- For Keypoint-only Motion Imitation: This is a simplified goal state that only uses 3D keypoints. $ s _ { t } ^ { \mathrm { g-kp } } \triangleq \big ( \hat { p } _ { t + 1 } - p _ { t } , \hat { v } _ { t + 1 } - v _ { t } , \hat { p } _ { t + 1 } \big ) . $
  - $\hat { p } _ { t + 1 } - p _ { t }$ : The position difference between the next reference keypoint position and the current simulated keypoint position.
  - $\hat { v } _ { t + 1 } - v _ { t }$ : The linear velocity difference between the next reference linear velocity and the current simulated linear velocity.
  - $\hat { p } _ { t + 1 }$ : The next reference keypoint position itself.
    
    Normalization: All quantities in $\pmb{s}_t^{\mathrm{g}}$ and $\pmb{s}_t^{\mathrm{p}}$ are normalized with respect to the humanoid's current facing direction and root position [47, 20]. This helps make the state representation invariant to global translation and rotation, allowing the policy to learn more general behaviors.

4.2.2.2. Reward

The reward function $\mathcal{R}$ provides the learning signal for the RL policy. Unlike prior methods that solely focused on motion imitation rewards, PHC incorporates Adversarial Motion Prior (AMP) for more natural motion and an energy penalty. The total reward is defined as: $ r _ { t } = 0 . 5 r _ { t } ^ { \mathrm { g } } + 0 . 5 r _ { t } ^ { \mathrm { a m p } } + r _ { t } ^ { \mathrm { e n e r g y } } . $

$r _ { t } ^ { \mathrm { g } }$ : The task reward, which changes based on the current objective (motion imitation or fail-state recovery).
$r _ { t } ^ { \mathrm { a m p } }$ : The style reward from the AMP discriminator. This term encourages the simulated motion to look natural and human-like by rewarding motions that are indistinguishable from real MoCap data. It is crucial for stable and natural fail-state recovery.
$r _ { t } ^ { \mathrm { e n e r g y } }$ $r_{t}^{energy}$ : An energy penalty term [29]. This penalty helps to regulate the policy and prevent high-frequency jitter (especially in the foot), which can occur in policies trained without external forces. $ r _ { t } ^ { \mathrm { e n e r g y } } = - 0 . 0 0 0 5 \cdot \sum _ { j \in \mathrm { j o i n t s } } { \left| \mu _ { j } \omega _ { j } \right| } ^ { 2 } $
- $\mu_j$ : The joint torque applied at joint $j$ .
- $\omega_j$ : The joint angular velocity of joint $j$ . The penalty is proportional to the square of the product of torque and angular velocity, effectively penalizing power consumption. This encourages more energy-efficient and smoother movements.

The task reward $r _ { t } ^ { \mathrm { g } }$ is defined as follows:

For Motion Imitation: $ r _ { t } ^ { \mathrm { g-imitation } } = \mathcal { R } ^ { \mathrm { imitation } } ( s _ { t } , \hat { q } _ { t } ) = w _ { \mathrm { j p } } e ^ { - 100 | \hat { p } _ { t } - p _ { t } | } + w _ { \mathrm { j r } } e ^ { - 10 | \hat { q } _ { t } \ominus q _ { t } | } + w _ { \mathrm { j v } } e ^ { - 0 . 1 | \hat { v } _ { t } - v _ { t } | } + w _ { \mathrm { j\omega } } e ^ { - 0 . 1 | \hat { \omega } _ { t } - \omega _ { t } | } $ This reward function encourages the simulated humanoid to match the reference motion across several key metrics:
- $w_{\mathrm{jp}} e ^ { -100 \| \hat { p } _ { t } - p _ { t } \| }$ : Rewards joint position matching. $w_{\mathrm{jp}}$ is a weight. The exponential term ensures that even small deviations are heavily penalized, and the reward quickly drops to zero if positions diverge.
- $w_{\mathrm{jr}} e ^ { -10 \| \hat { q } _ { t } \ominus q _ { t } \| }$ : Rewards joint rotation matching. $w_{\mathrm{jr}}$ is a weight. The rotation difference $\ominus$ is used.
- $w_{\mathrm{jv}} e ^ { -0.1 \| \hat { v } _ { t } - v _ { t } \| }$ : Rewards linear velocity matching. $w_{\mathrm{jv}}$ is a weight.
- $w_{\mathrm{j\omega}} e ^ { -0.1 \| \hat { \omega } _ { t } - \omega _ { t } \| }$ : Rewards angular velocity matching. $w_{\mathrm{j\omega}}$ is a weight. All norms are L2 norms (Euclidean distance). The weighting factors (e.g., 100, 10, 0.1) determine the sensitivity to deviations for each component.
For Fail-state Recovery: The reward $r_t^{\mathrm{g-recover}}$ is defined in Eq. 3, which will be detailed in the Fail-state Recovery section.

4.2.2.3. Action

The policy $\pi_{\mathrm{PHC}}$ outputs actions that control the humanoid.

PD Controller: A proportional derivative (PD) controller is used at each Degree of Freedom (DoF) of the humanoid. The action $\mathbf{\pmb{a}}_t$ specifies the PD target.
Target Joint Set: With the target joint set as $\pmb{q}_t^d = \pmb{a}_t$ $q_{t}^{d} = a_{t}$ , the torque applied at each joint $i$ $i$ is calculated as: $ \pmb { \tau } ^ { i } = \pmb { k } ^ { p } \circ ( \pmb { a } _ { t } - \pmb { q } _ { t } ) - \pmb { k } ^ { d } \circ \dot { \pmb { q } } _ { t } $
- $\pmb{\tau}^i$ : The torque vector applied at joint $i$ .
- $\pmb{k}^p$ : The proportional gain vector (element-wise).
- $\pmb{k}^d$ : The derivative gain vector (element-wise).
- $\pmb{a}_t$ : The PD target output by the RL policy. This represents the desired joint position for the PD controller.
- $\pmb{q}_t$ : The current joint position of the simulated humanoid.
- $\dot{\pmb{q}}_t$ : The current joint velocity of the simulated humanoid.
- $\circ$ : Denotes element-wise multiplication (Hadamard product).
Distinction from Residual Action: This formulation is different from residual action representation [50, 20, 28] used in prior work, where the action is added to the reference pose ( $\pmb{q}_t^d = \hat{\pmb{q}}_t + \pmb{a}_t$ ). By removing this dependency on reference motion in the action space, PHC aims for robustness to noisy and ill-posed reference motions.
No External Forces: Crucially, the controller explicitly states that it does not use any external forces [50] or meta-PD control [52].

4.2.2.4. Control Policy and Discriminator

Control Policy ( $\pi_{\mathrm{PHC}}$ ): Represents a Gaussian distribution over actions. $ \pi _ { \mathrm { PHC } } ( \mathbf { a } _ { t } | \mathbf { s } _ { t } ) = \mathcal { N } ( \mu ( \mathbf { \boldsymbol { s } } _ { t } ) , \boldsymbol { \sigma } ) $
- $\mathcal{N}$ : Denotes a Gaussian (normal) distribution.
- $\mu(\mathbf{s}_t)$ : The mean of the Gaussian distribution, which is a function of the state $\mathbf{s}_t$ computed by the neural network policy.
- $\boldsymbol{\sigma}$ : The standard deviation (or covariance matrix) of the Gaussian distribution. Here, it's a fixed diagonal covariance.
AMP Discriminator ( $\mathcal{D}$ ): $ \mathcal { D } \big ( { s } _ { t - 10 : t } ^ { \mathrm { p } } \big ) $ The discriminator computes a real or fake value based on a history of the humanoid's proprioception ( $s_{t-10:t}^{\mathrm{p}}$ ). This means it observes the humanoid's body state over the last 10 time steps to determine if its motion looks natural. It uses the same observations, loss formulation, and gradient penalty as the original AMP paper [33].
Network Architecture: All neural networks in the framework (the discriminator, each primitive policy, the value function, and the composer) are two-layer Multilayer Perceptrons (MLPs) with dimensions [1024, 512]. This means they have an input layer, two hidden layers with 1024 and 512 neurons respectively, and an output layer.

4.2.2.5. Humanoid

The humanoid controller is designed to be flexible and can support any human kinematic structure.

SMPL Model: The paper uses the SMPL [19] kinematic structure, following previous works [52, 20, 21].
Degrees of Freedom: The SMPL body consists of 24 rigid bodies, of which 23 are actuated. This results in an action space of $\pmb{a}_t \in \mathbb{R}^{23 \times 3}$ , meaning for each of the 23 actuated joints, the policy outputs 3 PD targets (e.g., for roll, pitch, yaw).
Body Shape: The body proportion can vary based on a body shape parameter $\beta \in \mathbb{R}^{10}$ , which allows for controlling avatars with different physical builds.

4.2.2.6. Initialization and Relaxed Early Termination (RET)

Reference State Initialization (RSI) [29]: During training, RSI is used, where the humanoid's initial state is set to match a randomly selected starting point within a motion clip. This helps to efficiently explore the state space and learn motion imitation.
Early Termination: An episode terminates if the humanoid's joints deviate by more than 0.5 meters on average globally from the reference motion.
Relaxed Early Termination (RET): A key modification proposed by the paper. Unlike UHC [20], PHC removes the ankle and toe joints from the termination condition.
- Rationale: Simulated humanoids often have a dynamics mismatch with real humans, especially concerning the multi-segment foot [27]. Blindly following MoCap foot movements can cause the simulated humanoid to lose balance. RET allows these joints to slightly deviate from the MoCap motion to maintain balance.
- Prevention of Unnatural Movement: Despite RET, the humanoid still receives imitation and discriminator rewards for these body parts, preventing them from moving in a non-human manner. This is deemed a small but crucial detail for achieving a high motion imitation success rate.

4.2.2.7. Hard Negative Mining

To efficiently train on large motion datasets, it's important to focus on challenging sequences as training progresses.

Procedure: Similar to UHC [20], the paper employs hard negative mining.
Definition of Hard Sequences: Hard sequences $\hat{Q}_{\mathrm{hard}} \subseteq \hat{Q}$ are defined as those which the current controller fails to imitate successfully.
Training Strategy: By evaluating the model over the entire dataset and selecting failed sequences, the training can be biased towards these harder examples, making the learning process more informative. The paper notes that even hard negative mining alone can suffer from catastrophic forgetting, which PMCP aims to address.

4.2.3. Progressive Multiplicative Control Policy (PMCP)

The PMCP is the core innovation for enabling PHC to scale to large datasets and learn new tasks without catastrophic forgetting. The observation is that model performance plateaus and forgets older sequences when learning new ones.

The following figure (Figure 2 from the original paper) illustrates the training progress and the role of primitives:

该图像是示意图，展示了基于参考运动和模拟运动的训练进展，涉及不同难度的原始动作序列（Primitive），并说明了失败恢复策略和训练的全数据集。图中提到的策略包括硬挖掘（Hard Mining）和作曲器（Composer）C。 Figure 2: The image is a diagram illustrating the training progress based on reference motion and simulated motion, involving primitive action sequences of varying difficulty and detailing the fail recovery strategy along with the full dataset. The strategies mentioned in the image include hard mining and composer C.

4.2.3.1. Progressive Neural Networks (PNN) Foundation

PMCP is inspired by Progressive Neural Networks (PNN) [34].

PNN Mechanism: A PNN starts with a single primitive network $\pmb{\mathcal{P}}^{(1)}$ $P^{(1)}$ trained on the full dataset $\hat{Q}$ $\hat{Q}$ . Once $\pmb{\mathcal{P}}^{(1)}$ $P^{(1)}$ has converged on $\hat{Q}$ $\hat{Q}$ using the imitation task, a subset of hard motions ( $\hat{Q}_{\mathrm{hard}}^{(k+1)}$ $\hat{Q}_{hard}^{(k + 1)}$ ) is identified by evaluating $\pmb{\mathcal{P}}^{(1)}$ $P^{(1)}$ on $\hat{Q}$ $\hat{Q}$ .
- The parameters of $\pmb{\mathcal{P}}^{(1)}$ are then frozen.
- A new primitive $\pmb{\mathcal{P}}^{(2)}$ (randomly initialized) is created.
- Lateral connections are established, connecting each layer of $\pmb{\mathcal{P}}^{(1)}$ to $\pmb{\mathcal{P}}^{(2)}$ . This allows $\pmb{\mathcal{P}}^{(2)}$ to leverage features learned by $\pmb{\mathcal{P}}^{(1)}$ .
PMCP Adaptation: In PMCP, each new primitive $\mathcal{P}^{(k)}$ is responsible for learning a new and harder subset of motion sequences. The paper also considers a variant where new primitives are initialized from the weights of the prior layer (a weight sharing scheme), similar to fine-tuning on harder sequences while preserving the previous primitive's ability to imitate learned sequences.

4.2.3.2. Fail-state Recovery as a New Task

Learning fail-state recovery is treated as a new task within the PMCP framework.

Types of Fail-states:
1. Fallen on the ground: The humanoid has lost balance and is lying down.
2. Faraway from the reference motion (> 0.5m): The humanoid has significantly deviated from the target path.
3. Combination: Both fallen and faraway.
Recovery Objective: In these situations, the humanoid should get up, approach the reference motion naturally, and resume imitation.
Dedicated Primitive ( $\mathcal{P}^{(F)}$ ): A new primitive $\mathcal{P}^{(F)}$ is initialized at the end of the primitive stack specifically for fail-state recovery.
Modified State Space for Recovery: During fail-state recovery, the reference motion itself is not useful (a fallen humanoid shouldn't imitate a standing reference). Therefore, the state space is modified to remove all reference motion information except the root's reference.
- Reference Joint Rotation Modification: For the reference joint rotation $\hat{\pmb{\theta}}_t = [ \hat{\pmb{\theta}}_t^0, \hat{\pmb{\theta}}_t^1, \dots, \hat{\pmb{\theta}}_t^J ]$ (where $\hat{\pmb{\theta}}_t^i$ is the $i^{\mathrm{th}}$ joint's rotation), a modified reference $\hat{\pmb{\theta}}_t^{\prime}$ is constructed: $ \hat { \pmb { \theta } } _ { t } ^ { \prime } = [ \hat { \pmb { \theta } } _ { t } ^ { 0 } , \pmb { \theta } _ { t } ^ { 1 } , \cdot \cdot \cdot \pmb { \theta } _ { t } ^ { j } ] $ Here, all non-root joint rotations (i.e., for $i=1 \dots J$ ) are replaced with the simulated values $\pmb{\theta}_t^i$ . This effectively sets the non-root joint goals to be identity (no change from current simulated pose) during recovery, except for the root joint $\hat{\pmb{\theta}}_t^0$ .
- Modified Goal State for Fail-state ( $s_t^{\mathrm{g-Fail}}$ ): This new goal state reflects the recovery objective. $ s _ { t } ^ { \mathrm { g-Fai l } } \stackrel { \Delta } { = } \left( \hat { \pmb { \theta } } _ { t } ^ { \prime } \ominus \pmb { \theta } _ { t } , \hat { p } _ { t } ^ { \prime } - p _ { t } , \hat { v } _ { t } ^ { \prime } - v _ { t } , \hat { \omega } _ { t } ^ { \prime } - \omega _ { t } , \hat { \theta _ { t } ^ { \prime } } , \hat { p } _ { t } ^ { \prime } \right) $ The only reference information guiding this state is the relative position and orientation of the target root.
- Clamping Goal Position: If the reference root is too far ( $> 5m$ ), the goal position difference is normalized and clamped to prevent excessively large goal signals: $ \frac { 5 \times ( \hat { \pmb { p } } _ { t } ^ { \prime } - \pmb { p } _ { t } ) } { | \hat { \pmb { p } } _ { t } ^ { \prime } - \pmb { p } _ { t } | _ { 2 } } $
- Switching Condition: The goal state seamlessly switches between fail-state recovery and full-motion imitation based on the root's distance to the reference root: $ s _ { t } ^ { \mathrm { g } } = \left{ \begin{array} { l l } { s _ { t } ^ { \mathrm { g } } } & { | \hat { p } _ { t } ^ { \mathrm { 0 } } - p _ { t } ^ { \mathrm { 0 } } | _ { 2 } \leq 0 . 5 } \ { s _ { t } ^ { \mathrm { g-Fai l } } } & { \mathrm { o t h e rw i s e } . } \end{array} \right. $ If the simulated root position $p_t^0$ is within 0.5m of the reference root position $\hat{p}_t^0$ , the normal goal state $s_t^{\mathrm{g}}$ (for imitation) is used. Otherwise, the fail-state goal state $s_t^{\mathrm{g-Fail}}$ is used.
Creating Fail-states for Training:
- Fallen states: The humanoid is randomly dropped on the ground, and random joint torques are applied for 150 time steps at the beginning of an episode (similar to ASE [32]).
- Far-states: The humanoid is initialized $2 \sim 5$ meters away from the reference motion.
Reward for Fail-state Recovery: $ r _ { t } ^ { \mathrm { g-recover } } = \pmb { \mathcal { R } } ^ { \mathrm { recover } } ( \pmb { s } _ { t } , \hat { \pmb { q } } _ { t } ) = 0 . 5 r _ { t } ^ { \mathrm { g-point } } + 0 . 5 r _ { t } ^ { \mathrm { amp } } + 0 . 1 r _ { t } ^ { \mathrm { energy } } $
- $r _ { t } ^ { \mathrm { g-point } } = ( d _ { t - 1 } - d _ { t } )$ : A point-based reward that encourages the humanoid to reduce the distance $d_t$ between its root and the reference root [47].
- The AMP style reward $r_t^{\mathrm{amp}}$ and energy penalty $r_t^{\mathrm{energy}}$ are also included to ensure natural and stable recovery.
Training Data for Recovery: $\mathcal{P}^{(F)}$ is trained using a handpicked subset of the AMASS dataset named $\pmb{Q}^{\mathrm{loco}}$ , which contains mainly walking and running sequences. This biases the discriminator and AMP reward towards simple locomotion, which is appropriate for basic recovery.
Value Function and Discriminator: The existing value function and discriminator are continuously fine-tuned without initializing new ones for recovery.

4.2.3.3. Multiplicative Control

Once each primitive ( $\mathcal{P}^{(1)}, \dots, \mathcal{P}^{(K)}, \mathcal{P}^{(F)}$ ) has been trained to imitate a subset of the dataset $\hat{Q}$ or perform fail-state recovery, a composer $c$ is trained to dynamically combine these learned primitives.

Composer ( $C$ ): The composer $\pmb{C}(\pmb{w}_t^{1:K+1} | \pmb{s}_t)$ takes the same input state $\pmb{s}_t$ as the primitives and outputs weights $\pmb{w}_t^{1:K+1} \in \mathbb{R}^{K+1}$ to activate the primitives. These weights determine how much each primitive's output contributes to the final action.
PHC's Output Distribution: The final PHC policy $\pi_{\mathrm{PHC}}$ $π_{PHC}$ 's output action distribution is a multiplicative combination of the individual primitive distributions. $ \pi _ { \mathrm { PHC } } ( \boldsymbol { a } _ { t } \mid \boldsymbol { s } _ { t } ) = \frac { 1 } { \mathcal { C } ( \boldsymbol { s } _ { t } ) } \prod _ { i } ^ { k } \mathcal { P } ^ { ( i ) } ( \boldsymbol { a } _ { t } ^ { ( i ) } \mid \boldsymbol { s } _ { t } ) ^ { \mathcal { C } ( \boldsymbol { s } _ { t } ) } , \quad \mathcal { C } ( \boldsymbol { s } _ { t } ) \ge 0 . $
- $\mathcal{P}^{(i)}(\boldsymbol{a}_t^{(i)} \mid \boldsymbol{s}_t)$ : The action distribution predicted by the $i^{\mathrm{th}}$ primitive.
- $\mathcal{C}(\boldsymbol{s}_t) = \sum_i^k \pmb{C}_i(\boldsymbol{s}_t)$ : The sum of the activation weights from the composer. This term normalizes the combined distribution.
Combined Action Distribution (for Gaussian Primitives): Since each $\mathcal{P}^{(k)}$ $P^{(k)}$ is an independent Gaussian policy, their multiplicative combination also results in a Gaussian distribution with the following parameters for each action dimension $j$ $j$ : $ \mathcal { N } \left( \frac { 1 } { \sum _ { l } ^ { k } \frac { C _ { i } ( s _ { t } ) } { \sigma _ { l } ^ { j } ( s _ { t } ) } } \sum _ { i } ^ { k } \frac { \pmb { C } _ { i } ( s _ { t } ) } { \sigma _ { i } ^ { j } ( s _ { t } ) } \mu _ { i } ^ { j } ( s _ { t } ) , \sigma ^ { j } ( s _ { t } ) = \left( \sum _ { i } ^ { k } \frac { \pmb { C } _ { i } ( s _ { t } ) } { \sigma _ { i } ^ { j } ( s _ { t } ) } \right) ^ { - 1 } \right) $
- $\mu _ { i } ^ { j } ( s _ { t } )$ : The mean output by primitive $i$ for action dimension $j$ .
- $\sigma _ { i } ^ { j } ( s _ { t } )$ : The variance (square of standard deviation) output by primitive $i$ for action dimension $j$ .
- $C_i(s_t)$ : The weight (activation) for primitive $i$ given state $s_t$ from the composer. This formula shows that the mean of the combined policy is a weighted average of the primitive means, and the variance is an inverse weighted sum of the primitive variances.
MOE vs. MCP: Unlike a Mixture of Expert (MOE) policy (which typically uses top-1 MOE and only activates one expert at a time), Multiplicative Control Policy (MCP) combines the distributions of all primitives, similar to a top-inf MOE. This allows for a richer and more combined policy output.
Training Process (Alg. 1): The primitives are progressively trained, and then the composer is trained to combine them. This ensures that the composite policy leverages the specialized knowledge of each primitive. During composer training, fail-state recovery training is interleaved.

Algorithm 1: PMCP Training Procedure

The following is the Progressive Multiplicative Control Policy (PMCP) training procedure as described in Algorithm 1 of the original paper:

1 Function TrainPPO (π, Q(k), D, V, R):   
2 while not converged do   
3     M ← ∅ initialize sampling memory ;   
4     while M not full do   
5         q_1:T ← sample motion from Q ;   
6         for t = 1 . . . T do   
7             s_t ← (s_t^p, s_t^g) ;   
8             a_t ← π(a_t | s_t) ;   
9             s_{t+1} ← T(s_{t+1} | s_t, a_t) // simulation;   
10            r_t ← R(s_t, q_{t+1}) ;   
11            store (s_t, a_t, r_t, s_{t+1}) into memory M ;   
12    P^(k), V ← PPO update using experiences collected in M ;   
13    D ← Discriminator update using experiences collected in M ;   
14 return π;   
15 Input: Ground truth motion dataset Q ;   
16 D, V, Q_hard^(1) ← Q . // Initialize discriminator, value   
   function, and dataset;   
17 for k ← 1 . . . K do   
18    Initialize P^(k) // with lateral connection/weight sharing;   
19    Q_hard^(k+1) ← eval(P^(k), Q_hard^(k)) // Collect new hard sequences from Q_hard^(k) that P^(k) fails on;   
20    P^(k) ← TrainPPO(P^(k), Q_hard^(k+1), D, V, R_imitation) ;   
21    freeze P^(k) ;   
22 P^(F) ← TrainPPO(P^(F), Q_loco, D, V, R_recover) // Fail-state Recovery;   
23 π_PHC ← {P^(1) ... P^(K), P^(F), c} // The final PHC is composed of all primitives and a composer c;   
24 PHC ← TrainPPO(π_PHC, Q, D, V, {R_imitation, R_recover}) // Train Composer;

Explanation of Algorithm 1:

TrainPPO Function (Lines 1-14): This is a generic PPO training loop used for training individual policies (primitives) and the final composer.

Line 1: Defines the TrainPPO function, taking a policy ( $\pi$ ), a motion dataset ( $Q^{(k)}$ or a subset), a discriminator ( $D$ ), a value function ( $V$ ), and a reward function ( $R$ ) as inputs.
Line 2: The training continues until a convergence criterion is met.
Line 3: Initializes an empty sampling memory $M$ . This memory will store experience tuples collected during policy rollout.
Line 4: Loop until the sampling memory $M$ is full.
Line 5: Samples a motion sequence $\hat{q}_{1:T}$ from the provided motion dataset $Q$ .
Line 6: Iterates through each time step $t$ of the sampled motion sequence.
Line 7: Constructs the current state $s_t$ by combining proprioception $s_t^p$ and goal state $s_t^g$ .
Line 8: The policy $\pi$ takes the current state $s_t$ and outputs an action $a_t$ .
Line 9: The physics simulation (transition dynamics $\mathcal{T}$ ) updates the environment from $s_t$ to $s_{t+1}$ based on the action $a_t$ .
Line 10: The reward function $R$ calculates the reward $r_t$ based on the current state $s_t$ and the next reference pose $\hat{q}_{t+1}$ .
Line 11: The experience tuple $(s_t, a_t, r_t, s_{t+1})$ is stored in the sampling memory $M$ .
Line 12: Once $M$ is full, PPO is used to update the policy $\mathcal{P}^{(k)}$ and the value function $V$ based on the collected experiences.
Line 13: The discriminator $D$ is also updated using the collected experiences from $M$ .
Line 14: Returns the trained policy $\pi$ .

Main Training Loop (Lines 15-24): This describes the progressive training of the primitives and composer.
Line 15: Input is the full ground truth motion dataset $\hat{Q}$ .
Line 16: Initializes the discriminator $D$ , value function $V$ , and sets the initial hard sequences dataset $\hat{Q}_{\mathrm{hard}}^{(1)}$ to be the entire ground truth motion dataset $\hat{Q}$ .
Line 17: Loop for $k$ from 1 to $K$ , where $K$ is the total number of motion imitation primitives. This signifies the progressive training stages.
Line 18: Initializes the $k^{\mathrm{th}}$ primitive $\mathcal{P}^{(k)}$ . This involves setting up lateral connections to previous primitives or weight sharing from the previous primitive (as discussed in PNN).
Line 19: Evaluates the currently trained primitive $\mathcal{P}^{(k)}$ on the previous set of hard sequences $\hat{Q}_{\mathrm{hard}}^{(k)}$ . It identifies the sequences that $\mathcal{P}^{(k)}$ still fails to imitate and designates them as the new, harder sequences $\hat{Q}_{\mathrm{hard}}^{(k+1)}$ for the next stage.
Line 20: Trains the $k^{\mathrm{th}}$ primitive $\mathcal{P}^{(k)}$ using the TrainPPO function on the newly identified hard sequences $\hat{Q}_{\mathrm{hard}}^{(k+1)}$ and the imitation reward function $\mathcal{R}_{\mathrm{imitation}}$ .
Line 21: After $\mathcal{P}^{(k)}$ is trained on its hard sequences, its parameters are frozen to prevent catastrophic forgetting.
Line 22: After all motion imitation primitives $\mathcal{P}^{(1)} \ldots \mathcal{P}^{(K)}$ are trained and frozen, a dedicated fail-state recovery primitive $\mathcal{P}^{(F)}$ is trained. It uses the TrainPPO function on a specific locomotion dataset $\pmb{Q}^{\mathrm{loco}}$ and the fail-state recovery reward function $\mathcal{R}_{\mathrm{recover}}$ .
Line 23: The final Perpetual Humanoid Controller $\pi_{\mathrm{PHC}}$ is formed by combining all the trained primitives ( $\mathcal{P}^{(1)} \ldots \mathcal{P}^{(K)}, \mathcal{P}^{(F)}$ ) and introducing a composer $c$ .
Line 24: The composer $c$ (as part of $\pi_{\mathrm{PHC}}$ ) is then trained using the TrainPPO function on the full ground truth motion dataset $\hat{Q}$ , utilizing a combined reward structure including both imitation and recovery rewards. This final stage teaches the composer how to dynamically select and blend the primitives based on the current situation (imitation or recovery).

4.2.4. Connecting with Motion Estimators

The PHC is designed to be task-agnostic, meaning it only requires the next-timestep reference pose $\tilde{\pmb{q}}_t$ or keypoint $\tilde{\pmb{p}}_t$ for motion tracking. This modular design allows it to be integrated with various off-the-shelf motion estimators and generators.

Video-based Pose Estimators:
- HybrIK [17]: Used to estimate joint rotations $\tilde{\pmb{\theta}}_t$ .
- MeTRAbs [39, 38]: Used to estimate global 3D keypoints $\tilde{\pmb{p}}_t$ .
- Distinction: HybrIK provides rotations, while MeTRAbs provides keypoints, which aligns with the PHC's support for both rotation-based and keypoint-based imitation.
Language-based Motion Generation:
- Motion Diffusion Model (MDM) [41]: A model that generates disjoint motion sequences from text prompts. PHC's recovery ability is crucial here to achieve in-betweening (smoothly transitioning between disconnected generated clips).
  
  The figure (Figure 4 from the original paper) shows examples of noisy motion imitation from video and language:
  
  Figure 4: (d) Noisy Motion Imitation (Video: real-time & live, pose estimation from MeTRAbs) a webcam stream for a real-time simulated avatar.

5. Experimental Setup

The paper evaluates the Perpetual Humanoid Controller (PHC) on its ability to imitate high-quality MoCap sequences, noisy motion sequences estimated from videos, and its capacity for fail-state recovery.

5.1. Datasets

AMASS [23]:
- Source: A large-scale Motion Capture (MoCap) dataset that aggregates diverse MoCap data from various labs.
- Characteristics: Contains tens of thousands of clips (40 hours of motion) with corresponding surface shapes (via SMPL parameters).
- Usage: PHC is primarily trained on the training split of AMASS.
- Filtering: Following UHC [20], sequences that are noisy or involve human-object interactions are removed. This results in 11,313 high-quality training sequences and 140 test sequences used for evaluation.
- Q_loco: A handpicked subset of the AMASS dataset containing mainly walking and running sequences is used to train the fail-state recovery primitive ( $\mathcal{P}^{(F)}$ ).
H36M (Human3.6M) [14]:
- Source: A popular human motion dataset containing 3.6 million 3D human poses captured in various scenarios.
- Usage: Used to evaluate the policy's ability to handle unseen MoCap sequences and noisy pose estimates from vision-based pose estimation methods.
- Subsets Derived:
  - H36M-Motion*: Contains 140 high-quality MoCap sequences from the entire H36M dataset. This set tests PHC's generalization to unseen MoCap data.
  - H36M-Test-Video*: Contains 160 sequences of noisy poses estimated from videos in the H36M test split. This is crucial for evaluating PHC's robustness to real-world, noisy input from video. (The * indicates removal of human-chair interaction sequences).

5.2. Evaluation Metrics

The paper uses a comprehensive set of pose-based and physics-based metrics to evaluate motion imitation performance.

Success Rate (Succ):
- Conceptual Definition: This metric quantifies the percentage of motion clips that the humanoid controller can successfully imitate without major failures. An imitation episode is deemed unsuccessful if, at any point, the average global deviation of the humanoid's joints from the reference motion exceeds 0.5 meters.
- Purpose: Succ measures the humanoid's ability to track the reference motion continuously without losing balance or significantly lagging behind.
- Mathematical Formula: The paper does not provide a specific formula, but it implies a binary outcome per clip: 1 if successful, 0 if failed. The success rate is then the average over all clips. $ \mathrm{Succ} = \frac{1}{N_{clips}} \sum_{i=1}^{N_{clips}} \mathbf{1}(\mathrm{episode}_i \text{ is successful}) \times 100% $
  - $N_{clips}$ : Total number of motion clips evaluated.
  - $\mathbf{1}(\cdot)$ : Indicator function, which is 1 if the condition is true, and 0 otherwise.
  - An episode is successful if $\forall t, \frac{1}{J} \sum_{j=1}^J \| p_{t,j} - \hat{p}_{t,j} \|_2 \leq 0.5 \text{ meters}$ , where $p_{t,j}$ and $\hat{p}_{t,j}$ are the simulated and reference 3D positions of joint $j$ at time $t$ , and $J$ is the number of joints.
Root-relative Mean Per-Joint Position Error (MPJPE):
- Conceptual Definition: Measures the average Euclidean distance between the simulated 3D joint positions and the ground truth 3D joint positions, after aligning the root joints (e.g., pelvis) of both the simulated and reference poses. This metric focuses on the relative accuracy of joint positions within the body, effectively removing global translation differences.
- Purpose: MPJPE assesses the local fidelity of the imitated motion's pose structure.
- Mathematical Formula: $ E_{\mathrm{mpjpe}} = \frac{1}{N_F \cdot J} \sum_{k=1}^{N_F} \sum_{j=1}^J | (\hat{P}{k,j} - \hat{P}{k,root}) - (P_{k,j} - P_{k,root}) |_2 $
  - $N_F$ : The total number of frames evaluated across all successful episodes.
  - $J$ : The total number of joints in the human body model.
  - $\hat{P}_{k,j}$ : The ground truth 3D position vector of joint $j$ at frame $k$ .
  - $P_{k,j}$ : The simulated 3D position vector of joint $j$ at frame $k$ .
  - $\hat{P}_{k,root}$ : The ground truth 3D position vector of the root joint (e.g., pelvis) at frame $k$ .
  - $P_{k,root}$ : The simulated 3D position vector of the root joint at frame $k$ .
  - $\|\cdot\|_2$ : The Euclidean (L2) norm, representing the distance between two vectors.
  - $(\hat{P}_{k,j} - \hat{P}_{k,root})$ : The relative position of joint $j$ to the root in the ground truth pose.
  - $(P_{k,j} - P_{k,root})$ : The relative position of joint $j$ to the root in the simulated pose.
Global MPJPE ( $E_{\mathrm{g-mpjpe}}$ ):
- Conceptual Definition: Measures the average Euclidean distance between the simulated 3D joint positions and the ground truth 3D joint positions without any root alignment.
- Purpose: Global MPJPE considers both the relative joint accuracy and the global position accuracy of the entire pose.
- Mathematical Formula: $ E_{\mathrm{g-mpjpe}} = \frac{1}{N_F \cdot J} \sum_{k=1}^{N_F} \sum_{j=1}^J | \hat{P}{k,j} - P{k,j} |_2 $
  - $N_F$ : The total number of frames evaluated across all successful episodes.
  - $J$ : The total number of joints in the human body model.
  - $\hat{P}_{k,j}$ : The ground truth 3D position vector of joint $j$ at frame $k$ .
  - $P_{k,j}$ : The simulated 3D position vector of joint $j$ at frame $k$ .
  - $\|\cdot\|_2$ : The Euclidean (L2) norm, representing the distance between two vectors.
Acceleration Error ( $E_{\mathrm{acc}}$ ):
- Conceptual Definition: Measures the difference in acceleration between the simulated motion and the reference MoCap motion.
- Purpose: This physics-based metric indicates the smoothness and physical plausibility of the simulated motion. High acceleration error can suggest jerky or unnatural movements.
- Mathematical Formula: The paper does not provide a specific formula. Generally, acceleration for a joint position $P_j(t)$ $P_{j} (t)$ can be approximated by finite difference as $A_j(t) \approx \frac{P_j(t+1) - 2P_j(t) + P_j(t-1)}{\Delta t^2}$ $A_{j} (t) \approx \frac{P _{j} ( t + 1 ) - 2 P _{j} ( t ) + P _{j} ( t - 1 )}{Δ t ^{2}}$ . The error would be the average Euclidean distance between simulated and reference accelerations. $ E_{\mathrm{acc}} = \frac{1}{N_F \cdot J} \sum_{k=1}^{N_F} \sum_{j=1}^J | A_{k,j} - \hat{A}_{k,j} |_2 $
  - $N_F$ : Total number of frames.
  - $J$ : Total number of joints.
  - $A_{k,j}$ : Simulated acceleration of joint $j$ at frame $k$ .
  - $\hat{A}_{k,j}$ : Reference acceleration of joint $j$ at frame $k$ .
Velocity Error ( $E_{\mathrm{vel}}$ ):
- Conceptual Definition: Measures the difference in velocity between the simulated motion and the reference MoCap motion.
- Purpose: Another physics-based metric that, similar to acceleration error, reflects the physical realism and smoothness of the imitated motion.
- Mathematical Formula: The paper does not provide a specific formula. Generally, velocity for a joint position $P_j(t)$ $P_{j} (t)$ can be approximated by finite difference as $V_j(t) \approx \frac{P_j(t+1) - P_j(t-1)}{2 \Delta t}$ $V_{j} (t) \approx \frac{P _{j} ( t + 1 ) - P _{j} ( t - 1 )}{2Δ t}$ . The error would be the average Euclidean distance between simulated and reference velocities. $ E_{\mathrm{vel}} = \frac{1}{N_F \cdot J} \sum_{k=1}^{N_F} \sum_{j=1}^J | V_{k,j} - \hat{V}_{k,j} |_2 $
  - $N_F$ : Total number of frames.
  - $J$ : Total number of joints.
  - $V_{k,j}$ : Simulated velocity of joint $j$ at frame $k$ .
  - $\hat{V}_{k,j}$ : Reference velocity of joint $j$ at frame $k$ .

5.3. Baselines

The main baseline for comparison is the state-of-the-art (SOTA) motion imitator UHC [20].

UHC [20]: The Universal Humanoid Control, which previously achieved a 97% success rate on the AMASS dataset.
- Comparison modes: PHC is compared against UHC in two configurations:
  1. UHC with Residual Force Control (RFC): This is the original, high-performing UHC setup that uses external forces for stabilization.
  2. UHC without Residual Force Control (RFC): This configuration removes the external stabilizing forces from UHC to provide a fair comparison on physical realism and inherent stability. This is crucial for evaluating PHC's ability to perform without such a "hand of God."

5.4. Implementation Details

Primitives: The PHC uses four primitives in total, including the fail-state recovery primitive.
Training Time: Training all primitives and the composer takes approximately one week on a single NVIDIA A100 GPU.
Real-time Performance: Once trained, the composite policy runs at $> 30 FPS$ (frames per second), which is sufficient for real-time applications.
Physics Simulator: NVIDIA's Isaac Gym [24] is used for physics simulation. Isaac Gym is known for its high-performance, GPU-accelerated physics simulation, which is critical for RL training efficiency.
Control Frequency: The control policy runs at 30 Hz.
Simulation Frequency: The physics simulation runs at 60 Hz. This means the physics engine takes two simulation steps for each control step.
Body Shape: For evaluation purposes, body shape variation is not considered, and the mean SMPL body shape is used. This simplifies the evaluation and focuses on motion imitation and recovery capabilities.

6. Results & Analysis

The experimental results demonstrate the PHC's superior performance in motion imitation on both high-quality and noisy data, as well as its robust fail-state recovery capabilities.

6.1. Motion Imitation

6.1.1. Motion Imitation on High-quality MoCap

The following are the results from Table 1 of the original paper:

		AMASS-Train*					AMASS-Test*				H36M-Motion*
		Method	RFC	Succ ↑	Eg-mppe ↓		Empipe ↓	Eacc ↓	Evel ↓	Succ ↑	Eg-mppe ↓	Empipe ↓	Eacc↓	Evel ↓	Succ ↑	Eg-mpjpe ↓	Empipe ↓	Eacc ↓	Evel ↓
UHC	✓	97.0 %	36.4	25.1	4.4	5.9	96.4 %	50.0	31.2	9.7	12.1	87.0%	59.7	35.4	4.9	7.4
UHC	X	84.5 %	62.7	39.6	10.9	10.9	62.6%	58.2	98.1	22.8	21.9	23.6%	133.14	67.4	14.9	17.2
Ours	X	98.9 %	37.5	26.9	3.3	4.9	96.4%	47.4	30.9	6.8	9.1	92.9%	50.3	33.3	3.7	5.5
Ours-kp	X	98.7%	40.7	32.3	3.5	5.5	97.1%	53.1	39.5	7.5	10.4	95.7%	49.5	39.2	3.7	5.8

Table 1: Quantitative results on imitating MoCap motion sequences ( indicates removing sequences containing human-object interaction). AMASS-Train, AMASS-Test, and H36M-Motion* are high-quality MoCap datasets. RFC: Residual Force Control. Succ ↑: Higher is better. Eg-mpjpe ↓: Lower is better. Empipe ↓: Lower is better. Eacc ↓: Lower is better. Evel ↓: Lower is better.*

Analysis:

PHC (Ours) vs. UHC with RFC: The PHC (Ours, no RFC) generally outperforms UHC with RFC on almost all metrics across all datasets.
- On AMASS-Train*, PHC achieves a success rate of 98.9% (vs. 97.0% for UHC with RFC), while having comparable or slightly higher MPJPEs (Eg-mpjpe 37.5 vs 36.4, Empipe 26.9 vs 25.1) but significantly lower acceleration (3.3 vs 4.4) and velocity (4.9 vs 5.9) errors. This indicates PHC can imitate training sequences with higher physical realism and smoothness while achieving a higher success rate.
- On AMASS-Test* and H36M-Motion* (unseen data), PHC maintains a high success rate (96.4% and 92.9% respectively), which is comparable to or better than UHC with RFC (96.4% and 87.0%). Crucially, PHC consistently shows lower acceleration and velocity errors, indicating more natural and physically plausible motion even on unseen data.
PHC (Ours) vs. UHC without RFC: UHC trained without RFC performs significantly worse, especially on test sets. Its success rate drops dramatically (AMASS-Test*: 62.6%, H36M-Motion*: 23.6%). It also exhibits much higher acceleration and velocity errors, suggesting it struggles to stay balanced and resorts to high-frequency, unnatural movements. This highlights the effectiveness of PHC's design in achieving stability without external aids.
Keypoint-based Controller (Ours-kp): Surprisingly, the keypoint-based controller (Ours-kp) is competitive with, and in some cases even outperforms, the rotation-based controller (Ours). For instance, on H36M-Motion*, Ours-kp achieves the highest success rate of 95.7% and comparable error metrics. This suggests that providing only 3D keypoints can be a strong and potentially simpler input modality. The authors hypothesize this is due to the keypoint-based controller having more freedom in joint configuration to match keypoints, making it more robust.

Overall: PHC demonstrates superior motion imitation capabilities compared to the RFC-enabled UHC while entirely abstaining from external forces, leading to more physically realistic and natural motions. Its progressive training and AMP integration are likely key contributors to this performance.

6.1.2. Motion Imitation on Noisy Input from Video

The following are the results from Table 2 of the original paper:

H36M-Test-Video*
Method	RFC	Pose Estimate	Succ ↑	Eg-mppe ↓	Empipe ↓
UHC	✓	HybrIK + MeTRAbs (root)	58.1%	75.5	49.3
UHC	X	HybrIK + MeTRAbs (root)	18.1%	126.1	67.1
Ours	X	HybrIK + MeTRAbs (root)	88.7%	55.4	34.7
Ours-kp	X	HybrIK + MeTRAbs (root)	90.0%	55.8	41.0
Ours-kp	X	MeTRAbs (all keypoints)	91.9%	55.7	41.1

Table 2: Motion imitation on noisy motion. We use HybrIK[17] to estimate the joint rotations $\\tilde { \\theta } _ { t }$ and uses MeTRAbs [39] for global 3D keypoints $\\tilde { p } _ { t }$ . HybrIK $^+$ MeTRAbs (root): using joint rotations $\\tilde { \\pmb { \\theta } } _ { t }$ from HybrIK and root position $\\tilde { p } _ { t } ^ { 0 }$ from MeTRAbs. MeTRAbs (all keypoints): using all keypoints $\\tilde { p } _ { t }$ from MeTRAbs, only applicable to our keypoint-based controller.

Analysis:

Robustness to Noisy Input: This table evaluates the performance on H36M-Test-Video*, which contains noisy pose estimates from video. This is a highly challenging scenario due to depth ambiguity, monocular global pose estimation, and depth-wise jitter.
PHC (Ours) vs. UHC:
- PHC (Ours, no RFC) significantly outperforms both UHC variants. With HybrIK + MeTRAbs (root) input, PHC achieves an 88.7% success rate, far surpassing UHC with RFC (58.1%) and UHC without RFC (18.1%).
- The MPJPEs for PHC are also substantially lower (Eg-mpjpe 55.4, Empipe 34.7) compared to UHC with RFC (Eg-mpjpe 75.5, Empipe 49.3) and UHC without RFC (Eg-mpjpe 126.1, Empipe 67.1). This demonstrates PHC's exceptional robustness to noisy motion, making it practical for video-driven avatar control.
Keypoint-based Controller (Ours-kp) Advantage:
- The keypoint-based controller (Ours-kp) consistently performs best. When using HybrIK + MeTRAbs (root) input, it reaches a 90.0% success rate.
- When provided with MeTRAbs (all keypoints), Ours-kp achieves the highest success rate of 91.9%.
- Explanation: The authors provide two reasons for Ours-kp's superior performance:
  1. Estimating 3D keypoints directly from images might be an easier task than estimating joint rotations, leading to higher quality MeTRAbs keypoints compared to HybrIK rotations.
  2. The keypoint-based controller has more freedom to find a joint configuration that matches the given keypoints, which makes it inherently more robust to noisy input that might specify inconsistent rotations.

6.2. Ablation Studies

The following are the results from Table 3 of the original paper:

H36M-Test-Video*
RET	MCP	PNN	Rotation	Fail-Recover	Succ ↑	Eg-mpjpe ↓	Empipe ↓
X	X	X	✓	X	51.2%	56.2	34.4
✓	X	X	✓	X	59.4%	60.2	37.2
✓	✓	X	✓	X	66.2%	59.0	38.3
✓	✓	✓	✓	X	86.9%	53.1	33.7
✓	✓	✓	✓	✓	88.7%	55.4	34.7
✓	✓	✓	X	✓	90.0%	55.8	41.0

Table 3: Ablation on components of our pipeline, performed using noisy pose estimate from HybrIK $^+$ Metrabs (root) on the H36M-Test-Video data. RET: relaxed early termination. MCP: multiplicative control policy. PNN: progressive neural networks.*

Analysis: The ablation study is performed on the H36M-Test-Video* dataset with noisy pose estimates to highlight the impact of each component on robustness.

Impact of Relaxed Early Termination (RET) (R1 vs. R2):
- R1 (no RET) achieves a 51.2% success rate.
- R2 (with RET) improves to 59.4%.
- Finding: RET significantly boosts the success rate by allowing the ankle and toe joints to slightly deviate for better balance, confirming its importance. The Eg-mpjpe and Empipe metrics are slightly higher for R2, suggesting a minor trade-off in exact pose matching for increased stability.
Impact of Multiplicative Control Policy (MCP) without Progressive Training (R2 vs. R3):
- R2 (no MCP, no PNN) has 59.4% success.
- R3 (with MCP but no PNN -- i.e., MCP trained on a single policy without progressive primitive learning) improves to 66.2% success.
- Finding: Even without progressive training, MCP provides a boost in performance, likely due to its enlarged network capacity and ability to blend multiple experts implicitly (if it was a single large network, MCP provides a way to combine different outputs).
Impact of Progressive Neural Networks (PNN) / PMCP Pipeline (R3 vs. R4):
- R3 (with MCP, no PNN) has 66.2% success.
- R4 (with MCP and PNN – representing the PMCP pipeline for imitation only) jumps to 86.9% success.
- Finding: The full PMCP pipeline (which includes PNN's progressive learning and MCP's composition) significantly boosts robustness and imitation performance. This validates that progressively learning with new network capacity is crucial for handling diverse and harder motion sequences effectively and mitigating catastrophic forgetting.
Impact of Fail-state Recovery (R4 vs. R5):
- R4 (full PMCP for imitation, no fail-state recovery) achieves 86.9% success.
- R5 (full PMCP for imitation and fail-state recovery) achieves 88.7% success.
- Finding: Adding fail-state recovery capability improves the success rate without compromising motion imitation. This is a strong result, demonstrating that PMCP is effective in adding new tasks without catastrophic forgetting, and the recovery primitive contributes positively to overall robustness, even for typical imitation tasks (by providing graceful handling of failures).
Impact of Keypoint-based vs. Rotation-based (R5 vs. R6):
- R5 (Rotation-based with full PMCP and fail-state recovery) has 88.7% success.
- R6 (Keypoint-based with full PMCP and fail-state recovery) has 90.0% success.
- Finding: The keypoint-based controller (Ours-kp) outperforms the rotation-based one on noisy video input. This reinforces the idea that keypoint-based imitation can be a simpler and more robust alternative, especially when input quality is compromised.

6.3. Real-time Simulated Avatars

The paper demonstrates PHC's capability for real-time simulated avatars.

A live demo (30 fps) is shown where a webcam video is used as input.
The keypoint-based controller and MeTRAbs-estimated keypoints are used in a streaming fashion.
The humanoid can perform various motions like posing and jumping while remaining stable.
The controller can also imitate reference motion generated by a motion language model like MDM [41]. The recovery ability of PHC is crucial for in-betweening (smoothly connecting) the disjoint motion sequences generated by MDM.

6.4. Fail-state Recovery

The following are the results from Table 4 of the original paper:

	Fallen-State		Far-State		Fallen + Far-State
	Method	Succ-5s ↑	Succ-10s ↑	Succ-5s ↑	Succ-10s ↑	Succ-5s ↑	Succ-10s ↑
Ours	95.0%	98.8%	83.7%	99.5%	93.4%	98.8%
Ours-kp	92.5%	94.6%	95.1%	96.0%	79.4%	93.2%

Table 4: We measure whether our controller can recover from the fail-states by generating these scenarios (dropping the humanoid on the ground & far from the reference motion) and measuring the time it takes to resume tracking. Succ-5s ↑: Percentage of successful recoveries within 5 seconds. Succ-10s ↑: Percentage of successful recoveries within 10 seconds.

Analysis: This table evaluates the humanoid's ability to recover from various fail-states and resume tracking a standing-still reference motion. Recovery is considered successful if the humanoid reaches the reference motion within a given time frame (5 or 10 seconds).

High Success Rates: Both the rotation-based (Ours) and keypoint-based (Ours-kp) controllers demonstrate very high success rates for fail-state recovery.
- For Fallen-State and Far-State, both methods achieve success rates well above 90% within 10 seconds, and often above 80-90% within 5 seconds. This shows excellent capability in getting up and moving back to the reference.
- Even in the most challenging Fallen + Far-State scenario (where the humanoid is both lying down and far from the target), Ours achieves 93.4% within 5s and 98.8% within 10s, while Ours-kp achieves 79.4% and 93.2% respectively.
Minor Differences between Rotation-based and Keypoint-based:
- Ours (rotation-based) shows slightly higher success rates for Fallen-State and Fallen + Far-State in the 5s window, suggesting it might be slightly quicker to recover in some complex scenarios.
- Ours-kp (keypoint-based) performs exceptionally well in the Far-State (95.1% in 5s), even outperforming Ours. This indicates it might be particularly effective at navigating towards a distant target.
Conclusion: The results robustly confirm that PHC can effectively and rapidly recover from diverse fail-states, ensuring perpetual control without the need for manual resets. The AMP reward and dedicated recovery primitive are key to this natural and stable recovery.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces the Perpetual Humanoid Controller (PHC), a significant advancement in physics-based humanoid control for real-time simulated avatars. The PHC achieves high-fidelity motion imitation of large and diverse MoCap datasets (98.9% on AMASS-Train*) while rigorously adhering to physical realism by eliminating the need for external stabilizing forces. A core innovation, the progressive multiplicative control policy (PMCP), enables efficient scalability to tens of thousands of motion clips and the seamless integration of new tasks, such as natural fail-state recovery, without suffering from catastrophic forgetting. The PHC is demonstrated to be highly robust to noisy input originating from video-based pose estimators and language-based motion generators, making it a practical and task-agnostic solution for live, real-time multi-person avatar applications. Its ability to recover gracefully from various fail-states (fallen, faraway, or combined) and resume motion imitation ensures continuous, perpetual control without disruptive resets.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Dynamic Motion Imitation: While PHC achieves high success rates, it doesn't reach 100% on the training set. Highly dynamic motions such as high jumping and backflipping remain challenging. The authors hypothesize that learning such motions, especially when combined with simpler movements, requires more planning and intent than what is conveyed by a single next-frame pose target ( $\hat{\pmb{q}}_{t+1}$ ).
Training Time: The progressive training procedure of PMCP results in a relatively long training time (around a week on an A100 GPU).
Disjoint Process for Downstream Tasks: The current setup involves a disjoint process where video pose estimators or motion generators operate independently of the physics simulation. For enhanced downstream tasks, tighter integration between pose estimation [52, 21], language-based motion generation [51], and the physics-based controller is needed.

Based on these limitations, the authors suggest the following future work:

Improved Imitation Capability: Focus on learning to imitate 100% of the motion sequences in the training set, particularly challenging dynamic movements. This might involve incorporating longer-term planning or intent into the controller.
Terrain and Scene Awareness: Integrate terrain and scene awareness into the controller to enable more complex human-object interactions and navigation in varied environments.
Tighter Integration with Downstream Tasks: Develop more integrated systems where the motion estimator or generator and the physics-based controller can co-adapt or provide feedback to each other.

7.3. Personal Insights & Critique

The Perpetual Humanoid Controller (PHC) is a highly impactful paper that addresses several critical challenges in physics-based character control.

Inspirations and Strengths:

Perpetual Control: The concept of "perpetual control" without resets is a game-changer for real-time avatar applications. It makes simulated characters feel more alive and responsive to continuous, real-world input, which is crucial for VR/AR and metaverse applications.
Physical Realism without Compromise: The commitment to avoiding external forces is commendable. While RFC is effective for stabilizing, it's an artificial crutch. PHC demonstrates that high-performance motion imitation is achievable within true physical realism, setting a new bar for plausibility.
PMCP for Scalability: The progressive multiplicative control policy is an elegant solution to catastrophic forgetting and the scalability problem in RL. By dynamically expanding capacity and composing primitives, it allows a single policy to master a vast motion repertoire and new tasks efficiently. This architecture could inspire similar solutions in other multi-task or curriculum learning RL domains.
Robustness to Noisy Input: The PHC's demonstrated robustness to noisy pose estimates from video is highly practical. Real-world pose estimation is inherently imperfect, and a controller that can gracefully handle these imperfections is invaluable for real-world deployment. The finding that keypoint-based control can be superior in noisy conditions is a valuable insight.
Natural Fail-state Recovery: The seamless and natural recovery from falls, without jerky movements or teleportation, adds a significant layer of realism and usability. The integration of this into the overall PMCP framework, rather than as a separate, ad-hoc module, is well-designed.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Computational Cost of PMCP: While PMCP is efficient in terms of learning, maintaining multiple primitive networks and a composer, even with frozen weights, adds to the memory footprint and computational complexity during inference compared to a single monolithic policy. The paper states it runs at $>30 FPS$ , which is good, but for extremely resource-constrained devices, this could be a factor.
Planning Horizon for Dynamic Motions: The limitation regarding high-dynamic motions (jumping, backflipping) is insightful. It highlights a common challenge in imitation learning: faithfully replicating momentary targets might not capture the long-term planning and intent required for such actions. Future work might explore hierarchical RL or latent skill discovery to address this, where higher-level policies dictate intent over longer horizons.
Generalizability to Diverse Environments: The current work focuses on flat ground. Incorporating terrain and scene awareness is listed as future work, but this is a substantial challenge. Handling uneven terrain, stairs, or complex object interactions (beyond just imitating MoCap) requires significant extensions to the state space and reward functions.
AMP Discriminator Bias: While AMP is powerful for natural motion, its discriminator is trained on MoCap data. If the MoCap itself has certain biases or limitations (e.g., lack of extreme motions or specific interaction types), the AMP reward might inadvertently limit the policy's ability to explore truly novel or very energetic behaviors beyond its training distribution.
Hyperparameter Tuning: RL systems with multiple reward terms, multiple policies, and progressive training stages often have a large number of hyperparameters (e.g., reward weights, PPO parameters, PD gains, PMCP primitive count, freeze points). The effort required to tune these for optimal performance can be substantial.

Overall, PHC makes a substantial contribution towards realizing robust and realistic real-time avatars. Its PMCP architecture for scalable multi-task learning in RL is particularly noteworthy and has broader implications beyond humanoid control.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Perpetual Humanoid Control for Real-time Simulated Avatars

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~40 min read · 51,949 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Physics-based Motion Imitation

3.2.2. Fail-state Recovery for Simulated Characters

3.2.3. Progressive Reinforcement Learning

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Reference Motion Representation

4.2.2. Goal-Conditioned Motion Imitation with Adversarial Motion Prior

4.2.2.1. State

4.2.2.2. Reward

4.2.2.3. Action

4.2.2.4. Control Policy and Discriminator

4.2.2.5. Humanoid

4.2.2.6. Initialization and Relaxed Early Termination (RET)

4.2.2.7. Hard Negative Mining

4.2.3. Progressive Multiplicative Control Policy (PMCP)

4.2.3.1. Progressive Neural Networks (PNN) Foundation

4.2.3.2. Fail-state Recovery as a New Task

4.2.3.3. Multiplicative Control

4.2.4. Connecting with Motion Estimators

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Motion Imitation

6.1.1. Motion Imitation on High-quality MoCap

6.1.2. Motion Imitation on Noisy Input from Video

6.2. Ablation Studies

6.3. Real-time Simulated Avatars

6.4. Fail-state Recovery

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers