Paper status: completed

AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control

Published:04/06/2021

Adversarial Imitation Learning (1)Physics-Based Character Control (1)Motion Prior Mechanism (1)Dynamic Selection in Reinforcement Learning (1)Unstructured Motion Dataset (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents a fully automated method called Adversarial Motion Priors (AMP) for generating graceful and realistic motions in physically simulated characters, utilizing adversarial imitation learning to simplify task objectives and learn behavior styles from unstructured mo

Abstract

Synthesizing graceful and life-like behaviors for physically simulated characters has been a fundamental challenge in computer animation. Data-driven methods that leverage motion tracking are a prominent class of techniques for producing high fidelity motions for a wide range of behaviors. However, the effectiveness of these tracking-based methods often hinges on carefully designed objective functions, and when applied to large and diverse motion datasets, these methods require significant additional machinery to select the appropriate motion for the character to track in a given scenario. In this work, we propose to obviate the need to manually design imitation objectives and mechanisms for motion selection by utilizing a fully automated approach based on adversarial imitation learning. High-level task objectives that the character should perform can be specified by relatively simple reward functions, while the low-level style of the character's behaviors can be specified by a dataset of unstructured motion clips, without any explicit clip selection or sequencing. These motion clips are used to train an adversarial motion prior, which specifies style-rewards for training the character through reinforcement learning (RL). The adversarial RL procedure automatically selects which motion to perform, dynamically interpolating and generalizing from the dataset. Our system produces high-quality motions that are comparable to those achieved by state-of-the-art tracking-based techniques, while also being able to easily accommodate large datasets of unstructured motion clips. Composition of disparate skills emerges automatically from the motion prior, without requiring a high-level motion planner or other task-specific annotations of the motion clips. We demonstrate the effectiveness of our framework on a diverse cast of complex simulated characters and a challenging suite of motor control tasks.

Mind Map

In-depth Reading

English Analysis~42 min read · 53,960 chars

1. Bibliographic Information

1.1. Title

AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control

1.2. Authors

Xue Bin Peng (University of California, Berkeley)
Ze Ma (Shanghai Jiao Tong University)
Pieter Abbeel (University of California, Berkeley)
Sergey Levine (University of California, Berkeley)
Angjoo Kanazawa (University of California, Berkeley)

1.3. Journal/Conference

ACM Transactions on Graphics (ACM Trans. Graph. 40, 4, Article 1). This is a highly prestigious and influential journal in the field of computer graphics, widely recognized for publishing cutting-edge research. Publication here signifies high-quality and impactful work.

1.4. Publication Year

2021

1.5. Abstract

The paper addresses the long-standing challenge of synthesizing graceful and life-like behaviors for physically simulated characters in computer animation. Existing data-driven methods, particularly those based on motion tracking, often require intricate objective functions and complex mechanisms for selecting appropriate motions when dealing with large, diverse datasets. To overcome these limitations, the authors propose a fully automated approach called Adversarial Motion Priors (AMP) which leverages adversarial imitation learning.

AMP allows users to specify high-level task objectives using simple reward functions, while the low-level style of the character's behavior is learned from unstructured motion clips without explicit selection or sequencing. These motion clips are used to train an adversarial motion prior, which acts as a style-reward for training the character through reinforcement learning (RL). The adversarial RL procedure automatically selects, interpolates, and generalizes motions from the dataset. The system generates high-quality motions comparable to state-of-the-art tracking methods, accommodating large, unstructured motion datasets. It also automatically facilitates the composition of disparate skills without needing a high-level motion planner or task-specific annotations. The framework's effectiveness is demonstrated on diverse simulated characters and challenging motor control tasks.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2104.02180v2
PDF Link: https://arxiv.org/pdf/2104.02180v2.pdf
Publication Status: Published at ACM Transactions on Graphics in August 2021 (Article 1). The arXiv link is a preprint, but the paper has since been formally published.

2. Executive Summary

2.1. Background & Motivation

Synthesizing natural, life-like, and graceful motions for virtual characters is a fundamental and challenging problem in computer animation, crucial for creating immersive experiences in films, games, and virtual reality. Beyond entertainment, developing realistic control strategies for simulated characters also has implications for robotics, as natural motions often implicitly encode properties like safety and energy efficiency.

Prior methods face significant hurdles:

Optimization-based methods: While capable of producing physically plausible motions, these techniques struggle with defining quantitative metrics for "naturalness" or "life-likeness." Heuristics like symmetry or effort minimization often require careful, task-specific tuning and may not generalize well across different behaviors.
Data-driven kinematic methods: These methods leverage large datasets of human motion (e.g., mocap) to generate realistic motions. However, their ability to synthesize motions for novel situations is limited by data availability, making it difficult to cover all necessary behaviors, especially for non-human or fictional characters.
Data-driven physics-based methods (tracking-based): A common approach involves using a tracking objective to minimize the pose error between a simulated character and reference motions. While effective for high-quality single-skill imitation, scaling these methods to large, diverse, and unstructured motion datasets is challenging. They typically require a motion planner to select the appropriate reference motion for the character to track at each timestep, which introduces significant algorithmic overhead and necessitates annotating and organizing motion clips. Moreover, these methods often rely on manually designed pose error metrics that are difficult to tune across various skills.

The paper's entry point is to address these limitations by proposing a system that obviates the need for manually designed imitation objectives and explicit motion selection mechanisms. The authors aim to enable users to specify high-level task objectives with simple reward functions, while the low-level style of motion is learned automatically from raw, unstructured motion clips.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of physics-based character control:

Adversarial Motion Priors (AMP): The core contribution is the introduction of Adversarial Motion Priors which, through adversarial imitation learning, learn a general, task-agnostic measure of motion similarity. This prior acts as a style-reward function, enabling a character to adopt behavioral characteristics from a dataset without needing explicit clip selection or sequence planning.
Automated Style Control from Unstructured Data: AMP provides a fully automated approach to learning low-level motion style from unstructured motion clips, eliminating the need for task-specific annotations, organization, or synchronization of the dataset. This significantly improves scalability to large and diverse motion repositories.
High-Quality and Diverse Motion Synthesis: The system produces high-quality, natural, and life-like motions comparable to state-of-the-art tracking-based techniques. It demonstrates the ability to learn highly dynamic and diverse motor skills, including acrobatic feats, for various complex simulated characters (humanoids, T-Rex, dogs).
Automatic Skill Composition and Generalization: The adversarial RL procedure automatically selects, interpolates, and generalizes different behaviors from the motion dataset. This allows for the emergence of complex skill compositions (e.g., walking and then punching, or running, leaping, and rolling to clear obstacles) in furtherance of high-level task objectives, all without a separate motion planner.
Stabilized Adversarial Training: The paper proposes several key design decisions, including the use of a Least-Squares Discriminator and a gradient penalty on discriminator observations, which are crucial for stabilizing the notoriously unstable adversarial training process and achieving consistent, high-quality results.
Decoupling Task and Style Specification: AMP provides a convenient interface that decouples the specification of what task a character should perform (via simple task rewards) from how it should perform it (via the learned motion style from examples). This allows characters to acquire more complex skills than those explicitly demonstrated in the original motion clips.

The key finding is that by combining goal-conditioned reinforcement learning with an adversarially learned motion prior, it is possible to train robust and versatile physically simulated characters that perform complex tasks with natural and stylized behaviors, effectively bridging the gap between data-driven realism and physics-based control.

3.1. Foundational Concepts

To understand AMP, a grasp of several core concepts in reinforcement learning, machine learning, and computer graphics is essential:

Physically Simulated Characters: These are virtual characters whose movements are governed by the laws of physics. Instead of simply playing back animations (kinematic methods), their bodies interact with the environment through forces, torques, gravity, and collisions. This provides physical realism but makes control more challenging.
Degrees of Freedom (DoF): Refers to the number of independent parameters that define the configuration of a physical system. For a character, this includes the position and orientation of its root (e.g., pelvis) and the rotational angles of its joints. A 34 DoF humanoid, for example, has 34 independent values describing its pose.
Proportional-Derivative (PD) Controllers: These are feedback control mechanisms widely used in robotics and character animation. For a joint, a PD controller attempts to reach a target position (e.g., a desired angle) by applying a torque proportional to both the position error (difference between current and target position) and the velocity error (difference between current and target velocity). They help the character's simulated muscles exert forces to achieve desired poses.
Reinforcement Learning (RL): A paradigm where an agent learns to make optimal decisions by interacting with an environment.
- Agent: The entity that performs actions (e.g., the character's controller).
- Environment: The world the agent interacts with (e.g., the physics simulation).
- State ( $\mathbf{s}$ ): A description of the current situation in the environment (e.g., character's joint angles, velocities, position).
- Action ( $\mathbf{a}$ ): A decision made by the agent that influences the environment (e.g., target joint angles for PD controllers).
- Reward ( $r$ ): A scalar feedback signal from the environment indicating how good or bad the agent's last action was. The agent's goal is to maximize cumulative reward.
- Policy ( $\pi$ ): A function or strategy that maps states to actions (or a distribution over actions). It dictates the agent's behavior.
- Value Function ( $V$ ): Estimates the expected future cumulative reward from a given state.
- Expected Discounted Return ( $J(\pi)$ ): The sum of future rewards, where later rewards are discounted by a factor $\gamma \in [0, 1)$ . The agent aims to find a policy that maximizes this.
Goal-Conditioned Reinforcement Learning: An extension of standard RL where the policy also takes a goal (\mathbf{g}) as input. This allows a single policy to learn a variety of behaviors to achieve different goals (e.g., walk to location A, then location B).
Generative Adversarial Networks (GANs): A class of generative models consisting of two neural networks, a generator and a discriminator, trained in a competitive "adversarial" process.
- Generator: Tries to produce realistic data samples (e.g., images, motions) that mimic a real dataset.
- Discriminator: Tries to distinguish between real data samples (from the dataset) and fake data samples (produced by the generator).
- Adversarial Training: The generator tries to "fool" the discriminator, and the discriminator tries to get better at "unfooling" the generator. This minimax game drives both networks to improve, with the generator eventually learning to produce highly realistic data.
Adversarial Imitation Learning (AIL): Combines RL with GANs. Instead of manually designing a reward function for imitation, the discriminator learns to distinguish between the agent's behavior and expert demonstrations. The policy (agent's controller) is then trained using the discriminator's output as a reward, trying to make its behavior indistinguishable from the expert.
Dynamic Time Warping (DTW): An algorithm for measuring similarity between two temporal sequences that may vary in speed or duration. It finds an optimal alignment between two time series by "warping" one or both along the time axis. This is crucial for comparing the pose error of a simulated character's motion with a reference motion when their timings are not perfectly synchronized.
Exponential Map (for rotations): A way to represent 3D rotations as a 3D vector. The direction of the vector defines the axis of rotation, and its magnitude defines the angle of rotation. It's more compact than axis-angle or quaternion representations and avoids gimbal lock issues common with Euler angles.

3.2. Previous Works

The paper contextualizes its work by reviewing existing approaches for synthesizing natural motions for virtual characters:

Kinematic Methods: These methods typically do not use physics simulation explicitly. They primarily rely on datasets of human motion (mocap data) to generate animations.
- Motion Graphs (Lee et al. 2002, 2010b; Agrawal and van de Panne 2016; Safonova and Hodgins 2007; Treuille et al. 2007): These techniques build a graph where nodes are poses and edges represent transitions. A controller then traverses this graph to stitch together motion clips.
- Generative Models (Levine et al. 2012; Ye and Liu 2010; Holden et al. 2017; Ling et al. 2020; Zhang et al. 2018): Use models like Gaussian processes or neural networks (e.g., Phase-Functioned Neural Networks) to synthesize motions online.
- Limitations: While capable of realistic motions given large datasets, their ability to generalize to novel situations or produce physically plausible interactions is limited by the training data. Collecting sufficient data for complex tasks or non-human characters is challenging.
Physics-Based Methods: These methods synthesize motions using a physics simulation and optimize controllers to achieve desired behaviors.
- Optimization Techniques (Raibert and Hodgins 1991; Wampler et al. 2014; Mordatch et al. 2012; Tan et al. 2014): Trajectory optimization and reinforcement learning are used to find controllers that optimize an objective function.
- Challenges: Designing effective objective functions that lead to natural and life-like behaviors is extremely difficult. Heuristics (symmetry, stability, effort minimization) are often incorporated but require careful tuning and may not be universally applicable. Even with biologically accurate actuators, motions can still appear unnatural.
Imitation Learning (Data-Driven Physics-Based Methods): These approaches combine physics simulation with reference motion data to improve motion quality.
- Tracking Objectives (Da Silva et al. 2008; Kwon and Hodgins 2017; Lee et al. 2010a; Sharon and van de Panne 2005; Zordan and Hodgins 2002; Liu et al. 2016, 2010; Peng et al. 2018a): The most common method. The controller tries to minimize the pose error between the simulated character and target poses from a reference motion.
  - Synchronization: For individual clips, a phase variable (Lee et al. 2019; Peng et al. 2018a,b) can synchronize the character with the reference.
  - Scaling to Datasets: More recent methods provide target poses as inputs to the controller (Bergamin et al. 2019; Chentanez et al. 2018; Park et al. 2019; Won et al. 2020).
  - Limitations: Still requires a high-level motion planner (Bergamin et al. 2019; Park et al. 2019; Peng et al. 2017) to select which motion clip to track for a given task, introducing significant overhead. Also, manually designed pose error metrics are hard to tune across diverse skills.
Adversarial Imitation Learning (AIL) in Motion:
- Generative Adversarial Imitation Learning (GAIL) (Ho and Ermon 2016; Ziebart et al. 2008): This is a key precursor. Instead of manual objectives, GAIL trains a discriminator to distinguish between agent-generated behaviors and expert demonstrations. The policy then optimizes the discriminator's output as a reward.
  - Standard GAIL: Requires access to demonstrator's actions, which are often unavailable in motion clips. The Imitation from Observations extension (Torabi et al. 2018) allows training on state transitions $(\mathbf{s}, \mathbf{s}')$ only.
  - Previous AIL Limitations: Often unstable, and resulting motion quality lagged behind tracking-based methods (Merel et al. 2017; Wang et al. 2017). Peng et al. [2019b] improved realism with an information bottleneck but still needed a phase variable, limiting it to single-motion imitation.
Latent Space Models (as motion priors):
- General Latent Space Control (Burgard et al. 2008; Florensa et al. 2017; Hausman et al. 2018; Heess et al. 2016): Encodes behaviors into a low-dimensional latent representation, which is then mapped to controls. Can be used in hierarchical control.
- For Character Control (Merel et al. 2019; Peng et al. 2019a): Latent representations trained from reference motions constrain character behavior.
- Limitations: Realism is enforced implicitly. High-level controllers can still specify latent encodings that lead to unnatural behaviors. Luo et al. [2020] used an adversarial domain confusion loss in latent space, but this doesn't directly enforce similarity on the actual motions.

3.3. Technological Evolution

The field of character animation has evolved from purely artistic (keyframe animation) to physics-based (simulating natural dynamics), and then to data-driven methods (leveraging motion capture).

Kinematic Era: Early data-driven methods focused on kinematic playback and motion graph techniques, offering realism but lacking physical interaction and generalization.
Physics-Based Era: The introduction of physics engines allowed for physically plausible motions, but control was difficult, often relying on trajectory optimization or RL with hand-crafted reward functions, which struggled with naturalness.
Hybrid (Tracking) Era: The convergence of data-driven and physics-based methods led to motion tracking, where physics-based characters try to imitate reference motion data. This improved realism and physical plausibility but introduced challenges in motion selection (motion planners) and reward engineering.
Adversarial Era: The rise of Generative Adversarial Networks provided a new paradigm for learning complex distributions. Adversarial Imitation Learning (GAIL) promised to automate reward design for imitation but initially suffered from instability and lower motion quality in complex physics-based scenarios.
AMP's Position: AMP sits at the forefront of this evolution, specifically addressing the limitations of prior AIL and tracking-based methods. It refines AIL to be robust enough for high-fidelity, physics-based character control from unstructured data, moving beyond the need for manual reward design and motion planners. It also improves upon latent space models by directly enforcing motion similarity through the reward function.

3.4. Differentiation Analysis

AMP distinguishes itself from previous methods primarily in its approach to motion style specification, scalability to unstructured data, and stability of adversarial training:

Compared to Tracking-Based Methods (e.g., Peng et al. 2018a; Park et al. 2019):
- Reward Design: Tracking methods rely on manually designed pose error metrics to explicitly minimize the difference between the character's pose and a specific reference pose. AMP replaces this with a learned, task-agnostic style-reward from an adversarial discriminator, obviating manual engineering and tuning.
- Motion Selection: Tracking methods often require a high-level motion planner or explicit phase variables to select and synchronize with a particular reference motion clip. AMP, through its adversarial motion prior, automatically selects, interpolates, and generalizes from a large dataset of unstructured motions, without any explicit synchronization or clip selection mechanism.
- Dataset Structure: Tracking methods typically need annotated and organized motion clips, or at least a mechanism to feed specific target poses. AMP works directly with raw, unstructured motion clips.
- Task Coupling: Tracking methods implicitly couple task and style (e.g., a "running" reward means tracking a running clip). AMP explicitly decouples task objectives from style objectives, allowing a character to perform novel tasks in a learned style.
Compared to Previous Adversarial Imitation Learning (AIL) Systems (e.g., Merel et al. 2017; Wang et al. 2017; Peng et al. 2019b):
- Stability and Quality: Earlier AIL methods for physics-based characters were notoriously unstable and produced lower fidelity motions. AMP introduces several critical design decisions (Least-Squares Discriminator, gradient penalty, specific discriminator observations) that significantly improve training stability and yield high-quality, full-body motions comparable to state-of-the-art tracking methods.
- Synchronization: Unlike Peng et al. [2019b], which still required a phase variable for synchronization, limiting it to single-motion imitation, AMP does not require any synchronization between the policy and reference motion. This is key to its ability to learn from large, diverse datasets.
- General Motion Prior: AMP learns a truly general motion prior from unstructured datasets, rather than being limited to imitating a single motion per policy.
Compared to Latent Space Models (e.g., Peng et al. 2019a; Merel et al. 2020):
- Direct vs. Implicit Enforcement: Latent space models enforce motion style implicitly by constraining actions through a learned latent representation. This can still lead to unnatural behaviors if the high-level controller specifies latent encodings outside the pre-trained distribution. AMP, by contrast, directly enforces similarity between the character's actual motions and the reference dataset through its discriminator reward, resulting in higher fidelity motions.
- Training Phases: Latent space models often require a separate pre-training phase for the low-level controller. AMP's motion prior can be trained jointly with the policy, simplifying the overall training pipeline.
  
  In essence, AMP leverages the strengths of adversarial learning to create a task-agnostic motion prior that functions as a powerful, learned style-reward, overcoming the scalability and reward-engineering challenges of previous imitation learning techniques in physics-based character animation.

4. Methodology

The AMP system aims to synthesize a control policy that enables a character to achieve task objectives in a physically simulated environment while exhibiting behaviors that resemble a given dataset of reference motions. It decouples the "what" (task) from the "how" (style) of character behavior.

4.1. Principles

The core idea is to combine goal-conditioned reinforcement learning (RL) with an adversarial motion prior.

Task Objective ( $r^G$ ): High-level goals are defined by relatively simple, manually designed reward functions specific to the task (e.g., moving to a target location).
Style Objective ( $r^S$ ): Low-level motion style is learned from a dataset of unstructured motion clips using an adversarial discriminator. This discriminator becomes the motion prior, providing a reward signal that encourages the character's movements to be indistinguishable from the reference motions.
Combined Reward: The policy is trained using a composite reward that linearly combines the task and style objectives.
Automatic Skill Composition: The adversarial training allows the character to automatically select, interpolate, and generalize behaviors from the motion dataset to fulfill tasks, without requiring explicit motion planners or clip annotations.

The overall system architecture is depicted in Figure 2 from the original paper:

$Fig. 2. Schematic overview of the system. Given a motion dataset defining a desired motion style for the character, the system trains a motion prior that specifies style-rewards $r _ { t } ^ { S }$ for the policy during training. These style-rewards are combined with task-rewards $r _ { t } ^ { G }$ and used to train a policy that enables a simulated character to satisfy task-specific goals $^ \\mathrm { g }$ while also adopting behaviors that resemble the reference motions in the dataset.$ 该图像是示意图，展示了系统的整体结构。给定定义角色运动风格的动作数据集，系统训练一个运动先验，该先验为训练中的策略指定风格奖励 $r_{t}^{S}$ 。这些风格奖励与任务奖励 $r_{t}^{G}$ 结合，用于训练策略，使得模拟角色能够达到任务特定目标 $g$ 的同时，表现出与数据集中参考运动相似的行为。

As shown in the figure, a given motion dataset defines the desired motion style. This dataset is used to train a motion prior (the discriminator) which, in turn, provides style-rewards ( $r_t^S$ ) to the policy. These style-rewards are combined with task-rewards ( $r_t^G$ ) that define task-specific goals ( $\mathbf{g}$ ). The reinforcement learning agent (policy) then learns to generate actions ( $\mathbf{a}_t$ ) that drive the simulated character, aiming to satisfy both the task goals and the desired motion style.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Reward Function Formulation

The total reward $r_t$ at each time step $t$ is a combination of two components: $ r (\mathrm{s}_t, \mathrm{a}t, \mathrm{s}{t+1}, \mathrm{g}) = w^G r^G (\mathrm{s}_t, \mathrm{a}_t, \mathrm{s}_t, \mathrm{g}) + w^S r^S (\mathrm{s}t, \mathrm{s}{t+1}) $ Where:

$r^G (\mathrm{s}_t, \mathrm{a}_t, \mathrm{s}_t, \mathrm{g})$ : The task-specific reward (the "what"). It defines high-level objectives that the character should satisfy, conditioned on the current state $\mathbf{s}_t$ , action $\mathbf{a}_t$ , next state $\mathbf{s}_{t+1}$ , and goal $\mathbf{g}$ .
$r^S (\mathrm{s}_t, \mathrm{s}_{t+1})$ : The style-reward (the "how"). It is a learned, task-agnostic reward component derived from the adversarial motion prior, specifying the low-level behavioral style.
$w^G$ and $w^S$ : Manually specified weights for balancing the task and style objectives.

The main challenge addressed by AMP is learning an effective $r^S$ that leads to naturalistic behaviors conforming to a particular style, which is the function of the Adversarial Motion Prior.

4.2.2. Adversarial Motion Prior (AMP)

The style-reward is modeled using a learned discriminator $D$ , which is referred to as an adversarial motion prior. Its purpose is to quantify the similarity of the character's motion to the motions in the dataset, without explicit comparison to a specific clip.

4.2.2.1. Imitation from Observations

Traditional Generative Adversarial Imitation Learning (GAIL) typically requires access to both states ( $\mathbf{s}$ ) and actions ( $\mathbf{a}$ ) from expert demonstrations. However, when working with motion clips, only states (poses) are observed, and the underlying actions of the demonstrator are unknown. To adapt GAIL to this state-only setting, the discriminator is trained on state transitions $(\mathbf{s}_t, \mathbf{s}_{t+1})$ rather than state-action pairs $(\mathbf{s}_t, \mathbf{a}_t)$ .

The objective for training the discriminator in a standard GAN framework with sigmoid cross-entropy loss would be: $ \underset { D } { \arg \operatorname* { m i n } } - \mathbb { E } _ { d ^ { M } ( \mathrm { s } , \mathrm { s } ^ { \prime } ) } \left[ \log \left( D ( \mathrm { s } , \mathrm { s } ^ { \prime } ) \right) \right] - \mathbb { E } _ { d ^ { \pi } ( \mathrm { s } , \mathrm { s } ^ { \prime } ) } \left[ \log \left( 1 - D ( \mathrm { s } , \mathrm { s } ^ { \prime } ) \right) \right] $ Where:

$D(\mathbf{s}, \mathbf{s}')$ : The discriminator's output, a scalar value representing the probability that the state transition from $\mathbf{s}$ to $\mathbf{s}'$ is "real" (from the dataset).
$d^M(\mathbf{s}, \mathbf{s}')$ : The distribution of state transitions in the reference motion dataset $M$ .
$d^\pi(\mathbf{s}, \mathbf{s}')$ : The distribution of state transitions generated by following the policy $\pi$ .
$\mathbb{E}[\cdot]$ : Expected value.
$\log(\cdot)$ : Natural logarithm.

This objective encourages the discriminator to output 1 for real transitions and 0 for fake transitions.

4.2.2.2. Least-Squares Discriminator

The standard sigmoid cross-entropy loss in GANs can suffer from vanishing gradients, hindering policy training. To improve stability and performance, AMP adopts the Least-Squares GAN (LSGAN) objective [Mao et al. 2017]. The discriminator is trained using a least-squares regression to predict a score of 1 for real samples (from the dataset) and -1 for fake samples (from the policy).

The LSGAN objective for training the discriminator is given by: $ \underset { D } { \arg \operatorname* { m i n } } \mathbb { E } _ { d ^ { M } ( \mathbf { s } , \mathbf { s } ^ { \prime } ) } \left[ \left( D ( \mathbf { s } , \mathbf { s } ^ { \prime } ) - 1 \right) ^ { 2 } \right] + \mathbb { E } _ { d ^ { \pi } ( \mathbf { s } , \mathbf { s } ^ { \prime } ) } \left[ \left( D ( \mathbf { s } , \mathbf { s } ^ { \prime } ) + 1 \right) ^ { 2 } \right] $ Where:

$D(\mathbf{s}, \mathbf{s}')$ : The discriminator's output, a scalar value.
$d^M(\mathbf{s}, \mathbf{s}')$ : The distribution of real state transitions from the dataset.
$d^\pi(\mathbf{s}, \mathbf{s}')$ : The distribution of fake state transitions generated by the policy.
$(D(\mathbf{s}, \mathbf{s}') - 1)^2$ : Penalty for real samples not predicted as 1.
$(D(\mathbf{s}, \mathbf{s}') + 1)^2$ : Penalty for fake samples not predicted as -1.

After the discriminator is trained, the style-reward for the policy is derived from its output. This reward encourages the policy to generate transitions that the discriminator classifies as "real" (i.e., close to 1). The paper uses the following specific form for the style-reward: $ r^S ( { \mathrm { s } } _ { t } , { \mathrm { s } } _ { t + 1 } ) = \operatorname* { m a x } \left[ 0 , \ 1 - 0 . 2 5 ( D ( { \mathrm { s } } _ { t } , { \mathrm { s } } _ { t + 1 } ) - 1 ) ^ { 2 } \right] $ Where:
$D(\mathbf{s}_t, \mathbf{s}_{t+1})$ : The discriminator's raw output for the policy's generated state transition.
$(D(\mathbf{s}_t, \mathbf{s}_{t+1}) - 1)^2$ : This term measures how far the discriminator's output is from the target value of 1 (which it assigns to real samples). The policy tries to minimize this, pushing $D(\mathbf{s}_t, \mathbf{s}_{t+1})$ towards 1.
0.25: A scaling factor.
$1 - 0.25(\cdot)^2$ : Transforms the error into a reward signal, where a smaller error (discriminator output closer to 1) yields a higher reward.
$\operatorname{max}[0, \cdot]$ : Clips the reward to be non-negative, ensuring it stays within [0, 1] as is common practice in RL.

4.2.2.3. Discriminator Observations

The choice of features given to the discriminator is crucial for it to provide effective feedback to the policy. Before a state transition $(\mathbf{s}_t, \mathbf{s}_{t+1})$ is input to the discriminator, an observation map $\Phi(\mathbf{s})$ extracts a compact set of motion-relevant features. The discriminator then operates on $D(\Phi(\mathbf{s}), \Phi(\mathbf{s}'))$ . These features are designed to be task-agnostic.

The set of features for $\Phi(\mathbf{s})$ includes:

Linear velocity and angular velocity of the root: The root is typically the character's pelvis. These velocities are represented in the character's local coordinate frame.
Local rotation of each joint: The 3D rotation of each spherical joint is encoded using two 3D vectors (normal and tangent) in its coordinate frame, providing a smooth and unique representation.
Local velocity of each joint: The velocity of each joint relative to its parent or the root.
3D positions of the end-effectors: Such as hands and feet, represented in the character's local coordinate frame.

The character's local coordinate frame is defined with the origin at the root, the X-axis along the root's facing direction, and the Y-axis aligned with the global up vector.

4.2.2.4. Gradient Penalty

GANs are prone to unstable training dynamics, often due to function approximation errors in the discriminator. Specifically, the discriminator might assign non-zero gradients on the manifold of real data samples, which can cause the generator (policy) to overshoot the data manifold instead of converging to it, leading to oscillations. To stabilize training, a gradient penalty is applied, which penalizes non-zero gradients of the discriminator output with respect to its input features on samples from the dataset.

The discriminator objective with the gradient penalty is: $ \begin{array} { r l } { \underset { D } { \arg \operatorname* { m i n } } } & { \mathbb { E } _ { d ^ { M } ( \mathrm { s } , \mathrm { s } ^ { \prime } ) } \left[ \left( D ( \Phi ( \mathrm { s } ) , \Phi ( \mathrm { s } ^ { \prime } ) ) - 1 \right) ^ { 2 } \right] } \ & { + \mathbb { E } _ { d ^ { \pi } ( \mathrm { s } , \mathrm { s } ^ { \prime } ) } \left[ \left( D \left( \Phi ( \mathrm { s } ) , \Phi ( \mathrm { s } ^ { \prime } ) \right) + 1 \right) ^ { 2 } \right] } \ & { + \displaystyle \frac { w ^ { \mathrm { g p } } } { 2 } \mathbb { E } _ { d ^ { M } ( \mathrm { s } , \mathrm { s } ^ { \prime } ) } \left[ \left| \nabla _ { \phi } D ( \phi ) \middle | \phi = ( \Phi ( \mathrm { s } ) , \Phi ( \mathrm { s } ^ { \prime } ) ) \right| ^ { 2 } \right] } \end{array} $ Where:

The first two terms are the LSGAN objective described previously.
\frac { w ^ { \mathrm { g p } } } { 2 } \mathbb { E } _ { d ^ { M } ( \mathrm { s } , \mathrm { s } ^ { \prime } ) } \left[ \left\| \nabla _ { \phi } D ( \phi ) \middle | \phi = ( \Phi ( \mathrm { s } ) , \Phi ( \mathrm { s } ^ { \prime } ) ) \right\| ^ { 2 } \right]: This is the gradient penalty term.
$w^{\mathrm{gp}}$ : A manually specified coefficient (weight) for the gradient penalty (e.g., 10).
$\nabla_\phi D(\phi)$ : The gradient of the discriminator's output $D$ with respect to its input features \phi = (\Phi(\mathbf{s}), \Phi(\mathbf{s}')).
$\|\cdot\|^2$ : The squared Euclidean norm of the gradient.
The gradient penalty is calculated on samples from the real data distribution $d^M(\mathbf{s}, \mathbf{s}')$ , not interpolated samples as in WGAN-GP. This encourages the discriminator's gradients to be close to zero on the actual data manifold.

4.2.3. Model Representation

4.2.3.1. States and Actions

State ( $\mathbf{s}_t$ ): Consists of features describing the character's body configuration:
- Relative positions of each link with respect to the root (pelvis).
- Rotation of each link, represented using a 6D normal-tangent encoding.
- Linear and angular velocities of each link. All features are expressed in the character's local coordinate system. Unlike prior methods, no phase information or target poses are included, as AMP does not explicitly track specific reference motions.
Action ( $\mathbf{a}_t$ ): Specifies target positions for Proportional-Derivative (PD) controllers located at each of the character's joints.
- For spherical joints (e.g., hip, shoulder), the target is a 3D exponential map $\mathbf{q} \in \mathbb{R}^3$ $q \in R^{3}$ . The rotation axis $\mathbf{v}$ $v$ and rotation angle $\theta$ $θ$ are determined by: $ \mathrm { v } = { \frac { \mathbf { q } } { | | \mathbf { q } | | _ { 2 } } } , \qquad \theta = | | \mathbf { q } | | _ { 2 } $ Where:
  - $\mathbf{q}$ : The 3D vector representing the exponential map.
  - $\|\mathbf{q}\|_2$ : The Euclidean norm (magnitude) of $\mathbf{q}$ .
  - $\mathbf{v}$ : The unit vector representing the rotation axis.
  - $\theta$ : The rotation angle. This representation is compact and avoids gimbal lock.
- For revolute joints (e.g., knee), the target is a 1D rotation angle $q = \theta$ .

4.2.3.2. Network Architecture

All neural networks (policy, value function, discriminator) are fully-connected networks with ReLU activation functions.

Policy ( $\pi$ ): Maps a given state $\mathbf{s}_t$ $s_{t}$ and goal $\mathbf{g}$ $g$ to a Gaussian distribution over actions $\pi(\mathbf{a}_t|\mathbf{s}_t, \mathbf{g}) = N(\mu(\mathbf{s}_t, \mathbf{g}), \Sigma)$ $π (a_{t} ∣ s_{t}, g) = N (μ (s_{t}, g), Σ)$ .
- $\mu(\mathbf{s}_t, \mathbf{g})$ : The mean of the Gaussian distribution, determined by a network with two hidden layers (1024 and 512 units).
- $\Sigma$ : A fixed diagonal covariance matrix, manually specified.
Value Function ( $V(\mathbf{s}_t, \mathbf{g})$ ) and Discriminator ( $D(\mathbf{s}_t, \mathbf{s}_{t+1})$ ): Modeled by separate networks with similar architectures to the policy.

4.2.4. Training

AMP policies are trained using a combination of Generative Adversarial Imitation Learning (GAIL) and Proximal Policy Optimization (PPO) [Schulman et al. 2017].

Algorithm 1 in the paper outlines the training process: $1: input M: dataset of reference motions 2: D ← initialize discriminator 3: π ← initialize policy 4: V ← initialize value function 5: B ← ∅ initialize replay buffer 6: while not done do 7: for trajectory i = 1, ..., m do 8: τ^i ← { (s_t, a_t, r_t^G)_t=0^T-1, s_T^G, g } // Collect trajectory using policy π 9: for time step t = 0, ..., T-1 do 10: d_t ← D(Φ(s_t), Φ(s_t+1)) // Query discriminator 11: r_t^S ← calculate style reward according to Equation 7 using d_t 12: r_t ← w^G r_t^G + w^S r_t^S // Combine rewards 13: record r_t in τ^i 14: end for 15: store τ^i in B // Store trajectory in replay buffer 16: end for 17: for update step = 1, ..., n do 18: b^M ← sample batch of K transitions { (s_j, s_j') }_j=1^K from M // Real samples 19: b^π ← sample batch of K transitions { (s_j, s_j') }_j=1^K from B // Fake samples 20: update D according to Equation 8 using b^M and b^π 21: end for 22: update V and π using data from trajectories { τ^i }_i=1^m 23: end while$ Detailed Steps:

Initialization (Lines 1-5): Initialize the motion dataset $M$ , discriminator $D$ , policy $\pi$ , value function $V$ , and an empty replay buffer $\mathcal{B}$ .
Trajectory Collection (Lines 7-16):
- The policy $\pi$ interacts with the environment for $m$ trajectories (episodes).
- For each time step in a trajectory:
  - The task-reward $r_t^G$ is obtained from the environment based on the current goal $\mathbf{g}$ .
  - The discriminator $D$ is queried with the observed state transition $(\mathbf{s}_t, \mathbf{s}_{t+1})$ (after applying the observation map $\Phi$ ) to get a raw score $d_t$ .
  - The style-reward $r_t^S$ is calculated from $d_t$ using Equation 7.
  - The total reward $r_t$ is computed by linearly combining $r_t^G$ and $r_t^S$ (Equation 4).
- Each collected trajectory $\tau^i$ (including states, actions, and the computed total rewards) is stored in the replay buffer $\mathcal{B}$ .
Discriminator Update (Lines 17-21):
- For $n$ update steps, the discriminator $D$ is updated.
- Two mini-batches of $K$ state transitions are sampled: $b^M$ from the real motion dataset $M$ , and $b^\pi$ from the replay buffer $\mathcal{B}$ (containing policy-generated transitions from current and past iterations).
- The discriminator $D$ is updated using the LSGAN objective with the gradient penalty (Equation 8), aiming to distinguish between real and fake transitions. The replay buffer helps prevent overfitting to only the most recent policy behaviors.
Policy and Value Function Update (Line 22):
- The policy $\pi$ and value function $V$ are updated using the collected trajectories $\{ \tau^i \}$ (from the current iteration of trajectory collection) using PPO.
- The value function is updated with target values computed using TD(\lambda) [Sutton and Barto 1998].
- The policy is updated using advantages computed using Generalized Advantage Estimation (GAE(\lambda)) [Schulman et al. 2015].

Additional Training Details:

Reference State Initialization: To aid exploration, characters are initialized to states sampled randomly from all motion clips in the dataset.
Early Termination: For most tasks, an episode terminates early if any part of the character's body (except feet) touches the ground. This encourages stable upright locomotion. For contact-rich tasks (e.g., rolling), this criterion is disabled.

5. Experimental Setup

The effectiveness of AMP is evaluated on a suite of challenging motion control tasks involving complex 3D simulated characters.

5.1. Datasets

Reference motion clips are a combination of:

Public mocap libraries (e.g., CMU Graphics Lab Motion Capture Database, SFU Motion Capture Database).
Custom recorded mocap clips.
Artist-authored keyframe animations (e.g., for the T-Rex).

The datasets are described as unstructured, meaning they are raw collections of clips without task-specific annotations or segmentation into individual skills. This is a key feature that AMP aims to leverage.

The following are the summary statistics of the different datasets used to train the motion priors (Table 2 from the original paper):

Character	Dataset	Size (s)	Clips	Subjects
Humanoid	Cartwheel	13.6	3	1
	Jump	28.6	10	4
	Locomotion	434.1	56	8
	Run	204.4	47	3
	Run + Leap + Roll	22.1	10	7
	Stealthy	136.5	3	1
	Walk	229.6	9	5
	Walk + Punch	247.8	15	9
	Zombie	18.3	1	1
T-Rex	Locomotion	10.5	5	1

Example of data use: For the Target Heading task with a Locomotion dataset, the clips would contain various walking, running, and jogging motions. For Strike with Walk + Punch dataset, the clips would consist of walking motions and separate punching motions. The system automatically learns to compose these.

5.2. Evaluation Metrics

The paper uses two primary metrics to evaluate performance:

Normalized Task Return:
- Conceptual Definition: This metric quantifies how well the policy achieves its high-level task objectives. It normalizes the cumulative task-specific reward collected over an episode to a range between 0 and 1, where 0 is the minimum possible return and 1 is the maximum possible return for that task. It focuses solely on the success of the task, independent of motion style.
- Mathematical Formula: (Not explicitly provided in the paper, but implied by "normalized task return, with 0 being the minimum possible return per episode and 1 being the maximum possible return.") Let $R_T$ be the total task-specific reward accumulated over an episode, $R_{T, min}$ be the minimum possible task reward, and $R_{T, max}$ be the maximum possible task reward. $ \text{Normalized Task Return} = \frac{R_T - R_{T, min}}{R_{T, max} - R_{T, min}} $
- Symbol Explanation:
  - $R_T$ : The sum of task-specific rewards $r^G$ for a given episode.
  - $R_{T, min}$ : The minimum possible sum of task-specific rewards for an episode.
  - $R_{T, max}$ : The maximum possible sum of task-specific rewards for an episode.
Pose Error (for single-clip imitation):
- Conceptual Definition: This metric quantifies the fidelity of the simulated character's motion compared to a specific reference motion. It measures the average positional difference between corresponding joints of the simulated character and the reference motion. The paper states this metric better aligns with human perception of motion similarity.
- Mathematical Formula: $ e _ { t } ^ { \mathrm { p o s e } } = \frac { 1 } { N ^ { \mathrm { j o i n t } } } \sum _ { j \in \mathrm { j o i n t s } } \left| ( { \bf x } _ { t } ^ { j } - { \bf x } _ { t } ^ { \mathrm { r o o t } } ) - ( \hat { \bf x } _ { t } ^ { j } - \hat { \bf x } _ { t } ^ { \mathrm { r o o t } } ) \right| _ { 2 } $ The average pose error for an episode is then usually taken over time. Crucially, before computing this error, Dynamic Time Warping (DTW) is applied to align the reference motion with the simulated character's motion.
- Symbol Explanation:
  - $e_t^{\mathrm{pose}}$ : The pose error at time step $t$ .
  - $N^{\mathrm{joint}}$ : The total number of joints in the character's body.
  - $j$ : Index for a specific joint.
  - $\mathbf{x}_t^j$ : The 3D Cartesian position of joint $j$ in the simulated character at time $t$ .
  - $\mathbf{x}_t^{\mathrm{root}}$ : The 3D Cartesian position of the root joint (pelvis) in the simulated character at time $t$ .
  - $(\mathbf{x}_t^j - \mathbf{x}_t^{\mathrm{root}})$ : The position of joint $j$ relative to the root in the simulated character.
  - $\hat{\mathbf{x}}_t^j$ : The 3D Cartesian position of joint $j$ in the reference motion at time $t$ .
  - $\hat{\mathbf{x}}_t^{\mathrm{root}}$ : The 3D Cartesian position of the root joint in the reference motion at time $t$ .
  - $(\hat{\mathbf{x}}_t^j - \hat{\mathbf{x}}_t^{\mathrm{root}})$ : The position of joint $j$ relative to the root in the reference motion.
  - $\|\cdot\|_2$ : The Euclidean (L2) norm, measuring the distance between the relative joint positions.

5.3. Baselines

AMP is compared against several baseline methods to highlight its advantages:

Latent Space Models:
- Approach: This baseline aims to learn a motion prior indirectly through a latent representation. A low-level controller is first pre-trained using a motion tracking objective to imitate the same reference motions used for AMP. This controller learns to map a latent variable ( $\mathbf{z}_t$ ) to actions. Then, a separate high-level controller is trained for each downstream task, which specifies these latent variables to the fixed low-level controller.
- Relevance: Represents methods that use hierarchical control and implicit style constraints.
- Specifics (from Appendix C): An encoder $q(\mathbf{z}_t | \mathbf{g}_t)$ maps a goal to a distribution over latent variables $\mathbf{Z}_t$ . A latent encoding $\mathbf{z}_t \sim q(\mathbf{z}_t | \mathbf{g}_t)$ is sampled and passed to the policy $\pi(\mathbf{a}_t | \mathbf{s}_t, \mathbf{z}_t)$ . The encoder and policy are trained jointly using an objective that includes a KL-regularizer (similar to Variational Autoencoders) with respect to a variational prior $p_0(\mathbf{z}_t) = N(0, I)$ . After pre-training, $\pi$ is fixed, and a new high-level controller $u(\mathbf{z}_t | \mathbf{s}_t, \mathbf{g}_t)$ is trained for each task. Pre-training uses a motion imitation task where $\mathbf{g}_t = (\hat{q}_{t+1}, \hat{q}_{t+2})$ specifies target poses. The latent encoding dimension is 16D.
No Data (Baseline from Scratch):
- Approach: Policies are trained from scratch for each task using only the task-specific reward ( $r^G$ ), without leveraging any motion data or motion prior.
- Relevance: This baseline demonstrates the difficulty of achieving natural behaviors solely through manually designed task objectives, highlighting the value of data-driven style guidance.
Motion Tracking (Peng et al. 2018a):
- Approach: This is a state-of-the-art tracking-based imitation learning method. It uses a manually designed reward function that explicitly minimizes pose error between the simulated character and a reference motion. It requires a phase variable to synchronize the policy with the reference motion.
- Relevance: Serves as a strong baseline for evaluating AMP's ability to produce high-quality motions, especially in single-clip imitation tasks, without the need for manual reward engineering or synchronization.

Simulation Details:

Physics Engine: Bullet physics engine [Coumans et al. 2013].
Simulation Frequency: $1.2 \mathrm{kHz}$ .
Policy Query Rate: $30 \mathrm{Hz}$ .
Actuation: PD controllers at character joints.
Training Time: 100-300 million samples, 30-140 hours on 16 CPU cores.
Gradient Penalty Coefficient ( $w^{\mathrm{gp}}$ ): Set to 10.

6. Results & Analysis

6.1. Core Results Analysis

The AMP framework is demonstrated on a diverse cast of complex simulated characters (34 DoF humanoid, 59 DoF T-Rex, 64 DoF dog) and a challenging suite of motor control tasks.

6.1.1. Task Performance and Skill Composition

AMP successfully enables characters to perform various high-level tasks while adopting specific motion styles, accommodating large unstructured datasets. The weights for task-reward ( $w^G$ ) and style-reward ( $w^S$ ) are set to 0.5 for all tasks.

1. Target Heading / Target Location:

Locomotion Dataset: When trained with a Locomotion dataset (walking, running, jogging), the humanoid automatically selects appropriate gaits based on target speed. It walks at slow speeds ( $\sim 1 \mathrm{m/s}$ ), jogs at medium speeds ( $\sim 2.5 \mathrm{m/s}$ ), and runs at high speeds ( $\sim 4.5 \mathrm{m/s}$ ). This demonstrates automatic interpolation and generalization of skills. Policies also develop human-like strategies like banking into turns and slowing before sharp direction changes.
Stylistic Behaviors: By providing different datasets (Zombie, Stealthy), the character learns distinct stylistic gaits (e.g., shambling zombie walk, stealthy movements), proving AMP's ability to control style via example.
No Motion Planner: The paper highlights that these intricate behaviors and transitions emerge automatically from the motion prior, without requiring a motion planner or explicit motion selection, a key advantage over many prior tracking-based systems.

2. Obstacles:

AMP trains visuomotor policies for traversing obstacle courses. Using a Run + Leap + Roll dataset, the character learns to leap over gaps and transition into a rolling behavior to pass under overhead obstructions.
Novel Recovery: When falling, the character can tuck its body into a roll to transition more quickly into a get-up behavior, even if this specific combined motion (fall + roll + getup) is not present in the dataset. This indicates generalization and discovery of novel strategies.
Diverse Obstacle Clearing Styles: Different datasets (Jump, Cartwheel) allow the character to clear stepping stones by jumping or cartwheeling, showcasing stylistic variation in complex environments.

3. Dribbling:

The humanoid learns to dribble a soccer ball to a target location, demonstrating the ability to handle complex object manipulation tasks while maintaining a specified style (e.g., Locomotion or Zombie).

4. Strike:

Using a Walk + Punch dataset, the character learns to walk to a target and then transition to a punching motion when in range. The dataset contains only walking-only or punching-only clips; the policy learns to temporally sequence these behaviors automatically. This is a strong example of composition of disparate skills emerging from the motion prior.

Figure 1 (Main Paper) & Figure 3 (Appendix A) illustrate these behaviors:

该图像是一个示意图，展示了物理模拟角色的复杂运动行为。左侧展示了角色在进行跑步和翻越障碍物的动态序列，右侧则显示角色与物体交互的动作。通过对不结构化运动剪辑的使用，角色能够自动生成高质量的运动表现。

该图像是一个示意图，展示了多种仿人类角色的运动示例，包括跑步、起立、带球、击打、越障、翻滚、跳跃等多种行为。每个动作都通过不同的帧展示，体现了仿真角色在物理控制下的优雅移动。

6.1.2. Single-Clip Imitation

Although designed for large datasets, AMP's effectiveness in imitating individual motion clips is evaluated. Here, the policy is trained solely to maximize the style-reward $r_t^S$ .

AMP can closely imitate a wide variety of highly dynamic and acrobatic skills (e.g., back-flip, side-flip, cartwheel, spin-kick) for humanoids, T-Rex, and dogs. (See Figure 6 and 7).
- 该图像是展示了人形机器人在单个片段模仿任务中学到的动作快照。自上而下依次为：后空翻、侧空翻、侧手翻、旋转、旋风踢和滚动。AMP 使得角色能够可靠地模仿多样化的高动态和特技技能。
- 该图像是示意图，展示了仿生角色在不同模式下的运动，包括T-Rex的行走、狗的慢跑和小跑。每个角色的运动表现为多个动态姿态，体现了物理仿真下角色控制的多样性。
The performance is evaluated using pose error after applying Dynamic Time Warping (DTW) for alignment, as AMP policies are not synchronized via phase variables and may progress at different rates.
AMP produces results of comparable quality to tracking-based methods (Peng et al. 2018a), but without the need for manual reward engineering or phase-based synchronization.
Limitation: For some complex motions like Front-Flip, AMP can converge to locally optimal behaviors (e.g., shuffling forward instead of flipping), a challenge that tracking-based methods can mitigate with early termination criteria.

6.1.3. Comparison to Baselines

Latent Space Models & No Data: Both AMP and latent space models produce substantially more life-like behaviors than policies trained from scratch (No Data). This is visually evident in the supplementary video and quantitatively in task performance (Figure 5).
- 该图像是一个图表，比较了AMP方法与“无数据”和潜在空间模型在不同任务（如目标位置、控球、击打和障碍物）下的任务表现。图中展示了AMP方法在多个任务中都实现了与现有技术相当的效果，且具有更高的动作真实感。
AMP vs. Latent Space Models: AMP is able to mitigate unnatural motions more effectively than latent space models because it directly enforces motion similarity through its reward function. Latent space models, which enforce style indirectly, can suffer from distribution mismatch between the high-level controller's latent encodings and the low-level controller's pre-trained distribution. While latent space models can sometimes solve tasks faster initially due to structured exploration, their pre-training phase can be sample-intensive. AMP does not require such a separate pre-training phase.

6.1.4. Ablation Studies (Critical Design Decisions)

Gradient Penalty: This is identified as the most vital component for stable training and effective performance. Models trained without the gradient penalty ( $w^{\mathrm{gp}}$ ) exhibit large performance fluctuations and noticeable visual artifacts. Its inclusion leads to faster learning and improved stability. (See Figure 8)
Velocity Features in Discriminator: Including velocity features in the discriminator's observations is crucial for imitating certain dynamic motions, such as rolling. Without them, the character might converge to an undesirable static pose (e.g., holding a fixed pose on the ground instead of rolling). (See Figure 8)
Dataset Diversity: Experiments on the Target Heading task using limited datasets (only walking or only running) versus a diverse Locomotion dataset show that the diversity of behaviors (e.g., gait transitions) is largely attributed to the motion prior and not solely the task objective. Policies trained with limited data cannot achieve the full range of target speeds. (See Figure 4)
- 该图像是图表，展示了在不同数据集上训练的目标方位策略的性能。左侧为学习曲线，比较了使用包含大量多样化运动片段的数据集与仅使用走路或跑步参考动作训练的策略的标准化任务回报。右侧比较了不同策略的目标速度与实际平均速度的关系。
Figure 8 illustrates these ablation study results:
- $Fig. 8. Learning curves of various methods on the single-clip imitation tasks. We compare AMP to the motion tracking approach proposed by Peng et al. \[2018a\] (Motion Tracking), as well a version of AMP without velocity features for the discriminator (AMP - No Vel), and AMP without the gradient penalty regularizer (AMP - No GP). A comprehensive collection of learning curves for all skills are available in the Appendix. AMP produces results of comparable quality when compared to prior tracking-based methods, without requiring a manually designed reward function or synchronization between the policy and reference motion. Velocity features and gradient penalty are vital for effective and consistent results on challenging skills.$ 该图像是学习曲线图，展示了不同方法在单剪辑模仿任务中的表现。图中比较了AMP（我们的方法）、不使用速度特征的AMP（AMP - No Vel）、不使用梯度惩罚正则化的AMP（AMP - No GP）以及由Peng等提出的运动跟踪方法。可以看出，AMP在姿态误差上表现优异，且不需要手动设计奖励函数或与参考动作同步。

6.1.5. Spatial Composition (Appendix D)

AMP also demonstrates some capability for spatial composition, where a character performs different skills simultaneously (e.g., walking while waving). Using datasets with only walking motions and only waving motions (no combined examples), the policy learns to combine these skills.
Policies trained with both datasets achieve relatively high rewards on both the heading and waving objectives, outperforming those trained with only one type of motion. (See Table 6).
Limitation: While possible, some unnatural behaviors can still emerge, especially when the target height for the hand is very high, suggesting areas for improvement in more complex spatial compositions.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Character	Task	Dataset	Task Return
Humanoid	TargetHeading	Locomotion	0.90 ± 0.01
		Walk	0.46 ± 0.01
		Run	0.63 ± 0.01
		Stealthy	0.89 ± 0.02
		Zombie	0.94 ± 0.00
	TargetLocation	Locomotion	0.63 ± 0.01
	TargetLocation	Zombie	0.50 ± 0.00
	Obstacles	Run + Leap + Roll	0.27 ± 0.10
	Stepping Stones	Cartwheel	0.43 ± 0.03
	Stepping Stones	Jump	0.56 ± 0.12
	Dribble	Locomotion	0.78 ± 0.05
	Dribble	Zombie	0.60 ± 0.04
	Strike	Walk + Punch	0.73 ± 0.02
T-Rex	TargetLocation	Locomotion	0.36 ± 0.03

The following are the results from Table 3 of the original paper:

Character	Motion	DatasetSize	MotionTracking	AMP(Ours)
Humanoid	Back-Flip	1.75s	0.076 ± 0.021	0.150 ± 0.028
	Cartwheel	2.72s	0.039 ± 0.011	0.067 ± 0.014
	Crawl	2.93s	0.044 ± 0.001	0.049 ± 0.007
	Dance	1.62s	0.038 ± 0.001	0.055 ± 0.015
	Front-Flip	1.65s	0.278 ± 0.040	0.425 ± 0.010
	Jog	0.83s	0.029 ± 0.001	0.056 ± 0.001
	Jump	1.77s	0.033 ± 0.001	0.083 ± 0.022
	Roll	2.02s	0.072 ± 0.018	0.088 ± 0.008
	Run	0.80s	0.028 ± 0.002	0.075 ± 0.015
	Spin	1.00s	0.063 ± 0.022	0.047 ± 0.002
	Side-Flip	2.44s	0.191 ± 0.043	0.124 ± 0.012
	Spin-Kick	1.28s	0.042 ± 0.001	0.058 ± 0.012
	Walk	1.30s	0.018 ± 0.005	0.030 ± 0.001
	Zombie	1.68s	0.049 ± 0.013	0.058 ± 0.014
T-Rex	Turn	2.13s	0.098 ± 0.011	0.284 ± 0.023
T-Rex	Walk	2.00s	0.069 ± 0.005	0.096 ± 0.027
Dog	Canter	0.45s	0.026 ± 0.002	0.034 ± 0.002
	Pace	0.63s	0.020 ± 0.001	0.024 ± 0.003
	Spin	0.73s	0.026 ± 0.002	0.086 ± 0.008
	Trot	0.52s	0.019 ± 0.001	0.026 ± 0.001

The following are the results from Table 4 of the original paper:

Parameter	Value
$w^G$ Task-Reward Weight	0.5
$w^S$ Style-Reward Weight	0.5
$w^{\mathrm{gP}}$ Gradient Penalty	10
Samples Per Update Iteration	4096
Batch Size	256
$\pi$ Policy Stepsize (Single-Clip Imitation)	$2 \times 10^{-6}$
$\pi$ Policy Stepsize (Tasks)	$4 \times 10^{-6}$
$V$ Value Stepsize (Single-Clip Imitation)	$10^{-4}$
$V$ Value Stepsize (Tasks)	$2 \times 10^{-5}$
$D$ Discriminator Stepsize	$10^{-5}$
$\mathcal{B}$ Discriminator Replay Buffer Size	$10^5$
$\gamma$ Discount (Single-Clip Imitation)	0.95
$\gamma$ Discount (Tasks)	0.99
SGD Momentum	0.9
GAE( $\lambda$ )	0.95
TD( $\lambda$ )	0.95
PPO Clip Threshold	0.02

The following are the results from Table 5 of the original paper:

Parameter	Value
Latent Encoding Dimension	16
$\lambda$ KL-Regularizer	$10^{-4}$
Samples Per Update Iteration	4096
Batch Size	256
$\pi$ Policy Stepsize (Pre-Training)	$2.5 \times 10^{-6}$
$u$ Policy Stepsize (Downstream Task)	$10^{-4}$
$V$ Value Stepsize	$10^{-3}$
$\gamma$ Discount (Pre-Training)	0.95
$\gamma$ Discount (Downstream Task)	0.99
SGD Momentum	0.9
GAE( $\lambda$ )	0.95
TD( $\lambda$ )	0.95
PPO Clip Threshold	0.02

The following are the results from Table 6 of the original paper:

Dataset (Size)	Heading Return	Waving Return
Wave (51.7s)	0.683 ± 0.195	0.949 ± 0.144
Walk (229.7s)	0.945 ± 0.192	0.306 ± 0.378
Wave + Walk (281.4s)	0.885 ± 0.184	0.891 ± 0.202

6.3. Ablation Studies / Parameter Analysis

The paper meticulously investigates the impact of key design choices on AMP's performance and stability.

6.3.1. Importance of Gradient Penalty

Observation: The gradient penalty (from Equation 8) is identified as the most crucial component for AMP's success.
Impact: Without this regularization, training is highly unstable, characterized by large performance fluctuations (evident in the $AMP - No GP$ curves in Figure 8 and Figure 9). This often leads to severe visual artifacts and unnatural behaviors in the final policies.
Benefit: The gradient penalty significantly improves stability during adversarial training and leads to substantially faster learning across a wide range of skills. This is a critical finding for making adversarial imitation learning practical for high-fidelity character control.

6.3.2. Role of Velocity Features in Discriminator Observations

Observation: While one might assume that providing consecutive poses to the discriminator could implicitly convey velocity information, the ablation study ( $AMP - No Vel$ in Figure 8) shows this is often insufficient.
Impact: Without explicit velocity features in the discriminator observations ( $\Phi(\mathbf{s})$ ), the policy can converge to undesirable locally optimal behaviors. For instance, in the rolling motion, the character might learn to simply lie on the ground in a fixed pose rather than performing a dynamic roll.
Benefit: Including explicit linear and angular velocities of the root and joints in the discriminator's input helps it better understand the dynamics of the reference motions, guiding the policy to produce more accurate and dynamic behaviors.

6.3.3. Influence of Dataset Diversity on Skill Composition

Experiment: To confirm that the observed diversity and transitions between gaits (e.g., walking, jogging, running) are indeed a product of the motion prior and not just the task objective, policies for the Target Heading task are trained with:
1. A large, diverse Locomotion dataset.
2. Limited datasets containing only walking motions.
3. Limited datasets containing only running motions.
Results (Figure 4):
- Policies trained with only walking motions learn exclusively walking gaits and cannot achieve faster target speeds, resulting in lower task returns for higher speed goals.
- Policies trained with only running motions struggle to match slower target speeds, as they only know how to run.
- Policies trained with the diverse Locomotion dataset are much more flexible, able to smoothly transition between gaits to match a wider range of target speeds, achieving higher overall task returns.
Conclusion: This validates that the motion prior itself, learned from the diversity of the provided dataset, is responsible for enabling the policy to compose and dynamically adapt different skills to achieve task objectives.

6.3.4. Discount Factor ( $\gamma$ )

Observation (Table 4): The discount factor ( $\gamma$ ) for the RL objective is varied.
Impact: A smaller $\gamma = 0.95$ is found to be more effective for single-clip imitation tasks, allowing the character to focus on short-term fidelity to the reference motion. For tasks with additional objectives that might require longer-horizon planning (e.g., Dribble, Strike), a larger $\gamma = 0.99$ is used.

6.4. Learning Curves

The learning curves, such as those presented in Figure 8 and Figure 9 for single-clip imitation and Figure 5 for task performance, provide quantitative insights into training progression. Figure 10 provides a comprehensive collection of learning curves across various tasks and datasets.

General Trend: AMP often shows slower initial learning compared to motion tracking baselines in single-clip imitation (Figure 8), but eventually reaches comparable or even superior performance for some skills (e.g., Spin, Side-Flip). This is expected as AMP has to learn the reward function implicitly, while motion tracking benefits from a direct, engineered error signal.
Stability: The AMP curves (blue line) are generally smoother and more stable than $AMP - No GP$ (purple line), reinforcing the importance of the gradient penalty.
Comparison to Baselines in Tasks (Figure 5): AMP and Latent Space models typically converge to higher task returns than No Data baselines, showcasing the benefit of data-driven style guidance.

The image is a learning curve graph showing the performance of various methods on single-clip imitation tasks.

$Fig. 8. Learning curves of various methods on the single-clip imitation tasks. We compare AMP to the motion tracking approach proposed by Peng et al. \[2018a\] (Motion Tracking), as well a version of AMP without velocity features for the discriminator (AMP - No Vel), and AMP without the gradient penalty regularizer (AMP - No GP). A comprehensive collection of learning curves for all skills are available in the Appendix. AMP produces results of comparable quality when compared to prior tracking-based methods, without requiring a manually designed reward function or synchronization between the policy and reference motion. Velocity features and gradient penalty are vital for effective and consistent results on challenging skills.$ 该图像是学习曲线图，展示了不同方法在单剪辑模仿任务中的表现。图中比较了AMP（我们的方法）、不使用速度特征的AMP（AMP - No Vel）、不使用梯度惩罚正则化的AMP（AMP - No GP）以及由Peng等提出的运动跟踪方法。可以看出，AMP在姿态误差上表现优异，且不需要手动设计奖励函数或与参考动作同步。

The image is a diagram showing the error comparison between the AMP method and traditional motion tracking during the execution of various actions such as backflip, cartwheel, and dance.

该图像是一个示意图，展示了不同动作（如后空翻、卡特走、舞蹈等）执行过程中，AMP方法与传统运动跟踪的误差对比。图中横坐标为样本数，纵坐标为位置误差（米），蓝色线表示我们的AMP方法，橙色线表示运动跟踪方法。通过对比可以看出，AMP在多个动作中的误差表现优于运动跟踪。

The image is a chart that displays the learning curves for applying AMP across various tasks and datasets.

Fig. 10. Learning curves of applying AMP to various tasks and datasets 该图像是图表，展示了应用AMP进行各种任务和数据集的学习曲线。图中包含多个子图，分别显示了不同任务的任务回报随样本数量的变化情况，体现了AMP在多种情境下的表现效果。

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces AMP (Adversarial Motion Priors), a novel and effective adversarial learning system for physics-based character animation. The core contribution is the development of a task-agnostic motion prior derived from an adversarial discriminator, which serves as a powerful style-reward for reinforcement learning. This approach successfully addresses key limitations of prior methods:

Elimination of Manual Reward Engineering: AMP obviates the need for manually designed imitation objectives, which are often difficult to construct and tune across diverse skills.
Scalability to Unstructured Data: It effectively leverages large, unstructured motion datasets without requiring task-specific annotations, segmentation, or high-level motion planners.
Automatic Skill Composition: The system automatically learns to compose, interpolate, and generalize disparate skills from the dataset to fulfill complex task objectives, demonstrating emergent intelligence in behavior sequencing.
High-Fidelity Motions: AMP produces high-quality, natural, and life-like full-body motions for physically simulated characters, achieving results comparable to state-of-the-art tracking-based techniques.
Enhanced Training Stability: Crucial design decisions, particularly the Least-Squares Discriminator and gradient penalty on discriminator observations, significantly improve the stability of adversarial training, making the approach practical.

By decoupling task specification from style specification, AMP offers a flexible and powerful framework for generating sophisticated and stylized character behaviors in diverse simulated environments.

7.2. Limitations & Future Work

The authors identify several limitations and propose avenues for future research:

Mode Collapse: Like many GAN-based techniques, AMP is susceptible to mode collapse. When given a large and diverse motion dataset, the policy might only imitate a small subset of the available behaviors, neglecting other potentially optimal styles. This restricts the full exploitation of the dataset's diversity.
Training Motion Priors from Scratch: Currently, the motion prior is trained from scratch for each policy and task. Although the prior is largely task-agnostic, this process can be computationally intensive.
- Future Work: Exploring techniques for developing more general and transferable motion priors could lead to modular objective functions that can be reused across different policies and tasks without retraining.
Task Dependencies in Motion Priors: While designed to be task-agnostic, the data used to train the current motion priors are generated by policies performing a particular task. This could inadvertently introduce some task dependencies into the prior, hindering its transferability.
- Future Work: Training motion priors using data generated from a larger and more diverse repertoire of tasks might help to make them truly general and transferable to novel tasks.
Spatial vs. Temporal Composition: The experiments primarily focus on temporal composition of skills (performing different behaviors sequentially, e.g., walk then punch). While some spatial composition (performing multiple skills simultaneously, e.g., walk and wave) is demonstrated, AMP might still struggle with producing fully natural behaviors in highly complex spatial compositions, especially when skills conflict.
- Future Work: Developing motion priors that are more amenable to complex spatial composition of disparate skills could lead to even more flexible and sophisticated behaviors.

7.3. Personal Insights & Critique

AMP represents a significant leap forward in physics-based character animation by effectively harnessing the power of adversarial imitation learning. The decoupling of task and style objectives is an elegant solution to a long-standing problem in the field, offering a more intuitive and scalable interface for animators and developers.

Key Strengths and Insights:

Practicality of AIL: The paper's most impactful contribution might be demonstrating that adversarial imitation learning, often criticized for instability, can indeed produce high-fidelity motions for complex physical characters when properly regularized. The identification of the gradient penalty and carefully chosen discriminator observations as critical for stability is a valuable insight for anyone working with GANs in RL.
Generalization and Emergent Behavior: The automatic composition of disparate skills and the generalization to novel recovery behaviors (e.g., tucking into a roll during a fall) without explicit programming or motion planning is highly impressive. This showcases the power of learning a general motion prior over explicit tracking.
Unstructured Data Advantage: The ability to work with unstructured motion clips is a major practical advantage. It removes the cumbersome preprocessing steps (annotation, segmentation) that plague many data-driven animation pipelines, making it much easier to leverage large, readily available motion datasets.

Critique and Areas for Further Consideration:

Mode Collapse Mitigation: While the paper acknowledges mode collapse as a limitation, it's a fundamental challenge for GANs. Future work could explore diversity-promoting techniques (e.g., conditional GANs, ensemble methods, or explicit diversity penalties) within AMP to ensure that the policy truly utilizes the full spectrum of styles available in large datasets, rather than converging to a subset.
Interpretability of Motion Prior: The adversarial motion prior functions as a black box. While effective, understanding what specific features of a motion it finds "natural" or "stylistic" could provide deeper insights and perhaps lead to more controllable or fine-tunable style generation.
Computational Cost: Training AMP policies still requires significant computational resources (30-140 hours on 16 CPU cores for 100-300 million samples). While comparable to other state-of-the-art RL methods, it's a barrier for rapid iteration. Research into more sample-efficient adversarial RL or transfer learning for motion priors is crucial.
Robustness to Unnatural Spatial Composition: The observation that spatial composition can still lead to "unnatural behaviors" when target hand height is high (Appendix D) points to a frontier. This is a complex problem where different skills might have conflicting kinematic or dynamic requirements. Perhaps a more explicit mechanism for modeling skill compatibility or blending in the latent space of the discriminator could be explored.

In conclusion, AMP is a robust and innovative framework that pushes the boundaries of physically simulated character control. Its elegant solution to motion style specification and skill composition paves the way for more autonomous, natural, and expressive virtual agents, with clear implications for both computer graphics and robotics.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~42 min read · 53,960 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Reward Function Formulation

4.2.2. Adversarial Motion Prior (AMP)

4.2.2.1. Imitation from Observations

4.2.2.2. Least-Squares Discriminator

4.2.2.3. Discriminator Observations

4.2.2.4. Gradient Penalty

4.2.3. Model Representation

4.2.3.1. States and Actions

4.2.3.2. Network Architecture

4.2.4. Training

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Task Performance and Skill Composition

6.1.2. Single-Clip Imitation

6.1.3. Comparison to Baselines

6.1.4. Ablation Studies (Critical Design Decisions)

6.1.5. Spatial Composition (Appendix D)

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

6.3.1. Importance of Gradient Penalty

6.3.2. Role of Velocity Features in Discriminator Observations

6.3.3. Influence of Dataset Diversity on Skill Composition

6.3.4. Discount Factor (γ\gammaγ)

6.4. Learning Curves

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.3.4. Discount Factor ( $\gamma$ )