AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control
TL;DR Summary
The paper presents a fully automated method called Adversarial Motion Priors (AMP) for generating graceful and realistic motions in physically simulated characters, utilizing adversarial imitation learning to simplify task objectives and learn behavior styles from unstructured mo
Abstract
Synthesizing graceful and life-like behaviors for physically simulated characters has been a fundamental challenge in computer animation. Data-driven methods that leverage motion tracking are a prominent class of techniques for producing high fidelity motions for a wide range of behaviors. However, the effectiveness of these tracking-based methods often hinges on carefully designed objective functions, and when applied to large and diverse motion datasets, these methods require significant additional machinery to select the appropriate motion for the character to track in a given scenario. In this work, we propose to obviate the need to manually design imitation objectives and mechanisms for motion selection by utilizing a fully automated approach based on adversarial imitation learning. High-level task objectives that the character should perform can be specified by relatively simple reward functions, while the low-level style of the character's behaviors can be specified by a dataset of unstructured motion clips, without any explicit clip selection or sequencing. These motion clips are used to train an adversarial motion prior, which specifies style-rewards for training the character through reinforcement learning (RL). The adversarial RL procedure automatically selects which motion to perform, dynamically interpolating and generalizing from the dataset. Our system produces high-quality motions that are comparable to those achieved by state-of-the-art tracking-based techniques, while also being able to easily accommodate large datasets of unstructured motion clips. Composition of disparate skills emerges automatically from the motion prior, without requiring a high-level motion planner or other task-specific annotations of the motion clips. We demonstrate the effectiveness of our framework on a diverse cast of complex simulated characters and a challenging suite of motor control tasks.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control
1.2. Authors
- Xue Bin Peng (University of California, Berkeley)
- Ze Ma (Shanghai Jiao Tong University)
- Pieter Abbeel (University of California, Berkeley)
- Sergey Levine (University of California, Berkeley)
- Angjoo Kanazawa (University of California, Berkeley)
1.3. Journal/Conference
ACM Transactions on Graphics (ACM Trans. Graph. 40, 4, Article 1). This is a highly prestigious and influential journal in the field of computer graphics, widely recognized for publishing cutting-edge research. Publication here signifies high-quality and impactful work.
1.4. Publication Year
2021
1.5. Abstract
The paper addresses the long-standing challenge of synthesizing graceful and life-like behaviors for physically simulated characters in computer animation. Existing data-driven methods, particularly those based on motion tracking, often require intricate objective functions and complex mechanisms for selecting appropriate motions when dealing with large, diverse datasets. To overcome these limitations, the authors propose a fully automated approach called Adversarial Motion Priors (AMP) which leverages adversarial imitation learning.
AMP allows users to specify high-level task objectives using simple reward functions, while the low-level style of the character's behavior is learned from unstructured motion clips without explicit selection or sequencing. These motion clips are used to train an adversarial motion prior, which acts as a style-reward for training the character through reinforcement learning (RL). The adversarial RL procedure automatically selects, interpolates, and generalizes motions from the dataset. The system generates high-quality motions comparable to state-of-the-art tracking methods, accommodating large, unstructured motion datasets. It also automatically facilitates the composition of disparate skills without needing a high-level motion planner or task-specific annotations. The framework's effectiveness is demonstrated on diverse simulated characters and challenging motor control tasks.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2104.02180v2
- PDF Link: https://arxiv.org/pdf/2104.02180v2.pdf
- Publication Status: Published at ACM Transactions on Graphics in August 2021 (Article 1). The arXiv link is a preprint, but the paper has since been formally published.
2. Executive Summary
2.1. Background & Motivation
Synthesizing natural, life-like, and graceful motions for virtual characters is a fundamental and challenging problem in computer animation, crucial for creating immersive experiences in films, games, and virtual reality. Beyond entertainment, developing realistic control strategies for simulated characters also has implications for robotics, as natural motions often implicitly encode properties like safety and energy efficiency.
Prior methods face significant hurdles:
-
Optimization-based methods: While capable of producing physically plausible motions, these techniques struggle with defining quantitative metrics for "naturalness" or "life-likeness." Heuristics like symmetry or effort minimization often require careful, task-specific tuning and may not generalize well across different behaviors.
-
Data-driven kinematic methods: These methods leverage large datasets of human motion (e.g., mocap) to generate realistic motions. However, their ability to synthesize motions for novel situations is limited by data availability, making it difficult to cover all necessary behaviors, especially for non-human or fictional characters.
-
Data-driven physics-based methods (tracking-based): A common approach involves using a
tracking objectiveto minimize thepose errorbetween a simulated character and reference motions. While effective for high-quality single-skill imitation, scaling these methods to large, diverse, and unstructured motion datasets is challenging. They typically require amotion plannerto select the appropriate reference motion for the character to track at each timestep, which introduces significant algorithmic overhead and necessitates annotating and organizing motion clips. Moreover, these methods often rely on manually designed pose error metrics that are difficult to tune across various skills.The paper's entry point is to address these limitations by proposing a system that obviates the need for manually designed imitation objectives and explicit motion selection mechanisms. The authors aim to enable users to specify high-level task objectives with simple reward functions, while the low-level style of motion is learned automatically from raw, unstructured motion clips.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of physics-based character control:
-
Adversarial Motion Priors (AMP): The core contribution is the introduction of
Adversarial Motion Priorswhich, throughadversarial imitation learning, learn a general, task-agnostic measure of motion similarity. This prior acts as astyle-rewardfunction, enabling a character to adopt behavioral characteristics from a dataset without needing explicit clip selection or sequence planning. -
Automated Style Control from Unstructured Data: AMP provides a fully automated approach to learning low-level motion style from unstructured motion clips, eliminating the need for task-specific annotations, organization, or synchronization of the dataset. This significantly improves scalability to large and diverse motion repositories.
-
High-Quality and Diverse Motion Synthesis: The system produces high-quality, natural, and life-like motions comparable to state-of-the-art
tracking-based techniques. It demonstrates the ability to learn highly dynamic and diverse motor skills, including acrobatic feats, for various complex simulated characters (humanoids, T-Rex, dogs). -
Automatic Skill Composition and Generalization: The adversarial RL procedure automatically selects, interpolates, and generalizes different behaviors from the motion dataset. This allows for the emergence of complex skill compositions (e.g., walking and then punching, or running, leaping, and rolling to clear obstacles) in furtherance of high-level task objectives, all without a separate
motion planner. -
Stabilized Adversarial Training: The paper proposes several key design decisions, including the use of a
Least-Squares Discriminatorand agradient penaltyon discriminator observations, which are crucial for stabilizing the notoriously unstable adversarial training process and achieving consistent, high-quality results. -
Decoupling Task and Style Specification: AMP provides a convenient interface that decouples the specification of what task a character should perform (via simple task rewards) from how it should perform it (via the learned motion style from examples). This allows characters to acquire more complex skills than those explicitly demonstrated in the original motion clips.
The key finding is that by combining goal-conditioned reinforcement learning with an adversarially learned motion prior, it is possible to train robust and versatile physically simulated characters that perform complex tasks with natural and stylized behaviors, effectively bridging the gap between data-driven realism and physics-based control.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand AMP, a grasp of several core concepts in reinforcement learning, machine learning, and computer graphics is essential:
- Physically Simulated Characters: These are virtual characters whose movements are governed by the laws of physics. Instead of simply playing back animations (kinematic methods), their bodies interact with the environment through forces, torques, gravity, and collisions. This provides physical realism but makes control more challenging.
- Degrees of Freedom (DoF): Refers to the number of independent parameters that define the configuration of a physical system. For a character, this includes the position and orientation of its
root(e.g., pelvis) and the rotational angles of its joints. A 34 DoF humanoid, for example, has 34 independent values describing its pose. - Proportional-Derivative (PD) Controllers: These are feedback control mechanisms widely used in robotics and character animation. For a joint, a PD controller attempts to reach a
target position(e.g., a desired angle) by applying atorqueproportional to both the position error (difference between current and target position) and the velocity error (difference between current and target velocity). They help the character's simulated muscles exert forces to achieve desired poses. - Reinforcement Learning (RL): A paradigm where an
agentlearns to make optimal decisions by interacting with anenvironment.- Agent: The entity that performs actions (e.g., the character's controller).
- Environment: The world the agent interacts with (e.g., the physics simulation).
- State (): A description of the current situation in the environment (e.g., character's joint angles, velocities, position).
- Action (): A decision made by the agent that influences the environment (e.g., target joint angles for PD controllers).
- Reward (): A scalar feedback signal from the environment indicating how good or bad the agent's last action was. The agent's goal is to maximize cumulative reward.
- Policy (): A function or strategy that maps states to actions (or a distribution over actions). It dictates the agent's behavior.
- Value Function (): Estimates the expected future cumulative reward from a given state.
- Expected Discounted Return (): The sum of future rewards, where later rewards are discounted by a factor . The agent aims to find a policy that maximizes this.
- Goal-Conditioned Reinforcement Learning: An extension of standard RL where the policy also takes a
goal (\mathbf{g})as input. This allows a single policy to learn a variety of behaviors to achieve different goals (e.g., walk to location A, then location B). - Generative Adversarial Networks (GANs): A class of
generative modelsconsisting of two neural networks, ageneratorand adiscriminator, trained in a competitive "adversarial" process.- Generator: Tries to produce realistic data samples (e.g., images, motions) that mimic a real dataset.
- Discriminator: Tries to distinguish between real data samples (from the dataset) and fake data samples (produced by the generator).
- Adversarial Training: The generator tries to "fool" the discriminator, and the discriminator tries to get better at "unfooling" the generator. This
minimax gamedrives both networks to improve, with the generator eventually learning to produce highly realistic data.
- Adversarial Imitation Learning (AIL): Combines RL with GANs. Instead of manually designing a reward function for imitation, the discriminator learns to distinguish between the agent's behavior and expert demonstrations. The policy (agent's controller) is then trained using the discriminator's output as a reward, trying to make its behavior indistinguishable from the expert.
- Dynamic Time Warping (DTW): An algorithm for measuring similarity between two temporal sequences that may vary in speed or duration. It finds an optimal alignment between two time series by "warping" one or both along the time axis. This is crucial for comparing the pose error of a simulated character's motion with a reference motion when their timings are not perfectly synchronized.
- Exponential Map (for rotations): A way to represent 3D rotations as a 3D vector. The direction of the vector defines the axis of rotation, and its magnitude defines the angle of rotation. It's more compact than
axis-angleorquaternionrepresentations and avoidsgimbal lockissues common withEuler angles.
3.2. Previous Works
The paper contextualizes its work by reviewing existing approaches for synthesizing natural motions for virtual characters:
-
Kinematic Methods: These methods typically do not use physics simulation explicitly. They primarily rely on datasets of human motion (
mocap data) to generate animations.- Motion Graphs (Lee et al. 2002, 2010b; Agrawal and van de Panne 2016; Safonova and Hodgins 2007; Treuille et al. 2007): These techniques build a graph where nodes are poses and edges represent transitions. A controller then traverses this graph to stitch together motion clips.
- Generative Models (Levine et al. 2012; Ye and Liu 2010; Holden et al. 2017; Ling et al. 2020; Zhang et al. 2018): Use models like
Gaussian processesorneural networks(e.g.,Phase-Functioned Neural Networks) to synthesize motions online. - Limitations: While capable of realistic motions given large datasets, their ability to generalize to novel situations or produce physically plausible interactions is limited by the training data. Collecting sufficient data for complex tasks or non-human characters is challenging.
-
Physics-Based Methods: These methods synthesize motions using a physics simulation and optimize controllers to achieve desired behaviors.
- Optimization Techniques (Raibert and Hodgins 1991; Wampler et al. 2014; Mordatch et al. 2012; Tan et al. 2014):
Trajectory optimizationandreinforcement learningare used to find controllers that optimize an objective function. - Challenges: Designing effective objective functions that lead to natural and life-like behaviors is extremely difficult. Heuristics (symmetry, stability, effort minimization) are often incorporated but require careful tuning and may not be universally applicable. Even with biologically accurate actuators, motions can still appear unnatural.
- Optimization Techniques (Raibert and Hodgins 1991; Wampler et al. 2014; Mordatch et al. 2012; Tan et al. 2014):
-
Imitation Learning (Data-Driven Physics-Based Methods): These approaches combine physics simulation with reference motion data to improve motion quality.
- Tracking Objectives (Da Silva et al. 2008; Kwon and Hodgins 2017; Lee et al. 2010a; Sharon and van de Panne 2005; Zordan and Hodgins 2002; Liu et al. 2016, 2010; Peng et al. 2018a): The most common method. The controller tries to minimize the
pose errorbetween the simulated character and target poses from a reference motion.- Synchronization: For individual clips, a
phase variable(Lee et al. 2019; Peng et al. 2018a,b) can synchronize the character with the reference. - Scaling to Datasets: More recent methods provide target poses as inputs to the controller (Bergamin et al. 2019; Chentanez et al. 2018; Park et al. 2019; Won et al. 2020).
- Limitations: Still requires a
high-level motion planner(Bergamin et al. 2019; Park et al. 2019; Peng et al. 2017) to select which motion clip to track for a given task, introducing significant overhead. Also,manually designed pose error metricsare hard to tune across diverse skills.
- Synchronization: For individual clips, a
- Tracking Objectives (Da Silva et al. 2008; Kwon and Hodgins 2017; Lee et al. 2010a; Sharon and van de Panne 2005; Zordan and Hodgins 2002; Liu et al. 2016, 2010; Peng et al. 2018a): The most common method. The controller tries to minimize the
-
Adversarial Imitation Learning (AIL) in Motion:
- Generative Adversarial Imitation Learning (GAIL) (Ho and Ermon 2016; Ziebart et al. 2008): This is a key precursor. Instead of manual objectives, GAIL trains a
discriminatorto distinguish between agent-generated behaviors and expert demonstrations. The policy then optimizes the discriminator's output as a reward.- Standard GAIL: Requires access to demonstrator's actions, which are often unavailable in motion clips. The
Imitation from Observationsextension (Torabi et al. 2018) allows training on state transitions only. - Previous AIL Limitations: Often unstable, and resulting motion quality lagged behind tracking-based methods (Merel et al. 2017; Wang et al. 2017). Peng et al. [2019b] improved realism with an
information bottleneckbut still needed a phase variable, limiting it to single-motion imitation.
- Standard GAIL: Requires access to demonstrator's actions, which are often unavailable in motion clips. The
- Generative Adversarial Imitation Learning (GAIL) (Ho and Ermon 2016; Ziebart et al. 2008): This is a key precursor. Instead of manual objectives, GAIL trains a
-
Latent Space Models (as motion priors):
- General Latent Space Control (Burgard et al. 2008; Florensa et al. 2017; Hausman et al. 2018; Heess et al. 2016): Encodes behaviors into a low-dimensional
latent representation, which is then mapped to controls. Can be used in hierarchical control. - For Character Control (Merel et al. 2019; Peng et al. 2019a): Latent representations trained from reference motions constrain character behavior.
- Limitations: Realism is enforced implicitly. High-level controllers can still specify latent encodings that lead to unnatural behaviors. Luo et al. [2020] used an
adversarial domain confusion lossin latent space, but this doesn't directly enforce similarity on the actual motions.
- General Latent Space Control (Burgard et al. 2008; Florensa et al. 2017; Hausman et al. 2018; Heess et al. 2016): Encodes behaviors into a low-dimensional
3.3. Technological Evolution
The field of character animation has evolved from purely artistic (keyframe animation) to physics-based (simulating natural dynamics), and then to data-driven methods (leveraging motion capture).
- Kinematic Era: Early data-driven methods focused on
kinematic playbackandmotion graphtechniques, offering realism but lacking physical interaction and generalization. - Physics-Based Era: The introduction of physics engines allowed for physically plausible motions, but control was difficult, often relying on
trajectory optimizationorRLwith hand-crafted reward functions, which struggled with naturalness. - Hybrid (Tracking) Era: The convergence of data-driven and physics-based methods led to
motion tracking, where physics-based characters try to imitate reference motion data. This improved realism and physical plausibility but introduced challenges in motion selection (motion planners) and reward engineering. - Adversarial Era: The rise of
Generative Adversarial Networksprovided a new paradigm for learning complex distributions.Adversarial Imitation Learning(GAIL) promised to automate reward design for imitation but initially suffered from instability and lower motion quality in complex physics-based scenarios. - AMP's Position: AMP sits at the forefront of this evolution, specifically addressing the limitations of prior AIL and tracking-based methods. It refines AIL to be robust enough for high-fidelity, physics-based character control from unstructured data, moving beyond the need for manual reward design and motion planners. It also improves upon latent space models by directly enforcing motion similarity through the reward function.
3.4. Differentiation Analysis
AMP distinguishes itself from previous methods primarily in its approach to motion style specification, scalability to unstructured data, and stability of adversarial training:
-
Compared to Tracking-Based Methods (e.g., Peng et al. 2018a; Park et al. 2019):
- Reward Design: Tracking methods rely on
manually designed pose error metricsto explicitly minimize the difference between the character's pose and a specific reference pose. AMP replaces this with a learned, task-agnostic style-reward from anadversarial discriminator, obviating manual engineering and tuning. - Motion Selection: Tracking methods often require a
high-level motion planneror explicitphase variablesto select and synchronize with a particular reference motion clip. AMP, through itsadversarial motion prior, automatically selects, interpolates, and generalizes from a large dataset of unstructured motions, without any explicit synchronization or clip selection mechanism. - Dataset Structure: Tracking methods typically need annotated and organized motion clips, or at least a mechanism to feed specific target poses. AMP works directly with
raw, unstructured motion clips. - Task Coupling: Tracking methods implicitly couple task and style (e.g., a "running" reward means tracking a running clip). AMP explicitly decouples task objectives from style objectives, allowing a character to perform novel tasks in a learned style.
- Reward Design: Tracking methods rely on
-
Compared to Previous Adversarial Imitation Learning (AIL) Systems (e.g., Merel et al. 2017; Wang et al. 2017; Peng et al. 2019b):
- Stability and Quality: Earlier AIL methods for physics-based characters were notoriously unstable and produced lower fidelity motions. AMP introduces several critical design decisions (
Least-Squares Discriminator,gradient penalty, specificdiscriminator observations) that significantly improve training stability and yield high-quality, full-body motions comparable to state-of-the-art tracking methods. - Synchronization: Unlike Peng et al. [2019b], which still required a
phase variablefor synchronization, limiting it to single-motion imitation, AMP does not require any synchronization between the policy and reference motion. This is key to its ability to learn from large, diverse datasets. - General Motion Prior: AMP learns a truly general motion prior from unstructured datasets, rather than being limited to imitating a single motion per policy.
- Stability and Quality: Earlier AIL methods for physics-based characters were notoriously unstable and produced lower fidelity motions. AMP introduces several critical design decisions (
-
Compared to Latent Space Models (e.g., Peng et al. 2019a; Merel et al. 2020):
-
Direct vs. Implicit Enforcement: Latent space models enforce motion style implicitly by constraining actions through a learned latent representation. This can still lead to unnatural behaviors if the high-level controller specifies latent encodings outside the pre-trained distribution. AMP, by contrast, directly enforces similarity between the character's actual motions and the reference dataset through its
discriminator reward, resulting in higher fidelity motions. -
Training Phases: Latent space models often require a separate pre-training phase for the low-level controller. AMP's motion prior can be trained jointly with the policy, simplifying the overall training pipeline.
In essence, AMP leverages the strengths of adversarial learning to create a
task-agnostic motion priorthat functions as a powerful, learnedstyle-reward, overcoming the scalability and reward-engineering challenges of previous imitation learning techniques in physics-based character animation.
-
4. Methodology
The AMP system aims to synthesize a control policy that enables a character to achieve task objectives in a physically simulated environment while exhibiting behaviors that resemble a given dataset of reference motions. It decouples the "what" (task) from the "how" (style) of character behavior.
4.1. Principles
The core idea is to combine goal-conditioned reinforcement learning (RL) with an adversarial motion prior.
-
Task Objective (): High-level goals are defined by relatively simple, manually designed
reward functionsspecific to the task (e.g., moving to a target location). -
Style Objective (): Low-level motion style is learned from a dataset of unstructured motion clips using an
adversarial discriminator. This discriminator becomes themotion prior, providing a reward signal that encourages the character's movements to be indistinguishable from the reference motions. -
Combined Reward: The policy is trained using a composite reward that linearly combines the task and style objectives.
-
Automatic Skill Composition: The adversarial training allows the character to automatically select, interpolate, and generalize behaviors from the motion dataset to fulfill tasks, without requiring explicit motion planners or clip annotations.
The overall system architecture is depicted in Figure 2 from the original paper:
该图像是示意图,展示了系统的整体结构。给定定义角色运动风格的动作数据集,系统训练一个运动先验,该先验为训练中的策略指定风格奖励 。这些风格奖励与任务奖励 结合,用于训练策略,使得模拟角色能够达到任务特定目标 的同时,表现出与数据集中参考运动相似的行为。
As shown in the figure, a given motion dataset defines the desired motion style. This dataset is used to train a motion prior (the discriminator) which, in turn, provides style-rewards () to the policy. These style-rewards are combined with task-rewards () that define task-specific goals (). The reinforcement learning agent (policy) then learns to generate actions () that drive the simulated character, aiming to satisfy both the task goals and the desired motion style.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Reward Function Formulation
The total reward at each time step is a combination of two components: $ r (\mathrm{s}_t, \mathrm{a}t, \mathrm{s}{t+1}, \mathrm{g}) = w^G r^G (\mathrm{s}_t, \mathrm{a}_t, \mathrm{s}_t, \mathrm{g}) + w^S r^S (\mathrm{s}t, \mathrm{s}{t+1}) $ Where:
-
: The task-specific reward (the "what"). It defines high-level objectives that the character should satisfy, conditioned on the current state , action , next state , and goal .
-
: The style-reward (the "how"). It is a learned, task-agnostic reward component derived from the adversarial motion prior, specifying the low-level behavioral style.
-
and : Manually specified weights for balancing the task and style objectives.
The main challenge addressed by AMP is learning an effective that leads to naturalistic behaviors conforming to a particular style, which is the function of the
Adversarial Motion Prior.
4.2.2. Adversarial Motion Prior (AMP)
The style-reward is modeled using a learned discriminator , which is referred to as an adversarial motion prior. Its purpose is to quantify the similarity of the character's motion to the motions in the dataset, without explicit comparison to a specific clip.
4.2.2.1. Imitation from Observations
Traditional Generative Adversarial Imitation Learning (GAIL) typically requires access to both states () and actions () from expert demonstrations. However, when working with motion clips, only states (poses) are observed, and the underlying actions of the demonstrator are unknown. To adapt GAIL to this state-only setting, the discriminator is trained on state transitions rather than state-action pairs .
The objective for training the discriminator in a standard GAN framework with sigmoid cross-entropy loss would be:
$
\underset { D } { \arg \operatorname* { m i n } } - \mathbb { E } _ { d ^ { M } ( \mathrm { s } , \mathrm { s } ^ { \prime } ) } \left[ \log \left( D ( \mathrm { s } , \mathrm { s } ^ { \prime } ) \right) \right] - \mathbb { E } _ { d ^ { \pi } ( \mathrm { s } , \mathrm { s } ^ { \prime } ) } \left[ \log \left( 1 - D ( \mathrm { s } , \mathrm { s } ^ { \prime } ) \right) \right]
$
Where:
-
: The discriminator's output, a scalar value representing the probability that the state transition from to is "real" (from the dataset).
-
: The distribution of state transitions in the reference motion dataset .
-
: The distribution of state transitions generated by following the policy .
-
: Expected value.
-
: Natural logarithm.
This objective encourages the discriminator to output 1 for real transitions and 0 for fake transitions.
4.2.2.2. Least-Squares Discriminator
The standard sigmoid cross-entropy loss in GANs can suffer from vanishing gradients, hindering policy training. To improve stability and performance, AMP adopts the Least-Squares GAN (LSGAN) objective [Mao et al. 2017]. The discriminator is trained using a least-squares regression to predict a score of 1 for real samples (from the dataset) and -1 for fake samples (from the policy).
The LSGAN objective for training the discriminator is given by:
$
\underset { D } { \arg \operatorname* { m i n } } \mathbb { E } _ { d ^ { M } ( \mathbf { s } , \mathbf { s } ^ { \prime } ) } \left[ \left( D ( \mathbf { s } , \mathbf { s } ^ { \prime } ) - 1 \right) ^ { 2 } \right] + \mathbb { E } _ { d ^ { \pi } ( \mathbf { s } , \mathbf { s } ^ { \prime } ) } \left[ \left( D ( \mathbf { s } , \mathbf { s } ^ { \prime } ) + 1 \right) ^ { 2 } \right]
$
Where:
-
: The discriminator's output, a scalar value.
-
: The distribution of real state transitions from the dataset.
-
: The distribution of fake state transitions generated by the policy.
-
: Penalty for real samples not predicted as 1.
-
: Penalty for fake samples not predicted as -1.
After the discriminator is trained, the
style-rewardfor the policy is derived from its output. This reward encourages the policy to generate transitions that the discriminator classifies as "real" (i.e., close to 1). The paper uses the following specific form for thestyle-reward: $ r^S ( { \mathrm { s } } _ { t } , { \mathrm { s } } _ { t + 1 } ) = \operatorname* { m a x } \left[ 0 , \ 1 - 0 . 2 5 ( D ( { \mathrm { s } } _ { t } , { \mathrm { s } } _ { t + 1 } ) - 1 ) ^ { 2 } \right] $ Where: -
: The discriminator's raw output for the policy's generated state transition.
-
: This term measures how far the discriminator's output is from the target value of 1 (which it assigns to real samples). The policy tries to minimize this, pushing towards 1.
-
0.25: A scaling factor. -
: Transforms the error into a reward signal, where a smaller error (discriminator output closer to 1) yields a higher reward.
-
: Clips the reward to be non-negative, ensuring it stays within
[0, 1]as is common practice in RL.
4.2.2.3. Discriminator Observations
The choice of features given to the discriminator is crucial for it to provide effective feedback to the policy. Before a state transition is input to the discriminator, an observation map extracts a compact set of motion-relevant features. The discriminator then operates on . These features are designed to be task-agnostic.
The set of features for includes:
-
Linear velocity and angular velocity of the
root: The root is typically the character's pelvis. These velocities are represented in the character'slocal coordinate frame. -
Local rotation of each joint: The 3D rotation of each
spherical jointis encoded using two 3D vectors (normal and tangent) in its coordinate frame, providing a smooth and unique representation. -
Local velocity of each joint: The velocity of each joint relative to its parent or the root.
-
3D positions of the
end-effectors: Such as hands and feet, represented in the character'slocal coordinate frame.The
character's local coordinate frameis defined with the origin at the root, the X-axis along the root's facing direction, and the Y-axis aligned with the global up vector.
4.2.2.4. Gradient Penalty
GANs are prone to unstable training dynamics, often due to function approximation errors in the discriminator. Specifically, the discriminator might assign non-zero gradients on the manifold of real data samples, which can cause the generator (policy) to overshoot the data manifold instead of converging to it, leading to oscillations. To stabilize training, a gradient penalty is applied, which penalizes non-zero gradients of the discriminator output with respect to its input features on samples from the dataset.
The discriminator objective with the gradient penalty is:
$
\begin{array} { r l } { \underset { D } { \arg \operatorname* { m i n } } } & { \mathbb { E } _ { d ^ { M } ( \mathrm { s } , \mathrm { s } ^ { \prime } ) } \left[ \left( D ( \Phi ( \mathrm { s } ) , \Phi ( \mathrm { s } ^ { \prime } ) ) - 1 \right) ^ { 2 } \right] } \ & { + \mathbb { E } _ { d ^ { \pi } ( \mathrm { s } , \mathrm { s } ^ { \prime } ) } \left[ \left( D \left( \Phi ( \mathrm { s } ) , \Phi ( \mathrm { s } ^ { \prime } ) \right) + 1 \right) ^ { 2 } \right] } \ & { + \displaystyle \frac { w ^ { \mathrm { g p } } } { 2 } \mathbb { E } _ { d ^ { M } ( \mathrm { s } , \mathrm { s } ^ { \prime } ) } \left[ \left| \nabla _ { \phi } D ( \phi ) \middle | \phi = ( \Phi ( \mathrm { s } ) , \Phi ( \mathrm { s } ^ { \prime } ) ) \right| ^ { 2 } \right] } \end{array}
$
Where:
- The first two terms are the
LSGANobjective described previously. \frac { w ^ { \mathrm { g p } } } { 2 } \mathbb { E } _ { d ^ { M } ( \mathrm { s } , \mathrm { s } ^ { \prime } ) } \left[ \left\| \nabla _ { \phi } D ( \phi ) \middle | \phi = ( \Phi ( \mathrm { s } ) , \Phi ( \mathrm { s } ^ { \prime } ) ) \right\| ^ { 2 } \right]: This is thegradient penaltyterm.- : A manually specified coefficient (weight) for the gradient penalty (e.g., 10).
- : The gradient of the discriminator's output with respect to its input features
\phi = (\Phi(\mathbf{s}), \Phi(\mathbf{s}')). - : The squared Euclidean norm of the gradient.
- The
gradient penaltyis calculated on samples from thereal data distribution, not interpolated samples as in WGAN-GP. This encourages the discriminator's gradients to be close to zero on the actual data manifold.
4.2.3. Model Representation
4.2.3.1. States and Actions
- State (): Consists of features describing the character's body configuration:
- Relative positions of each link with respect to the
root(pelvis). - Rotation of each link, represented using a
6D normal-tangent encoding. - Linear and angular velocities of each link.
All features are expressed in the character's
local coordinate system. Unlike prior methods, no phase information or target poses are included, as AMP does not explicitly track specific reference motions.
- Relative positions of each link with respect to the
- Action (): Specifies
target positionsforProportional-Derivative (PD) controllerslocated at each of the character's joints.- For
spherical joints(e.g., hip, shoulder), the target is a 3Dexponential map. The rotation axis and rotation angle are determined by: $ \mathrm { v } = { \frac { \mathbf { q } } { | | \mathbf { q } | | _ { 2 } } } , \qquad \theta = | | \mathbf { q } | | _ { 2 } $ Where:- : The 3D vector representing the exponential map.
- : The Euclidean norm (magnitude) of .
- : The unit vector representing the rotation axis.
- : The rotation angle.
This representation is compact and avoids
gimbal lock.
- For
revolute joints(e.g., knee), the target is a 1D rotation angle .
- For
4.2.3.2. Network Architecture
All neural networks (policy, value function, discriminator) are fully-connected networks with ReLU activation functions.
- Policy (): Maps a given state and goal to a
Gaussian distributionover actions .- : The mean of the Gaussian distribution, determined by a network with two hidden layers (1024 and 512 units).
- : A fixed
diagonal covariance matrix, manually specified.
- Value Function () and Discriminator (): Modeled by separate networks with similar architectures to the policy.
4.2.4. Training
AMP policies are trained using a combination of Generative Adversarial Imitation Learning (GAIL) and Proximal Policy Optimization (PPO) [Schulman et al. 2017].
Algorithm 1 in the paper outlines the training process: Detailed Steps:
- Initialization (Lines 1-5): Initialize the motion dataset , discriminator , policy , value function , and an empty replay buffer .
- Trajectory Collection (Lines 7-16):
- The policy interacts with the environment for trajectories (episodes).
- For each time step in a trajectory:
- The
task-rewardis obtained from the environment based on the current goal . - The
discriminatoris queried with the observed state transition (after applying theobservation map) to get a raw score . - The
style-rewardis calculated from using Equation 7. - The total reward is computed by linearly combining and (Equation 4).
- The
- Each collected trajectory (including states, actions, and the computed total rewards) is stored in the
replay buffer.
- Discriminator Update (Lines 17-21):
- For update steps, the discriminator is updated.
- Two mini-batches of state transitions are sampled: from the real
motion dataset, and from thereplay buffer(containing policy-generated transitions from current and past iterations). - The discriminator is updated using the
LSGAN objectivewith thegradient penalty(Equation 8), aiming to distinguish between real and fake transitions. Thereplay bufferhelps prevent overfitting to only the most recent policy behaviors.
- Policy and Value Function Update (Line 22):
- The policy and value function are updated using the collected trajectories (from the current iteration of trajectory collection) using
PPO. - The value function is updated with
target valuescomputed usingTD(\lambda)[Sutton and Barto 1998]. - The policy is updated using
advantagescomputed usingGeneralized Advantage Estimation (GAE(\lambda))[Schulman et al. 2015].
- The policy and value function are updated using the collected trajectories (from the current iteration of trajectory collection) using
Additional Training Details:
- Reference State Initialization: To aid exploration, characters are initialized to states sampled randomly from all motion clips in the dataset.
- Early Termination: For most tasks, an episode terminates early if any part of the character's body (except feet) touches the ground. This encourages stable upright locomotion. For contact-rich tasks (e.g., rolling), this criterion is disabled.
5. Experimental Setup
The effectiveness of AMP is evaluated on a suite of challenging motion control tasks involving complex 3D simulated characters.
5.1. Datasets
Reference motion clips are a combination of:
-
Public
mocap libraries(e.g., CMU Graphics Lab Motion Capture Database, SFU Motion Capture Database). -
Custom recorded
mocap clips. -
Artist-authored
keyframe animations(e.g., for the T-Rex).The datasets are described as
unstructured, meaning they are raw collections of clips without task-specific annotations or segmentation into individual skills. This is a key feature thatAMPaims to leverage.
The following are the summary statistics of the different datasets used to train the motion priors (Table 2 from the original paper):
| Character | Dataset | Size (s) | Clips | Subjects |
|---|---|---|---|---|
| Humanoid | Cartwheel | 13.6 | 3 | 1 |
| Jump | 28.6 | 10 | 4 | |
| Locomotion | 434.1 | 56 | 8 | |
| Run | 204.4 | 47 | 3 | |
| Run + Leap + Roll | 22.1 | 10 | 7 | |
| Stealthy | 136.5 | 3 | 1 | |
| Walk | 229.6 | 9 | 5 | |
| Walk + Punch | 247.8 | 15 | 9 | |
| Zombie | 18.3 | 1 | 1 | |
| T-Rex | Locomotion | 10.5 | 5 | 1 |
Example of data use: For the Target Heading task with a Locomotion dataset, the clips would contain various walking, running, and jogging motions. For Strike with Walk + Punch dataset, the clips would consist of walking motions and separate punching motions. The system automatically learns to compose these.
5.2. Evaluation Metrics
The paper uses two primary metrics to evaluate performance:
-
Normalized Task Return:
- Conceptual Definition: This metric quantifies how well the policy achieves its high-level task objectives. It normalizes the cumulative task-specific reward collected over an episode to a range between 0 and 1, where 0 is the minimum possible return and 1 is the maximum possible return for that task. It focuses solely on the success of the task, independent of motion style.
- Mathematical Formula: (Not explicitly provided in the paper, but implied by "normalized task return, with 0 being the minimum possible return per episode and 1 being the maximum possible return.") Let be the total task-specific reward accumulated over an episode, be the minimum possible task reward, and be the maximum possible task reward. $ \text{Normalized Task Return} = \frac{R_T - R_{T, min}}{R_{T, max} - R_{T, min}} $
- Symbol Explanation:
- : The sum of task-specific rewards for a given episode.
- : The minimum possible sum of task-specific rewards for an episode.
- : The maximum possible sum of task-specific rewards for an episode.
-
Pose Error (for single-clip imitation):
- Conceptual Definition: This metric quantifies the fidelity of the simulated character's motion compared to a specific reference motion. It measures the average positional difference between corresponding joints of the simulated character and the reference motion. The paper states this metric better aligns with human perception of motion similarity.
- Mathematical Formula:
$
e _ { t } ^ { \mathrm { p o s e } } = \frac { 1 } { N ^ { \mathrm { j o i n t } } } \sum _ { j \in \mathrm { j o i n t s } } \left| ( { \bf x } _ { t } ^ { j } - { \bf x } _ { t } ^ { \mathrm { r o o t } } ) - ( \hat { \bf x } _ { t } ^ { j } - \hat { \bf x } _ { t } ^ { \mathrm { r o o t } } ) \right| _ { 2 }
$
The average pose error for an episode is then usually taken over time. Crucially, before computing this error,
Dynamic Time Warping (DTW)is applied to align the reference motion with the simulated character's motion. - Symbol Explanation:
- : The pose error at time step .
- : The total number of joints in the character's body.
- : Index for a specific joint.
- : The 3D Cartesian position of joint in the simulated character at time .
- : The 3D Cartesian position of the
rootjoint (pelvis) in the simulated character at time . - : The position of joint relative to the root in the simulated character.
- : The 3D Cartesian position of joint in the reference motion at time .
- : The 3D Cartesian position of the
rootjoint in the reference motion at time . - : The position of joint relative to the root in the reference motion.
- : The Euclidean (L2) norm, measuring the distance between the relative joint positions.
5.3. Baselines
AMP is compared against several baseline methods to highlight its advantages:
-
Latent Space Models:
- Approach: This baseline aims to learn a motion prior indirectly through a
latent representation. A low-level controller is firstpre-trainedusing amotion tracking objectiveto imitate the same reference motions used forAMP. This controller learns to map alatent variable() to actions. Then, a separate high-level controller is trained for each downstream task, which specifies these latent variables to the fixed low-level controller. - Relevance: Represents methods that use hierarchical control and implicit style constraints.
- Specifics (from Appendix C): An encoder maps a goal to a distribution over
latent variables. A latent encoding is sampled and passed to the policy . The encoder and policy are trained jointly using an objective that includes aKL-regularizer(similar toVariational Autoencoders) with respect to avariational prior. After pre-training, is fixed, and a new high-level controller is trained for each task. Pre-training uses a motion imitation task where specifies target poses. The latent encoding dimension is 16D.
- Approach: This baseline aims to learn a motion prior indirectly through a
-
No Data (Baseline from Scratch):
- Approach: Policies are trained from scratch for each task using only the
task-specific reward(), without leveraging any motion data ormotion prior. - Relevance: This baseline demonstrates the difficulty of achieving natural behaviors solely through manually designed task objectives, highlighting the value of data-driven style guidance.
- Approach: Policies are trained from scratch for each task using only the
-
Motion Tracking (Peng et al. 2018a):
- Approach: This is a state-of-the-art
tracking-based imitation learningmethod. It uses amanually designed reward functionthat explicitly minimizespose errorbetween the simulated character and a reference motion. It requires aphase variableto synchronize the policy with the reference motion. - Relevance: Serves as a strong baseline for evaluating
AMP's ability to produce high-quality motions, especially in single-clip imitation tasks, without the need for manual reward engineering or synchronization.
- Approach: This is a state-of-the-art
Simulation Details:
- Physics Engine: Bullet physics engine [Coumans et al. 2013].
- Simulation Frequency: .
- Policy Query Rate: .
- Actuation:
PD controllersat character joints. - Training Time: 100-300 million samples, 30-140 hours on 16 CPU cores.
- Gradient Penalty Coefficient (): Set to 10.
6. Results & Analysis
6.1. Core Results Analysis
The AMP framework is demonstrated on a diverse cast of complex simulated characters (34 DoF humanoid, 59 DoF T-Rex, 64 DoF dog) and a challenging suite of motor control tasks.
6.1.1. Task Performance and Skill Composition
AMP successfully enables characters to perform various high-level tasks while adopting specific motion styles, accommodating large unstructured datasets. The weights for task-reward () and style-reward () are set to 0.5 for all tasks.
1. Target Heading / Target Location:
- Locomotion Dataset: When trained with a
Locomotiondataset (walking, running, jogging), the humanoid automatically selects appropriate gaits based on target speed. It walks at slow speeds (), jogs at medium speeds (), and runs at high speeds (). This demonstrates automatic interpolation and generalization of skills. Policies also develop human-like strategies like banking into turns and slowing before sharp direction changes. - Stylistic Behaviors: By providing different datasets (
Zombie,Stealthy), the character learns distinct stylistic gaits (e.g., shambling zombie walk, stealthy movements), provingAMP's ability to control style via example. - No Motion Planner: The paper highlights that these intricate behaviors and transitions emerge automatically from the
motion prior, without requiring amotion planneror explicit motion selection, a key advantage over many prior tracking-based systems.
2. Obstacles:
AMPtrains visuomotor policies for traversing obstacle courses. Using aRun + Leap + Rolldataset, the character learns to leap over gaps and transition into a rolling behavior to pass under overhead obstructions.- Novel Recovery: When falling, the character can tuck its body into a roll to transition more quickly into a get-up behavior, even if this specific combined motion (fall + roll + getup) is not present in the dataset. This indicates
generalizationand discovery of novel strategies. - Diverse Obstacle Clearing Styles: Different datasets (
Jump,Cartwheel) allow the character to clear stepping stones by jumping or cartwheeling, showcasing stylistic variation in complex environments.
3. Dribbling:
- The humanoid learns to dribble a soccer ball to a target location, demonstrating the ability to handle complex
object manipulation taskswhile maintaining a specified style (e.g.,LocomotionorZombie).
4. Strike:
- Using a
Walk + Punchdataset, the character learns to walk to a target and then transition to a punching motion when in range. The dataset contains only walking-only or punching-only clips; the policy learns to temporally sequence these behaviors automatically. This is a strong example ofcomposition of disparate skillsemerging from themotion prior.
Figure 1 (Main Paper) & Figure 3 (Appendix A) illustrate these behaviors:
该图像是一个示意图,展示了物理模拟角色的复杂运动行为。左侧展示了角色在进行跑步和翻越障碍物的动态序列,右侧则显示角色与物体交互的动作。通过对不结构化运动剪辑的使用,角色能够自动生成高质量的运动表现。
该图像是一个示意图,展示了多种仿人类角色的运动示例,包括跑步、起立、带球、击打、越障、翻滚、跳跃等多种行为。每个动作都通过不同的帧展示,体现了仿真角色在物理控制下的优雅移动。
6.1.2. Single-Clip Imitation
Although designed for large datasets, AMP's effectiveness in imitating individual motion clips is evaluated. Here, the policy is trained solely to maximize the style-reward .
AMPcan closely imitate a wide variety of highly dynamic and acrobatic skills (e.g., back-flip, side-flip, cartwheel, spin-kick) for humanoids, T-Rex, and dogs. (See Figure 6 and 7).
该图像是展示了人形机器人在单个片段模仿任务中学到的动作快照。自上而下依次为:后空翻、侧空翻、侧手翻、旋转、旋风踢和滚动。AMP 使得角色能够可靠地模仿多样化的高动态和特技技能。
该图像是示意图,展示了仿生角色在不同模式下的运动,包括T-Rex的行走、狗的慢跑和小跑。每个角色的运动表现为多个动态姿态,体现了物理仿真下角色控制的多样性。
- The performance is evaluated using
pose errorafter applyingDynamic Time Warping (DTW)for alignment, asAMPpolicies are not synchronized via phase variables and may progress at different rates. AMPproduces results of comparable quality totracking-based methods(Peng et al. 2018a), but without the need for manual reward engineering or phase-based synchronization.- Limitation: For some complex motions like
Front-Flip,AMPcan converge tolocally optimal behaviors(e.g., shuffling forward instead of flipping), a challenge that tracking-based methods can mitigate with early termination criteria.
6.1.3. Comparison to Baselines
- Latent Space Models & No Data: Both
AMPandlatent space modelsproduce substantially more life-like behaviors than policies trained from scratch (No Data). This is visually evident in the supplementary video and quantitatively in task performance (Figure 5).
该图像是一个图表,比较了AMP方法与“无数据”和潜在空间模型在不同任务(如目标位置、控球、击打和障碍物)下的任务表现。图中展示了AMP方法在多个任务中都实现了与现有技术相当的效果,且具有更高的动作真实感。
- AMP vs. Latent Space Models:
AMPis able to mitigate unnatural motions more effectively thanlatent space modelsbecause it directly enforces motion similarity through its reward function. Latent space models, which enforce style indirectly, can suffer fromdistribution mismatchbetween the high-level controller's latent encodings and the low-level controller's pre-trained distribution. While latent space models can sometimes solve tasks faster initially due to structured exploration, their pre-training phase can be sample-intensive.AMPdoes not require such a separate pre-training phase.
6.1.4. Ablation Studies (Critical Design Decisions)
- Gradient Penalty: This is identified as the most vital component for stable training and effective performance. Models trained without the
gradient penalty() exhibit large performance fluctuations and noticeable visual artifacts. Its inclusion leads to faster learning and improved stability. (See Figure 8) - Velocity Features in Discriminator: Including velocity features in the
discriminator's observationsis crucial for imitating certain dynamic motions, such as rolling. Without them, the character might converge to an undesirable static pose (e.g., holding a fixed pose on the ground instead of rolling). (See Figure 8) - Dataset Diversity: Experiments on the
Target Headingtask using limited datasets (only walking or only running) versus a diverseLocomotiondataset show that the diversity of behaviors (e.g., gait transitions) is largely attributed to themotion priorand not solely the task objective. Policies trained with limited data cannot achieve the full range of target speeds. (See Figure 4)
该图像是图表,展示了在不同数据集上训练的目标方位策略的性能。左侧为学习曲线,比较了使用包含大量多样化运动片段的数据集与仅使用走路或跑步参考动作训练的策略的标准化任务回报。右侧比较了不同策略的目标速度与实际平均速度的关系。
- Figure 8 illustrates these ablation study results:
该图像是学习曲线图,展示了不同方法在单剪辑模仿任务中的表现。图中比较了AMP(我们的方法)、不使用速度特征的AMP(AMP - No Vel)、不使用梯度惩罚正则化的AMP(AMP - No GP)以及由Peng等提出的运动跟踪方法。可以看出,AMP在姿态误差上表现优异,且不需要手动设计奖励函数或与参考动作同步。
6.1.5. Spatial Composition (Appendix D)
AMPalso demonstrates some capability forspatial composition, where a character performs different skills simultaneously (e.g., walking while waving). Using datasets with only walking motions and only waving motions (no combined examples), the policy learns to combine these skills.- Policies trained with both datasets achieve relatively high rewards on both the
headingandwavingobjectives, outperforming those trained with only one type of motion. (See Table 6). - Limitation: While possible, some unnatural behaviors can still emerge, especially when the target height for the hand is very high, suggesting areas for improvement in more complex spatial compositions.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Character | Task | Dataset | Task Return |
|---|---|---|---|
| Humanoid | TargetHeading | Locomotion | 0.90 ± 0.01 |
| Walk | 0.46 ± 0.01 | ||
| Run | 0.63 ± 0.01 | ||
| Stealthy | 0.89 ± 0.02 | ||
| Zombie | 0.94 ± 0.00 | ||
| TargetLocation | Locomotion | 0.63 ± 0.01 | |
| Zombie | 0.50 ± 0.00 | ||
| Obstacles | Run + Leap + Roll | 0.27 ± 0.10 | |
| Stepping Stones | Cartwheel | 0.43 ± 0.03 | |
| Jump | 0.56 ± 0.12 | ||
| Dribble | Locomotion | 0.78 ± 0.05 | |
| Zombie | 0.60 ± 0.04 | ||
| Strike | Walk + Punch | 0.73 ± 0.02 | |
| T-Rex | TargetLocation | Locomotion | 0.36 ± 0.03 |
The following are the results from Table 3 of the original paper:
| Character | Motion | DatasetSize | MotionTracking | AMP(Ours) |
|---|---|---|---|---|
| Humanoid | Back-Flip | 1.75s | 0.076 ± 0.021 | 0.150 ± 0.028 |
| Cartwheel | 2.72s | 0.039 ± 0.011 | 0.067 ± 0.014 | |
| Crawl | 2.93s | 0.044 ± 0.001 | 0.049 ± 0.007 | |
| Dance | 1.62s | 0.038 ± 0.001 | 0.055 ± 0.015 | |
| Front-Flip | 1.65s | 0.278 ± 0.040 | 0.425 ± 0.010 | |
| Jog | 0.83s | 0.029 ± 0.001 | 0.056 ± 0.001 | |
| Jump | 1.77s | 0.033 ± 0.001 | 0.083 ± 0.022 | |
| Roll | 2.02s | 0.072 ± 0.018 | 0.088 ± 0.008 | |
| Run | 0.80s | 0.028 ± 0.002 | 0.075 ± 0.015 | |
| Spin | 1.00s | 0.063 ± 0.022 | 0.047 ± 0.002 | |
| Side-Flip | 2.44s | 0.191 ± 0.043 | 0.124 ± 0.012 | |
| Spin-Kick | 1.28s | 0.042 ± 0.001 | 0.058 ± 0.012 | |
| Walk | 1.30s | 0.018 ± 0.005 | 0.030 ± 0.001 | |
| Zombie | 1.68s | 0.049 ± 0.013 | 0.058 ± 0.014 | |
| T-Rex | Turn | 2.13s | 0.098 ± 0.011 | 0.284 ± 0.023 |
| Walk | 2.00s | 0.069 ± 0.005 | 0.096 ± 0.027 | |
| Dog | Canter | 0.45s | 0.026 ± 0.002 | 0.034 ± 0.002 |
| Pace | 0.63s | 0.020 ± 0.001 | 0.024 ± 0.003 | |
| Spin | 0.73s | 0.026 ± 0.002 | 0.086 ± 0.008 | |
| Trot | 0.52s | 0.019 ± 0.001 | 0.026 ± 0.001 |
The following are the results from Table 4 of the original paper:
| Parameter | Value |
|---|---|
| Task-Reward Weight | 0.5 |
| Style-Reward Weight | 0.5 |
| Gradient Penalty | 10 |
| Samples Per Update Iteration | 4096 |
| Batch Size | 256 |
| Policy Stepsize (Single-Clip Imitation) | |
| Policy Stepsize (Tasks) | |
| Value Stepsize (Single-Clip Imitation) | |
| Value Stepsize (Tasks) | |
| Discriminator Stepsize | |
| Discriminator Replay Buffer Size | |
| Discount (Single-Clip Imitation) | 0.95 |
| Discount (Tasks) | 0.99 |
| SGD Momentum | 0.9 |
| GAE() | 0.95 |
| TD() | 0.95 |
| PPO Clip Threshold | 0.02 |
The following are the results from Table 5 of the original paper:
| Parameter | Value |
|---|---|
| Latent Encoding Dimension | 16 |
| KL-Regularizer | |
| Samples Per Update Iteration | 4096 |
| Batch Size | 256 |
| Policy Stepsize (Pre-Training) | |
| Policy Stepsize (Downstream Task) | |
| Value Stepsize | |
| Discount (Pre-Training) | 0.95 |
| Discount (Downstream Task) | 0.99 |
| SGD Momentum | 0.9 |
| GAE() | 0.95 |
| TD() | 0.95 |
| PPO Clip Threshold | 0.02 |
The following are the results from Table 6 of the original paper:
| Dataset (Size) | Heading Return | Waving Return |
|---|---|---|
| Wave (51.7s) | 0.683 ± 0.195 | 0.949 ± 0.144 |
| Walk (229.7s) | 0.945 ± 0.192 | 0.306 ± 0.378 |
| Wave + Walk (281.4s) | 0.885 ± 0.184 | 0.891 ± 0.202 |
6.3. Ablation Studies / Parameter Analysis
The paper meticulously investigates the impact of key design choices on AMP's performance and stability.
6.3.1. Importance of Gradient Penalty
- Observation: The
gradient penalty(from Equation 8) is identified as the most crucial component forAMP's success. - Impact: Without this regularization, training is highly unstable, characterized by
large performance fluctuations(evident in the curves in Figure 8 and Figure 9). This often leads to severevisual artifactsandunnatural behaviorsin the final policies. - Benefit: The
gradient penaltysignificantly improvesstabilityduring adversarial training and leads tosubstantially faster learningacross a wide range of skills. This is a critical finding for makingadversarial imitation learningpractical for high-fidelity character control.
6.3.2. Role of Velocity Features in Discriminator Observations
- Observation: While one might assume that providing consecutive poses to the discriminator could implicitly convey velocity information, the ablation study ( in Figure 8) shows this is often insufficient.
- Impact: Without explicit
velocity featuresin thediscriminator observations(), the policy can converge to undesirablelocally optimal behaviors. For instance, in therollingmotion, the character might learn to simply lie on the ground in a fixed pose rather than performing a dynamic roll. - Benefit: Including explicit
linear and angular velocitiesof the root and joints in the discriminator's input helps it better understand the dynamics of the reference motions, guiding the policy to produce more accurate and dynamic behaviors.
6.3.3. Influence of Dataset Diversity on Skill Composition
- Experiment: To confirm that the observed diversity and transitions between gaits (e.g., walking, jogging, running) are indeed a product of the
motion priorand not just thetask objective, policies for theTarget Headingtask are trained with:- A large, diverse
Locomotiondataset. - Limited datasets containing only walking motions.
- Limited datasets containing only running motions.
- A large, diverse
- Results (Figure 4):
- Policies trained with only walking motions learn exclusively walking gaits and cannot achieve faster target speeds, resulting in lower task returns for higher speed goals.
- Policies trained with only running motions struggle to match slower target speeds, as they only know how to run.
- Policies trained with the diverse
Locomotiondataset are much more flexible, able to smoothly transition between gaits to match a wider range of target speeds, achieving higher overall task returns.
- Conclusion: This validates that the
motion prioritself, learned from the diversity of the provided dataset, is responsible for enabling the policy to compose and dynamically adapt different skills to achieve task objectives.
6.3.4. Discount Factor ()
- Observation (Table 4): The
discount factor() for theRL objectiveis varied. - Impact: A smaller is found to be more effective for
single-clip imitation tasks, allowing the character to focus on short-term fidelity to the reference motion. For tasks withadditional objectivesthat might require longer-horizon planning (e.g.,Dribble,Strike), a larger is used.
6.4. Learning Curves
The learning curves, such as those presented in Figure 8 and Figure 9 for single-clip imitation and Figure 5 for task performance, provide quantitative insights into training progression. Figure 10 provides a comprehensive collection of learning curves across various tasks and datasets.
-
General Trend:
AMPoften shows slower initial learning compared tomotion trackingbaselines in single-clip imitation (Figure 8), but eventually reaches comparable or even superior performance for some skills (e.g.,Spin,Side-Flip). This is expected asAMPhas to learn the reward function implicitly, whilemotion trackingbenefits from a direct, engineered error signal. -
Stability: The
AMPcurves (blue line) are generally smoother and more stable than (purple line), reinforcing the importance of thegradient penalty. -
Comparison to Baselines in Tasks (Figure 5):
AMPandLatent Spacemodels typically converge to higher task returns thanNo Databaselines, showcasing the benefit of data-driven style guidance.The image is a learning curve graph showing the performance of various methods on single-clip imitation tasks.
该图像是学习曲线图,展示了不同方法在单剪辑模仿任务中的表现。图中比较了AMP(我们的方法)、不使用速度特征的AMP(AMP - No Vel)、不使用梯度惩罚正则化的AMP(AMP - No GP)以及由Peng等提出的运动跟踪方法。可以看出,AMP在姿态误差上表现优异,且不需要手动设计奖励函数或与参考动作同步。
The image is a diagram showing the error comparison between the AMP method and traditional motion tracking during the execution of various actions such as backflip, cartwheel, and dance.
该图像是一个示意图,展示了不同动作(如后空翻、卡特走、舞蹈等)执行过程中,AMP方法与传统运动跟踪的误差对比。图中横坐标为样本数,纵坐标为位置误差(米),蓝色线表示我们的AMP方法,橙色线表示运动跟踪方法。通过对比可以看出,AMP在多个动作中的误差表现优于运动跟踪。
The image is a chart that displays the learning curves for applying AMP across various tasks and datasets.
该图像是图表,展示了应用AMP进行各种任务和数据集的学习曲线。图中包含多个子图,分别显示了不同任务的任务回报随样本数量的变化情况,体现了AMP在多种情境下的表现效果。
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces AMP (Adversarial Motion Priors), a novel and effective adversarial learning system for physics-based character animation. The core contribution is the development of a task-agnostic motion prior derived from an adversarial discriminator, which serves as a powerful style-reward for reinforcement learning. This approach successfully addresses key limitations of prior methods:
-
Elimination of Manual Reward Engineering:
AMPobviates the need for manually designed imitation objectives, which are often difficult to construct and tune across diverse skills. -
Scalability to Unstructured Data: It effectively leverages large, unstructured motion datasets without requiring task-specific annotations, segmentation, or high-level motion planners.
-
Automatic Skill Composition: The system automatically learns to compose, interpolate, and generalize disparate skills from the dataset to fulfill complex task objectives, demonstrating emergent intelligence in behavior sequencing.
-
High-Fidelity Motions:
AMPproduces high-quality, natural, and life-like full-body motions for physically simulated characters, achieving results comparable to state-of-the-art tracking-based techniques. -
Enhanced Training Stability: Crucial design decisions, particularly the
Least-Squares Discriminatorandgradient penaltyon discriminator observations, significantly improve the stability ofadversarial training, making the approach practical.By decoupling task specification from style specification,
AMPoffers a flexible and powerful framework for generating sophisticated and stylized character behaviors in diverse simulated environments.
7.2. Limitations & Future Work
The authors identify several limitations and propose avenues for future research:
- Mode Collapse: Like many
GAN-based techniques,AMPis susceptible tomode collapse. When given a large and diverse motion dataset, the policy might only imitate a small subset of the available behaviors, neglecting other potentially optimal styles. This restricts the full exploitation of the dataset's diversity. - Training Motion Priors from Scratch: Currently, the
motion prioris trained from scratch for each policy and task. Although the prior is largely task-agnostic, this process can be computationally intensive.- Future Work: Exploring techniques for developing more
general and transferable motion priorscould lead to modular objective functions that can be reused across different policies and tasks without retraining.
- Future Work: Exploring techniques for developing more
- Task Dependencies in Motion Priors: While designed to be task-agnostic, the data used to train the current motion priors are generated by policies performing a particular task. This could inadvertently introduce some task dependencies into the prior, hindering its transferability.
- Future Work: Training motion priors using data generated from a larger and more diverse repertoire of tasks might help to make them truly general and transferable to novel tasks.
- Spatial vs. Temporal Composition: The experiments primarily focus on
temporal compositionof skills (performing different behaviors sequentially, e.g., walk then punch). While somespatial composition(performing multiple skills simultaneously, e.g., walk and wave) is demonstrated,AMPmight still struggle with producing fully natural behaviors in highly complex spatial compositions, especially when skills conflict.- Future Work: Developing
motion priorsthat are more amenable to complexspatial compositionof disparate skills could lead to even more flexible and sophisticated behaviors.
- Future Work: Developing
7.3. Personal Insights & Critique
AMP represents a significant leap forward in physics-based character animation by effectively harnessing the power of adversarial imitation learning. The decoupling of task and style objectives is an elegant solution to a long-standing problem in the field, offering a more intuitive and scalable interface for animators and developers.
Key Strengths and Insights:
- Practicality of AIL: The paper's most impactful contribution might be demonstrating that
adversarial imitation learning, often criticized for instability, can indeed producehigh-fidelity motionsfor complex physical characters when properly regularized. The identification of thegradient penaltyand carefully chosendiscriminator observationsas critical for stability is a valuable insight for anyone working withGANsinRL. - Generalization and Emergent Behavior: The automatic composition of disparate skills and the generalization to novel recovery behaviors (e.g., tucking into a roll during a fall) without explicit programming or motion planning is highly impressive. This showcases the power of learning a general
motion priorover explicit tracking. - Unstructured Data Advantage: The ability to work with
unstructured motion clipsis a major practical advantage. It removes the cumbersome preprocessing steps (annotation, segmentation) that plague many data-driven animation pipelines, making it much easier to leverage large, readily available motion datasets.
Critique and Areas for Further Consideration:
-
Mode Collapse Mitigation: While the paper acknowledges
mode collapseas a limitation, it's a fundamental challenge forGANs. Future work could explorediversity-promoting techniques(e.g., conditional GANs, ensemble methods, or explicit diversity penalties) withinAMPto ensure that the policy truly utilizes the full spectrum of styles available in large datasets, rather than converging to a subset. -
Interpretability of Motion Prior: The
adversarial motion priorfunctions as a black box. While effective, understanding what specific features of a motion it finds "natural" or "stylistic" could provide deeper insights and perhaps lead to more controllable or fine-tunable style generation. -
Computational Cost: Training
AMPpolicies still requires significant computational resources (30-140 hours on 16 CPU cores for 100-300 million samples). While comparable to other state-of-the-artRLmethods, it's a barrier for rapid iteration. Research into moresample-efficient adversarial RLortransfer learningfor motion priors is crucial. -
Robustness to Unnatural Spatial Composition: The observation that spatial composition can still lead to "unnatural behaviors" when target hand height is high (Appendix D) points to a frontier. This is a complex problem where different skills might have conflicting kinematic or dynamic requirements. Perhaps a more explicit mechanism for modeling
skill compatibilityorblendingin the latent space of the discriminator could be explored.In conclusion,
AMPis a robust and innovative framework that pushes the boundaries of physically simulated character control. Its elegant solution to motion style specification and skill composition paves the way for more autonomous, natural, and expressive virtual agents, with clear implications for both computer graphics and robotics.
Similar papers
Recommended via semantic vector search.