DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills
TL;DR Summary
DeepMimic uses reinforcement learning to train physics-based characters that robustly imitate diverse, complex motions while adapting to tasks and environments, enabling versatile, interactive control across multiple characters and skills.
Abstract
A longstanding goal in character animation is to combine data-driven specification of behavior with a system that can execute a similar behavior in a physical simulation, thus enabling realistic responses to perturbations and environmental variation. We show that well-known reinforcement learning (RL) methods can be adapted to learn robust control policies capable of imitating a broad range of example motion clips, while also learning complex recoveries, adapting to changes in morphology, and accomplishing user-specified goals. Our method handles keyframed motions, highly-dynamic actions such as motion-captured flips and spins, and retargeted motions. By combining a motion-imitation objective with a task objective, we can train characters that react intelligently in interactive settings, e.g., by walking in a desired direction or throwing a ball at a user-specified target. This approach thus combines the convenience and motion quality of using motion clips to define the desired style and appearance, with the flexibility and generality afforded by RL methods and physics-based animation. We further explore a number of methods for integrating multiple clips into the learning process to develop multi-skilled agents capable of performing a rich repertoire of diverse skills. We demonstrate results using multiple characters (human, Atlas robot, bipedal dinosaur, dragon) and a large variety of skills, including locomotion, acrobatics, and martial arts.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills
- Authors: Xue Bin Peng (University of California, Berkeley), Pieter Abbeel (University of California, Berkeley), Sergey Levine (University of California, Berkeley), Michiel van de Panne (University of British Columbia).
- Journal/Conference: ACM Transactions on Graphics (TOG), presented at SIGGRAPH 2018. SIGGRAPH is the premier conference for computer graphics and interactive techniques, making it a highly prestigious venue. Publication in TOG indicates a work of significant and lasting impact.
- Publication Year: 2018
- Abstract: The paper presents a method to create physics-based character controllers that can realistically imitate example motions while also adapting to new situations and accomplishing tasks. The core idea is to use reinforcement learning (RL) where the reward function encourages the character to mimic a reference motion (e.g., from motion capture) and achieve a specific goal (e.g., walk in a certain direction). This approach combines the visual quality of data-driven animation with the robustness of physics-based simulation. The method is shown to work for a wide range of highly dynamic skills (like flips and kicks), various characters (humanoid, Atlas robot, dinosaur, dragon), and can integrate multiple motion clips to create agents with a diverse repertoire of abilities.
- Original Source Link:
- Official Link: https://doi.org/10.1145/3197517.3201311
- Preprint (arXiv): https://arxiv.org/abs/1804.02717
- The paper was formally published in ACM Transactions on Graphics.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Creating realistic and interactive character animations is a major challenge. Kinematic (data-driven) methods produce high-quality motion but struggle to react realistically to unexpected physical perturbations or changes in the environment. Physics-based methods are realistic by nature but are notoriously difficult to control (directability), often requiring painstaking manual design of controllers for each specific skill.
- Existing Gaps: Prior deep RL methods for character control could learn tasks from scratch but often resulted in unnatural, jittery, or "un-humanlike" motions. Methods that did use motion data were often complex, limited in the types of motions they could track, or couldn't easily incorporate new task goals. There was a need for a general framework that could leverage motion data for visual quality while using the power of RL for robustness and goal-directed behavior.
- Innovation: DeepMimic introduces a conceptually simple yet powerful framework that directly integrates motion imitation into the RL reward function. The agent is rewarded for both "looking like" a reference motion and for achieving a task. The authors also identify and demonstrate the critical importance of two training techniques—Reference State Initialization (RSI) and Early Termination (ET)—for successfully learning highly dynamic and complex skills.
-
Main Contributions / Findings (What):
- A General Framework: The paper presents a unified framework for learning physics-based character skills by combining a motion imitation reward with a task reward within a standard deep RL algorithm (PPO).
- High-Quality Dynamic Skills: It demonstrates the ability to learn an unprecedented range of highly dynamic and acrobatic skills (e.g., backflips, cartwheels, spinkicks) that were previously very difficult to achieve with learning-based methods. The resulting motions are robust and can recover from significant perturbations.
- Critical Training Techniques: It identifies
Reference State Initialization(RSI) andEarly Termination(ET) as crucial components for enabling the learning of these complex skills, providing a key insight for future research in the field. - Multi-Skill Integration: It explores and validates three distinct methods for creating agents capable of performing multiple skills: using a multi-clip reward, training a "skill selector" policy, and composing pre-trained policies using their value functions.
- Versatility: The framework's effectiveness is demonstrated across multiple character morphologies, including a standard humanoid, the Atlas robot, a T-Rex, and a dragon.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Physics-Based Character Animation: A subfield of computer graphics where character motion is generated by simulating the physical forces (gravity, friction, muscle actuations) acting on a character model (represented as a collection of rigid bodies and joints). This ensures physical realism but makes control extremely difficult.
- Kinematic Animation: Motion is defined by specifying the position, orientation, and angles of joints over time, typically sourced from motion capture data or hand-animation. It offers high artistic control and visual quality but lacks inherent physical realism and cannot automatically react to unforeseen physical interactions.
- Reinforcement Learning (RL): A machine learning paradigm where an "agent" learns to make decisions by interacting with an "environment". The agent receives a
statefrom the environment, takes anaction, and receives arewardsignal. The goal is to learn apolicy(a strategy for choosing actions) that maximizes the cumulative reward over time. - Policy Gradient Methods: A class of RL algorithms that directly optimize the parameters of the agent's policy. They work by estimating the gradient of the total expected reward with respect to the policy parameters and updating the parameters in the direction that increases the reward.
- PD Controllers (Proportional-Derivative Controllers): A simple feedback control mechanism. Instead of commanding raw torques, the policy in DeepMimic specifies a target angle for each joint. The PD controller then calculates the torque needed to move the joint to that target angle, based on the current angle (Proportional term) and angular velocity (Derivative term). This simplifies the control problem for the RL agent.
-
Previous Works:
- Kinematic Models: Methods like motion graphs [Lee et al. 2010b] stitch together pieces of motion capture data to generate new animations. While good for known situations, they cannot synthesize truly novel reactions to physical forces.
- Physics-based Models: Early work often relied on manually designing controllers [Yin et al. 2007] or using trajectory optimization [Mordatch et al. 2012]. These were either not generalizable to new skills or too computationally expensive for real-time control.
- Reinforcement Learning for Animation: Prior deep RL work [Heess et al. 2017] successfully trained agents to walk or run, but the motions often looked unnatural and artifact-prone. Generative Adversarial Imitation Learning (GAIL) [Ho and Ermon 2016] tried to learn a reward function from data, but results still lacked the quality of traditional animation.
- Motion Imitation: The idea of tracking reference motions is not new.
SAMCON[Liu et al. 2010] was a highly capable but complex system that could reproduce acrobatic skills.DeepLoco[Peng et al. 2017a] used an imitation reward similar to DeepMimic but was limited to locomotion and used fixed initial states, preventing it from learning dynamic aerial maneuvers.
-
Differentiation: DeepMimic's key innovation lies in its simplicity and generality. Unlike the complex, multi-stage pipeline of
SAMCON, DeepMimic uses a standard, end-to-end deep RL approach. Unlike prior deep RL methods, it achieves exceptionally high motion quality by directly incorporating a carefully designed imitation reward. Crucially, its use ofRSIandETunlocks the ability to learn the highly dynamic, acrobatic skills that were previously out of reach for general-purpose learning algorithms.
4. Methodology (Core Technology & Implementation)
The core idea of DeepMimic is to train a neural network policy that outputs actions to control a physics-based character. The training is guided by a reward function that encourages the character to imitate a reference motion while also accomplishing a task goal .
-
Principles: The system is built on standard policy gradient reinforcement learning. The central intuition is that a good reward function can shape the agent's behavior. By rewarding similarity to high-quality motion capture data, the resulting learned policy will naturally produce visually appealing motions.
-
Steps & Procedures:
- Input: The system takes a character model, a reference motion clip (e.g., a
.bvhfile from motion capture), and an optional task definition. - State and Action Space:
- State (): The state vector describes the character's physical configuration. It includes:
- Relative positions and rotations (quaternions) of each body part with respect to the root (pelvis).
- Linear and angular velocities of each body part.
- A phase variable that indicates the current progress through the reference motion cycle.
- All features are in the character's local coordinate frame for robustness to global position and orientation.
- Goal (): An optional vector specifying the task goal, e.g., a target walking direction.
- Action (): The policy outputs a vector of target angles for the PD controllers at each joint.
- State (): The state vector describes the character's physical configuration. It includes:
- Training Loop: The policy is trained using Proximal Policy Optimization (PPO), a state-of-the-art policy gradient algorithm.
- An episode begins. A starting state is chosen using Reference State Initialization (RSI).
- The agent simulates the character step-by-step. At each step, it observes the state , computes an action from its policy, and applies the resulting torques via PD controllers.
- The simulation provides the next state and a reward .
- The episode ends if the character falls (Early Termination, ET) or a time limit is reached.
- The collected trajectory data (states, actions, rewards) is used to update the policy and value function networks. This process is repeated for millions of simulation steps.
- Input: The system takes a character model, a reference motion clip (e.g., a
-
Mathematical Formulas & Key Details:
1. Policy and Value Networks: The policy is a neural network that maps state-goal pairs to a Gaussian distribution over actions. The value function
V(s, g)is a separate network with a similar architecture that estimates the expected future reward from a given state. For tasks involving terrain, a visuomotor policy is used.
Fig. 2. Schematic illustration of the visuomotor policy network. The heightmap is processed by 3 convolutional layers with 1 6 ~ 8 x 8filters, filters, and filters. The feature maps are then processed by 64 fully-connected units. The resulting features are concatenated with the input state and goal and processed by by two fully-connected layer with 1024 and 512 units. The output is produced by a layer of linear units. ReLU activations are used for all hidden layers. For tasks that do not require a heightmap, the networks consist only of layers 5-7.As shown in Image 7, for vision-based tasks, a heightmap of the terrain is processed by convolutional layers. The output is then concatenated with the standard state and goal vectors and fed into fully-connected layers to produce the final action.
2. Total Reward Function: The total reward at each timestep is a weighted sum of an imitation reward and a task reward.
- : The imitation reward, encouraging the character to look like the reference motion.
- : The task reward, encouraging goal achievement (e.g., moving forward).
- : Weights to balance the two objectives.
3. Imitation Reward (): This is a composite reward designed to match multiple aspects of the reference motion. The paper uses weights , , , and . Each component is an exponential of a squared error, which rewards being close to the target and does not harshly penalize large deviations.
-
Pose Reward (): Matches the joint orientations.
- : The orientation (quaternion) of joint of the simulated character.
- : The target orientation of joint from the reference motion.
- : The quaternion difference, representing the rotation needed to get from to .
- : The angle of rotation represented by the quaternion.
-
Velocity Reward (): Matches the joint angular velocities.
- : The angular velocity of joint of the simulated character.
- : The target angular velocity from the reference motion (calculated via finite differences).
-
End-Effector Reward (): Matches the world-space positions of the hands and feet.
- : The 3D world position of end-effector (e.g., left foot) of the simulated character.
- : The target position from the reference motion.
-
Center-of-Mass Reward (): Matches the center-of-mass position.
- : The 3D position of the center-of-mass of the simulated character.
- : The target center-of-mass position from the reference motion.
4. Training Strategies:
- Reference State Initialization (RSI): Instead of always starting an episode from the first frame of the motion,
RSIinitializes the character at a random frame from the reference motion. This is crucial for learning complex skills. For example, to learn a backflip, the agent can be initialized mid-air or just before landing, allowing it to learn the difficult landing phase without first having to master the perfect jump. This dramatically simplifies the exploration problem. - Early Termination (ET): An episode is immediately terminated if the character enters a "failure state," such as its head or torso hitting the ground. This prevents the agent from wasting time and network capacity learning to recover from unrecoverable situations and implicitly encourages it to avoid falling.
5. Multi-Skill Integration:
- Multi-Clip Reward: To learn from multiple clips (e.g., walking straight, turning left, turning right), the imitation reward is modified to be the maximum reward achievable across all clips. This allows the policy to dynamically choose which clip is most relevant to imitate at any moment.
- Skill Selector: A single policy is trained on a set of skills. The goal input is a one-hot vector indicating which skill the user wants the character to perform. The policy must learn to both imitate the selected skill and transition smoothly between different skills.
- Composite Policy: Multiple policies are trained independently, each for a single skill. At runtime, a master policy is constructed that mixes the individual policies. The weight for each policy is determined by its value function , which estimates how well that skill can be performed from the current state .
- : The -th pre-trained policy.
- : The value function for the -th policy.
- : The probability of choosing policy , derived from a Boltzmann (softmax) distribution over the value functions.
- : A temperature parameter controlling the sharpness of the selection.
5. Experimental Setup
-
Datasets: The reference motions are primarily sourced from publicly available motion capture (mocap) datasets. For non-humanoid characters like the T-Rex and dragon, keyframed animations were used instead, demonstrating the method's flexibility.
-
Characters: Four main characters with varying morphologies and dynamics were used to test the generality of the approach.
Illustration of the four character models used: Humanoid, Atlas, T-Rex, and Dragon.The properties of these characters are detailed in Table 1 below.
Manual transcription of Table 1 from the paper. Table 1. Properties of the characters.
Property Humanoid Atlas T-Rex Dragon Links 13 12 20 32 Total Mass (kg) 45 169.8 54.5 72.5 Height (m) 1.62 1.82 1.66 1.83 Degrees of Freedom 34 31 55 79 State Features 197 184 262 418 Action Parameters 36 32 64 94 -
Evaluation Metrics:
- Normalized Return (NR): The primary quantitative metric used to report performance.
- Conceptual Definition: NR measures the quality of the learned policy's performance on a scale from 0 to 1. A score of 1 represents the maximum possible reward (perfect imitation), while a score of 0 represents the minimum possible reward (e.g., falling immediately). It provides a standardized way to compare performance across different skills with different reward scales.
- Mathematical Formula: The paper does not provide an explicit formula, but it can be defined as:
- Symbol Explanation:
- : The expected total reward obtained by the learned policy over an episode.
- : The minimum possible reward for an episode, typically 0.
- : The maximum possible reward, achieved by perfectly tracking the reference motion for the entire episode duration.
- Normalized Return (NR): The primary quantitative metric used to report performance.
-
Baselines: The main comparisons are ablation studies of their own method. The baselines are:
- DeepMimic (Full Method): With both
Reference State Initialization (RSI)andEarly Termination (ET). - No RSI: Training with a fixed initial state but with
ET. - No ET: Training with
RSIbut episodes run for a fixed length regardless of falls. - No RSI & No ET: The most basic setup, with a fixed initial state and fixed-length episodes.
- DeepMimic (Full Method): With both
-
Tasks:
- Target Heading: A task where the character must follow a reference locomotion gait (e.g., walking or running) while steering towards a user-specified target direction . The task reward is: This reward encourages the character to maintain a minimum speed in the target direction .
- Other tasks included striking a target with a kick and throwing a ball at a target.
6. Results & Analysis
The paper presents extensive results, both quantitative and qualitative, to demonstrate the effectiveness and versatility of DeepMimic.
-
Core Results: The method successfully learns a large and diverse corpus of skills for the humanoid character, many of which are highly dynamic and had not been previously synthesized with general-purpose RL methods.
Manual transcription of Table 2 from the paper. Table 2. Performance statistics of imitating various skills. All skills are performed by the humanoid unless stated otherwise. is the clip length. is the number of training samples. NR is the Normalized Return.
Skill Tcycle (s) Nsamples (10^6) NR Backflip 1.75 72 0.729 Balance Beam 0.73 96 0.783 Baseball Pitch 2.47 57 0.785 Cartwheel 2.72 51 0.804 Crawl 2.93 68 0.932 Dance A 1.62 67 0.863 Dance B 2.53 79 0.822 Frontflip 1.65 81 0.485 Getup Face-Down 3.28 49 0.885 Getup Face-Up 4.02 66 0.838 Headspin 1.92 112 0.640 Jog 0.80 51 0.951 Jump 1.77 86 0.947 Kick 1.53 50 0.854 Landing 2.83 66 0.590 Punch 2.13 60 0.812 Roll 2.02 81 0.735 Run 0.80 53 0.951 Sideflip 2.44 64 0.805 Spin 4.42 191 0.664 Spinkick 1.28 67 0.748 Vault 1-Handed 1.53 41 0.695 Vault 2-Handed 1.90 87 0.757 Walk 1.26 61 0.985 Atlas: Backflip 1.75 63 0.630 Atlas: Run 0.80 - - The table shows high NR scores across the board, with simpler motions like
WalkandJogachieving near-perfect imitation (>0.95) and highly complex acrobatic skills likeBackflipandCartwheelachieving strong scores (>0.70). The framework also successfully transfers to the significantly heavier and dynamically different Atlas robot.Qualitative results (shown in images 1, 5, 6, 7, 9) are a key part of the paper's contribution, demonstrating fluid, natural-looking, and robust motions that are visually almost indistinguishable from the reference mocap but can adapt to perturbations and new goals.
-
Ablations / Parameter Sensitivity: The ablation study highlights the critical role of
RSIandET.
Fig. 11 (from PDF). Learning curves for policies trained with and without reference state initialization (RSI) and early termination (ET).Image 3 shows the learning curves for four skills.
- For simple skills like
Walk: All methods eventually learn the skill, but the full method () learns much faster. - For dynamic skills like
Backflip,Sideflip, andSpinkick: The results are stark. The methods withoutRSIcompletely fail to learn, achieving near-zero rewards. The method withRSIbut noETlearns, but more slowly and to a lower final performance than the full method. This provides conclusive evidence thatRSIis essential for exploration in complex skills, andETfurther improves sample efficiency and final performance.
- For simple skills like
-
Task-Directed Behavior:
- Figure 7 in the paper shows a character successfully learning to strike a randomly placed target with a spinkick, demonstrating the seamless integration of a task () and imitation () objective.
- Figure 8 shows that without the guidance of a reference motion (i.e., only a task reward to throw a ball), the agent fails to discover the complex throwing behavior and instead learns a simple but suboptimal strategy (running towards the target). This highlights the importance of the imitation reward for specifying complex, stylized behaviors.
- Figure 4 (Image 9 in this analysis) shows a character successfully navigating varied and challenging terrains (gaps, stairs, beams) by adapting a standard walking motion, demonstrating the robustness afforded by the physics-based approach.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully presents DeepMimic, a powerful and general framework for creating physics-based character controllers. By combining an imitation-based reward with a task reward and using standard deep RL algorithms, the method can learn an extensive repertoire of complex and highly dynamic skills. The authors demonstrate that the resulting controllers are not only visually realistic but also robust, goal-aware, and adaptable to different characters and environments. The paper's key insight is that this success is critically dependent on the use of two simple but effective training strategies:
Reference State Initialization(RSI) andEarly Termination(ET). -
Limitations & Future Work:
- Data Dependency: The method's quality is tied to the quality of the reference motion. It cannot invent skills for which no reference data exists.
- Scalability: While the paper demonstrates three methods for multi-skill integration, scaling to hundreds or thousands of skills within a single framework remains a challenge.
- Simulation-to-Reality Gap: The work is purely in simulation. Transferring these policies to a real-world robot like the Atlas would require addressing the sim-to-real gap.
- The authors suggest future work on developing better methods for integrating large numbers of clips to build agents with even richer skill repertoires.
-
Personal Insights & Critique:
- Impact: DeepMimic was a landmark paper in character animation. It dramatically raised the bar for what was considered possible with deep RL and set a new standard for motion quality. Its conceptual simplicity, combined with its impressive results, made it highly influential and inspired a wave of subsequent research in learned character control.
- Strength: The paper's greatest strength is its clear identification and empirical validation of
RSIandET. These "tricks" are not just minor improvements; they are enabling factors for learning complex skills. This was a crucial contribution to the RL community, showing that how you structure the learning problem can be as important as the learning algorithm itself. - Critique: While powerful, the framework is still a form of "imitation." The agent learns to be robust around a given motion style but does not exhibit true creativity or problem-solving from first principles. The reliance on clean motion capture data also remains a practical bottleneck for many potential applications. Nevertheless, as a tool for combining the best of data-driven and physics-based animation, DeepMimic represents a major step forward.
Similar papers
Recommended via semantic vector search.