Learning Robust and Agile Legged Locomotion Using Adversarial Motion Priors
TL;DR Summary
This study introduces the first blind locomotion system for legged robots, achieving robust traversal in complex terrains and rapid movement on natural terrains, utilizing Adversarial Motion Priors (AMP) for zero-shot generalization.
Abstract
Developing both robust and agile locomotion skills for legged robots is non-trivial. In this work, we present the first blind locomotion system capable of traversing challenging terrains robustly while moving rapidly over natural terrains. Our approach incorporates the Adversarial Motion Priors (AMP) in locomotion policy training and demonstrates zero-shot generalization from the motion dataset on flat terrains to challenging terrains in the real world. We show this result on a quadruped robot Go1 using only proprioceptive sensors consisting of the IMU and joint encoders. Experiments on the Go1 demonstrate the robust and natural motion generated by the proposed method for traversing challenging terrains while moving rapidly over natural terrains.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "Learning Robust and Agile Legged Locomotion Using Adversarial Motion Priors." It focuses on developing advanced locomotion skills for legged robots that combine both robustness in challenging terrains and agility on natural terrains.
1.2. Authors
The authors of the paper are Jinze Wu, Guiyang Xin, Chenkun Qi, and Yufei Xue. Their affiliations are indicated by and ©, suggesting different research institutions or departments. Guiyang Xin and Chenkun Qi are noted as Members of IEEE, indicating their professional involvement and recognition in the field of electrical and electronics engineering, which often encompasses robotics and automation.
1.3. Journal/Conference
The paper does not explicitly state the journal or conference it was published in, but the standard format for academic papers suggests it is likely intended for a peer-reviewed publication. The index terms "Legged Robots," "Reinforcement Learning," and "Machine Learning for Robot Control" are typical for robotics and AI conferences or journals.
1.4. Publication Year
The publication date (UTC) is given as 2023-06-28T00:00:00.000Z.
1.5. Abstract
The paper addresses the challenge of developing both robust and agile locomotion skills for legged robots. It introduces a novel blind locomotion system—meaning it relies solely on internal (proprioceptive) sensors without external perception—that can robustly traverse challenging terrains while moving rapidly over natural ones. The core methodology integrates Adversarial Motion Priors (AMP) into the locomotion policy training. A key finding is its zero-shot generalization capability, where the policy, trained on motion data from flat terrains, successfully applies to complex real-world environments. This system is demonstrated on a Unitree Go1 quadruped robot using only IMU and joint encoders. Experiments show the method generates natural and robust motion for traversing obstacles and agile movement over varied outdoor terrains.
1.6. Original Source Link
The original source link provided is /files/papers/695a514dba758f5b2a1ecae0/paper.pdf. This typically refers to a local file path on a system or a server, suggesting it is a direct link to the PDF of the paper. Its publication status is likely an officially published or accepted paper, given the formal structure and presentation.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the non-trivial task of developing locomotion skills for legged robots that are simultaneously robust (able to handle challenging terrains) and agile (able to move rapidly over natural terrains).
This problem is crucial in the current field because legged robots have the potential for versatile real-world applications (e.g., search and rescue, exploration, logistics) that require adapting to diverse and unpredictable environments. Current research faces specific challenges:
-
Robustness: Most existing methods for
robust locomotionover challenging terrains rely oncomputationally intensiveexteroceptive sensorslike cameras andLIDARs. These sensors are susceptible to environmental factors (lighting, weather) and require complexfoothold planning, making the systems less reliable and more complex. -
Agility: Approaches for
agile locomotionoften usemodel predictive control (MPC)with hand-designed models, which struggle to balancemodel accuracywithcomputational complexity. -
Combination Challenge: Combining these two capabilities—robustness and agility—into a single controller is a significant hurdle.
-
Naturalness of Motion: Even with
Reinforcement Learning (RL)methods, manylegged robotmovements in experiments appear unnatural and jerky, indicating a lack of refined gait control.The paper's entry point and innovative idea revolve around using
proprioceptive sensors(such asjoint encodersandIMU), which are more robust and fundamental, to achieve bothrobustandagile locomotion. It leveragesReinforcement Learning (RL)to learn ablind locomotion controller(one that does not use external perception) and addresses theunnatural motionproblem by incorporatingAdversarial Motion Priors (AMP)into the policy training.
2.2. Main Contributions / Findings
The paper makes several primary contributions to the field of legged robotics and Reinforcement Learning:
-
First Blind Locomotion System for Robust and Agile Skills: The paper presents what it claims to be the first
blind locomotion systemcapable of traversing challenging terrains robustly while moving rapidly over natural terrains. This addresses the long-standing challenge of combining these two capabilities without relying onexteroceptive sensors. -
Incorporation of Adversarial Motion Priors (AMP) for Natural Gaits: The approach successfully integrates
Adversarial Motion Priors (AMP)into thelocomotion policy training. ThisAMPtechnique introduces "soft gait priors" that replace complex auxiliary rewards with astyle reward, enabling the robot to learn natural and smooth motion styles. -
Zero-Shot Generalization: The work demonstrates
zero-shot generalizationfrom a motion dataset collected on flat terrains to challenging, real-world environments. This means the robot can adapt to unforeseen obstacles and disturbances without prior explicit training on those specific difficult terrains. -
Teacher-Student Training Framework: The authors propose a
teacher-student training frameworkthat concurrently learnsrobust locomotionover challenging terrains andagile locomotionover even terrains, relying solely onproprioception. This framework effectively distills knowledge from a teacher (trained with privileged information in simulation) to a student (which operates blindly). -
Demonstration on Real Hardware (Go1 Robot): The method is validated on a real
quadruped robot, the Unitree Go1, using onlyproprioceptive sensors(IMUandjoint encoders). The robot achieves robust traversal of obstacles up to25 cmhigh (e.g., curbs, stairs, rocky, vegetation) and agile movements (sprinting up to and spinning outdoors), exhibiting natural and smooth gaits across all conditions.The key findings are that a single
RL policy, trained withAMPand ateacher-student framework, can indeed achieve bothrobustandagile locomotioncapabilities on a real robot using onlyproprioceptive feedback, showcasing impressivezero-shot generalizationand natural motion. These findings solve the problem of requiring separate controllers or complexexteroceptive sensingfor robust and agile behaviors, offering a more streamlined and reliable approach.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner needs to grasp several fundamental concepts related to robotics, control, and machine learning:
-
Legged Robots: These are robots that use articulated limbs (legs) for locomotion, mimicking biological organisms. They are designed to navigate complex and uneven terrains where wheeled or tracked robots might struggle, offering high mobility and adaptability. Examples include quadrupedal (four-legged) robots like the Unitree Go1 used in this paper.
-
Locomotion: Refers to the act or power of moving from place to place. In robotics, it involves the algorithms and control strategies that enable a robot to move its body and limbs to achieve desired velocities and navigate an environment.
-
Proprioceptive Sensors: These are internal sensors that provide information about the robot's own state, such as joint angles, joint velocities, body orientation, and angular velocity. They report on the robot's internal configuration and movement.
- IMU (Inertial Measurement Unit): A device that measures and reports a body's velocity, orientation, and gravitational forces. It typically combines accelerometers, gyroscopes, and sometimes magnetometers.
- Joint Encoders: Sensors attached to robot joints that measure the angular position or velocity of each joint.
- Why they are important: Proprioceptive sensors are generally more robust to environmental conditions (e.g., lighting, weather) than
exteroceptive sensorsand are fundamental for low-level control and stability.
-
Exteroceptive Sensors: These are external sensors that gather information about the robot's environment.
- Cameras: Provide visual data (images/videos) of the surroundings.
- LIDARs (Light Detection and Ranging): Use pulsed laser light to measure distances to the surrounding environment, creating detailed 3D maps.
- Why they are challenging: While providing rich environmental context,
exteroceptive sensorscan be computationally intensive, susceptible to environmental conditions (e.g., dust, fog, bright sunlight), and require complex processing for tasks likefoothold planning.
-
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by interacting with an environment. The agent performs actions, receives rewards (or penalties) based on those actions, and observes new states. The goal is to learn a
policy—a mapping from states to actions—that maximizes the cumulative reward over time.- Agent: The learner or decision-maker (e.g., the robot's controller).
- Environment: The world with which the agent interacts (e.g., the simulated or real terrain).
- State ( or ): A complete description of the environment at a given time.
- Action (): A decision or output from the agent that affects the environment.
- Reward (): A scalar feedback signal from the environment indicating the desirability of an action taken from a state.
- Policy (): A function parameterized by that dictates the agent's behavior, mapping observed states to actions.
- Discount Factor (): A value between 0 and 1 that determines the present value of future rewards. A higher makes the agent care more about long-term rewards.
-
Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It assumes that the future state depends only on the current state and action, not on the sequence of events that preceded it (Markov property).
- Partially Observable Markov Decision Process (POMDP): An
MDPvariant where the agent does not have direct access to the true underlying state of the environment. Instead, it receives observations that are related to the state but may be noisy or incomplete. This is typical forblind locomotionwhere external terrain features are unknown.
- Partially Observable Markov Decision Process (POMDP): An
-
Sim-to-real Transfer: The process of training a
robot control policyin a simulation environment and then deploying it on a physical robot. This is crucial because training in the real world is often expensive, time-consuming, and potentially damaging to the robot. However,sim-to-real gaps(differences between simulation and reality) can make direct transfer challenging. Techniques likedynamics randomizationare used to bridge these gaps. -
Model Predictive Control (MPC): An advanced control method that uses a model of the system to predict its future behavior over a finite time horizon. At each time step, an
optimization problemis solved to determine the optimal control actions that minimize a cost function while satisfying constraints. Only the first part of the optimal control sequence is implemented, and the process is repeated at the next time step with updated measurements. -
Generative Adversarial Networks (GAN): A class of
machine learning frameworkswhere two neural networks, ageneratorand adiscriminator, compete against each other.- Generator: Learns to create new data instances that resemble the training data (e.g., generate realistic images or robot motions).
- Discriminator: Learns to distinguish between real data (from the training dataset) and fake data (generated by the generator).
- The two networks are trained simultaneously in a zero-sum game: the generator tries to fool the discriminator, and the discriminator tries to not be fooled. This adversarial process drives both networks to improve.
-
Trajectory Optimization (TO): A method used to find a sequence of control inputs (actions) and corresponding states (trajectory) that optimize a certain objective function (e.g., minimize energy, maximize speed) while satisfying system dynamics and constraints. It's often used to generate physically plausible and efficient motions for robots.
3.2. Previous Works
The paper discusses various prior studies, broadly categorized into those focusing on robust locomotion, agile locomotion, and motion imitation using RL.
-
RL for Robust Locomotion (using
privileged learningorexteroceptive sensors):- [1] Hwangbo et al. (2019): Used an
actuator networkto model complex actuator dynamics, aidingsim-to-real transferfor the ANYmal robot. This improved agility and dynamics. - [2] Lee et al. (2020): Extended the robustness of locomotion by training robots on diverse terrains using the
privileged learning paradigm.Privileged learninginvolves training a "teacher" policy with access to additional information (privileged states) only available in simulation, then distilling this knowledge into a "student" policy that only uses deployable observations. - [3] Kumar et al. (2021) - RMA (Rapid Motor Adaptation): A similar
teacher-student frameworkthat further improved robustness. The student policy included a1-D CNN adaptation moduleto infer privileged information from observations. This is a direct baseline for comparison in the current paper. - [4] Miki et al. (2022) & [5] Agarwal et al. (2022): Integrated
proprioceptiveandexteroceptive states(e.g., cameras, LIDARs) to improve locomotion efficiency in the wild. These works demonstrated robust locomotion but typically at low or moderate speeds and relied on external sensing. The current paper differentiates by achieving robustness blindly. - [6] Yu et al. (2022): Focused on
Visual-locomotionfor complex terrains with vision, again highlighting the reliance onexteroceptive sensing.
- [1] Hwangbo et al. (2019): Used an
-
RL for Agile Locomotion (high-speed, but potentially unnatural):
- [7] Margolis et al. (2022): Used
RLto train a Mini Cheetah robot for high-speed movements over natural terrains. They achieved remarkable results in sprinting and spinning but noted that the locomotion gaits could be unnatural. This work is also a baseline (Concurrentin the paper) in terms of its high-speed training curriculum. - [8] Ji et al. (2022): Proposed
concurrent trainingof acontrol policyand astate estimatorfor dynamic and robustlegged locomotion. This method also focused on agility but did not explicitly address naturalness of gait on challenging terrains.
- [7] Margolis et al. (2022): Used
-
RL for Motion Imitation (to achieve natural gaits):
- [15] Peng et al. (2018) - DeepMimic: Used
example-guided deep reinforcement learningforphysics-based character skills. These methods often utilizemotion tracking techniques, where an agent mimics a desired motion by explicitly tracking target poses. - [16] Peng et al. (2020): Further explored
motion trackinginRLforlegged robots. - [17] Florensa et al. (2017), [18] Peng et al. (2019), [19] Yang et al. (2020): Used
hierarchical policiesandlatent space modelsto learn reusable skills from large motion datasets, often dividing the problem into pre-training and transfer stages. Many of these were verified in simulation with moderately challenging terrains. - [20] Bogdanovic et al. (2022) & [21] Fuchioka et al. (2022): Leveraged
trajectory optimization (TO)to generate stable demonstrations formodel-free RLfor robust locomotion. These methods still rely on a pre-training stage to track reference motions, which can be decoupled from high-level tasks.
- [15] Peng et al. (2018) - DeepMimic: Used
-
Generative Adversarial Imitation Learning (GAIL) and Adversarial Motion Priors (AMP):
- [22] Ho and Ermon (2016) - GAIL: A general approach for imitating behaviors from unstructured demonstration datasets without motion tracking. It integrates
GANsintoRLwith adiscriminatormeasuring similarity between policy and demonstrations. - [23] Goodfellow et al. (2020): The foundational work on
Generative Adversarial Networks (GANs). - [24] Peng et al. (2021) - AMP (Adversarial Motion Priors): Proposed
AMP, using theGAIL frameworkto predict if astate transitionis real (from a dataset) or fake (from the agent). This enabled simulated agents to perform high-level tasks with motion styles from unstructured datasets. This is the direct inspiration for the current paper's core methodology. - [25] Peng et al. (2022): Extended
AMPtolatent space(ASE) for reusable adversarial skill embeddings. - [26] Escontrela et al. (2022): Applied
AMPto learnquadrupedal locomotionon flats without complex reward functions. - [27] Vollenweider et al. (2022): Extended
AMPtomultiple style controlforwheeled-legged locomotion. - [28] Li et al. (2022): Showed
AMPcan enablequadruped robotsto learnagile skillsusing aWasserstein GANformulation and rough partial demonstrations. - The key limitation of these prior
AMPworks: Mostmotion priorswere collected oneven terrains, leaving the effectiveness ofAMPforrobust locomotionovercomplex terrainsunknown. This is precisely what the current paper investigates.
- [22] Ho and Ermon (2016) - GAIL: A general approach for imitating behaviors from unstructured demonstration datasets without motion tracking. It integrates
3.3. Technological Evolution
The field of legged robot locomotion control has evolved from traditional optimization-based approaches (MPC, trajectory optimization) that rely on hand-designed models and explicit environment sensing, to Reinforcement Learning (RL). Early RL efforts focused on either robustness (often with privileged learning or exteroceptive sensors) or agility (high-speed movements). However, these often resulted in unnatural or jerky gaits and struggled to combine both robust and agile capabilities in a single controller, especially without external sensing.
To address the unnatural gait problem, researchers turned to imitation learning, using motion tracking techniques or hierarchical policies to learn from demonstrations. Generative Adversarial Imitation Learning (GAIL) and more recently Adversarial Motion Priors (AMP) emerged as powerful tools to learn natural motion styles from unstructured datasets, essentially replacing complex reward engineering with a style reward derived from a discriminator.
This paper marks a significant step in this evolution by taking AMP, which was primarily shown effective for stylized motion on flat terrains in simulation, and applying it to achieve both robust and agile blind locomotion on a real robot over challenging, unseen terrains. It bridges the gap between stylized natural motion and real-world robustness and agility, showcasing zero-shot generalization with only proprioceptive sensors.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper's approach presents several core differences and innovations:
-
Combines Robustness and Agility Blindly: Prior work typically achieved either
robust locomotion(often relying onexteroceptive sensors[4], [5], [6] or privileged information in simulation [2], [3]) oragile locomotion(often with unnatural gaits [7], [8]). This paper is the first to achieve bothrobustandagile locomotionwith a single policy on a real robot, using onlyproprioceptive sensors. Thisblind locomotioncapability is a major differentiator. -
Zero-Shot Generalization to Challenging Terrains with AMP: While
AMPhas been used forstylized motion[24], [26], [27], [28], its application forrobust locomotionover complex, unseen terrains was largely unexplored, especially when themotion datasetonly containsflat-terrainmotions. This paper demonstrateszero-shot generalizationfromflat-terrainAMPdata to real-worldchallenging terrains(stairs, curbs, vegetation), which is a novel finding forAMP's capabilities. -
Concise Reward Functions via AMP: Instead of complex, hand-engineered
auxiliary rewardsto induce natural gaits or robustness (which can be tedious to tune), this method leverages theAMP style rewardto implicitly encourage natural motion, significantly simplifying thereward function design. This contrasts with explicitmotion trackingmethods [15], [16], [20], [21] that require pre-training to track reference motions. -
Effective Teacher-Student Framework for Blind Locomotion on Challenging Terrains: The paper utilizes a
teacher-student training frameworksimilar toRMA[3], but with a crucial difference in how the student'slow-level networkis handled. UnlikeRMAwhich freezes the student'slow-level network, this paper reuses the teacher's weights and continues training them for the student, which is shown to be more effective for traversing increasingly difficult terrains (demonstrated in the ablation study with25 cmsteps). This allows the student to better adapt to complex situations by continuously refining its control layers. -
Proprioception-Only Sensing: Unlike works that combine
proprioceptiveandexteroceptivesensors [4], [5], [6], this paper strictly adheres toproprioceptive-only sensing, making the system inherently more robust to environmental conditions and less computationally demanding for perception. Thememory encoder(LSTM) in the student policy is critical for inferringunobservable statesfrom historicalproprioceptive observations.
4. Methodology
4.1. Principles
The core idea of the method is to enable a quadruped robot to achieve both robust and agile locomotion using only proprioceptive sensors by integrating Adversarial Motion Priors (AMP) within a teacher-student Reinforcement Learning (RL) framework. The theoretical basis is that by learning from a dataset of natural-looking motions (even if collected on flat terrains) via an adversarial process, the robot can acquire a "soft gait prior" that promotes natural movement. Simultaneously, the teacher-student framework (inspired by privileged learning) helps the blind student policy implicitly infer unobservable environmental properties (like terrain roughness) from a history of proprioceptive observations, allowing it to adapt to challenging terrains and achieve sim-to-real transfer. The reward function is kept concise by using the AMP-derived style reward in place of complex, hand-engineered auxiliary rewards.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology can be broken down into three main parts: Reinforcement Learning Problem Formulation, Reward Terms Design (including AMP), and Motion Dataset Generation, followed by the Training details for both Teacher and Student policies.
4.2.1. Reinforcement Learning Problem Formulation
The control problem is formulated as a discrete-time dynamic system.
The goal of RL is to find optimal parameters for a policy to maximize the expected discounted return over future trajectories.
The expected discounted return is defined as:
Here,
-
: The objective function representing the expected total discounted reward, which the
RL agentaims to maximize. -
: The parameters of the
policy. -
: The expectation taken over trajectories generated by following the
policy. -
: The
discount factor(a value between 0 and 1, exclusive), which weighs immediate rewards more heavily than future rewards. -
: The reward received at time step .
The paper focuses on
blind locomotion, which is aPartially Observable Markov Decision Process (POMDP)becauseexteroceptive sensorsare not used, meaning terrains are not fully observable. To handle this,privileged learning[14] is employed:
- A teacher policy is trained in simulation with access to
privileged states(information only available in simulation, like accurate linear velocity or terrain properties). - A student policy then learns to imitate the teacher's actions and infer the privileged states from a history of its
proprioceptive observationsviasupervised learning.
State Space
- Teacher Policy State (): Consists of three components:
Proprioceptive Observation(): This is the only part accessible to the student. It includes:- Orientation of the gravity vector (in robot's base frame).
- Base angular velocity (in robot's base frame).
- Joint positions.
- Joint velocities.
- Previous action () selected by the current policy.
- Desired base velocity command vector.
Privileged State(): Information available only to the teacher in simulation. It includes:- Base linear velocity (computed from state estimation, typically noisy in reality).
- Friction coefficients.
- Restitution coefficients (for ground contact).
- Contact forces.
- External forces and their positions on the robot.
- Collision states (of the trunk, thigh, and shank).
Terrain Information(): 187 measurements of terrain points sampled from a grid around the robot base, representing the vertical distance from the sampled point to the robot base.
- Student Policy State: Only the
proprioceptive observation() is accessible.
Action Space
- The
policy actionis a 12-dimensional vector. - It's interpreted as a
target joint position offset. This offset is added to a time-invariant nominal joint position to specify the target motor position for each of the robot's 12 joints (3 joints per leg x 4 legs). Joint PD controllers(Proportional-Derivative controllers) then compute the torque commands using these target motor positions with fixed gains and .PD Controller: A feedback control loop mechanism widely used in industrial control systems. It calculates an "error" value as the difference between a measured process variable and a desired setpoint. The controller attempts to minimize the error by adjusting the process control inputs. In this case, it adjusts motor torques based on the difference between the target joint position and the current joint position (P-term) and the joint velocity (D-term).
4.2.2. Reward Terms Design
The paper aims for concise reward functions to induce robust and agile legged locomotion with a natural gait. The total reward is composed of three terms:
Here,
- : The total reward at time step .
- : The
task rewardcomponent, encouraging velocity tracking. - : The
style rewardcomponent, based onAMP, encouraging natural gait. - : The
regularization rewardcomponent, ensuring smoothness and safety.
Task Reward ()
This term focuses on tracking desired linear and angular velocities. Its specific equation and weight are provided in Table I.
Style Reward ()
The style reward is crucial for promoting natural and smooth trot gait. It evaluates the similarity between the demonstrator's behavior (from a reference dataset ) and the agent's behavior (). The more similar they are, the higher the style reward. It is based on the Adversarial Motion Priors (AMP) framework.
The system uses a discriminator , a neural network with parameters , to predict whether a state transition is a real sample from the reference dataset or a fake sample produced by the agent .
The AMP state includes:
- Joint positions
- Joint velocities
- Base linear velocity
- Base angular velocity
- Base height with respect to the terrains.
To mitigate
mode collapse(a commonGANissue where the generator produces a limited variety of samples), the reference dataset contains onlytrot gait motion clips.
The training objective for the discriminator is defined as a least squares GAN formulation with a gradient penalty:
Here,
-
: The discriminator aims to find the parameters that minimize this objective function.
-
: Expectation over
state transitionssampled from the real motion dataset . -
: Expectation over
state transitionsgenerated by the agent's policy . -
: The output of the
discriminator, which tries to output 1 for real samples and -1 for fake samples (as perleast squares GANformulation). -
: Encourages the
discriminatorto output 1 for real samples. -
: Encourages the
discriminatorto output -1 for fake samples. -
: A manually-specified coefficient for the
gradient penaltyterm (). -
: The
gradient penaltyterm, which encourages thediscriminator's gradients on real data samples to be close to zero. This helps stabilizeGAN trainingand preventmode collapse.The
style rewardfor the agent is then defined based on thediscriminator's score: Here, -
: The
style rewardreceived by the agent for astate transitiongenerated by its policy. -
: The score given by the
discriminatorfor the agent'sstate transition. -
The
max[0, ...]term ensures the reward is non-negative. -
The term scales the
style rewardto the range[0, 1]. If is close to 1 (meaning thediscriminatorbelieves the agent's motion is very similar to the real data), the reward will be close to 1.
Regularization Reward ()
This term is added to aid sim-to-real transfer over challenging terrains, imposing constraints on motion for smoothness and safety.
-
Smoothness: Induced by
joint-level penalty(penalizing large joint torques and changes in joint positions) andstride duration bonus(encouraging consistent foot-ground contact timing). -
Safety: Ensured by penalties for
collisions(e.g., robot trunk with the ground, or self-collisions).The detailed reward terms and their weights are summarized in Table I.
The following are the results from Table I of the original paper:
| Term | Equation | Weight |
|---|---|---|
| Task rg | exp kvdes vt,xy∥2/0.15) |des exp − ωt,z∥2/0.15) | 1.0 0.5 |
| Smoothness rl | −|τ ∥2 −∥l2 −kqt−1 − qt ||2 ∑ i=0 min (tair,i − 0.5, 0) | 1 × 10−4 2.5 × 10−7 0.1 1.0 |
| Safety rl | −ncollision | 0.1 |
| Style rs | max [0, 1 − 0.25 (dscore − 1)2] | 0.5 |
4.2.3. Motion Dataset Generation
The motion dataset for AMP is generated using trajectory optimization (TO) techniques.
- Method: The paper adopts a
single TO formulationfrom previous work [11] to generatequadrupedal locomotionwith atrotting gaitonflat terrain. - Robot Model: The robot is first represented using a
simplified centroidal dynamics modelto reduce computational complexity. - Optimization Problem: This model is then transcribed into a
nonlinear programming problem. This problem explicitly enforcesfriction cone constraints(ensuring feet don't slip) andkinematic constraints(ensuring joints operate within their physical limits). - Solver: The
TO problemis solved usingTOWR[29], an open-source trajectory optimizer that does not require a hand-tuned cost function, simplifying the optimization process. - Dataset Content: The dataset includes various
locomotion trajectories: forward, backward, lateral left, lateral right, left steering, right steering, and combined movements. The total duration of these motions is30 seconds. - Benefit: Generating motion datasets using
TOensures that theyfully match the state spaceof the simulated agent and the demonstrator, eliminating the need for complexmotion retargeting techniqueslike those in [15].
4.2.4. Training
Simulation Environment and General Settings
- Simulator:
IsaacGym[9], a high-performanceGPU-accelerated physics simulator, is used for training. - Parallelism:
4096 parallel agentsare trained concurrently on different terrain types. - Training Time:
- Teacher policy:
400 million simulated time steps. - Student policy:
200 million simulated time steps. - Total wall-clock time:
seven hoursfor both stages.
- Teacher policy:
- Episode Duration: Each
RL episodelasts a maximum of1000 steps(equivalent to20 seconds) and terminates early if specifictermination criteriaare met. - Control Frequency:
50 Hzin simulation. - Hardware: All training is performed on a single
NVIDIA RTX 3090Ti GPU.
Termination Criteria
An episode terminates, and a new one starts, if any of the following conditions are met:
Trunk collisionswith the ground.Large body inclinations(robot tipping over too much).- Robot being
trappedfor a long period of time (not making progress).
Dynamics Randomization
To enhance the robustness of the policy and facilitate sim-to-real transfer, various dynamic parameters are randomized in each episode during training. Some of these randomized parameters are included in the privileged state to aid the teacher policy training. Observation noise (as described in [9]) is also added.
The randomized parameters include:
-
Mass of the trunk and legs.
-
Mass and position of payload applied to the robot's body.
-
Ground friction and restitution coefficients.
-
Motor strength.
-
Joint-level
PD gains. -
Initial joint positions.
The
randomization rangesfor these parameters are detailed in Table II.
The following are the results from Table II of the original paper:
| Parameters | Range[Min, Max] | Unit |
|---|---|---|
| Link Mass | [0.8, 1.2]×nominal value | Kg |
| Payload Mass | [0, 3] | Kg |
| Payload Position | [-0.1, 0.1] relative to base origin | m |
| Ground Friction | [0.05, 2.75] | - |
| Ground Restitution | [0.0, 1.0] | - |
| Motor Strength | [0.8, 1.2]× motor torque | Nm |
| Joint Kp | [0.8, 1.2]×20 | - |
| Joint Kd | [0.8, 1.2]×0.5 | - |
| Initial Joint Positions | [0.5, 1.5]×nominal value | rad |
Training Curriculum
Given the complexity of blind locomotion over challenging terrains, an automated terrain curriculum is adopted and improved from [9].
-
Terrain Types: Five types of
procedurally generated terrainsare created:rough flats,slopes,waves,stairs, anddiscrete steps. -
Curriculum Structure: A
height-field mapof100 terrainsarranged in a20x10 gridis used. Each row contains the same type of terrain with increasing difficulty. Each terrain section is8mlong and wide.Rough flats: Noise increases from to .Slopes: Inclination increases from to .Waves: Amplitude increases from to .Stairs: Fixed width of , step height increases from to .Discrete obstacles: Two height levels, increasing from to .
-
Progression Logic:
- Initially, robots are assigned to all terrain types at the lowest difficulty.
- A robot moves to a more difficult terrain once it adapts to its current difficulty, defined as achieving
more than 85% of the average linear velocity tracking reward(meaning it successfully follows velocity commands). - If a robot fails to travel at least half the distance required by its command linear velocity by the end of an episode, it is reset to an easier terrain.
- To prevent
skill forgetting, robots that solve the hardest terrains are looped back to a randomly selected difficulty level of the current terrain type.
-
Command-Conditioned Policy: Robots learn to track desired
velocity commands, representing longitudinal velocity (), lateral velocity (), and yaw velocity () in the base frame.- During the
terrain curriculum phase, longitudinal and lateral velocity commands are sampled from a small range . Yaw velocity is computed from heading errors to guide the robot out of terrains.
- During the
-
Concurrent High-Speed Training: After an agent on
rough flat terrainssuccessfully steps out of the roughest flats, its velocity command sampling schedules switch to agrid adaptive curriculum strategy[7]. This enables learninghigh-speed sprintingandspinning. This high-speed training only terminates the terrain curriculum for the agents engaged in it, not for others.The following figure (Figure 2 from the original paper) illustrates the teacher-student training framework with adversarial motion priors.
该图像是一个示意图,展示了在教师策略训练与学生策略训练中使用对抗运动优先的方法。左侧展示了传感器信息和重建损失的计算,右侧则呈现了GAN鉴别器的结构及轨迹优化的公式,其中 。
Oh. , privileged state and terrain information using RL. The total reward r _ { t } obtained by the teacher consists of a task reward , a style reward based on AMP, and a regularization reward which and reconstructs the latent representation lov seeuaiTheo priors eom Ous bo.
4.2.5. Teacher Policy Training and Architecture
-
Training Algorithm: The
teacher policyis trained usingProximal Policy Optimization (PPO)[31].PPO: Anon-policy RL algorithmthat aims to strike a balance between ease of implementation, sample efficiency, and good performance. It iteratively optimizes aclipped surrogate objective functionto update thepolicywhile keeping the newpolicyclose to the old one.
-
Synchronization: The training of the
teacher policyand thediscriminator() are synchronized.- The
teacherperforms arolloutin the environment, generating astate transition. - This transition is fed to the
discriminatorto get a . - The is used to compute the
trot gait style rewardaccording to (4), which is then added to the other rewards () obtained by the teacher. - After collecting rollout data, the
teacher policy parametersanddiscriminator parametersare optimized in every training step usingPPOfor the policy and the objective in (3) for the discriminator.
- The
-
Teacher Policy Architecture (): (As shown in Figure 2 and detailed in Table III)
- Consists of
Multi-Layer Perceptron (MLP)components. - Terrain Encoder (): Compresses the
terrain informationinto a low-dimensionallatent representation. - Privileged Encoder (): Compresses the
privileged stateinto a low-dimensionallatent representation. - Full Latent Representation: is formed by concatenating and . This compression preserves essential information for the student to reconstruct.
- Low-Level Network: Takes the concatenated and the
proprioceptive observationas input. It has atanh output layerto output the mean of aGaussian distribution. The variance is determined byPPO. - Critic Network: An
MLPwith three hidden layers, providing target values for theGeneralized Advantage Estimator (GAE)used inPPO.
- Consists of
-
Discriminator (): A single
MLPwith two hidden layers and a linear output layer.The following are the results from Table III of the original paper:
Module Inputs Hidden Layers Outputs Low-Level (MLP) lt, op [256, 128, 64] at Critic (MLP) xt [512, 256, 128] Vt Memory (LSTM) op , ht−1, ct−1 [256, 256, 256] ht Eθp (MLP) [64, 32] lp Eθe (MLP) ie [256, 128] lt Eθm (MLP) ht [256, 128] D (MLP) , st+1 AM P [1024, 512]
All networks use ELU activations for hidden layers.
4.2.6. Student Policy Training and Architecture
-
Goal: The
student policyis trained to replicate theteacher's actionswithout access to theprivileged stateandterrain information. This means the student operates in aPOMDPsetting and must infer unobservable states from its history ofproprioceptive observations. -
Memory Encoder: Since the student needs to infer unobservable states from historical observations, a
memory encoderis used to capture sequential correlations.- Architecture: A
Long Short Term Memory (LSTM)network [33] is chosen as theRNN architecture.LSTMsare suitable for processing sequential data and retaining past information withinhidden stateswithout requiring large memory buffers for storing raw historical observations.- The
proprioceptive observation, along with the pasthidden stateandcell stateof theLSTM, are fed into theLSTM module. - The
LSTM moduleoutputs the currenthidden state. - This is then passed to an
MLP encoderto produce the student'slatent representation.
- The
- Memory Length: The student policy is trained with a sequence of
50proprioceptive observations(), corresponding to1 secondof memory, to learnrobust blind locomotion.
- Architecture: A
-
Training Losses (Supervised Fashion): The student is trained by minimizing two losses:
- Imitation Loss: Enables the student to mimic the
teacher's actions. This is typically a mean squared error (MSE) loss between the student's actions and the teacher's actions. - Reconstruction Loss: Encourages the student's
memory encoderto reconstruct theteacher's full latent representation. This helps the student learn to infer the necessary context (terrain and privileged information) that the teacher had access to.
- Imitation Loss: Enables the student to mimic the
-
Robustness Enhancement: The
dataset aggregation strategy (DAgger)[32] is employed to generate samples by rolling out thestudent policy. This iterative process helps increase therobustnessof the student policy by exposing it to states it might encounter during its own execution, mitigatingdistribution shiftissues. -
Curriculum: The same
training curriculum(terrain progression and command scheduling) as the teacher's is applied to the student. -
No Discriminator: No
discriminatoris trained duringstudent training. -
Low-Level Network: The
student's low-level MLP networkhas the same structure as the teacher'slow-level net. Critically, itslearned weightsarereused from the teacher's low-level netfor initialization, and it continues to be trained (not frozen) during student training to speed up convergence and improve adaptation, unlike some prior methods.
5. Experimental Setup
5.1. Datasets
The paper primarily uses one type of dataset for training the Adversarial Motion Priors (AMP):
- AMP Dataset: This dataset consists of
quadrupedal locomotion trajectoriesgenerated usingTrajectory Optimization (TO)techniques [11] specifically forflat terrainwith atrotting gait.-
Source: Generated synthetically using
TOwithTOWR[29]. -
Scale: Total duration of
30 seconds. -
Characteristics: Contains various
trot gait motion clipsincluding forward, backward, lateral left, lateral right, left steering, right steering, and combined locomotion. Eachstate transitionincludesjoint positions,joint velocities,base linear velocity,base angular velocity, andbase height with respect to the terrains. -
Domain: Critically, it only contains motions on
flat terrains. -
Why chosen: This dataset is chosen because
TOgenerates stable and dynamically feasible motions that perfectly match the simulated agent's state space, avoiding issues withmotion capture dataormotion retargeting. Thetrot gaitis a common and efficient gait forquadrupedal robots. Its limited domain (flat terrains) allows the paper to testzero-shot generalizationto challenging terrains.For the actual
Reinforcement Learningtraining, theIsaacGym simulatoris used to create diverse and challenging environments. Theseprocedurally generated terrains(rough flats, slopes, waves, stairs, discrete steps) are not pre-recorded datasets but are generated dynamically during training based on acurriculum.
-
5.2. Evaluation Metrics
The paper uses several metrics to evaluate the robustness and agility of the proposed locomotion controller.
-
Success Rate (for Obstacle Traversal):
- Conceptual Definition: This metric quantifies the reliability of the robot in successfully navigating specific obstacles, particularly steps. It measures the proportion of attempts where the robot completes a defined traversal task without falling, getting stuck, or incurring significant undesirable events. It focuses on the robot's ability to maintain stability and complete the task under challenging conditions.
- Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
- Symbol Explanation:
Number of Successful Trials: The count of trials where the robot successfully traversed the obstacle (defined as traversing the step with both front and hind legs).Total Number of Trials: The total number of attempts made for a given obstacle type and height (10 trials per step height in this paper).
-
Average Running Speed (m/s):
- Conceptual Definition: This metric quantifies the robot's
agilityand efficiency over a given distance. It measures how quickly the robot can cover a certain distance on various terrains, indicating its ability to sustain desiredlinear velocities. - Mathematical Formula: $ \text{Average Running Speed} = \frac{\text{Total Distance}}{\text{Time Elapsed}} $
- Symbol Explanation:
Total Distance: The fixed distance the robot is commanded to sprint (e.g., 10 m in this paper).Time Elapsed: The total time taken by the robot to cover the specifiedtotal distance.
- Conceptual Definition: This metric quantifies the robot's
-
Maximum Yaw Velocity (rad/s):
- Conceptual Definition: This metric specifically evaluates the robot's
agilityin rotational movements (spinning). It quantifies how fast the robot can rotate around its vertical axis, demonstrating its dynamic control over angular motion. - Mathematical Formula: No explicit formula is given, but it represents the peak angular velocity achieved around the Z-axis (yaw).
- Symbol Explanation:
rad/s: Radians per second, the unit for angular velocity.
- Conceptual Definition: This metric specifically evaluates the robot's
-
Froude Number (Fr):
- Conceptual Definition: The
Froude numberis a dimensionless quantity used to compare the inertial forces with the gravitational forces in fluid dynamics, and analogously, in animal locomotion. Forlegged locomotion, it provides a way to characterize thegaitanddynamic similarityof different animals or robots, independent of their absolute size or speed. A higherFroude numbergenerally implies more dynamic, "running" or "bounding" gaits, while lower values correspond to more static, "walking" gaits. - Mathematical Formula: For
legged locomotion, theFroude numberis typically defined as: $ \mathrm{Fr} = \frac{v^2}{gL} $ - Symbol Explanation:
- : The average forward speed of the robot (m/s).
- : The acceleration due to gravity (approximately ).
- : The characteristic length of the robot, often taken as the
leg lengthorhip height(m). In the context ofquadrupeds, this is usually thehip height[35].
- Conceptual Definition: The
5.3. Baselines
The paper compares its method against several baselines, all of which use only proprioception for a fair comparison in the single-step tests. All RL methods were trained using the same curriculum strategy, reward functions, random seed, and low-level network architecture as the proposed method.
-
1. RMA [3] (Rapid Motor Adaptation):
- Description: This baseline uses a
teacher-student training frameworksimilar to the proposed method. The student policy incorporates a1-D CNN adaptation moduleto implicitly inferprivileged informationand adapt its control. A key difference from the proposed method is thatRMAfreezes thestudent's low-level network(copied from the teacher) during student training, relying only on theadaptation moduleto infer context. - Representativeness:
RMAis a state-of-the-art method forrobust blind locomotionforlegged robots, demonstrating strongsim-to-real transfer.
- Description: This baseline uses a
-
2. Concurrent [8]:
- Description: This method involves a
policyconcurrently trained with astate estimation networkthat explicitly estimates the body state. For comparison, noterrain informationwas provided during its training (though the original paper by Ji et al. might use it or estimate it). - Representativeness: Represents methods that use
concurrent state estimationto handlepartially observable environmentsfor dynamic and robust locomotion, often focusing onagility.
- Description: This method involves a
-
3. Domain Randomization (DR):
- Description: This baseline involves directly training the policy using
Reinforcement Learningwithout anyprivileged stateorterrain information(i.e., noteacher-student distillationor explicitstate estimation). It relies solely on extensivedynamics randomizationandobservation noiseduring training to achievesim-to-real transferand robustness. - Representativeness:
Domain Randomizationis a common and foundational technique forsim-to-real transferinRLrobotics.
- Description: This baseline involves directly training the policy using
-
4. Built-in MPC:
- Description: This refers to the
Model Predictive Control (MPC)controller that is pre-installed or standard on the Unitree Go1 Edu robot. - Representativeness: Provides a strong baseline representing
traditional optimization-based controlmethods, which often excel in known, structured environments but can struggle with unforeseen disturbances or complex, varied terrains without online adaptation.
- Description: This refers to the
6. Results & Analysis
6.1. Core Results Analysis
The experimental results validate the effectiveness of the proposed method in achieving both robust and agile locomotion with natural gaits using only proprioceptive sensors and demonstrating zero-shot generalization.
6.1.1. Ablation Study for the Design of the Reward Terms
The ablation study evaluates the impact of the different reward components on the robot's gait motion. Three policies were trained with the same time and random seed:
-
Total Reward: (proposed method).
-
Task + Style Reward: .
-
Task + Regularization Reward: .
The following figure (Figure 3 from the original paper) shows the positions and velocities of the two joints (thigh and shank) of the front-right leg during forward walking (around ) in simulation.
该图像是图表,展示了前右腿的两个关节(股关节和小腿关节)在前进行走时的位置和速度变化情况。图中使用不同的线型表示不同的数据组合,记录时间为2.5秒。
Fig. 3. Measured positions and velocities of the two joints (thigh and shank) of the front-right leg during forward walking (around in simulation.
As can be seen from Figure 3:
- The policy trained with the total reward terms (, purple lines) shows joint positions and velocities that are closest to those generated from the
Trajectory Optimization (TO)dataset (red lines). This indicates that the combination oftask tracking,style imitationviaAMP, andregularizationsuccessfully produces a gait that closely matches the desired natural trot. The resulting motion is described as natural and smooth. - The policies trained with only two types of reward (, orange lines, and , blue lines) generate
jerky velocities and positions.-
When the
regularization reward() is omitted (policy ), the gait lacks smoothness, suggesting that safety and joint-level penalties are crucial for stable and fluid movement. -
When the
style reward() is omitted (policy ), the gait deviates significantly from theTOreference, confirming thatAMPis essential for learning natural motion styles and preventing "unnatural and jerky" behaviors often seen inRLlocomotion.This ablation study strongly validates that all three components of the reward function are necessary for achieving a natural, smooth, and desired gait, with the
AMP-based style rewardbeing key for naturalness and theregularization rewardfor smoothness and safety.
-
6.1.2. Evaluation of the Robust Locomotion
Traversability for Challenging Obstacles
The robot's robustness was tested on significant obstacles. For example, Figure 4 shows the robot successfully traversing a 25 cm high step. This is a highly challenging obstacle for the Go1 robot, which stands 28 cm tall, implying the obstacle is almost as tall as the robot's standing height. The AMP dataset only contained motions with 8 cm foot clearance, meaning this 25 cm step represents a completely unseen challenge in the motion prior.
The following figure (Figure 4 from the original paper) illustrates the robot's locomotion over challenging steps.

该图像是一个示意图,展示了机器人在面对 2 5 ext{ cm} 的高台阶时的运动策略。机器人通过感知头部碰撞来检测障碍,并调整底部高度和摆动高度,以克服这些障碍。
Fig. 4. Locomotion over challenging steps. Our robot learns to detect huge steps of by head collision and lifts base height and swing height to overcome obstructions.
Figure 4 illustrates the emergent foot-trapping reflex:
-
The policy
first identifies the head collision(implying the front of the robot body hits the obstacle) purely fromproprioceptive observations. This highlights the student policy's ability to implicitly infer contact and the presence of an obstacle from itsmemory of proprioceptive stream. -
It then
raises the body heightto ensure the trunk clears the obstacle. -
Next, it
lifts the foot as high as possibleto step over.This demonstrates the policy's impressive
adaptationandrecovery behaviorswithout anyexteroceptive sensors, learning to adapt robustly to obstacles by interpreting changes in its own body configuration.
Comparison with Baselines on Step Traversal
The method was compared with several proprioception-only baselines in single-step tests (10 trials per step height, given a forward command of for 10 s). A test was successful if the robot traversed the step with both front and hind legs.
The following figure (Figure 5 from the original paper) shows the success rates of different methods for different step heights.

该图像是图表,展示了不同方法在不同台阶高度下的成功率。图中左侧为上台阶,右侧为下台阶,数据基于10次试验评估,红色表示我们的算法,绿色为RMA,橙色为Concurrent,蓝色为Built-in MPC,紫色为Domain Randomization。
Fig. 5. Success rates of different methods for different step heights. Success rate was evaluated over 10 trials for each step height.
From Figure 5, the following observations can be made:
- Superior Performance: The proposed method (red line) significantly
outperformed all other baselinesin bothstepping upandstepping down. It maintained a near100% success rateup to25 cmstep height, demonstrating exceptional robustness. - Effectiveness of Teacher-Student Framework:
Teacher-student training frameworks(ours andRMA, green line) are generally more effective than methods without explicitterrain information(Concurrent, orange;Domain Randomization, purple;Built-in MPC, blue). Methods without access toterrain information(even implicitly) often failed when step heights exceeded13 cm. This confirms that inferringterrain properties(even implicitly) is crucial for large obstacle negotiation. - Advantage Over RMA: While
RMAperformed well up to15 cm, its performancedecreased rapidlybeyond that. The paper attributes this difference to the student policy'slow-level networkhandling:- Our method: Reuses the teacher's
low-level networkweights but continues training them for the student. This allows the student torefine its control layersand better adapt toincreasingly difficult terrains, even whenreconstruction errorsoflatent representationsmight be larger. - RMA: Freezes the
student's low-level networkafter copying from the teacher. With fixed weights, the output becomes unstable as terrain difficulty increases and the adaptation module's inferred information becomes less accurate, leading to a rapid performance drop. This highlights the importance of allowing the entire student network to adapt.
- Our method: Reuses the teacher's
Robustness in Diverse Environments
The robot's robustness was further evaluated in diverse real-world environments (shown in Figure 1 and the accompanying video), including large curbs, dense vegetation, moderately rocky terrains, loose rubble, and grassland. These terrains present significant challenges due to variations in material properties, deformability, and friction. Despite never having seen these specific terrains in simulation (which only used procedurally generated terrains), the policy demonstrated strong zero-shot generalization. The robot showed impressive recovery and adaptation behaviors to continue running over these uneven surfaces, indicating that the history of proprioceptive observations allows it to infer and react to changes in interaction dynamics.
The following figure (Figure 1 from the original paper) shows the robot achieving robust locomotion over challenging terrains and agile locomotion over natural terrains.

该图像是展示了四足机器人在不同复杂地形上进行稳健与灵活运动的插图。机器人可以成功地跨越路缘、楼梯以及崎岖和植被等障碍,在自然地形上以高速奔跑和旋转,同时展现出自然的步态和流畅的动作。
Fig. 1. Our robot can achieve robust locomotion over challenging terrains and agile locomotion over natural terrains. The robot can successfully traverse curbs, stairs, rocky and vegetation while running and spinning at high speed over natural terrains. Whether on even or complex terrains, at high or low speeds, the emergent behaviors learned by our robot exhibit a natural gait and smooth motion.
6.1.3. Evaluation of the Agile Locomotion
Outdoor Sprinting
The agility of the controller was tested through outdoor sprinting on three different real-world terrains over a 10 m distance, with a maximum longitudinal velocity command of 4 m/s.
The following are the results from Table IV of the original paper:
| Terrains | Time Elapsed (s) | Speed (m/s) |
|---|---|---|
| Plastic Track (Real) | 2.89 ± 0.04 | 3.46 ± 0.04 |
| Rocky Road (Real) | 3.19 ± 0.06 | 3.14 ± 0.06 |
| Grassland (Real) | 3.65 ± 0.10 | 2.74 ± 0.07 |
From Table IV:
- High Speeds Achieved: The robot achieved a maximum average speed of on the
plastic track, which is an even terrain. This demonstrates high agility. - Generalization to Natural Terrains: On
rocky roadandgrassland(terrains with more variations in friction, restitution, and deformation, and not seen during simulation training), the robot still maintained high average speeds of and respectively. This again confirms the strongzero-shot generalizationof the policy tounseen natural terrains. - Sim-to-Real Gaps: The paper notes that
tracking errorsstill remain (e.g., actual speed vs. command ). This is attributed toremaining sim-to-real gapspotentially from robot motors or unreal dynamics in the simulator.
High-Speed Spinning
The robot also demonstrated impressive high-speed spinning ability on a lab flat, accelerating to a maximum yaw velocity of 5.8 rad/s and stopping safely.
Agility in Uneven Environments
The robot was able to run robustly on vegetation terrains and small stairs at high speeds. These environments, with their tangles of branches/leaves and small steps, can cause significant disruptions to high-speed movements. The robot's ability to show impressive recovery and adaptation behaviors and continue running under such conditions further highlights the combined robustness and agility of the single policy.
Overall Comparison with Previous RL Methods
Table V provides an overall comparison of the proposed method's locomotion abilities with other RL controllers on different quadrupeds.
The following are the results from Table V of the original paper:
| Robot | Blind | Complex Terrains | Froude [35] | Speed (m/s) |
|---|---|---|---|---|
| A1 [36] | Yes | Unknown | 1.1 | 1.7 |
| MC [7] | Yes | Unknown | 5.1 | 3.9 |
| ANYmal [4] | No | Yes | 0.3 | 1.2 |
| Ours Go1 | Yes | Yes | 4.2 | 3.5 |
From Table V:
- Blind and Complex Terrains: The proposed method (Ours Go1) is one of the few that achieves
blind locomotion(Yes) and can traversecomplex terrains(Yes), setting it apart fromMini Cheetah (MC)[7] andA1[36] (wherecomplex terrainperformance isUnknown) andANYmal[4] (which isnot blind). - Froude Number and Speed:
-
The
Froude numberfor our robot is4.2, indicating a highly dynamic, agile gait, comparable toMC[7] (5.1). This is significantly higher thanA1(1.1) andANYmal(0.3), which represent more walking-like gaits. -
Our robot achieves a speed of , which is highly competitive, especially considering its
blindnessandcomplex terraincapabilities.MC[7] achieves a higher speed of but without reported performance oncomplex terrains.These results conclusively demonstrate that the proposed method yields a highly competitive robot capable of both
robust locomotionoverchallenging terrainsandagile locomotionovernatural terrains, all while operatingblindlywith a natural gait.
-
6.2. Ablation Studies / Parameter Analysis
The paper presents an ablation study on the reward terms design (discussed in Section 6.1.1 and Figure 3) which confirmed the necessity of task reward, style reward, and regularization reward for generating natural, smooth, and robust gaits.
While not a formal ablation study in the results section, the comparison with RMA in the robust locomotion evaluation (Section 6.1.2 and Figure 5) can be seen as an analysis of a key architectural parameter/decision: how the student's low-level network is treated during training. The finding that continuing to train the student's low-level network (as in our method) is superior to freezing it (as in RMA) for handling more difficult terrains highlights a critical hyper-parameter choice that significantly affects the method's performance and generalization capability. This indicates that the learning process for the low-level control layers should remain active to adapt to the inferred environmental context from the memory network, especially when facing extreme challenges.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully demonstrates a pioneering Reinforcement Learning (RL)-based approach that enables legged robots to achieve both robust and agile locomotion simultaneously. A single policy, trained using Adversarial Motion Priors (AMP) within a teacher-student framework, effectively generates natural gaits. A critical contribution is the zero-shot generalization of these learned skills from flat-terrain motion data to diverse, challenging real-world environments. The system, deployed on a Unitree Go1 quadruped robot using only proprioceptive sensors (IMU and joint encoders), showcases impressive capabilities in traversing 25 cm high obstacles and sprinting up to while maintaining natural, smooth motions. To the best of the authors' knowledge, this is the first system to achieve such a combination of blind robustness and agility in legged locomotion.
7.2. Limitations & Future Work
The authors acknowledge a significant limitation inherent to blind locomotion:
-
Lack of Proactive Planning: Since the system relies solely on
proprioception, it operates in a "trial and error" manner and cannot perform tasks that requireplanning ahead of time. Examples given includeefficiently climbing stairs(which might require a high-level plan for foot placement and body trajectory) oravoiding holes(which requires foresight of upcoming negative obstacles). The robot reacts to direct physical interactions or inferred states rather than pre-emptively strategizing.For future work, the authors propose:
-
Integrating Vision Sensors (Exteroception): Extending the system with
vision sensors(e.g., cameras, LIDARs) to incorporateexteroception. This would allow the robot to perceive and understand its environment more directly, enablingplanning aheadand potentially improvinglocomotion efficiencyin complex scenarios like stair climbing or hole avoidance.
7.3. Personal Insights & Critique
This paper offers several profound inspirations. The most significant is the demonstration that AMP can be a powerful tool not just for stylized animation but for real-world robot control, particularly in achieving natural gaits with zero-shot generalization. The finding that motion priors learned on flat terrains can transfer to challenging, unseen environments is remarkable and opens avenues for generating diverse and realistic robot behaviors from simpler, more accessible datasets. The elegance of replacing complex reward engineering with a style reward is also highly appealing, simplifying the RL process.
The teacher-student framework, combined with the continuous training of the student's low-level network, is a crucial insight. It highlights that while privileged information can kickstart learning, allowing the student's entire control architecture to remain adaptable is key for robust generalization to extreme conditions. This nuanced approach to sim-to-real transfer and blind locomotion is valuable.
However, a potential issue or area for improvement, as hinted by the authors' sim-to-real gap discussion, is the realism of the IsaacGym simulator's dynamics randomization. While effective, it's possible that even more sophisticated randomization techniques or adaptive domain randomization could further reduce the performance gap between desired and actual speeds, especially for agile tasks. The paper doesn't delve deeply into the specifics of the sim-to-real gap beyond attributing it to motor or unreal dynamics. Further analysis or specific mitigation strategies for these gaps could enhance the practical applicability of the system.
The core limitation of blind locomotion is well-articulated by the authors. This method excels at reactive adaptation, but true intelligence in navigating complex environments often requires a predictive capacity. The suggested future work of integrating exteroception is a natural and necessary next step. However, it will introduce new challenges, such as sensor fusion, handling noisy visual data, and integrating high-level planning with the low-level proprioceptive control. The paper provides a strong foundation upon which such hybrid systems can be built.
The methods and conclusions could be highly transferable to other domains involving complex dynamic systems where natural motion or adaptive behavior under partial observability is desired, such as humanoid robotics, prosthetics control, or even virtual character animation with more realistic physics. The AMP approach, in particular, could be generalized to imbue various RL agents with naturalistic styles without arduous reward shaping.
Similar papers
Recommended via semantic vector search.