Reinforcement Learning for Versatile, Dynamic, and Robust Bipedal Locomotion Control
TL;DR Summary
This paper develops dynamic locomotion controllers for bipedal robots using deep reinforcement learning, surpassing single skill limitations with a novel dual-history architecture that enhances adaptivity and robustness. The controllers show superior performance in diverse skills
Abstract
This paper presents a comprehensive study on using deep reinforcement learning (RL) to create dynamic locomotion controllers for bipedal robots. Going beyond focusing on a single locomotion skill, we develop a general control solution that can be used for a range of dynamic bipedal skills, from periodic walking and running to aperiodic jumping and standing. Our RL-based controller incorporates a novel dual-history architecture, utilizing both a long-term and short-term input/output (I/O) history of the robot. This control architecture, when trained through the proposed end-to-end RL approach, consistently outperforms other methods across a diverse range of skills in both simulation and the real world. The study also delves into the adaptivity and robustness introduced by the proposed RL system in developing locomotion controllers. We demonstrate that the proposed architecture can adapt to both time-invariant dynamics shifts and time-variant changes, such as contact events, by effectively using the robot's I/O history. Additionally, we identify task randomization as another key source of robustness, fostering better task generalization and compliance to disturbances. The resulting control policies can be successfully deployed on Cassie, a torque-controlled human-sized bipedal robot. This work pushes the limits of agility for bipedal robots through extensive real-world experiments. We demonstrate a diverse range of locomotion skills, including: robust standing, versatile walking, fast running with a demonstration of a 400-meter dash, and a diverse set of jumping skills, such as standing long jumps and high jumps.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Reinforcement Learning for Versatile, Dynamic, and Robust Bipedal Locomotion Control
1.2. Authors
Zhongyu Li, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, Koushil Sreenath
-
Zhongyu Li: Affiliated with the University of California Berkeley.
-
Xue Bin Peng: Affiliated with Simon Fraser University.
-
Pieter Abbeel: Affiliated with the University of California Berkeley.
-
Sergey Levine: Affiliated with the University of California Berkeley.
-
Glen Berseth: Affiliated with Université de Montréal and Mila - Quebec AI Institute.
-
Koushil Sreenath: Affiliated with the University of California Berkeley.
The authors represent prominent institutions known for their leading research in robotics, artificial intelligence, and control systems, suggesting a strong background in deep reinforcement learning and robotic locomotion.
1.3. Journal/Conference
The paper is published as a preprint on arXiv. The abstract mentions a preliminary version was presented at Robotics: Science and System [6]. Robotics: Science and Systems (RSS) is a highly reputable and selective conference in robotics research. Publication on arXiv indicates it's shared for community review and dissemination before, or in parallel with, formal peer review.
1.4. Publication Year
2024
1.5. Abstract
This paper presents a comprehensive study on using deep reinforcement learning (RL) to develop dynamic locomotion controllers for bipedal robots. Unlike prior work focusing on single skills, this research develops a general control solution capable of handling a range of dynamic bipedal skills, including periodic walking and running, and aperiodic jumping and standing. The RL-based controller introduces a novel dual-history architecture that utilizes both long-term and short-term input/output (I/O) history of the robot. This architecture, trained via an end-to-end RL approach, consistently surpasses other methods across diverse skills in both simulation and real-world deployment. The study also investigates the adaptivity and robustness of the proposed RL system. It demonstrates that the architecture can adapt to time-invariant dynamics shifts and time-variant changes (e.g., contact events) by effectively leveraging the robot's I/O history. Task randomization is identified as another crucial source of robustness, fostering better task generalization and disturbance compliance. The resulting control policies are successfully deployed on Cassie, a torque-controlled human-sized bipedal robot. This work significantly pushes the limits of agility for bipedal robots through extensive real-world experiments, showcasing robust standing, versatile walking, fast running (including a 400-meter dash), and diverse jumping skills (standing long jumps and high jumps).
1.6. Original Source Link
https://arxiv.org/abs/2401.16889 The paper is available as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The overarching goal in bipedal robotics is to develop robots capable of operating reliably in diverse human environments. A significant bottleneck is the lack of a general control solution for diverse, agile, and robust legged locomotion skills (like walking, running, and jumping) for high-dimensional, human-sized bipedal robots.
Core Problem:
Previous research often focuses on single locomotion skills or struggles with the complexity of underactuated dynamics and distinct contact plans of bipedal robots. Bipedal robots, with their floating base and high-dimensional nonlinear dynamics, present challenges for motion planning and control due to:
- Underactuated Dynamics & Contacts: Reliance on contacts with the environment for movement leads to discontinuities in trajectories, making contact mode planning and stabilization difficult. Leveraging full-order dynamics models is computationally expensive.
- Diverse Locomotion Skills: Different skills (periodic like walking/running vs. aperiodic like jumping) have distinct stability requirements. Periodic skills can leverage
orbital stability, while aperiodic skills requirefinite-time stability, further complicated by large impact forces during landing.
Why this problem is important: Human environments are predominantly tailored for bipedal locomotion. Addressing these control challenges is critical for enabling bipedal robots to operate effectively and safely in human-centric spaces, unlocking their full potential for complex, agile maneuvers.
Paper's Entry Point / Innovative Idea:
The paper leverages deep reinforcement learning (RL) to create controllers that can learn directly from the robot's full-order dynamics. Its innovative idea centers on a novel dual-history architecture for RL-based controllers and a multi-stage training framework that emphasizes task randomization. This approach aims to:
- Develop a general control solution: Go beyond single-skill focus to encompass a wide range of dynamic bipedal skills.
- Improve adaptivity: Enable controllers to leverage proprioceptive information (
I/O history) to adapt to uncertain, potentially time-varying dynamics. - Enhance robustness: Generalize to new environments and unexpected scenarios, demonstrating robust behaviors through
task randomization.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of legged locomotion control for bipedal robots:
- Development of a New Framework for General Bipedal Locomotion Control: Introduces a general
RL frameworkeffective across periodic (walking, running), aperiodic (jumping), and stationary (standing) skills. The controllers arezero-shot deployableon real robots. - Novel Design Choices for RL-based Control Policy: Proposes a
dual-history architecturefornon-recurrent RL policies, integrating both long-term and short-terminput/output (I/O) history. This, combined with anend-to-end training strategy, achieves state-of-the-art performance, validated in simulation and real-world experiments. - Empirical Investigation of Adaptivity in RL Controllers: Conducts a detailed empirical study showing that
RL-induced adaptivitycovers bothtime-invariant dynamics shiftsandtime-variant changes(like contact events) by effectively using the robot's I/O history. - Improved Robustness in RL Controllers: Identifies
task randomizationas a key source of robustness, distinct from traditionaldynamics randomization. It significantly enhancestask generalizationanddisturbance compliance. - Extensive Real-World Validation and Demonstrations: Successfully deploys the system on
Cassie, ahuman-sized bipedal robot, demonstrating robust standing, versatile walking (including a 400-meter dash), and diverse jumping skills (standing long jumps and high jumps). This pushes the agility limits for bipedal robots.
Key Conclusions/Findings:
- The
dual-history architectureeffectively leverages both recent feedback (short history) andsystem identification/state estimation (long history) for superior control. End-to-end trainingof thebase policyandhistory encoderis more effective thanpolicy distillationmethods (likeTeacher-StudentorRMA) for bipedal control.AdaptivityinRL controllersstems from the ability of thehistory encoderto capture meaningful information abouttime-varying eventsandtime-invariant dynamics changes.Task randomizationis a crucial, "orthogonal" source of robustness, enabling policies to generalize across tasks and exhibit compliant, recovery behaviors not explicitly trained for.- The developed
skill-specific versatile policiescan implicitly learncontact planningonline and dynamically adjust maneuvers for stability and robustness.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Reinforcement Learning (RL)
Reinforcement Learning (RL) is a subfield of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent does not receive explicit instructions but learns through trial-and-error.
- Agent: The learner or decision-maker (in this paper, the robot's controller).
- Environment: The world with which the agent interacts (the physical robot and its surroundings).
- State (): A representation of the current situation in the environment at time .
- Observation (): The partial information the agent perceives about the state (in
POMDPs). - Action (): A decision made by the agent that affects the environment.
- Reward (): A scalar feedback signal from the environment indicating the desirability of the agent's action at a given state. The agent's goal is to maximize the cumulative reward over time.
- Policy (): A function that maps states (or observations) to actions. The goal of RL is to find an optimal policy that maximizes expected cumulative reward.
- Deep Reinforcement Learning (DRL): Combines RL with
deep neural networksto handle high-dimensional states and actions, often usingdeep learningmodels (likeMLPsorCNNs) as function approximators for policies or value functions.
3.1.2. Bipedal Robots
Bipedal robots are robots that walk on two legs, mimicking human or animal locomotion.
- Floating Base: Refers to the main body of the robot (e.g., torso or pelvis) which is not fixed to the ground and has 6
Degrees of Freedom (DoFs)(3 translational, 3 rotational). This makes control more complex due tounderactuated dynamics. - Underactuated Dynamics: Systems where the number of actuators (motors) is less than the number of
Degrees of Freedom (DoFs). Bipedal robots are inherently underactuated during certain phases of locomotion (e.g., flight phase in running/jumping, or even when standing with both feet on the ground, as torso motions are not directly actuated). - Torque-controlled robots: Robots where the actuators (motors) are directly commanded with desired torques (forces that cause rotation). This offers finer control over interaction forces but requires more sophisticated controllers compared to position-controlled robots.
Cassieis a torque-controlled robot.
3.1.3. Sim-to-Real Transfer
Sim-to-real transfer is the process of training a robot control policy in a simulated environment and then deploying it on a physical robot without significant retraining or fine-tuning. This is crucial because training directly on physical robots can be expensive, time-consuming, and risky.
- Dynamics Randomization: A common technique in
sim-to-real transferwhere physical parameters of the simulated robot and environment (e.g., mass, friction, motor gains, sensor noise, latency) are varied randomly during training. This forces theRL agentto learn policies that are robust to uncertainties and discrepancies between simulation and the real world. - Zero-shot transfer: When a policy trained purely in simulation works effectively on a real robot without any additional training or fine-tuning on the hardware.
3.1.4. Partially Observable Markov Decision Process (POMDP)
A Partially Observable Markov Decision Process (POMDP) is a generalization of an MDP where the agent does not know the exact state of the environment. Instead, it makes observations that are probabilistically related to the state.
- Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It is defined by states, actions, transition probabilities, and rewards.
- Belief State: In a
POMDP, the agent maintains abelief state, which is a probability distribution over the possible underlying states, given the history of observations and actions.
3.1.5. Input/Output (I/O) History
In the context of robot control, I/O history refers to a sequence of past inputs (actions sent to the robot, e.g., motor commands) and outputs (observable states or sensor readings from the robot, e.g., joint positions, velocities, base orientation) over a certain time window. This history provides the controller with context about recent dynamics, enabling it to infer unobservable states or adapt to changing system parameters.
3.1.6. Low Pass Filter (LPF)
A Low Pass Filter (LPF) is an electronic or digital filter that passes low-frequency signals and attenuates (reduces the amplitude of) signals with frequencies higher than a certain cutoff frequency. In robotics, it's often used to smooth out noisy sensor data or control commands, preventing jerky movements and reducing wear on mechanical components.
3.2. Previous Works
The paper categorizes previous efforts into model-based optimal control (OC) and model-free reinforcement learning (RL).
3.2.1. Model-based Optimal Control (OC)
Model-based optimal control methods formulate locomotion as an optimization problem where the robot's dynamics model is explicitly used as a constraint.
-
Challenges:
- Computational Complexity: Full-order dynamics are too complex for online optimization.
- Contact Planning: Making and breaking contact creates
non-smooth trajectoriesand is difficult to optimize. - Scalability: Often
task-specific, requiring significant effort to adapt to new skills.
-
Approaches:
- Cascaded Optimization: Hierarchical control where high-level planning generates reference trajectories and low-level controllers handle real-time execution.
- Reduced-Order Models: Simplifications of robot dynamics (e.g.,
centroidal dynamics,Linear Inverted Pendulum (LIP),Angular Momentum Linear Inverted Pendulum (ALIP),Hybrid-LIP (H-LIP)) for online trajectory optimization.- LIP Model: A simplified model for walking, assuming the center of mass (CoM) moves with constant height and the horizontal motion is decoupled from vertical motion. This allows for simpler planning but limits dynamic capabilities.
- Whole-Body Control (WBC): A reactive controller that translates reduced-order states to joint-level inputs, often solving a
Quadratic Program (QP)with various constraints. - Hybrid Zero Dynamics (HZD): Uses the robot's full-order model to design attractive
periodic gaitsoffline, with online feedback controllers to enforcevirtual constraints. It relies on the stability of periodic gaits, making it less suitable foraperiodic motionslike jumping.
-
Contact Planning:
- Many studies
pre-define fixed contact sequences(e.g., for walking, running, jumping). - Efforts to integrate
contact planningwithtrajectory optimizationinvolvemixed-integer programmingorcontact-implicit methods(usingcomplementarity constraintsorbilevel optimization), but these are often limited to offline optimization for bipedal robots.
- Many studies
3.2.2. Model-free Reinforcement Learning (RL)
Model-free RL allows robots to learn control policies through trial-and-error without an explicit dynamics model.
-
Control Policy Structure:
- Observation History: Varies from
states-only historytoI/O history(robot's input and output). - History Length: Ranging from
short (1-15 timesteps)tolong (50+ timesteps). - Policy Architecture:
MLPsfor shorter histories,recurrent units(likeLSTMs) for longer sequences. The paper notes that ashort state historywas reported to be better for bipedal humanoid control [79].
- Observation History: Varies from
-
Sim-to-Real Transfer Techniques in RL:
- End-to-end training with history: Policies trained directly under randomized dynamics using a history of robot measurements or I/O. Applied to bipedal robots [12, 14, 13].
- Policy Distillation (Teacher-Student / RMA): An expert policy (teacher) with access to privileged information supervises a student policy with only proprioceptive feedback.
- Teacher-Student (TS) / Rapid Motor Adaptation (RMA): A two-stage method. First, an
expert policyis trained in simulation with access to all ground truth environment parameters (privileged information). Second, astudent policylearns to mimic the expert's behavior, often by inferring theseprivileged parametersfromproprioceptive observations(like I/O history) using an encoder.- Expert Policy: Can be seen as a policy that directly uses information that is not available to the real robot (e.g., exact mass, friction, external forces). It learns how to react optimally if it knew these parameters.
- Student Policy: The actual policy deployed on the robot. It learns to estimate these
privileged parametersfrom its limited sensory observations and then uses these estimates to adapt its behavior.
- Adaptive Rapid Motor Adaptation (A-RMA): An extension of
RMAwhere, after the student policy learns to inferprivileged information, thebase MLP(which takes the inferred information) is further fine-tuned with the encoder frozen. This aims to improve performance by allowing the control part of the policy to adapt to potential inaccuracies in the inference mechanism.
- Teacher-Student (TS) / Rapid Motor Adaptation (RMA): A two-stage method. First, an
-
Scalability to Different Locomotion Skills:
- Single-skill, fixed-task policies: Early RL efforts focused on basic skills like walking forward.
- Single-skill, multi-task policies: Training policies to track varying commands (e.g., different walking velocities) without explicit reference motions.
- Parameterized Reference Motions: Providing parameterized motions or using policy distillation from
task-specific policiesfor bipedal robots. - Adversarial Motion Priors (AMP): Using adversarial training to learn diverse skills by matching the learned motion to a distribution of reference motions [92].
3.3. Technological Evolution
Early bipedal locomotion heavily relied on model-based optimal control, which provided theoretical guarantees but struggled with computational complexity, real-world uncertainties, and scaling to diverse, dynamic behaviors. Simplifications (e.g., LIP model, fixed contact sequences) were necessary, limiting agility and robustness.
The rise of deep learning and reinforcement learning offered a new paradigm. Initially successful in simulation, RL faced the sim-to-real gap. Techniques like dynamics randomization and policy distillation (e.g., Teacher-Student, RMA) bridged this gap, first for quadrupedal robots and then adapted for bipedal robots.
This paper's work fits into the current era of RL-driven robotics, building on these sim-to-real advancements. It pushes beyond single-skill focus and static robustness (from dynamics randomization) by introducing task randomization and a dual-history architecture to achieve truly versatile, dynamic, and robust control for human-sized bipedal robots. It moves towards more general and adaptive controllers that can handle the inherent complexity and uncertainties of real-world bipedal locomotion.
3.4. Differentiation Analysis
Compared to prior model-based and model-free methods, this paper's core innovations and differentiations are:
-
Generalization across Diverse Skills:
- Prior Model-based OC: Highly
task-specific, requiring significant re-engineering for each new skill (e.g., HZD for walking vs. running). Aperiodic skills like jumping are particularly challenging. - Prior RL: Often focused on
single skillsor required separate policies/fine-tuning for different tasks within a skill. - This Paper: Develops a single, general
RL frameworkthat producesskill-specific versatile policies(e.g., one policy for all walking tasks, one for all running tasks, one for all jumping tasks) across periodic, aperiodic, and stationary skills using largely the same architecture and training scheme.
- Prior Model-based OC: Highly
-
Novel
Dual-History Architecture:- Prior RL (History Usage): Varied from no history, short history (
MLP), to long states-only or I/O history (recurrent networksorCNNs). Ablation studies often contrasted long histories with immediate state feedback. - This Paper: Introduces a
dual-history approachwith both along-term I/O history(encoded by a1D CNN) and ashort-term I/O historyfed directly into thebase MLP. This is shown to significantly outperform policies relying solely on long or short history, or states-only history, by providing both contextualsystem identificationand immediatereal-time feedback.
- Prior RL (History Usage): Varied from no history, short history (
-
Emphasis on
Task Randomizationfor Robustness:- Prior RL (Robustness): Primarily attributed to
dynamics randomization(e.g.,Teacher-Student,RMA) to bridge thesim-to-real gap. While effective, this mainly addressesparametric uncertainty. - This Paper: Identifies and empirically demonstrates
task randomizationas an "orthogonal" and crucial source of robustness. By training on a wide range of tasks within a skill, the robot learnsgeneralized behaviorsandcompliant recovery maneuversthat go beyond adhering strictly to a single commanded task, even in unseen perturbations. This is a novel insight intoRL-based robustness.
- Prior RL (Robustness): Primarily attributed to
-
End-to-End Training vs. Policy Distillation:
- Prior RL (Sim-to-Real):
Policy distillation(TS/RMA) is prevalent, especially forquadrupeds, relying on anexpert policyand inferringprivileged information. - This Paper: Advocates for and demonstrates the superiority of
end-to-end training(jointly training thebase policyandhistory encoder) overpolicy distillationfor complex bipedal control. It argues that direct adaptive control, without explicit estimation ofpre-selected parameters, allows the policy to implicitly learn more relevanttime-varying information(like contact events) for control.
- Prior RL (Sim-to-Real):
-
Real-World Agility and Novel Capabilities:
- Prior Work: While some bipedal robots achieved running or jumping, demonstrations of
versatileandrobustperformance across multiple dynamic skills, especially with record-breaking feats (e.g., 400m dash, 1.4m long jump, 0.44m high jump onCassie), were limited. - This Paper: Achieves unprecedented levels of real-world agility, setting new benchmarks for
Cassieand human-sized bipedal robots, including demonstratingonline contact planning(implicit through learned policies) for recovery.
- Prior Work: While some bipedal robots achieved running or jumping, demonstrations of
4. Methodology
4.1. Principles
The core idea of this method is to leverage deep reinforcement learning (RL) to create general, dynamic, and robust locomotion controllers for bipedal robots. The theoretical basis or intuition behind it is that by exposing an RL agent to a wide variety of tasks and environmental uncertainties (dynamics randomization) within a simulation, and by providing it with a rich contextual observation (dual-history I/O), the agent can learn adaptive and robust control policies that are effectively model-free and direct adaptive. This means the robot learns to identify aspects of its own dynamics and environment changes implicitly from its past inputs and outputs and adjusts its control actions directly, rather than relying on an explicit model or estimated parameters.
4.2. Core Methodology In-depth (Layer by Layer)
The proposed RL system for general bipedal locomotion control is structured as follows:
4.2.1. Robot Model and Control Policy Structure
The experimental platform is Cassie, a human-sized bipedal robot (height , weight ).
-
Joints: 7 joints per leg, totaling 14 joints (). 10 are actuated (), 4 are passive.
-
Floating Base: with 6
DoFs(3 translational and 3 rotational ). -
Generalized Coordinates: The full system's generalized coordinates are .
-
Observable States (): Only partial states are reliably measured/estimated. contains motor positions and velocities (), base orientation (), and estimated base linear velocity ( from an
EKF).The robot's dynamics equation is given by: Where:
-
is the
generalized mass matrix. -
is the
centrifugal and Coriolis matrix. -
is the
generalized gravity vector. -
are the generalized coordinates.
-
are their time derivatives.
-
are the
motor torques(control inputs). -
distributes motor torques to generalized coordinates.
-
represents
state-dependent spring torques(from passive joints). -
represents
generalized external forces, including foot contact wrenches and joint-level friction/perturbations.The
control policyis adeep neural networkwith parameters . -
Action Space: The policy outputs
desired motor positionsas the robot's action . -
Action Smoothing: The action is first smoothed by a
Low Pass Filter (LPF)(aButterworth low pass filterwith acut-off frequencyof ). This helps prevent jerky movements. -
Control Loop: The policy is queried at . The filtered actions are then fed to
joint-level PD controllersoperating at to calculatemotor torques.The following figure (Figure 3 from the original paper) illustrates the proposed
RL-based controller architecture:
该图像是展示了机器人(Cassie)在不同条件下行走的示意图。这包括在持续横向力和随机矢量力的作用下,机器人如何保持稳定的行走姿态,图中展示了多个动态调整的过程。
Fig. 3: The proposed RL-based controller architecture that leverages a dual-history of input (a) and output (o) (I/O) from the robot. The control policy ,operating at ,processes a 2-second long I/O history. This data is initially encoded via a 1D CNN along its time axis before being merged with a base MLP. In addition, a short history spanning 4 timesteps is directly input into the base MLP, combined with skill-specific reference motion and variable commands that parameterize the tasks. The policy outputs desired motor positions as the robot's actions, which are then smoothed using a low-pass filter (LPF). These filtered outputs are employed by joint-level PD controllers operating at to specify motor torques . This architecture is general for various locomotion skills like standing, walking, running, and jumping. This figure also annotates the generalized coordinates for Cassie, which include actuated joints , marked as re and passive oints , marked as blue).
4.2.2. Dual-History Architecture
The policy's input at each timestep consists of four components:
- Given Command (): A
time-varying vectorthat parameterizes the task the robot needs to accomplish (e.g., desired velocity, target position). - Reference Motion (): A
previewof the desired trajectory, sampled at future timesteps (). This helps the robot anticipate and avoid beingshort-sighted. It includes desired motor positions and potentially base height. - Robot's Short I/O History: A brief history spanning 4 timesteps of \mathbf{o}{t:t-4}
), actions (\mathbf{a}{t-1:t-4})>. This provides immediate feedback for real-time control. - Robot's Long I/O History: A longer history spanning two seconds, comprising 66 pairs of
I/O data(, for ). This is crucial for$system identificationandstate estimation`, especially during ballistic movements.
Policy Representation Details:
- Base Network: A
multilayer perceptron (MLP)with two hidden layers, each having 512tanhunits. It receives the command, reference motion, and short I/O history directly. - Long-Term History Encoder: A
1D convolutional neural network (CNN)consisting of two hidden layers. Its configuration is and[4, 16, 2]withrelu activationand no padding. ThisCNNencodes the 66-timesteplong I/O historyinto alatent representationwhich is then fed into thebase MLP. - Output Layer: The
base MLP's output layer consists oftanhunits that specify the mean of theGaussian distributionof the normalized action (desired motor positions). The standard deviation is fixed at .
4.2.3. Multi-Stage Training Framework
A multi-stage training strategy is developed to train versatile locomotion control policies in simulation for zero-shot transfer to hardware. This strategy provides a structured curriculum, as illustrated in the following figure (Figure 4 from the original paper):
Fig. 4: The multi-stage training framework to obtain a versatile control policy that can be zero-shot transferred to the real world. It starts with single-task training stage, where the robot is encouraged to mimic a single reference motion with a fixed goal. This is followed by task randomization stage, which expands the range of tasks the robot learns and fosters task generalization resulting in a versatile policy. Once the robot is adept at various locomotion tasks and their transitions, extensive dynamics randomization is incorporated to enhance policy robustness for sim-to-real transfer. This framework is suitable for diverse bipedal locomotion skills, including walking, running, and jumping, and for learning from different sources of skill-specific reference motions such as trajectory optimization, human mocap, and animation.
-
Stage 1: Single-Task Training:
- Objective: To acquire a locomotion skill from scratch (e.g., walking forward, running forward, jumping in place) with a fixed command.
- Focus: Mastering the basic skill, avoiding undesired strategies.
-
Stage 2: Task Randomization:
- Objective: To develop a
versatile policyby encouraging the robot to perform a large variety of tasks within the acquired skill. - Mechanism: Introduces diverse commands to expand the range of tasks.
- Combining Standing Skill: For periodic skills (walking, running), a sub-stage is added to learn transitions to/from standing.
- For Walking: After mastering diverse walking tasks, a standing skill is commanded after random walking intervals. The reference motion changes to standing, and smoothing rewards are increased. This allows learning transitions from walking to standing and back.
- For Running: Similar to walking, but uses a separate reference motion for the transition from fast running to standing (retargeted human mocap) due to higher challenge.
- For Jumping: Standing is inherently part of the post-landing phase, so no separate sub-stage is needed.
- Objective: To develop a
-
Stage 3: Dynamics Randomization:
- Objective: To robustify the policy for successful
zero-shot transferfrom simulation to hardware. - Mechanism: Introduces extensive randomization of
dynamics parametersin simulation after the robot is proficient in a simple simulation environment.
- Objective: To robustify the policy for successful
4.2.4. Reference Motion
The framework can accommodate diverse sources of reference motion, crucial for shaping the robot's desired behaviors.
- Trajectory Optimization: For the
walking skill, a library of 1331 diverse periodic walking gaits is generated based on the robot'sfull-order dynamics[8, 26]. These gaits define desired sagittal (), lateral (), and height () ranges. A reference motion is a set ofBézier trajectoriesfor each actuated motor over a fixed walking period. - Motion Capture: For the
running skill,motion capture datafrom a human actor [100] isretargetedto Cassie's morphology usinginverse kinematics. A single reference motion for periodic running (average speed ) and one forrunning-to-standing transitionare used. - Animation: For the
jumping skill,hand-crafted animationin a 3D suite is used, providing a singlejumping-in-place animation(apex foot height ). Crucially, notrajectory optimizationis performed to make these kinematically feasible motions dynamically feasible for the robot, relying onRLto learn dynamic feasibility.
4.2.5. Reward Function
The reward function incentivizes the robot to perform desired locomotion skills and complete tasks. It is a weighted summation of several reward components :
Each individual reward component follows the format:
Where:
-
and are two vectors (e.g., actual state vs. desired state).
-
is the
Euclidean distancebetween the vectors. Minimizing this distance maximizes the reward. -
is a
scaling factorintroduced for each term to normalize units, ensuring the output range is(0, 1]. -
is the
weight vectorfor each component.The
reward componentsare grouped into three key terms:
-
Motion Tracking: Incentivizes following the provided reference motion.
Motor position reward:Global pelvis height:Global foot height:- : Accounts for terrain height variations or target elevated heights (privileged information).
-
Task Completion: Ensures the robot accomplishes assigned tasks.
Pelvis velocity tracking: and (for desired linear and angular velocities).Global pose tracking: and (for position and orientation). The cosine term handles angular periodicity.- For jumping, desired
landing targetsare specified, andaverage velocity termsshape the sparse position reward (, where is jumping timespan).
-
Smoothing: Discourages jerky behaviors.
Foot Impact: (reduce impact forces).Torque: (reduce energy consumption).Motor velocity: (smooth motions).Joint acceleration: (damp out accelerations).Change of action: (regulate action changes).
Reward Weights:
- Across Stages:
Motion trackingdominates in Stage 1.Task completiontakes precedence in Stage 2 (task randomization).Smoothingweights are initially low but gradually increased in later stages. - Across Skills: Weights are largely consistent, but
foot height motion trackingandtask completionhave higher weights for skills with aflight phase(running, jumping).Change of actionis increased for aggressive movements in running and jumping.
4.2.6. Episode Design
- Unified Approach: Consistent across all skills and stages.
- Episode Duration: 2500 timesteps (76 seconds).
- Variable Tasks (Stage 2): Commands randomized after 1-15 second intervals.
- Aperiodic Tasks (Stage 1): Episode length adjusted to cover full trajectory (e.g., for jumping) plus a significant extension for learning to maintain the final standing pose.
- Early Termination Conditions: In addition to standard conditions (e.g., robot falling), specific conditions are added to encourage desired behaviors:
- Foot height tracking tolerance: Terminates if . Effective for
flight phases. starts tight, relaxes later. - Task completion tolerance: Terminates if . is gradually reduced.
- Foot height tracking tolerance: Terminates if . Effective for
4.2.7. Dynamics Randomization
Applied in Stage 3 to ensure sim-to-real transfer. Parameters are sampled from uniform distributions at each episode.
-
Modeling Uncertainty:
Ground Friction Coefficient:[0.3, 3.0]Joint Damping Ratio:Spring Stiffness: (crucial for Cassie's passive joints)Link Mass:Link Inertia:Pelvis (Root) CoM Position: inOther Link CoM Position:Motor PD Gains: (applied independently per joint)
-
Measurement Uncertainty:
Motor Position Noise Mean:Motor Velocity Noise Mean:Gyro Rotation Noise:Linear Velocity Estimation Error:Communication Delay: (zero-order hold)
-
Optional Randomization:
- Randomized Perturbation: External wrenches (forces & torques) applied to the robot's pelvis. Range: Force , Torque .
Elapsed Time Interval (Walking):Elapsed Time Interval (Running):- Excluded for jumping and transitions to standing as it hindered learning.
- Randomized Terrain: Various types (
Waved,Slopes,Stairs,Steps) using parameterized height maps. Used only after proficiency in otherdynamics randomization(e.g., for running).
- Randomized Perturbation: External wrenches (forces & torques) applied to the robot's pelvis. Range: Force , Torque .
4.2.8. Training Details
- Simulator:
MuJoCobased on [101, 102]. - RL Algorithm:
Proximal Policy Optimization (PPO)[103]. - Policy:
Actor(control policy) described in Section 4.2.2. - Value Function: A 2-layered
MLPwith access toground truth observations. - Hyperparameters: Differ across stages and skills, detailed in Appendix D (Tables VII, VIII).
5. Experimental Setup
5.1. Datasets
The paper does not use traditional "datasets" in the sense of a fixed collection of samples. Instead, it generates data through continuous interaction with a simulated environment (powered by MuJoCo) during the reinforcement learning training process. The "dataset" for learning is thus the stream of observations, actions, and rewards experienced by the robot across millions of simulation steps, especially under dynamics randomization and task randomization.
The training environment is the simulated Cassie robot, which is a torque-controlled bipedal robot.
- Source: The
Cassierobot model and simulation environment are based on previous work [101, 102]. - Characteristics: The simulation includes detailed
physics modelsof the robot's body, joints, and interactions with the ground. It also incorporates simulatedsensor noise,communication delays, anddynamics parameter variationsas part of thedynamics randomizationprocess. - Domain: The domain is
robot locomotion control, specifically forhuman-sized bipedal robots. - Effectiveness: These simulated environments are designed to be as close to the real world as possible, while also being sufficiently diverse (via randomization) to enable
zero-shot transferto the physicalCassierobot.
5.2. Evaluation Metrics
The paper evaluates the performance of the control policies using both quantitative metrics (primarily Mean Absolute Error) and qualitative observations from real-world experiments.
5.2.1. Mean Absolute Error (MAE)
- Conceptual Definition:
Mean Absolute Error (MAE)measures the average magnitude of the errors in a set of predictions, without considering their direction. It quantifies the average absolute difference between predicted (or actual) values and desired (or reference) values. A lower MAE indicates better accuracy or tracking performance. - Mathematical Formula: $ \mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| $
- Symbol Explanation:
- : The total number of data points or observations.
- : The -th actual or observed value (e.g., actual velocity, actual joint position).
- : The -th desired or reference value (e.g., commanded velocity, reference joint position).
- : The absolute value.
5.2.2. Qualitative Observations
Beyond MAE, the paper relies heavily on qualitative assessments from real-world deployment, including:
- Drift: Observing if the robot maintains its position or path when commanded to stay in place.
- Stability: The robot's ability to maintain balance and prevent falls under various conditions.
- Compliance: How the robot reacts to external disturbances or unexpected terrain changes, e.g., by adjusting its gait rather than falling.
- Recovery Maneuvers: The robot's ability to regain stability after a perturbation, potentially by performing complex sequences of actions.
- Flight Phase: The presence and characteristics of periods where both feet are off the ground, indicative of dynamic running or jumping.
- Completion Time: For specific tasks like the 100-meter or 400-meter dash.
- Accuracy of Landing: For jumping tasks, how precisely the robot lands on a target.
5.3. Baselines
The paper conducts an extensive ablation study and comparisons against several baselines to evaluate its proposed policy architecture and training strategy. These baselines represent common design choices in RL-based locomotion control:
-
Ours (Proposed Method):
- Architecture:
Dual-history architecture(short-termI/Odirectly toMLP, long-termI/Oencoded by1D CNNtoMLP). - Action: Directly specifies
desired motor positions. - Training:
End-to-end training.
- Architecture:
-
Residual:
- Architecture: Same as
Ours, but the policy outputs aresidual termthat is added to areference motor position(i.e., ). - Purpose: Tests the impact of
residual learning, a common approach in prior work [11, 12].
- Architecture: Same as
-
State Feedback Only:
- Architecture: Same model structure and action space as
Ours. - Observation: Relies solely on
historical states(robot's output history), omitting the robot'sinput history. - Purpose: Evaluates the importance of including the robot's
input historyinI/O observations. This is common in prior work [12, 14, 22].
- Architecture: Same model structure and action space as
-
Long History Only:
- Architecture: Relies only on
long-term I/O historyencoded by theCNN. Thebase MLPreceives thelatest observation(immediate state feedback) but no explicit short history. - Purpose: Serves as a baseline to demonstrate the added benefit of explicitly providing
short historyalongsidelong history[5, 74].
- Architecture: Relies only on
-
Short History Only:
- Architecture: Relies solely on
short-term I/O history, excluding thelong-term I/O history CNN encoder. - Purpose: Represents an approach common in
quadruped controland some bipedal work [13, 65, 70].
- Architecture: Relies solely on
-
RMA/Teacher-Student (Policy Distillation):
- Mechanism:
Two-phase training:- Expert (Teacher) Policy: Trained by
RLwith access toprivileged environment information(encoded into an 8Dextrinsics vectorby anMLP encoder). This policy can only operate in simulation. - RMA (Student) Policy: Copies the
base MLPfrom theexpert policyand learns to leverage thelong I/O history encoderto estimate theteacher's extrinsic vector.
- Expert (Teacher) Policy: Trained by
- Modification: In this study, all
expert,RMA, andA-RMApolicies are modified to includeshort I/O historiesin theirbase MLPfor a fairer comparison withOurs. - Purpose: Compares
end-to-end trainingwithpolicy distillationmethods [71, 74].
- Mechanism:
-
A-RMA (Adaptive Rapid Motor Adaptation):
-
Mechanism: An additional training phase after
RMA. Thelong I/O history encoder's parameters remain fixed, while thebase MLPis further updated throughRL. -
Purpose: Explores improvements over standard
RMAby fine-tuning the control part of the policy [67].The experimental design for these baselines involved training 3 policies per method for each
locomotion skill(walking, running, jumping), using identicalmulti-stage training frameworksandhyperparameters, but differentrandom seeds. This resulted in distinct control policies for comprehensive evaluation.
-
5.4. Hyperparameters
The paper details the command ranges for different skills and the hyperparameters used in PPO training.
The following are the results from Table VI of the original paper:
| Task Parameters | Range |
|---|---|
| Walking | |
| Sagittal Velocity | [-1.5, 1.5] m/s |
| Lateral Velocity | [-0.6, 0.6] m/s |
| Turning Velocity | [-45, 45] deg/s |
| Walking Height | [0.65, 1.0] m |
| Running | |
| Sagittal Velocity | [2.0, 5.0] m/s |
| Lateral Velocity | [-0.75, 0.75] m/s |
| Turning Velocity | [-30, 30] deg/s |
| Jumping | |
| Sagittal Landing Location | [-0.5, 1.5] m |
| Lateral Landing Location | [-1.0, 1.0] m |
| Turning Direction at Landing | [-100, 100] deg |
| Elevation Change | [-0.5, 0.5] m |
The following are the results from Table VII of the original paper:
| PPO Training Iterations | Walking | Running | Jumping |
|---|---|---|---|
| Single-Task | 6000 | 6000 | 6000 |
| Task Randomization | 8000 | 18000 | 12000 |
| Combining Standing | 2000 | 5000 | N/A (inherent) |
| Dynamics Randomization | 8000 | 15000 | 20000 |
| Added Perturbation Training | 5000 | 5000 | N/A (not used) |
The following are the results from Table VIII of the original paper:
| Hyperparameter | Value |
|---|---|
| PPO iteration batch size | 65536 |
| PPO clip rate | 0.2 |
| Optimization step size (both actor and critic) | 1e-4 |
| Optimization batch size | 8192 |
| Optimization epochs | 2 |
| Discount factor () | 0.98 |
| GAE smoothing factor () | 0.95 |
6. Results & Analysis
6.1. Core Results Analysis
The paper presents a comprehensive analysis of the proposed RL-based controller's performance, focusing on its policy architecture, adaptivity, robustness, and real-world deployment. The results consistently validate the effectiveness of the proposed method across various dynamic bipedal locomotion skills.
6.1.1. Advantages of Policy Architecture
The analysis of learning performance in simulation (Stage 3, with randomized tasks and dynamics) and sim-to-real transfer for in-place walking demonstrates the superiority of the dual-history architecture and end-to-end training.
The following figure (Figure 6 from the original paper) displays the learning performance for training walking, running, and jumping policies with randomized tasks and dynamics parameters.
Fig. 6: Learning performance for training walking, running, and jumping policies with randomized tasks and dynamics parameters, using our method and various baselines. The y-axis shows the normalized return and the x-axis shows the number of samples (). The shaded region represents the standard deviation across policies trained from three different random seeds, indicating the consistency of our method.
Key Observations from Learning Performance (Fig. 6):
-
Choices of Action (Residual vs. Direct): Policies using a
residual term(purple curves) consistently showdeteriorated learning performanceacross all skills. This suggests that directly specifyingdesired motor positions(as inOurs) is more effective, as residual terms can introduce undesired movements and hinder the policy's ability to explore beyondreference motions. -
Choices of Observation (I/O History vs. State Feedback Only): Omitting the robot's
input (action) history(pink curves) leads to adecline in learning performance. This highlights the crucial role of utilizing bothinputandoutput historyfor the control policy to performsystem identificationandstate estimation, enhancingadaptivityto uncertain dynamics. -
History Length (Long vs. Short vs. Dual-History):
Long History Only(blue curves) generally performs worse thanShort History Only(orange curves) or other methods.- The proposed
dual-history approach(Ours, red curves), which providesdirect accesstoshort I/O historyin thebase MLPwhile also using along history encoder, significantlyimproves learning performance. This indicates thatshort historyprovides criticalreal-time feedbackthat complements thecontextual informationfromlong history.
-
Comparison with Policy Distillation (RMA/Teacher-Student):
-
RMA (student) policies(green curves) showsignificant degradationcompared to theexpert policy(black curves) andOurs. This is attributed to unavoidable errors in estimatingpre-selected environment parametersusinglong history.RMAeven fails to learn for challenging skills likerunning. -
A-RMA(dark green curves), which fine-tunes thebase MLPafterRMA, improves performance but stillfalls shortofOurs, despite using more training samples. In cases where the encoder struggles (running),A-RMAeffectively behaves likeShort History Only. -
Oursachieves performance similar to theexpert policy(theoretical upper bound for the student) but isdeployable in the real world, unlike theexpert.The following figure (Figure 7 from the original paper) compares the
in-place walkingperformance of different policy architectures onreal hardware.
该图像是示意图,展示了在动态双足机器人控制中,潜在值和足部冲击力随时间变化的关系。上部分显示了模型的潜在值,标记为Initialization和Perturbation,底部显示了左脚和右脚的冲击力曲线,数据在时间轴上展现了机器人的响应特征。
Fig. 7: Real-world in-place walking experiments of different policy architectures. The top row shows snapshots of Cassie in the middle of walking. The bottom row compares the estimated sagittal velocity(black line),lateral velocity(red line), andpelvis yaw angle(blue line) over time. The policies are trained from the same random seed ( policy in Fig. 8), and deployed without any tuning. Our proposed method shows minimal drift insagittalandlateraldirections, as well asyaw angle, compared to other methods.Case Study: In-place Walking Experiments (Real World):
-
-
The proposed method (
Ours) demonstratesnotably lower tracking errorsand successfully maintainsin-place walkingwithminimal drift. -
Other methods (
Long History Only,Short History Only,State Feedback Only) result insubstantial drift(e.g., to the robot's left). -
RMAshows the most obvioussagittal shiftand walks forward at high speed despite a zero velocity command.A-RMAreduces this but still experiences considerablelateral movement. -
The
Residualpolicy fails to maintain a stable gait.The following figure (Figure 8 from the original paper) quantitatively compares the
speedandorientation tracking errorsforin-place walkingacrosssimulationandreal-worlddeployments for policies trained with differentrandom seeds.
该图像是示意图,展示了在跑步和行走状态下,机器人控制系统的输入/输出历史编码。图中分别标记了初始化阶段和扰动阶段,以及短期和长期输入历史,帮助理解动态平衡和适应性控制的机制。
Fig. 8: Quantitative comparisons of speed tracking error() andorientation tracking error(MAE of ) forin-place walkingusing policies trained from differentrandom seedsbut the same method. The figure shows results in (a)simulationand (b)real-worldtests. Note that the , , policies trained from differentrandom seedsare reported in Fig. 8. -
In
simulation(Fig. 8a), most policies show good tracking performance. -
In
real-world(Fig. 8b),Oursexhibitsminimal degradationandconsistently better control performancein bothcommand trackingandstabilizing the floating base. Other methods show significantspeed driftandrotational tracking errors(e.g.,Long History Onlyperforms worst in stabilizing the pelvis rotation despite minimal oscillation in simulation).
Summary of Policy Architecture Advantages:
- Direct Motor Commands: Policies should directly specify
motor-level commandsrather than using aresidual term. - Full I/O History: Utilize a history of both the robot's
inputandoutput, not solelystate feedback. - Dual-History Approach: Combine
long-term history(forsystem identification) withshort-term history(forreal-time feedback) in thebase policy. - End-to-End Training: Train the
base policyandhistory encoderin anend-to-end mannerfor better performance and reduced complexity compared topolicy distillation.
6.1.2. Source of Adaptivity
The paper investigates the latent representation from the long I/O history encoder to understand how the proposed method adapts to varying environment parameters.
The following figure (Figure 9 from the original paper) shows latent representations for periodic running and their changes under different dynamics.
Fig. 9: (a) (Top) Recorded latent representation after long-term I/O history encoder during running. (Bottom) Comparison of two selected dimensions (marked as red lines in the top plot) with recorded impact forces on each of the robot's feet. (b) The blue plot shows the robot's latent representation with default dynamics parameters during running. The red plots indicate changes in the same region under different dynamics. (c)(Top) Recorded latent representation after long-term I/O history encoder during jumping. (Bottom) Comparison of two selected dimensions (marked as red lines in the top plot) with recorded total impact forces on both of the robot's two feet. (d) The blue plot shows the robot's latent representation with default dynamics parameters during jumping. The red plots indicate changes in the same region under different dynamics.
Time Varying Embedding (Adaptivity to Time-Varying Events):
- Periodic Running (Fig. 9a):
- The
latent embeddingexhibits aperiodic patternonce the gait stabilizes. - It captures
time-varying disturbances, showing variations when apersistent backward perturbation forceis applied. - Two
latent dimensionsshowstrong correlation with foot impact forces, effectively performingcontact estimation. These signals accurately tracktake-offandlandingevents (zero when foot is in swing phase). - Intriguingly, these
latent valuesshift to a lower envelope duringperturbation, even with unchangedground impact force, suggesting implicit learning ofexternal forcesandgeneralized dynamicswithout explicit engineering.
- The
- Aperiodic Jumping (Fig. 9c):
- The
latent representationdistinguishes betweenjumping phases(more varying signals) andstanding phases(less varying signals). - Different
jumping tasksresult in distinctlatent valuesduringjumping phases. - Two
latent dimensionscorrelate withcontact events:Latent Value 1(take-off event, increases and drops before contact force is zero), andLatent Value 2(landing event, active upon landing). This suggests the robot learns to estimate separatetake-offandlandingcues, which are more informative for control than a single binary contact variable.
- The
Adaptive Embedding for Changes in Dynamics (Adaptivity to Time-Invariant Dynamics Shifts):
- Running (Fig. 9b) and Jumping (Fig. 9d):
- Varying
dynamics parameters(e.g.,link CoM position,link mass,joint damping ratio,PD gains,ground friction) result insignificant shiftsin thelatent embeddingcompared to the default model. - Despite these changes in
latent representation, control performance metrics (e.g.,task completion error,motion tracking error) showminimal change. This highlights the controller'sadaptivitytotime-invariant dynamics shifts. Latency(communication delay) causes noticeable changes inlatent embedding, butincreased measurement noise(2x training bounds) haslittle effect, suggesting theCNN encodereffectively filters outzero-mean noise.
- Varying
Summary of Adaptivity:
The history encoder enables the proposed controller to adapt by capturing meaningful information from I/O history for:
Time-varying events(external perturbations, contact events).Time-invariant changes in dynamics parameters.Filtering out measurement noise. This capability explains the strong performance in challenging training settings with extensivedynamics randomization.
6.1.3. Advantages of Versatile Policies and Source of Robustness
The paper identifies task randomization as a key source of robustness, distinct from dynamics randomization. A single versatile policy capable of diverse tasks significantly improves robustness.
The following figure (Figure 10 from the original paper) demonstrates robustness tests in simulation for walking, running, and jumping under out-of-distribution uncertainties.
Fig. 10: Robustness tests in simulation for walking, running, and jumping policies under out-of-distribution uncertainties that exceed training bounds. (a, b) Walking under consistent lateral force and CoM offset. (c, d) Running under consistent forward force and CoM offset. (e, f) Jumping under lateral force and CoM offset. Policies are: (i) Single Task (trained with dynamics randomization), (ii) Single-Task w/ Perturbation (trained with dynamics randomization + perturbations), (iii) Versatile (trained with task randomization + dynamics randomization). Versatile policies show compliant behaviors and task generalization to handle disturbances, especially in running and jumping (Fig. 10c, 10e).
Robustness Tests in Simulation (Fig. 10):
- Walking (Fig. 10a, 10b):
- Under
consistent lateral pulling force():Single-Task(i) fails.Single-Task w/ Perturbation(ii) progresses with minor deviation.Versatile(iii,Ours), without perturbation training, stabilizes by using alearned side walking skillto compensate, showingcompliant gait. - Under
CoM offset():Single-Task(i) fails.Single-Task w/ Perturbation(ii) useslearned stabilizationto counter backward force, walking forward at reduced speed.Versatile(iii) leveragesbackward walking gaitsto offset theCoM shift.
- Under
- Running (Fig. 10c, 10d):
- Under
constant forward perturbation():Single-Task(i) andSingle-Task w/ Perturbation(ii) fail to maintain stable gaits.Versatile(iii) adapts by usingfaster running skillsto overcome the perturbation. - Under
CoM offset(): Similar results,Versatilepolicy successfully handles the offset.
- Under
- Jumping (Fig. 10e, 10f):
- Under
lateral perturbation(Fig. 10e) andforward CoM offset(Fig. 10f),Versatile policies(without perturbation training) successfully adapt by usinglateralorforward jumpskills respectively.
- Under
Conclusion on Robustness Sources:
-
Dynamics Randomization (+ Perturbation Training): Allows policies to function within an expanded scenario range, but limits them to the trained task.
-
Task Randomization: Enables policies to generalize
learned tasksfor greaterrobustnessandcompliance, findingbetter maneuverseven without extensivedynamics randomization.The following figure (Figure 11 from the original paper) shows
robust standing recoverywith theversatile walking policy.
Fig. 11: Robust standing recoverywithversatile walking policy. (a)Single-task standing policyfails if pushed beyondsupport region. (b)Single-task standing policy trained with perturbationsstill fails after being pushed too far. (c)Versatile walking policy(trained withtask randomization) recovers by transitioning towalking gaitswhen pushed beyond itssupport regionwhile standing. This is an illustration oftask generalizationforrobustnessin the real world.Case Study: Robust Standing Experiments (Real World):
-
When
single-task standing policies(trained with or withoutperturbations) are pushed beyond theirsupport region, theylose balance(Fig. 11a, 11b). -
A
versatile walking policy(also trained withstanding skill) demonstratesintelligent recovery maneuvers(Fig. 11c). When pushed, it transitions to awalking gait, executes several steps (forward/backward), and then smoothlyreverts to a standing pose. This occursautonomouslywithout human commands or explicitperturbation training, showcasingtask generalization.The following figure (Figure 12 from the original paper) illustrates other
robust recovery maneuversfromlateral push,collision, andunstable landing.
Fig. 12: More robust recovery maneuversbyversatile policiesin the real world. (a) Whenlaterally perturbedwhile standing, theversatile walking policyutilizesvaried walking skillsto recover and return to a stand. (b) Theversatile running policyshowsrobustnessto collision with a track guard, usingside-stepping skillsto disengage and maintain stability. (c) Theversatile jumping policyexecutes acorrective hopafterunstable landingfrom a complex multi-axis jump to achieve a more stable configuration. These complex recovery maneuvers are emergent properties ofversatile policies, not explicitly trained. -
Versatile walking policyrecovers fromlateral perturbationsby usingvaried walking skillsto lower itscenter of massand return to stand (Fig. 12a). -
Versatile running policyrecovers fromcollisionwith a track guard by usingside-stepping skillsand maintaining a stable stance (Fig. 12b). -
Versatile jumping policyperforms acorrective hopafter anunstable landingto achieve a more stable configuration (Fig. 12c).These complex,
long-horizon recovery maneuversareemergent propertiesofversatile policies, not explicitly trained.
Understanding Robustness from Training Distributions:
The findings are conceptually illustrated by considering training distributions, analogous to invariant sets in control theory.
The following figure (Figure 13 from the original paper) conceptually illustrates training distributions.
Fig. 13: An illustration of the concept of training distributions using different methods to enhance robustness. During deployment, as conceptually illustrated by the red curve, we want the robot controlled by its RL-based policy to operate inside the training distribution of the robot's trajectories. When the training is focused on a single task, the training distribution is confined to nominal trajectories specific to that task, drawn as the yellow region. Incorporating extensive dynamics randomization, including simulated perturbations or varying terrains, can expand this distribution. However, this expansion is still centered around the fixed task. Task randomization significantly broadens the training distribution (to the orange region) by enabling the robot to learn and generalize various control strategies across different tasks (marked as different faded yellow regions). It is important to note that task randomization can be combined with dynamics randomization, further widening the training distribution and enhancing the policy's robustness.
Single-task policieshave a limitedtraining distribution.Dynamics randomizationexpands this distribution but keeps it centered around the fixed task.Task randomizationsignificantlybroadens the training distributionby allowing the robot to learn and generalizevarious control strategiesacross different tasks. This enables the robot to remain within itsenhanced training distributioneven when faced with disturbances, usinglearned tasksfor recovery.- Combining
task randomizationwithdynamics randomizationfurtherexpands the training distributionand enhancesrobustness. Task randomizationis an "orthogonal" way to improverobustnessbeyond pushing the limits ofdynamics randomization(which can hinder learning if too extreme).
Summary of Robustness:
Versatile policies (trained with task randomization) show significant improvements in robustness compared to task-specific policies. This stems from their ability to generalize learned tasks and find better maneuvers to tackle unforeseen situations, leading to better stability and compliance to disturbances.
6.1.4. Dynamic Bipedal Locomotion in the Real World
Extensive real-world experiments on Cassie validate the adaptivity and robustness of the developed skill-specific versatile policies.
6.1.4.1. Walking Experiments
The following figure (Figure 14 from the original paper) illustrates the walking policy's performance in tracking variable commands and consistency over time.
该图像是示意图,展示了在不同时间(秒)下,机器人在三个维度的速度变化情况,其中包括 、 和 的速度(单位:cm/s)。图中红色曲线表示速度变化,黑色曲线为参考线。
Fig. 14: The walking policy tracking variable commands in the real world without any tuning. (a) Tracking variable sagittal velocity , lateral velocity , and walking height . (b, c) Consistency of tracking performance over a long timespan (492 and 325 days after initial testing). The tracking errors (MAE) are reported for each test, showing minimal degradation over time.
Tracking Performance:
-
Variable Commands (Fig. 14a): The policy efficiently tracks varying, fast-changing commands for
sagittal velocity(),lateral velocity(), andwalking height(). MAE in are , , respectively. -
The following figure (Figure 15 from the original paper) shows the robot tracking
turning yaw commands.
该图像是示意图,展示了在使用默认动态与与默认动态变化相关的潜在特征的比较。图中左侧部分显示了默认情境下的潜在特征,包括不同的动态参数示例;右侧部分则展示了在存在变化(如噪声、延迟等)情况下的潜在特征。每个子图下方标注了对应的能量值 ,反映了不同条件下的动态行为表现。
Fig. 15: A snapshot from the real world demonstrating the robot reliably tracking various turning yaw commandsusing the same controller frames in the real word.The robot can executefull turns in both counterclockwise and clockwis dretions. -
The policy reliably tracks
varying turning commands(clockwise/anti-clockwise) (Fig. 15). -
Consistency over Long Timespan (Fig. 14b, 14c): The
RL-based walking policyadapts to changing robot dynamics (due to wear and tear) and consistently performs well over extended periods (492 and 325 days after initial testing) withminimal tracking error degradation. This highlights theadaptivitywithout hardware-specific tuning. -
Fast Walking (Fig. 16):
-
The robot can transition from standstill to
fast forward walking(average , peak to track command), and quickly return to standing (Fig. 16a, 16c). -
The following figure (Figure 16 from the original paper) shows
fast forwardandbackward walking.
该图像是一个示意图,展示了机器人在不同情况下的站立和行走技能。左侧展示了在前向推力作用下,仅具备站立技能的情况,中间展示带扰动的情况,而右侧则展示了经过训练的同时具备行走和站立技能的表现。
Fig. 16: Fast forward walking(a) andfast backward walking(b) demonstrations of Cassie in the real worldwithout any tuning. The robot can transition from a stationarystanceto a rapid gait and return to standing with a single command, even during dynamic maneuvers like fast walking. The recordedsagittal velocityforfast forward walkingis shown in (c), with average and peak while tracking a command. -
It also performs
fast backward walking(average to track command) (Fig. 16b).
-
Robust Walking Maneuvers:
-
Uneven Terrains (Untrained) (Fig. 17): The policy shows considerable
robustnessto varyingelevation changes(backward walking on small stairs or declined slopes) despite no specific training for this terrain and lackingterrain elevation sensors. This is attributed torobustnessto changes incontact timingorwrench.-
The following figure (Figure 17 from the original paper) shows
walking on uneven terrains.
该图像是插图,展示了一种双足机器人在外力横向推动下,通过多种行走技能进行恢复的过程。从左到右分别展示了机器人被推、恢复行走技巧和最终站立的状态。
Fig. 17: Robust walkingonuneven terrainswithout any tuning. The robot can walk backward on stairs (a) and declined slopes (b). Thesagittal velocityandpelvis heightare consistently tracked. The robot lacksterrain elevation sensors, adapting through itsI/O historyandrobustnesstocontact changes.
-
-
Robustness to Random Perturbations:
-
Impulse Perturbation (Fig. 18a): A
substantial lateral perturbation forcecauses alateral velocity peakof . The robotswiftly recoversby moving in the oppositelateral direction, compensating for the perturbation and restoringstable in-place walking.-
The following figure (Figure 18 from the original paper) shows
recovery from lateral impulseandcomparison with model-based control.
该图像是示意图,展示了双足机器人在不同动态技能下的运动表现,包括跑步、站立、侧步及跳跃等动作。图中展示了在执行侧步时,机器人成功应对了动态变化的环境,通过运行技能增强了稳定性。
Fig. 18: Robust walkingunderlateral impulse perturbation(a) by theRL-based walking policyversus amodel-based controller(b) in the real world. In (a), the robot, despite being pushed laterally and accelerated to , still maintains a stable walking gait and compensates such a lateral impulse by walking in the opposite direction. The correspondinglateral velocityis recorded in the lower part of the figure. Additionally, theplanar position( q _ { y } , q _ { x } )is estimated, with points appearing in progressively darker colors as they are recorded later in time. Themodel-based controller(b) fails to maintain control when subjected to similarlateral perturbation(recorded in Vid. 3).
-
-
Persistent Perturbation (Fig. 19): Under
persistent lateral dragging forceorrandom sagittal forces, the robot showscompliance, following the force directions while commanded to walk in place, without losing balance. This demonstrates potential forsafe human-robot interaction.-
The following figure (Figure 19 from the original paper) shows
compliance to persistent perturbations.
Fig. 19: Robust walkingunderpersistent and random external perturbation(a) with alateral dragging forceand (b) withsagittal forceat a lower height. The robot showscomplianceto these forces, maintaining balance and returning to the commanded task after the force is removed. This demonstrates the advantages of theRL-based policyforsafe human-robot interaction.
-
-
-
Comparison with a Model-based Controller (Fig. 18b): A
model-based controllerfails to maintain control underlateral perturbation, resulting in a crash, due to inability to deal withlarge modeling errorsfrom unmodeled external forces. It also fails underpersistent perturbationand showsobvious lateral driftswithout manual tuning.
6.1.4.2. Running Experiments
The versatile running policies achieve impressive real-world feats.
-
Running a 400-meter Dash (Fig. 20):
-
The robot successfully completes a 400-meter dash in 2 minutes and 34 seconds.
-
It smoothly transitions from standing to running, accelerates to an average estimated speed of (peak ), and maintains desired speed while responding to
varying turning commands(MAE5.95degrees). -
Substantial flight phasesare evident, distinguishing it fromfast walking.Lateral running skillslearned during training allow correction oflateral drifts. -
The following figure (Figure 20 from the original paper) shows the
400-meter dashdemonstration andrecorded data.
该图像是图表,展示了时间与三个坐标轴速度(, , )的关系。图中红色曲线为估计值,黑色虚线为期望值,在时间范围内显示了动态变化和稳定性。
Fig. 20: Cassiecompleting a400-meter dashon a standard outdoor running track. (a) Snapshots showing the robot transitioning from standing to running, maintaining a dynamic gait withflight phases, and navigating turns. (b) Recorded data ofsagittal velocity,lateral velocity, andturning yaw angle. The robot achieves a peak speed of and maintains an average lateral speed of while tracking commands.
-
-
Tracking Varying Commands while Running (Fig. 21):
-
The same
versatile policyreliably tracksvarying sagittal velocity(Fig. 21a) andlateral velocity(Fig. 21b). -
Command changes in one dimension do not affect control performance in others, indicating
decoupled controlin fast running. -
The robot performs a
90-degree sharp turnin 2 seconds (5 steps) with a natural running gait (Fig. 21c), despite not being specifically trained for sharp turns (only smooth turning rates up to ). This demonstratestask generalization. -
The following figure (Figure 21 from the original paper) illustrates
tracking variable sagittal velocity,lateral velocity, andsharp turning.
该图像是一个插图,展示了机器人在追踪多个转向偏航命令 q _ { heta }时的表现。上方是机器人在不同时间点的动作快照,下方为控制器输出的估计值与期望值的对比图,显示机器人能够有效执行顺时针和逆时针的全转。 Fig. 21:Versatile running policytrackingvariable commands(a)sagittal velocity, (b)lateral velocity, and (c) performing a90-degree sharp turnin the real world. The command changing in one dimension does not affect control performance in others, showingdecoupled control. Thesharp-turnscenario was not specifically trained, demonstratingtask generalization.
-
-
Running a 100-meter Dash (Fig. 22, Table V):
-
Achieves a fastest time of
27.06seconds (Table V). -
Transitions from stationary stand to fast running gait within
1.8seconds (Fig. 22a) with aggressive maneuvers. -
Reaches a peak estimated speed of during the cruising phase, showing
notable flight phases. -
The following figure (Figure 22 from the original paper) shows the
100-meter dashdemonstration.
该图像是图表,展示了双足机器人在不同步态下的动态行为,包括快速向前行走(左)和向后行走(右)。图中标注了过渡状态,右侧的子图显示了与这些行为相关的记录数据,包括估计和期望的速度变化。
Fig. 22: Cassiecompleting a100-meter dashin the real world. (a) Transition from stationary stance to rapid running gait within 1.8 seconds. (b) Cruising phase, maintaining a fast running gait with a peak estimated speed of . (c) Recordedsagittal velocityduring the dash, showing estimated and desired speeds. The robot completed the dash in27.06seconds. -
The following are the results from Table V of the original paper:
Trial Completion Time (s) 1 27.06 2 27.99 3 28.28
-
-
Running on Uneven Terrains (Trained) (Fig. 23):
-
Successfully traverses terrains with
different slopes( inclined, lateral, inclined) without explicitterrain height estimationorexternal sensors. -
Maintains a stable running gait with
flight phaseseven on challenging terrains, avoiding degradation to walking. This is the first demonstration of running (with flight phases) over largeuneven terrainsby ahuman-sized bipedal robot. -
The following figure (Figure 23 from the original paper) shows
running on uneven terrains.
该图像是图表,展示了在机器人遭受侧向扰动时的恢复动作及控制效果。左侧(图a)展示了机器人在施加横向扰动后的运动过程,并记录了其横向速度 随时间变化的图表,红线表示估计速度,黑线为期望速度。在右侧,平面位置 ( , ) 的估计结果显示,随着时间推移,记录点的颜色逐渐加深,体现了机器人对扰动的反应。右下角(图b)对比了未能有效恢复的模型控制器的表现。
Fig. 23: Cassierunning ondifferent types of terrainswithdifferent slopes. (a) Snapshots of the robot traversing a incline, lateral slope, and incline, maintainingflight phases. (b) Phase plots ofleft/right thigh( vs. ) during terrain traversal, showing consistent gait adaptation.
-
-
Robust Running Maneuvers (Fig. 24):
-
Recovers from
abrupt impulse perturbation(safety cord causing speed drop, leaning, and twisting) by maintaining stability and quickly returning to a stable running gait, due to training onsimulated perturbationsanddiverse running tasks. -
The following figure (Figure 24 from the original paper) illustrates
robust running maneuvers.
Fig. 24: Robust running maneuversin the real world. (a)Recovery from abrupt impulse perturbationwhile running at high speed. (b)Compensation for lateral perturbationby exerting alateral running gait. These demonstrate therobustnessofversatile running policiesagainst unexpected disturbances. -
Compensates
lateral perturbationby exerting alateral running gait.
-
6.1.4.3. Jumping Experiments
The versatile jumping policies achieve a large variety of different bipedal jumps.
-
Jump and Turn (Flat-ground Policy) (Fig. 25a, Fig. 26c, Fig. 35):
-
Executes various target jumps by just changing the command:
jumping in place while turning(),jumping backward(),jumping forward(). -
Adjusts
take-off pose(leans backward for rear targets, forward for forward jumps). -
Capable of
multi-axis jumps(forward, lateral, and turning simultaneously) (Fig. 26c). -
The following figure (Figure 25 from the original paper) shows
jump and turnandjump to elevated platforms.
Fig. 25: Versatile jumping capabilitiesof Cassie in the real worldwithout any tuning. (a)Flat-ground policyexecuting varioustarget jumps: (i)in-place jump with60^\circturn, (ii)backward jump(), (iii)forward jump(). (b)Discrete-terrain policyjumping toelevated platforms: (i)high jump( elevated), (ii)long jump( forward and elevated), (iii)forward jump( and elevated).
-
-
Jump to Elevated Platforms (Discrete-terrain Policy) (Fig. 25b, Fig. 26):
-
Jumps precisely to targets at
different positionsandelevations. -
Achieves
standing long jumpsover andstanding high jumpsto a elevated platform. These are novel capabilities forhuman-sized bipedal robots. -
Adjusts
take-off maneuversfor different targets and managesangular momentumafter landing impacts. -
The following figure (Figure 26 from the original paper) shows
diverse bipedal jumps.
Fig. 26: Diverse bipedal jumpsdemonstrated in the real world. (a-c) Examples using theflat-ground policy: (a)lateral jump(), (b)diagonal jump( ahead and left), (c)complex jump(forward , lateral , turn ). (d-g) Examples using thediscrete-terrain policyjumping to elevated platforms: (d)in-place jump( elevated), (e)forward jump( ahead, elevated), (f)forward jump( ahead, elevated), (g)forward jump( ahead, elevated).
-
-
Robust Jumping Maneuvers (Fig. 27):
-
Under an
impulse perturbationat the jump's apex, the robot deviates and lands in an unstable pose (backward lean, toes pitched up). -
It then executes a
backward hop(a learned behavior fromtask randomization) to correct its pose and achieve a more stable landing configuration. This is a successfulreal-world recoveryfromperturbation during a jump. -
The following figure (Figure 27 from the original paper) shows
robust jumping maneuvers.
Fig. 27: Robust jumping maneuversin the real world. (a)Impulse perturbationapplied at the apex of anin-place jump. (b) The robot's response: deviation from nominal trajectory, unstable landing pose, and subsequentbackward hopto correct position and achieve a more stable configuration. This illustratestask generalizationforagile recoveryfrom unforeseen perturbations.Emergent Behaviors:
-
-
Online Contact Strategy: The robot develops its
own contact strategyonline, deviating fromreference motions, enhancing stability (e.g., small hops after landing in jumping, varying double-support phases in walking, not strictly following periodic contact in running). This aligns withcontact-implicit optimizationbut is achieved online on a real robot. -
Unified Control Policy Challenges: Combining highly dynamic (
aperiodic jumping) with stationary (standing) skills within a single policy can lead to oscillations in the stationary phase, indicating challenges in learning a single unified policy across vastly different dynamic characteristics.
6.2. Data Presentation (Tables)
All relevant tables are transcribed and presented in Section 5.4. Dynamics Randomization table (Table IV) is included in Section 4.2.7.
6.3. Ablation Studies / Parameter Analysis
6.3.1. Ablation on Action Filter (LPF)
The following figure (Figure 28 from the original paper) shows the ablation study on the use of Low Pass Filter (LPF).
Fig. 28: Ablation study on the use of Low Pass Filter (LPF) as an action filter for the training of the in-place jumping skill from scratch. Without using the LPF, the training return (blue curve) is much lower than the one using LPF (ours, red curve). These two policies are obtained by using the exact hyperparameters and training settings. The underlying reason is that it is harder for the RL-based policy to damp out high-frequency jittering motion without the use of LPF.
- An
LPFapplied after the policy output helps smooth actions. - Training a
jumping-in-place policywithoutLPFresults inworse learning performance(lower converged return) due tojittering motion. LPFreduces the need for excessively highsmoothing reward weights, which could otherwise lead tosuboptimal stationary behaviorsrather than dynamic skill learning.
6.3.2. Comparison of Different History Lengths
The following figure (Figure 29 from the original paper) shows the learning performance using different lengths of I/O history.
Fig. 29: Learning performance using different lengths of the robot's I/O history when training a single-task running policy with dynamics randomization. All of these policies used the proposed dual-history-based policy. When increasing the explicit length of robot history from 1 second (pink curve), 2 seconds (red curve, as used in this work), and 3 seconds (blue curve), we observe an increase in learning performance. However, if we keep increasing the history length, such as to 4 seconds (dark blue curve), the improvement of the learning performance may get saturated.
- Increasing
history length(e.g., 1s to 2s, 2s to 3s) generallyenhances learning performanceas it provides more information forstate estimationanddynamics parameter inference. - However, continuously increasing
history lengthcan lead tosaturationor evendegradation(e.g., 4s history) as it may introduceredundant informationthat the robot needs to filter. - A 2-second history length is found to perform consistently well across skills.
6.3.3. Comparison of Different Temporal Encoders
The following figure (Figure 30 from the original paper) compares learning performance using TCN and LSTM encoders with and without the dual-history approach.
Fig. 30: Learning performance using different neural network architectures to encode the robot's I/O history when training a single-task walking policy with dynamics randomization. For both Long Hist. Only methods, we still provide explicit immediate state feedback alongside the temporal encoder. As shown in Fig. 30a, using the proposed dual-history approach by providing an explicit short I/O history alongside the TCN encoder, the learning performance is much better than the TCN only. The TCN encodes 2-second robot I/O history and has 3 layers with filter sizes of [34,34,34], a kernel size of 5, a dilation base of 2, and a stride size of 2, with ReLU activation, as suggested in [71]. Fig. 30b shows that the dual-history approach will not help with the LSTM-based policy. The LSTM encoder has 1 layer of 128 units. However, both TCN with dual-history approach and Long Hist. Only perform better than the LSTM-based policy while using the hyperparameters tuned for LSTM. It suggests that LSTM may only learn to leverage a recent short history and converge to a more suboptimal policy.
- TCN (Non-recurrent): The
dual-history approachsignificantlyimproves learning performanceforTCNencoders, consistent with 1DCNNresults. - LSTM (Recurrent): The
dual-history approachdoesnot significantly aid learningforLSTM-based policies.LSTMtends to converge tosuboptimal policiesand issensitive to hyperparameter tuningacross differentMDPs.LSTM-based policiesstruggle to learn highly dynamic skills likejumping(Fig. 31), highlighting its sensitivity.
6.3.4. Latent Visualization of Walking Policy
The following figure (Figure 32 from the original paper) shows latent visualization and adaptivity for the walking policy.
该图像是示意图,展示了一个仿人机器人在执行不同跳跃技能的动作。多个图例(如(a)至(g))显示了机器人的飞行阶段及目标落地点,分别展示了不同的跳跃高度和角度配置,涉及的参数如 表示跳跃姿态和目标位置。
Fig. 32: Adaptivity of the walking policy demonstrated by latent representations. (a) Recorded latent representation after long-term I/O history encoder during walking. The figure below compares two selected dimensions (marked as red lines) with recorded impact forces on each of the robot's feet. (b) The blue plot shows the robot's latent representation with default dynamics parameters during walking. The red plots indicate changes in the same region under different dynamics. Despite significant environment changes, control performance metrics like task completion () and motion tracking error () show little changes.
- The
latent embeddingfrom thewalking policy(Fig. 32a) also capturestime-variant changes(external perturbation, contact events) and shows aperiodic pattern. Time-invariant dynamics shifts(Fig. 32b) cause changes inlatent embedding, butcontrol performance() remains largely unaffected, confirming theadaptivity.
6.3.5. Saliency Map Analysis
Saliency mapsof theMLP base(which produces the final action) show that the robotfocuses more on the short I/O history, especially themost recent observation, for bothrunningandwalking. This supports the importance of directshort historyinput.Saliency mapson theencoded long historyshow that different parts of thelong history embeddingare attended to underexternal perturbation, indicating its utilization for adjusting actions. This confirms thatlong I/O historyprovides useful context.
6.3.6. Estimator Errors in High-Speed Running
The following figure (Figure 34 from the original paper) illustrates estimation errors in high-speed running.
该图像是一个对比图,展示了在不同样本数量下,使用双历史架构(红色)与仅使用长历史架构(蓝色)所获得的标准化回报。左侧图标显示了大约300百万样本的表现,右侧图则呈现较小样本量下的表现,双历史架构在多样本条件下显著优于单一架构。
Fig. 34: The large estimation error using the robot onboard velocity estimator (based on EKF) during high-speed running in simulation. In this figure, the robot is controlled by the proposed running policy to track variable commands (black dashed line) in simulation, the estimated sagittal velocity is recorded as the red line while the robot's actual running speed is recorded as the blue line. The ground-truth speed is obtained from the simulator. Although showing accurate results under slow speed (such as ), the estimated velocity shows a significant error in the high-speed region (above ) compared to the ground-truth speed. The robot's actual speed tends to be the upper envelope of the estimated speed. In real-world experiments, we only have access to the state estimator whose result we can report, such as the running speed tracking results in Fig. 20b. This comparison shows that the robot's actual running speed in the real world is faster than the reported estimated value and closer to the command.
- The
onboard EKF velocity estimatorshowslarge estimation errorsforsagittal velocity() duringhigh-speed running(above ) compared toground-truth. The actual speed tends to be higher than the estimated speed. - This implies that the robot's
actual running speedin real-world experiments is faster and closer to the command than reported by the estimated values, suggesting even bettertracking performance. - This highlights the necessity of training with an
inaccurate estimatorto enablesim-to-real transferand points todeveloping reliable state estimatorsas future work.
6.4. Data Presentation (Figures)
All figures are integrated into the text at their most relevant points, as per the instructions.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work makes significant strides in bipedal robot locomotion control by introducing an RL-based framework that yields versatile, robust, and dynamic controllers. The core innovation lies in its dual-history architecture, which effectively integrates both long-term and short-term I/O history to enhance adaptivity. The framework employs a multi-stage training strategy that includes single-task training, task randomization, and dynamics randomization. Key findings show that the long I/O history encoder implicitly captures time-variant events (like contact forces) and time-invariant dynamics shifts, allowing the controller to adapt without explicit model parameters. Furthermore, task randomization is demonstrated as a crucial, orthogonal source of robustness, enabling task generalization and compliant recovery maneuvers from unforeseen disturbances, a capability distinct from dynamics randomization.
The effectiveness of the proposed method is rigorously validated on Cassie, a torque-controlled human-sized bipedal robot, through extensive real-world experiments. These demonstrations include:
Robust standingandversatile walkingwith consistent performance over long periods (over a year).Fast running, including a400-meter dashand running overchallenging uneven terrainswithflight phases.- A diverse repertoire of
jumping skills, such asstanding long jumps(1.4 meters) andhigh jumps(0.44 meters elevated). The work successfully bridges thesim-to-real gapand pushes the limits of agility forhuman-sized bipedal robots.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
-
Learning a Unified Control Policy: While the current work achieves
skill-specific versatile policies, learning a single, truly unified policy that encompasses all different locomotion skills (walking, running, jumping, and potentially manipulation) remains challenging. The current approach of combining skills (e.g., standing with walking) can lead tocatastrophic forgettingif not carefully managed. -
Adversarial Motion Prior (AMP): While
AMPcould potentially help learn unified policies without explicitmotion tracking rewards, applying it toaggressive real-world bipedal locomotionis challenging due tomode-collapseissues inGAN-styled methodsand the largesim-to-real gap. -
Continual RL / Imitation Learning:
Continual RL(to keep learning new skills) orimitation learningfrom offline datasets could be avenues for unified policies, but theirrobustnessandsim-to-real transferfor bipedal robots are open questions. -
Precision vs. Generalization: While the policies demonstrate
generalizationandadaptivity, achievingperfect precise control(e.g., minimal errors in fast-running tasks) with a single policy handling wide variations remains an open question. There is a trade-off betweengeneralizationandprecision. -
Oscillations in Standing: After large jumps, the robot occasionally oscillates while standing. This indicates the difficulty of learning both
dynamic aperiodic skillsandstationary skillswith a single policy. -
Reliable State Estimators: The paper highlights
large estimation errorsfrom theonboard velocity estimatorduringhigh-speed running. Developing more reliablestate estimatorsfordynamic bipedal locomotion skillsusingRLis an interesting future direction.Future work suggestions include:
-
Humanoid Robots: Extending the method to
humanoid robotsthat leverageupper-body motionsforagilityandstability. -
Depth Vision Integration: Integrating
depth visiondirectly into thelocomotion controllerby adding anadditional depth encoderalongside theI/O history encoder. -
Loco-Manipulation Tasks: Combining
bipedal locomotionwithbimanual manipulationto tacklelong-horizon loco-manipulation tasks.
7.3. Personal Insights & Critique
This paper represents a significant step forward in bipedal locomotion control. The dual-history architecture is a clever and effective design choice that addresses a fundamental challenge in RL-based control: how to best incorporate past information for adaptivity and real-time responsiveness. The empirical validation of I/O history for implicit system identification and contact estimation is a powerful demonstration of RL's emergent capabilities.
The most profound insight, in my opinion, is the explicit identification and emphasis on task randomization as a source of robustness. This idea, that exposing a robot to a wide range of tasks rather than just dynamics variations leads to more flexible and compliant behaviors, is very intuitive yet often overlooked in the RL community's focus on dynamics randomization. It suggests a shift in how we think about robustness in RL, moving from merely hardening against noise to fostering intelligent, adaptable reactions grounded in a broader behavioral repertoire. This could inspire new curriculum learning strategies for complex robotic tasks.
The real-world experiments on Cassie are exceptionally impressive, particularly the long-term consistency, the 400-meter dash, uneven terrain running, and diverse jumping feats. These demonstrations are a strong testament to the practical applicability and superior performance of the proposed method.
Potential Issues / Areas for Improvement:
-
Explainability of Latent Space: While the paper shows the
latent spacechanges withdynamics shiftsand correlates withcontact events, a deeper understanding of what specificdynamics parametersorenvironmental featuresare being encoded in eachlatent dimensioncould provide more interpretability and potentially guide future controller designs. -
Trade-off between Generalization and Precision: The paper briefly mentions this trade-off. While
versatilityis critical, for many industrial or mission-critical applications, extreme precision is also required. Future work could explore how to integrate precision-focused fine-tuning or hierarchical control on top of these generalized policies. -
Unified Policy for All Skills: The current approach provides
skill-specific versatile policies. While effective, the ultimate goal of a single policy for all locomotion and potentially manipulation remains a grand challenge. The paper acknowledges this, and the issues ofcatastrophic forgettingwhen simply combining skills with current methods are significant. Exploring more advancedcontinual learningormulti-task learningarchitectures would be crucial here. -
State Estimation Reliance: The reliance on an
EKFforlinear velocity estimation, which showssignificant errorsathigh speeds, highlights a potential vulnerability. While theRL policyadapts to this noisy input, developing anRL-based state estimatorthat is jointly learned with the control policy could further improve performance and robustness, especially for highly dynamic tasks. -
Hyperparameter Sensitivity: The paper notes
LSTM's sensitivity tohyperparameters. WhileCNN-based non-recurrent policiesseem more robust in this regard, a deeper analysis into thehyperparameter landscapefor differenthistory encodertypes could further inform design choices for futureRL-based locomotion.Overall, this paper provides a robust and highly impactful contribution to
legged locomotion, setting new benchmarks and offering valuable insights into the design ofadaptiveandrobust RL controllersfor complexbipedal robots.
Similar papers
Recommended via semantic vector search.