PIE: Parkour with Implicit-Explicit Learning Framework for Legged Robots
TL;DR Summary
The PIE framework enhances legged robots' parkour capabilities by utilizing dual-level implicit-explicit estimation, enabling even low-cost robots to perform exceptionally on challenging terrains through simple training and successful zero-shot deployment.
Abstract
Parkour presents a highly challenging task for legged robots, requiring them to traverse various terrains with agile and smooth locomotion. This necessitates comprehensive understanding of both the robot's own state and the surrounding terrain, despite the inherent unreliability of robot perception and actuation. Current state-of-the-art methods either rely on complex pre-trained high-level terrain reconstruction modules or limit the maximum potential of robot parkour to avoid failure due to inaccurate perception. In this paper, we propose a one-stage end-to-end learning-based parkour framework: Parkour with Implicit-Explicit learning framework for legged robots (PIE) that leverages dual-level implicit-explicit estimation. With this mechanism, even a low-cost quadruped robot equipped with an unreliable egocentric depth camera can achieve exceptional performance on challenging parkour terrains using a relatively simple training process and reward function. While the training process is conducted entirely in simulation, our real-world validation demonstrates successful zero-shot deployment of our framework, showcasing superior parkour performance on harsh terrains.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is a framework for enabling legged robots to perform parkour maneuvers, titled PIE: Parkour with Implicit-Explicit Learning Framework for Legged Robots.
1.2. Authors
The authors are Shixin Luo, Songbo Li, Ruiqi Yu, Zhicheng Wang, Jun Wu, and Qiuguo Zhu. Shixin Luo and Songbo Li are noted as contributing equally to this work. Qiuguo Zhu is the corresponding author. Their affiliations are:
- Institute of Cyber-Systems and Control, Zhejiang University, 310027, China.
- State Key Laboratory of Industrial Control Technology, 310027, China (for Qiuguo Zhu and Jun Wu).
1.3. Journal/Conference
The paper was published at arXiv as a preprint arXiv:2408.13740. However, the manuscript received, revised, and accepted dates (April 23, 2024; July 25, 2024; August 23, 2024, respectively) and the note "This paper was recommended for publication by Editor Aleksandra Faust upon evaluation of the Associate Editor and Reviewers' comments" indicate that it has been accepted for publication, likely in a peer-reviewed robotics journal or conference. Given the nature of the work, a highly reputable robotics conference like ICRA or IROS or a journal like Science Robotics or IEEE Robotics and Automation Letters would be a fitting venue.
1.4. Publication Year
The paper was Published at (UTC): 2024-08-25T07:01:37.000Z, indicating it was published in 2024.
1.5. Abstract
The paper addresses the challenge of enabling legged robots to perform parkour, a task requiring agile and smooth locomotion over diverse terrains. This necessitates a comprehensive understanding of the robot's state and surroundings, despite inherent unreliability in robot perception and actuation. Current state-of-the-art methods either rely on complex, pre-trained high-level terrain reconstruction modules or limit robot parkour potential to avoid failures from inaccurate perception.
The authors propose PIE (Parkour with Implicit-Explicit learning framework for legged robots), a one-stage, end-to-end learning-based parkour framework. PIE employs a dual-level implicit-explicit estimation mechanism. This approach allows even low-cost quadruped robots, equipped with unreliable egocentric depth cameras, to achieve exceptional parkour performance on challenging terrains with a relatively simple training process and reward function. The framework is trained entirely in simulation, and its real-world validation demonstrates successful zero-shot deployment, showcasing superior parkour performance on harsh terrains.
1.6. Original Source Link
- Official Source: https://arxiv.org/abs/2408.13740
- PDF Link: https://arxiv.org/pdf/2408.13740v3.pdf
The paper is available as a preprint on
arXivand has been accepted for publication in a peer-reviewed venue.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is enabling legged robots, specifically quadruped robots, to perform parkour maneuvers with high agility, smoothness, and robustness across varied and challenging terrains. Parkour involves navigating obstacles by running, jumping, and climbing, tasks that demand precise balance, stability, and real-time environmental understanding.
This problem is important because successful implementation could open up new possibilities for robot technology in extreme environments, driving advancements in real-world applications such as search and rescue, exploration in unstructured environments, and logistics in complex terrains.
Specific challenges or gaps in prior research include:
-
Unreliability of Perception and Actuation: Robot sensors (like depth cameras) and actuators (motors) are inherently noisy and have latency, making precise real-time terrain estimation and control difficult, especially for high-stakes maneuvers near edges or during jumps.
-
Complexity of State-of-the-Art Methods: Existing methods often rely on complex, pre-trained high-level terrain reconstruction modules, which complicate the overall system and training.
-
Limited Performance Potential: To compensate for unreliable perception, many approaches limit the maximum potential of robot parkour, avoiding challenging scenarios where inaccurate perception could lead to failure.
-
Two-Stage Training Paradigms: Many learning-based parkour approaches use a two-stage training process (e.g., teacher-student policies), which can lead to information loss and performance degradation in the deployed student policy.
-
Lack of Seamless Behavior Integration: Integrating multiple complex behaviors (running, jumping, climbing) seamlessly into a neural network using a simple training process and reward function remains challenging.
-
Focus on Explicit Terrain Estimation: Prior works primarily focus on explicit terrain estimation, often lacking the implicit understanding of the surroundings, which hinders maximal robot performance.
The paper's entry point or innovative idea is to propose a one-stage end-to-end learning-based parkour framework called
PIE(Parkour with Implicit-Explicit learning framework for legged robots). The key innovation withinPIEis itsdual-level implicit-explicit estimationmechanism. This mechanism aims to enhance the robot's understanding of its own state and surroundings in a more robust and comprehensive way, pushing the limits of parkour performance even with low-cost robots and unreliable sensors.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
A Novel One-Stage Learning Framework:
PIEis introduced as a novel one-stage, end-to-end learning-based parkour framework. This eliminates the complexities and information loss associated with two-stage training paradigms common in prior works. -
Dual-Level Implicit-Explicit Estimation: The framework leverages a unique dual-level implicit-explicit estimation approach to enhance the quality of estimating the robot's state and surroundings.
- Level 1 (Understanding State/Surroundings): Integrates real-time proprioception with exteroception to implicitly infer the robot's state and surroundings by estimating its successor state, alongside explicitly estimating terrain from visual data. This improves estimation accuracy and robustness against unreliable sensors.
- Level 2 (Latent vs. Physical Quantity): Explicitly estimates specific physical quantities (like base velocity and foot clearance) alongside encoded latent vectors, further enhancing the robot's ability to execute complex parkour maneuvers.
-
Exceptional Parkour Capabilities: Experiments demonstrate that
PIEsignificantly enhances the parkour capabilities of quadruped robots. The robot can:- Leap onto and jump off steps 3x its height (e.g., for a robot).
- Negotiate gaps 3x its length (e.g., for a robot).
- Climb up and down stairs 1x its height (e.g., for a robot). These results represent a significant performance improvement (at least 50% compared to state-of-the-art frameworks) and push the limits of parkour for quadruped robots.
-
Sim-to-Real Transferability and Robustness: The framework, trained entirely in simulation, demonstrates successful zero-shot deployment in the real world (indoor and challenging outdoor environments) without extensive fine-tuning. This showcases remarkable robustness and generalization capabilities, even in the presence of significant disturbances to the depth camera in outdoor settings.
-
Simplicity in Training:
PIEachieves these advanced capabilities using a relatively simple training process and reward function, contrasting with methods requiring intricate reward designs or imitation learning.The key conclusions reached are that the dual-level implicit-explicit estimation mechanism, coupled with a one-stage end-to-end learning framework, is highly effective in enabling robust and agile parkour locomotion for legged robots, overcoming the limitations of unreliable perception and achieving impressive sim-to-real transfer.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the PIE paper, a reader should be familiar with several core concepts in robotics, artificial intelligence, and machine learning:
- Legged Robots / Quadruped Robots: Robots that use legs for locomotion, offering advantages over wheeled robots in traversing uneven, unstructured, or obstacle-laden terrains. Quadruped robots have four legs, mimicking animals like dogs or cats, providing inherent stability and agility.
- Parkour: A training discipline that involves moving rapidly through an area, typically an urban environment, negotiating obstacles by running, jumping, and climbing. In robotics, it refers to similar agile traversal of challenging and varied terrains.
- Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to make decisions by performingactionsin anenvironmentto maximize a cumulativereward. The agent is not explicitly programmed with how to perform a task but learns through trial and error, observing thestateof the environment and therewardsreceived for its actions.- Agent: The learner or decision-maker. In this paper, it's the robot's control policy.
- Environment: The world with which the agent interacts. For
PIE, this is the simulated or real-world terrain and robot physics. - State: A complete description of the environment at a given time (e.g., robot's joint angles, velocities, orientation, terrain information).
- Action: A decision made by the agent that affects the environment (e.g., target joint angles for the robot).
- Reward: A scalar feedback signal from the environment indicating how well the agent is doing. The agent's goal is to maximize cumulative reward.
- Policy: A function that maps states to actions, defining the agent's behavior.
- Deep Learning (DL): A subfield of machine learning that uses
neural networkswith many layers (deep neural networks) to learn complex patterns from data.- Neural Network: A computational model inspired by the structure of the human brain, composed of interconnected
neurons(nodes) organized in layers. - Multi-Layer Perceptron (MLP): A type of feedforward neural network consisting of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer. Each node is a neuron that uses a non-linear activation function. MLPs are used for various tasks, including classification and regression.
- Convolutional Neural Network (CNN): A specialized type of neural network particularly effective for processing grid-like data, such as images. CNNs use
convolutional layersto automatically and adaptively learn spatial hierarchies of features from input data. They are crucial for visual perception tasks.
- Neural Network: A computational model inspired by the structure of the human brain, composed of interconnected
- Asymmetric Actor-Critic Architecture: A common design in
Reinforcement Learningwhere theactor(policy network) andcritic(value network) have different observation inputs. Theactorreceives only observations available during real-world deployment (e.g., proprioception, camera data), while thecriticcan receive additionalprivileged information(e.g., ground truth base velocity, full terrain height map) only available in simulation. This allows thecriticto learn a more accurate value function, guiding theactormore effectively during training, but theactorremains deployable in the real world without the privileged information. - Proximal Policy Optimization (PPO): A popular
Reinforcement Learningalgorithm that updates the policy with a small, conservative step, ensuring stability while maximizing rewards. It's anon-policyalgorithm, meaning it learns from data collected by the current policy.PPOis known for its balance of performance and ease of implementation. - Transformer Encoder: A type of neural network architecture, originally developed for natural language processing, that relies heavily on
self-attention mechanismsto weigh the importance of different parts of the input. In vision and robotics, transformers are increasingly used for processing sequential or multi-modal data, allowing for complex interactions between different feature streams (e.g., visual and proprioceptive). - Gated Recurrent Unit (GRU): A type of
recurrent neural network (RNN)designed to handle sequential data. GRUs have gating mechanisms that regulate the flow of information, allowing them to capture long-term dependencies in sequences and mitigate the vanishing gradient problem common in simpler RNNs. They are used to maintainmemoryof past states or observations. - Autoencoder / Variational Autoencoder (VAE):
- Autoencoder: A type of neural network used for unsupervised learning of efficient data codings (or
latent representations). It learns toencodeinput data into a lower-dimensionallatent spaceand thendecodeit back to reconstruct the original input. The goal is that thelatent representationcaptures the most important features of the data. - Variational Autoencoder (VAE): An extension of the autoencoder that provides a probabilistic way of describing the
latent space. Instead of mapping inputs to a fixed vector, aVAEmaps them to a probability distribution (mean and variance). It tries to ensure that thelatent spacehas a regular, continuous structure, typically by penalizing deviations from a standard normal distribution usingKL divergence. This makesVAEsuseful for generating new data and for learning meaningful latent representations.
- Autoencoder: A type of neural network used for unsupervised learning of efficient data codings (or
- Proprioception: The sense of the relative position of one's own body parts and the strength of effort being employed in movement. In robotics,
proprioceptive sensors(e.g., joint encoders, Inertial Measurement Units (IMUs)) provide data about the robot's internal state, such as joint angles, velocities, body orientation, and angular velocity. - Exteroception: The sense that provides information about the external environment. In robotics,
exteroceptive sensors(e.g., cameras, LiDARs, depth sensors) provide data about the robot's surroundings, such as terrain shape, obstacles, and distances. - Sim-to-Real Transfer / Zero-Shot Deployment: The process of training a robot control policy entirely in a simulated environment and then deploying it directly (without further training or fine-tuning) in the real world.
Zero-shot deploymentimplies successful transfer with no real-world training at all. This is a major challenge due to thereality gap(discrepancies between simulation and reality). - Domain Randomization: A technique used to bridge the
sim-to-real gap. During simulation training, various physical parameters of the environment and robot (e.g., mass, friction, sensor noise, camera intrinsics) are randomly varied within specified ranges. This forces the policy to learn to be robust to these variations, making it more likely to generalize to the unpredictable real world.
3.2. Previous Works
The paper categorizes related work into Vision-Guided Locomotion and Robot Parkour.
3.2.1. Vision-Guided Locomotion
This field focuses on enabling robots to navigate using visual inputs.
- Traditional Decoupled Approaches: Historically, this problem was split into:
- Perception Component: Translating visual inputs (cameras, LiDARs) into
elevation mapsortraversability maps. - Controller Component: Using
model-based methods(e.g., [13], [14]) orRL methods(e.g., [8], [15]) for locomotion. - Limitation: This decoupling often leads to information loss and system delays, hindering flexible adaptation to complex terrains.
- Perception Component: Translating visual inputs (cameras, LiDARs) into
- Learning-Based End-to-End Control Systems: Recent advancements have favored end-to-end approaches, showing promise in complex terrains.
- Agarwal et al. [10]: Designed a two-stage learning framework. A
teacher policy(withprivileged informationlike ground truth base velocity and height map) guides astudent policy. Thestudent policydirectly predicts joint angles from depth camera inputs and proprioceptive feedback. - Yang et al. [11]: Proposed a
coupled training frameworkusing atransformerstructure to integrate both proprioception and visual observations. Aself-attention mechanismfuses these inputs for autonomous navigation in indoor and outdoor environments with varying obstacles. - Yang et al. [12]: Utilized a 3D
voxel representationwithSE(3) equivarianceas features from visual inputs, aiming for precise terrain understanding.
- Agarwal et al. [10]: Designed a two-stage learning framework. A
3.2.2. Robot Parkour
This subfield specifically addresses agile, nimble, and robust movement in highly dynamic situations, requiring precise environmental understanding.
- Hoeller et al. [1]: Described a
hierarchical pipelinefor navigation in parkour terrains.- Limitation:
Occupancy voxelsfrom an encoder-decoder architecture often contain incorrectness and mistrust, leading to inappropriate responses, high training costs, and low scalability in unstructured terrain.
- Limitation:
- Zhuang et al. [2]: Proposed a
multi-stage methodusingsoft/hard constraintsto accelerate training. The robot learns to traverse terrains directly from depth images.- Limitation: Its
privileged physics informationis strongly tied to the geometric properties of obstacles in simulation, making it difficult to train for terrains not solely describable by geometry.
- Limitation: Its
- Cheng et al. [3]: Adopted a framework similar to
Agarwal et al. [10].- Innovation: Introduced
waypointsinto theteacher policy's privileged inputsto guide thestudent policyin learning autonomous heading. - Limitation: Manual specification of waypoints based on terrain imposes considerable limitations.
- Innovation: Introduced
- Common Limitations of Prior Parkour Works:
- Two-Stage Training: All aforementioned parkour works use a two-stage training paradigm, leading to information loss and performance degradation in the deployed
student policy. - Explicit Terrain Estimation: Primarily focus on explicit terrain estimation distilled from the
teacher policy, lacking implicit estimation of the surroundings, which hinders maximal performance.
- Two-Stage Training: All aforementioned parkour works use a two-stage training paradigm, leading to information loss and performance degradation in the deployed
3.3. Technological Evolution
The field of legged robot locomotion has evolved significantly:
-
Early Model-Based Control: Initial efforts relied heavily on precise physical models of robots and environments, using analytical methods for control. These were often brittle to real-world uncertainties.
-
Blind Locomotion with Proprioception: Advancements came with learning-based approaches focusing on
proprioceptive sensors(IMUs, joint encoders). Robots learned to walk and adapt to various terrains without direct visual input, inferring terrain implicitly (e.g., Hwangbo et al. [4], Nahrendra et al. [5]). This highlighted the power of learning and robust state estimation. -
Vision-Guided Locomotion (Decoupled): To tackle more complex, unknown environments,
exteroceptive sensors(cameras, LiDAR) were introduced. Initially, perception and control were often decoupled, where vision generated maps, and a separate controller used these maps (e.g., Fankhauser et al. [13]). -
Vision-Guided Locomotion (End-to-End RL): The trend shifted towards
end-to-end Reinforcement Learning, where visual inputs directly inform the control policy (e.g., Miki et al. [8], Agarwal et al. [10], Yang et al. [11], [12]). This reduced latency and allowed for more complex, learned behaviors. Theasymmetric actor-criticarchitecture became prominent for enablingsim-to-real transfer. -
Robot Parkour: Building on vision-guided locomotion,
parkouremerged as a highly challenging benchmark, requiring extreme agility and robustness (e.g., Hoeller et al. [1], Zhuang et al. [2], Cheng et al. [3]). These works often still employed two-stage learning and focused primarily on explicit terrain understanding.PIEfits into this timeline as a cutting-edgeend-to-end Reinforcement Learningapproach forrobot parkour. It addresses the limitations of previous parkour methods by introducing a one-stage training paradigm and a novel dual-level implicit-explicit estimation to achieve superior performance andsim-to-real transferwithunreliable sensors.
3.4. Differentiation Analysis
Compared to the main methods in related work, PIE presents several core differences and innovations:
-
One-Stage End-to-End Training vs. Two-Stage:
- Prior Works (e.g., Agarwal et al. [10], Cheng et al. [3], Zhuang et al. [2]): Many learning-based parkour methods use a two-stage training paradigm, typically involving a
teacher policytrained withprivileged informationand astudent policythat learns to mimic the teacher's behavior using onlyexteroceptiveandproprioceptiveinputs. - PIE's Innovation:
PIEproposes a one-stage end-to-end learning framework. This simplifies the training process, avoids potential information loss during the mimicking phase, and allows for a more direct optimization of the control policy based on raw sensor data. It leverages anasymmetric actor-criticarchitecture within this one-stage setup, where thecriticuses privileged information, but the overall training is unified.
- Prior Works (e.g., Agarwal et al. [10], Cheng et al. [3], Zhuang et al. [2]): Many learning-based parkour methods use a two-stage training paradigm, typically involving a
-
Dual-Level Implicit-Explicit Estimation vs. Primarily Explicit:
- Prior Works (e.g., Hoeller et al. [1], Cheng et al. [3]): These methods largely focus on
explicit terrain estimation(e.g.,elevation maps,occupancy voxels) to inform the control policy. While beneficial, this can be unreliable with noisy sensors and may lack a holistic understanding of the robot's interaction with the environment. - PIE's Innovation:
PIEintroduces a dual-level implicit-explicit estimation.- Level 1 (Implicit-Explicit Understanding of State/Surroundings): It goes beyond explicit terrain maps by integrating
proprioceptionwithexteroceptionto implicitly infer the robot's state and surroundings throughsuccessor proprioceptive state prediction(), alongside explicitheight map reconstruction(). This blend allows the robot to anticipate future states and adapt based on its internal model, even when external perception is noisy. - Level 2 (Latent vs. Physical Quantities): It explicitly estimates crucial
physical quantitieslikebase velocity() andfoot clearance(), which are vital for parkour maneuvers, in addition to purelylatent vectors(, ). This provides the policy with direct, interpretable information alongside robust, compressed representations.
- Level 1 (Implicit-Explicit Understanding of State/Surroundings): It goes beyond explicit terrain maps by integrating
- Prior Works (e.g., Hoeller et al. [1], Cheng et al. [3]): These methods largely focus on
-
Robustness to Unreliable Perception and Challenging Terrains:
- Prior Works: Often struggle with
latencyandnoisefromexteroceptive sensors, especially near edges or during dynamic maneuvers, leading to limited performance or reliance on highly accurate terrain reconstruction. - PIE's Innovation: Through its
dual-level implicit-explicit estimation, particularly the implicit estimation ofsuccessor proprioceptive state,PIEdemonstrates superior robustness, especially when camera inputs are noisy or vision contradicts proprioception. This enables the robot to achieve significantly higher parkour abilities (e.g., 3x robot height/length jumps/gaps) and bettersim-to-real transferonharsh terrainsandoutdoor environments.
- Prior Works: Often struggle with
-
Simpler Training Process and Reward Function:
-
Prior Works (e.g., Zhuang et al. [2], Cheng et al. [3]): May use complex reward functions,
soft/hard constraints, orwaypointsto guide training. -
PIE's Innovation:
PIEutilizes a relatively simple reward function, closely aligned with prior research onblind walking. This highlights the effectiveness of its implicit-explicit estimation and network architecture in extracting complex behaviors without overly engineering the reward.In essence,
PIEdifferentiates itself by offering a more holistic and robust understanding of the robot's internal state and external environment within a streamlined, one-stage learning framework, leading to a new level of parkour performance and generalization for legged robots.
-
4. Methodology
The proposed PIE framework is a one-stage, end-to-end learning-based framework that directly computes desired joint angle commands from raw depth images and onboard proprioception using a single neural network. It aims to circumvent the performance reductions seen in previous two-stage learning-based parkour methodologies by enhancing the estimation of the robot's state and surroundings through a dual-level implicit-explicit estimation.
4.1. Principles
The core idea behind PIE is to enable a quadruped robot to perform complex parkour maneuvers by providing its control policy with a comprehensive and robust understanding of both its internal state and its external environment. This understanding is achieved through a dual-level implicit-explicit estimation mechanism, which processes proprioceptive (robot's internal state) and exteroceptive (environmental perception, e.g., depth camera) sensor data.
The implicit-explicit estimation works on two levels:
-
Understanding of Robot's State and Surroundings: Beyond just explicitly perceiving the terrain, the framework implicitly infers the robot's future state and surrounding terrain by predicting the
successor proprioceptive state(what the robot's internal state will be in the next timestep). Thisimplicit imaginationof future states, combined withexplicit terrain estimation(like a height map), allows the robot to anticipate dangers and plan maneuvers more effectively, especially where vision is unreliable. -
Nature of Estimated Values (Latent vs. Physical): The framework explicitly estimates specific, physically meaningful quantities (e.g., base velocity, foot clearance) that are directly relevant to control, alongside learning encoded
latent vectors. This combination provides the policy with both interpretable, direct control signals and robust, compressed representations of complex information, reducing noise influence and enhancing stability.By integrating these dual levels within a
one-stage end-to-end reinforcement learningsetup,PIEseeks to create a highly agile and robust control policy capable ofzero-shot sim-to-real transfer, even withlow-cost hardwareandunreliable sensors.
4.2. Core Methodology In-depth (Layer by Layer)
The PIE framework adopts an asymmetric actor-critic architecture to combine the benefits of privileged information during training with real-world deployability. It consists of three subnetworks: an actor, a critic, and an estimator. The optimization involves two concurrent parts: optimizing the actor-critic using Proximal Policy Optimization (PPO) and optimizing the estimator.
The following figure (Figure 2 from the original paper) shows the overall architecture of the PIE framework:
该图像是一个示意图,展示了基于隐式-显式学习框架的四足机器人控制系统。图中显示了不同的观察输入以及如何通过多层感知器(MLP)、卷积神经网络(CNN)和变换器编码器处理这些信息,以实现智能决策和运动控制,具体参数包括机器人34cm的长度和25cm的高度。
4.2.1. Policy Network (Actor)
The actor network is responsible for generating the robot's actions (joint commands). Crucially, its inputs are limited to data obtainable during real-world deployment.
Inputs to the Actor Network: The actor network takes the following as input:
-
: Proprioceptive observation at time .
-
: Estimated base velocity at time .
-
: Estimated foot clearance at time .
-
: Encoded height map estimation (latent vector representing terrain) at time .
-
: Purely latent vector (representing implicit state and surroundings) at time .
The
proprioceptive observationis a 45-dimensional vector directly measured from the robot'sjoint encodersandIMU. It is defined as: $ \mathbf{o}_t = \left[ \omega_t \quad \mathbf{g}_t \quad \mathbf{c}_t \quad \pmb{\theta}_t \quad \dot{\pmb{\theta}}t \quad \mathbf{\bar{a}}{t-1} \right]^T $ Where: -
: Body angular velocity.
-
: Gravity direction vector in the body frame.
-
: Velocity command (desired linear and angular velocities for the robot).
-
: Joint angles.
-
: Joint angular velocities.
-
: Previous action (motor commands), providing temporal context.
The terms , , , and are the outputs of the
estimatorsubnetwork, which processes rawdepth imagesand historicalproprioceptionto provide these estimations.
Output of the Actor Network: The actor network produces an action , which is a 12-dimensional vector. This vector represents the desired joint position offsets for the 12 joints of the quadruped robot (3 joints per leg, 4 legs).
Action Space & Target Joint Angle:
For stability and to maintain a default stance, the action is added as a bias to the robot's standstill pose . The final target joint angle is thus defined as:
$
\pmb{\theta}{target} = \pmb{\theta}{stand} + \mathbf{a}_t
$
This is then sent to the robot's low-level joint controllers (e.g., PD controllers) to achieve the desired joint positions.
4.2.2. Value Network (Critic)
The value network (critic) is responsible for estimating the state value (), which represents the expected cumulative future reward from a given state. To provide a more accurate estimation and guide the actor effectively, the critic network can incorporate privileged information that is typically only available in simulation.
Inputs to the Value Network: The input to the value network, , includes:
-
: Proprioceptive observation (same as for the actor).
-
: Ground truth base velocity (privileged information).
-
: Height map scan dots (privileged information, representing the true terrain height map).
Thus, is defined as: $ \mathbf{s}_t = \left[ \mathbf{o}_t \quad \mathbf{v}_t \quad \mathbf{m}_t \right]^T $ The use of privileged information for the critic is a common practice in
asymmetric actor-criticarchitectures to accelerate learning and improve the quality of the value estimates.
4.2.3. Reward Function
To emphasize the robustness of the PIE framework and avoid over-engineering, a relatively simple reward function is utilized, similar to those found in prior research on blind walking [19], [20]. The reward function encourages desired behaviors (like velocity tracking and stable locomotion) and penalizes undesirable ones (like collisions, high accelerations, and power consumption). The total reward is a sum of weighted individual reward terms.
The following are the results from Table I of the original paper:
| Reward | Equation(ri) | Weight(wi) |
|---|---|---|
| Lin. velocity tracking | exp{−4(vcmd − vxy)2} xy | 1.5 |
| Ang. velocity tracking | exp{−4(ωcrhd − ωyaw)2} | 0.5 |
| Linear velocity (z) | -1.0 | |
| Angular velocity (xy) | -0.05 | |
| Orientation | g|2 |82 | -1.0 |
| Joint accelerations | −2.5 × 10−7 | |
| Joint power | |τ| | -2 × 10-5 |
| Collision | −ncollision | -10.0 |
| Action rate | (at − at−1)2 | -0.01 |
| Smoothness | (at − 2at−1 + at−2)2 | -0.01 |
Where:
Lin. velocity tracking: Rewards the robot for matching its horizontal linear velocity () with the commanded linear velocity (). The exponential term heavily penalizes deviations.Ang. velocity tracking: Rewards the robot for matching its yaw angular velocity () with the commanded angular velocity ().Linear velocity (z): Penalizes vertical linear velocity, encouraging the robot to maintain a stable height.Angular velocity (xy): Penalizes roll and pitch angular velocities, promoting body stability and preventing tipping.Orientation: Penalizes deviations from a desired upright orientation, typically measured by the magnitude of the gravity vector in the body frame.Joint accelerations: Penalizes high joint accelerations, promoting smooth movements and reducing wear on actuators.Joint power: Penalizes high joint power consumption, encouraging energy-efficient locomotion. represents joint torque.Collision: Heavily penalizes collisions of any part of the robot (other than the feet) with the environment. is the number of collision points.Action rate: Penalizes large changes in actions (joint position commands) between consecutive time steps, promoting smooth control signals.Smoothness: Penalizes large second-order differences in actions, further encouraging smooth, continuous movements.
4.2.4. Estimator
The estimator subnetwork is at the heart of the PIE framework, implementing the dual-level implicit-explicit estimation. Its primary role is to process raw sensor data and generate the estimations (, , , ) that are fed into the actor network.
Dual-Level Implicit-Explicit Estimation: The estimated vectors are categorized by two levels:
Level 1: Understanding Robot's State and Surroundings (Implicit vs. Explicit) This level addresses how the robot comprehends its environment and its interaction with it.
- Challenge:
Exteroceptive sensors(like depth cameras) suffer fromlatencyandnoise, making purely explicit terrain estimation unreliable, especially for critical parkour maneuvers near edges. Ablind policycan implicitly infer surroundings but needs direct environmental interaction for feedback, which is too slow for anticipatory parkour. - PIE's Solution: Multi-Head Auto-Encoder Mechanism:
PIEintegratesimplicitandexplicitestimation of the robot's state and surroundings.- Encoder Module Inputs:
- Temporal Depth Images (): Depth observations from the near past are stacked along the channel dimension. The paper sets , meaning two recent depth images are used. These are processed by a
CNN encoderto extract visual features. - Temporal Proprioceptive Observations (): Proprioceptive observations from the recent past are concatenated. The paper sets , using ten recent proprioceptive states. These are processed by an
MLP encoderto extract proprioceptive features.
- Temporal Depth Images (): Depth observations from the near past are stacked along the channel dimension. The paper sets , meaning two recent depth images are used. These are processed by a
- Cross-Modal Reasoning: A
shared transformer encoderis employed to integrate the visual features from theCNNand proprioceptive features from theMLP. This allows forcross-modal attentionto fuse information from both modalities. - Memory Generation: The outputs of the
transformer(fused depth and proprioceptive features) are concatenated and fed into aGRU(Gated Recurrent Unit). TheGRUgeneratesmemoriesof the state and terrain, effectively capturing temporal dependencies and historical context, especially important since the egocentric camera cannot perceive terrain beneath or behind the robot. - Estimator Outputs for Actor Input: The outputs of the
GRUserve as input vectors for thepolicy network(actor). These include:- Encoded Height Map Estimation (): This is a latent vector representing the explicit understanding of the surrounding terrain. It is then
decodedinto toreconstruct the high-dimensional height map(the ground truth terrain map). This is an explicit estimation of the terrain. - Purely Latent Vector (): This vector is designed to encapsulate
implicit informationabout the robot's state and surroundings. It's not directly interpretable as a physical quantity.PIEuses aVariational Autoencoder (VAE)structure for . When decoded alongside other vectors, aims to reconstruct , which represents the robot'ssuccessor proprioceptive state. This is an implicit estimation of the robot's state and surrounding interaction.
- Encoded Height Map Estimation (): This is a latent vector representing the explicit understanding of the surrounding terrain. It is then
- Loss for Level 1:
Mean-Squared-Error (MSE) lossis used between the decoded reconstruction and the ground truth for explicit terrain estimation.Mean-Squared-Error (MSE) lossis used between the decoded reconstruction and the ground truth for implicit successor state estimation.KL divergenceis used as alatent lossfor theVAEstructure of to regularize its distribution.
- Encoder Module Inputs:
Level 2: Nature of Estimated Values (Encoded Latent Vector vs. Explicit Physical Quantity) This level concerns whether the output vector is a compressed, abstract representation or a direct physical measurement.
- Encoded Latent Vectors (, ): These are the outputs from the
GRUthat are further processed by decoders for reconstruction. The benefit of usingencoded latent vectorsiscompressionanddimension reduction, which helps innoise reductionand increasesrobustness. - Explicit Physical Quantities (, ): These are explicitly estimated by the
estimatorand are directly interpretable as physical measurements.- : Estimated base velocity. Explicitly estimating this helps in
velocity trackingduring training. - : Estimated foot clearance. This provides relevant, direct information about the height of the terrain relative to the robot's feet, which is crucial for understanding terrain in
parkour scenarios(e.g., stepping on edges, clearing obstacles). - Loss for Level 2:
Mean-Squared-Error (MSE) lossis used for both (between estimated and ground truth base velocity ) and (between estimated and ground truth foot clearance ).
- : Estimated base velocity. Explicitly estimating this helps in
Overall Training Loss for the Estimator:
The total training loss for the estimator combines all these objectives:
$
\mathcal{L} = D_{\mathrm{KL}} \big( q ( \mathbf{z}_t | \mathbf{o}_t^{H_1} , \mathbf{d}t^{H_2} ) \ | \ p ( \mathbf{z}t ) \big) + \mathbf{MSE} \big( \hat{\mathbf{o}}{t+1} , \mathbf{o}{t+1} \big) + \mathbf{MSE} ( \hat{\mathbf{m}}_t , \mathbf{m}_t ) + \mathbf{MSE} ( \hat{\mathbf{v}}_t , \mathbf{v}_t ) + \mathbf{MSE} ( \hat{\mathbf{h}}_t^f , \mathbf{h}_t^f )
$
Where:
- :
KL divergenceloss.- : The posterior distribution of the latent vector , conditioned on the historical proprioceptive observations and temporal depth images . This is learned by the
encoderpart of theVAE. - : The prior distribution of the latent vector , which is parameterized by a
standard normal distribution(mean 0, variance 1). This term encourages the learned latent space to be well-structured.
- : The posterior distribution of the latent vector , conditioned on the historical proprioceptive observations and temporal depth images . This is learned by the
- :
Mean-Squared-Errorloss between the predictedsuccessor proprioceptive stateand the ground truthsuccessor proprioceptive state. - :
Mean-Squared-Errorloss between the reconstructedheight mapand the ground truthheight map. - :
Mean-Squared-Errorloss between the estimatedbase velocityand the ground truthbase velocity. - :
Mean-Squared-Errorloss between the estimatedfoot clearanceand the ground truthfoot clearance.
4.2.5. Training Details
1. Simulation Platform:
- The
actor,critic, andestimatorare trained in simulation usingIsaac Gym, a high-performance GPU-accelerated robotics simulator. Isaac Gymis configured with4096 parallel environments, enabling massively parallel training.- Leveraging
NVIDIA Warp(a Python framework for high-performance GPU programming), the training process completes10,000 iterationsin under20 hourson anNVIDIA RTX 4090 GPU. This efficiency allows for rapid development and deployment of the network.
2. Training Curriculum:
- A
training curriculumis employed, progressively increasing the difficulty of the terrains over time. This allows the policy to first master simpler tasks and then gradually adapt to more challenging environments. Parkour terrainsin simulation include:Gaps: Up to1mwide.Steps: Up to0.75mheight.Hurdles: Up to0.75mheight (covered by step randomization).Stairs: With a height up to0.25m.
Lateral velocity commandsare sampled from .Horizontal angular velocity commandsare sampled from .- Terrain Randomization: To ensure the policy generalizes beyond fixed simulation paradigms, various terrains are randomized. For example, the
depthandwidthofgapsare randomized to train the robot to handle scenarios where a large gap might be mistaken for flat ground between two steps, requiring a jump down and then up.
3. Domain Randomization:
To enhance the robustness of the network trained in simulation and facilitate smooth sim-to-real transfer, various physical and sensor parameters are randomized during training.
The following are the results from Table II of the original paper:
| Parameter | Randomization range | Unit |
|---|---|---|
| Payload | [−1, 2] | kg |
| Kp factor | [0.9, 1.1] | Nm/rad |
| factor | [0.9, 1.1] | Nms/rad |
| Motor strength factor | [0.9, 1.1] | Nm |
| Center of mass shift | [−50, 50] | mm |
| Friction coefficient | [0.2, 1.2] | - |
| Initial joint positions | [0.5, 1.5] | rad |
| System delay | [0, 15] | ms |
| Camera position (x) | [−10, 10] | mm |
| Camera position (y) | [−10, 10] | mm |
| Camera position (z) | [−10, 10] | mm |
| Camera pitch | [−1, 1] | deg |
| Camera horizontal FOV | [86, 88] | deg |
These parameters are randomized to introduce variability that mimics real-world uncertainties and variations, thereby improving the policy's ability to generalize to unseen real-world conditions. Examples include variations in robot weight (Payload), joint stiffness (Kp factor), damping (Kd factor), motor capabilities (Motor strength factor), balance (Center of mass shift), ground interaction (Friction coefficient), initial conditions (Initial joint positions), control loop latency (System delay), and sensor noise/placement (Camera position, Camera pitch, Camera horizontal FOV).
5. Experimental Setup
The effectiveness of the PIE framework was evaluated through ablation studies and comparisons with existing parkour frameworks in both simulation and real-world environments.
5.1. Datasets
The experiments primarily used simulated parkour terrains for training and evaluation. These terrains were procedurally generated and randomized to include a variety of obstacles at different difficulty levels.
- Types of Terrains:
Gaps: Open spaces the robot must jump over.Steps/Hurdles: Obstacles of varying heights the robot must climb over or jump onto/off.Stairs: A series of steps with varying heights and widths.Ramps: Inclined surfaces (used for generalization testing in real-world, not specifically trained on).
- Difficulty Levels: For simulation metrics, terrains were configured with ten levels of increasing difficulty.
- Real-World Environments:
- Indoor Labs: Structured parkour courses with fabricated steps, gaps, and stairs for controlled testing and direct comparison.
- Outdoor Environments:
-
Long-distance hike: A 2 km round trip on a mountain trail with an elevation gain of 153m, featuring continuous curved stairs of varying heights/widths, irregularly shaped steps/hurdles, steep ramps, deformable/slippery ground, and rocky surfaces. -
Dark outdoor conditions: Nighttime tests involving steps, irregular rocks, slopes, and stairs, challenging the depth camera's performance.The datasets (simulated terrains) were chosen to cover a wide range of parkour challenges, ensuring the learned policy is robust and versatile. The real-world environments then validate the
sim-to-real transfercapabilities and generalization.
-
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a complete explanation:
5.2.1. Mean Terminated Difficulty Level
- Conceptual Definition: This metric quantifies the average difficulty level of terrain that a robot can successfully traverse before terminating due to a fall or collision. It is specifically used in simulation experiments to assess the policy's maximum traversal capability across different terrain types (gap, stairs, step). Terrains are arranged in a curriculum-like fashion with increasing difficulty levels (e.g., 1 to 10).
- Mathematical Formula: The paper does not provide an explicit formula, but conceptually it is calculated as the average of the maximum difficulty levels reached across multiple trials and environments for a specific terrain type. Let be the maximum difficulty level reached by robot in environment for a given terrain type. $ \text{Mean Terminated Difficulty Level} = \frac{1}{N \cdot M} \sum_{i=1}^{N} \sum_{j=1}^{M} L_{i,j} $
- Symbol Explanation:
- : Total number of environment sets created for a specific terrain type (e.g., 40 sets).
- : Number of robots tested simultaneously in each environment set (e.g., 100 robots).
- : The highest difficulty level successfully completed by robot in environment set before termination.
5.2.2. Average Success Rate
- Conceptual Definition: This metric measures the percentage of trials where the robot successfully completes a given task or traverses a specific terrain without falling or colliding. It is used in both simulation (for camera error studies) and real-world experiments (for ablations and SOTA comparisons).
- Mathematical Formula: $ \text{Average Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
- Symbol Explanation:
Number of Successful Trials: The count of attempts where the robot completed the task as defined (e.g., traversed the obstacle, reached the goal) without failure.Total Number of Trials: The total number of attempts made for a specific task or terrain.
5.2.3. Parkour Abilities (Relative Size of Traversable Obstacles)
- Conceptual Definition: This metric provides a qualitative and quantitative comparison of how large an obstacle a robot can reliably traverse relative to its own physical dimensions (height or length). It's a key indicator of parkour performance, typically expressed as a multiple (e.g., "3x robot height").
- Mathematical Formula: The paper does not provide a formal formula, but it is conceptually calculated as: $ \text{Obstacle Ratio} = \frac{\text{Maximum Traversable Obstacle Dimension}}{\text{Robot's Relevant Dimension}} $
- Symbol Explanation:
Maximum Traversable Obstacle Dimension: The largest height (for steps/stairs) or length (for gaps) of an obstacle that the robot can consistently overcome.Robot's Relevant Dimension: The robot'sthigh joint height(for steps/stairs) ordistance between two thigh joints(for gaps), as defined in the paper.
5.3. Baselines
The paper compared PIE against several baselines:
5.3.1. Ablation Study Baselines (PIE variants):
These variants were designed to verify the effectiveness of different components of PIE's dual-level implicit-explicit estimation.
-
PIE w/o reconstructing: This variant removes theimplicit estimationof the robot'ssuccessor proprioceptive state. It relies solely onexplicit terrain estimationfrom the vision and other explicit estimates. -
**
PIE w/o reconstructing\hat{\mathbf{m}}t**: This variant lacks theexplicit estimationof the terrain (height map reconstruction). It uses both proprioception and exteroception but only reconstructs\mathbf{o}{t+1}$ as an implicit estimation of surroundings. -
**
PIE w/o estimating\hat{\mathbf{v}}_t**: This variant is trained without explicitly estimating thebase velocity(\hat{\mathbf{v}}_t$) as an input to the actor. -
**
PIE w/o estimating\hat{\mathbf{h}}_t^f**: This variant is trained without explicitly estimating thefoot clearance(\hat{\mathbf{h}}_t^f$) as an input to the actor. -
PIE using predicted\mathbf{o}_{t+1}**: In this variant, the policy network uses the directly predicted `successor proprioceptive state` ($\hat{\mathbf{o}}_{t+1}$) as input, instead of the `purely latent vector` ($\mathbf{z}_t$) used for reconstruction. This tests if a direct, higher-dimensional prediction is better than a compressed latent representation. ### 5.3.2. Prior Work Baselines: These are state-of-the-art methods in `robot parkour`. * **Hoeller et al. [1]**: `AnymalC` robot, hierarchical pipeline for parkour. * **Zhuang et al. [2]**: `Unitree-A1` robot, multi-stage method using soft/hard constraints. * **Cheng et al. [3]**: `Unitree-A1` robot, two-stage framework with waypoints for guidance. ### 5.3.3. Robot Hardware: * `PIE` was deployed on a **DEEP Robotics Lite3** quadruped robot. * **Specifications:** Thigh joint height: $25 \mathrm{cm}$, distance between two thigh joints: $34 \mathrm{cm}$, weight: $12.7 \mathrm{kg}$, peak knee joint torque: $30.5 \mathrm{Nm}$. * **Sensors:** Onboard Intel RealSense D435i depth camera (10 Hz), joint encoders, IMU. * **Processor:** Onboard Rockchip RK3588 processor. * The comparison baselines `Zhuang et al. [2]` and `Cheng et al. [3]` used the **Unitree A1** robot. * **Specifications:** Thigh joint height: $26 \mathrm{cm}$, distance between two thigh joints: $40 \mathrm{cm}$, weight: $12 \mathrm{kg}$, peak knee joint torque: $33.5 \mathrm{Nm}$. * The paper notes that the specifications of the Lite3 are comparable to the Unitree A1, making a direct comparison of parkour abilities reasonable. # 6. Results & Analysis ## 6.1. Core Results Analysis The experimental results demonstrate `PIE`'s superior performance in `parkour`, its robustness to sensor errors, and its excellent `sim-to-real transfer` capabilities. ### 6.1.1. Simulation Experiments In simulation, `PIE` was evaluated against its ablations on various terrains (gap, stairs, step) with 10 increasing difficulty levels. The following are the results from Table III of the original paper: <div class="table-wrapper"><table> <thead> <tr> <th rowspan="2">Method</th> <th colspan="3">Mean Terminated Difficulty Level</th> </tr> <tr> <th>Gap</th> <th>Stairs</th> <th>Step</th> </tr> </thead> <tbody> <tr> <td>PIE (ours)</td> <td>9.9</td> <td>9.86</td> <td>9.81</td> </tr> <tr> <td>PIE w/o ôt+1</td> <td>9.51</td> <td>9.45</td> <td>9.62</td> </tr> <tr> <td>PIE w/o hf</td> <td>7.41</td> <td>7.36</td> <td>3.09</td> </tr> <tr> <td>PIE w/o t</td> <td>8.7</td> <td>8.22</td> <td>8.48</td> </tr> <tr> <td>PIE w/o nt</td> <td>9.75</td> <td>4.25</td> <td>1.67</td> </tr> <tr> <td>PIE using predicted ôt+1</td> <td>9.23</td> <td>7.28</td> <td>3.29</td> </tr> </tbody> </table></div> * **`PIE (ours)`** consistently achieved the highest mean terminated difficulty levels across all terrain types (9.9 for Gap, 9.86 for Stairs, 9.81 for Step). This validates the effectiveness of the full `dual-level implicit-explicit estimation` framework. * **`PIE w/o reconstructing $\hat{\mathbf{o}}_{t+1}$`** performed commendably, but slightly below the full `PIE`. Its reliance solely on explicit terrain estimation implies a lack of comprehensive terrain understanding, making it more susceptible to falls on the most challenging terrains due to minor deviations in foothold. This highlights the benefit of `implicit successor state prediction`. * **`PIE w/o estimating $\hat{\mathbf{h}}_t^f`** showed a significant decline, especially on `Step` terrains (3.09). This indicates that explicit estimation of `foot clearance` is crucial for the robot's direct understanding of the terrain beneath its feet, essential for extreme parkour maneuvers along terrain edges. * **`PIE w/o estimating $\hat{\mathbf{v}}_t`** also demonstrated reduced performance. Without an explicit estimate of `base velocity`, the policy introduces biases in `velocity tracking`, diminishing maneuverability on high-difficulty terrains. * **`PIE w/o reconstructing $\hat{\mathbf{m}}_t`** performed poorly, particularly on `Stairs` (4.25) and `Step` (1.67). This shows that regressing a reconstructed `height map` from the `latent vector` is vital for extracting useful terrain information from the depth image, and without it, the robot struggles with complex terrains. * **`PIE using predicted`\mathbf{o}_{t+1} struggled significantly onStep(3.29) andStairs(7.28). This suggests that using the raw predictedsuccessor proprioceptive stateas input is less effective than a compressedpurely latent vector(). The distribution of is more complex than a standard normal distribution, making it harder for the policy network to extract useful information directly, reinforcing the benefit of theVAEstructure and latent representations.The following figure (Figure 3 from the original paper) shows the average success rate of PIE and PIE without under various camera input errors:
该图像是图表,展示了PIE及PIE无 ackslash hat { ackslash boldsymbol { o } }_{ t + 1 } 在不同摄像头输入误差下的平均成功率。图中包括五种摄像头误差类型的结果,展示了不同条件下机器人的表现。
The figure illustrates the average success rate of PIE and PIE w/o reconstructing under five types of camera input errors, beyond standard domain randomization.
- Overall Performance:
PIEconsistently outperformsPIE w/o reconstructingacross all terrains and camera error types. - Impact of Noise: The performance disparity between the two methods becomes more significant when considerable noise is added to the camera. This strongly suggests that the
implicit estimationvia is crucial for robustness againstunreliable depth camerasand various interferences. - Vision-Proprioception Trust: The results imply that with estimation, the policy can make correct decisions even when vision contradicts proprioception, placing more trust in
proprioceptionwhen visual inputs are compromised. This enablesPIEto better acquirerobot state estimationandterrain understandingunder challenging visual conditions.
6.1.2. Real-World Indoor Experiments
Real-world experiments were conducted on a DEEP Robotics Lite3 robot, comparing PIE with its ablations and prior works on step, gap, stairs, and ramp terrains (ramp was for generalization testing). Each method was tested for ten trials on each difficulty level.
The following figure (Figure 4 from the original paper) shows the comparative success rates of PIE and its ablations on various real-world terrains:
该图像是插图,展示了四种不同的障碍物(步高、间隙、楼梯和坡道)以及相应的成功率图表。每种障碍物下方显示了相应的成功测试,测试中使用了不同版本的PIE框架。右侧的图表则表明了在不同条件下,机器人成功率的变化情况,提供了与其他方法的对比数据。
-
Outstanding Performance of PIE:
PIEdemonstrates outstanding performance across all skills and terrains in the real world, surpassing all ablations and previous related works.- It enables the robot to climb obstacles as high as (3x robot height).
- Leap over gaps as large as (3x robot length).
- Climb stairs as high as (1x robot height).
-
Significant Improvement: These results represent a significant performance improvement of at least 50% compared to state-of-the-art robot parkour frameworks, as summarized in Table IV.
-
Remarkable Sim-to-Real Transferability:
PIEexhibits consistent success rates compared to its simulation performance, showcasing excellentsim-to-real transferabilitywithout extensive fine-tuning. -
Generalization to Unseen Terrains: Despite no specific training on
ramp terrainsin simulation,PIEdemonstrates better generalization performance on ramps compared to ablations.The following are the results from Table IV of the original paper:
Method Robot Step Gap Stairs Hoeller et al. [1] AnymalC 2× 1.5× 0.5× Zhuang et al. [2] Unitree-A1 1.6× 1.5× - Cheng et al. [3] Unitree-A1 2× 2× - PIE (ours) DEEP Robotics Lite3 3× 3× 1×
This table provides a direct comparison of Parkour Abilities (relative to robot height/length). PIE significantly outperforms all previous works, achieving 3x for both Step and Gap and 1x for Stairs.
Ablation Performance in Real-World:
-
PIE w/o reconstructing: Performed relatively better among ablations, but its performance noticeably decreased compared to simulation due to largerdelayandnoisein real-world perception and actuation. This further underscores the importance of the implicit estimation for real-world robustness. -
PIE w/o estimating \hat{\mathbf{v}}_t: As terrain difficulty increased, the estimation ofbase velocitydeteriorated, leading to a noticeable decline in success rates when following velocity commands. -
**
PIE w/o reconstructing \hat{\mathbf{m}}_t`**: Faced significant challenges in extracting useful terrain information from more complicated real-world depth images (compared to simulation). External perception became an interference rather than assistance, leading to near-zero success rates across all terrains. This highlights the crucial role of the explicit `height map reconstruction` in real-world visual understanding. ### 6.1.3. Real-World Outdoor Experiments To assess robustness and generalization in highly disturbed outdoor settings, extensive tests were conducted: The following figure (Figure 5 from the original paper) shows a quadruped robot performing parkour tasks over various terrains during a long-distance hike:  *该图像是一个展示四足机器人在不同地形上执行跑酷任务的示意图。图中包含机器人的运动轨迹和多种自然地形的照片,验证了所提出的PIE框架的有效性,包括180m和27m的高度标记。* * **Long-Distance Hike:** The robot completed a `2 km round trip` from ZJU Yuquan campus to Laohe Mountain, with an `elevation gain of 153m`, in just `40 minutes` without stops. This challenging trail included continuous curved stairs of varying heights/widths, irregularly shaped steps/hurdles, steep ramps, deformable/slippery ground, and rocky surfaces. This demonstrates remarkable robustness and sustained performance in complex, unstructured natural environments. The following figure (Figure 6 from the original paper) shows tests in a dark outdoor environment:  *该图像是显示机器人在黑暗户外环境中进行测试的插图。尽管几乎没有可见光,该机器人仍然能够准确地执行灵活的动作,并在执行艰难的越野任务时表现出色。* * **Dark Outdoor Conditions:** Tests were successfully conducted at night in dim outdoor conditions. Despite the near absence of visible light, the robot accurately performed agile maneuvers, including continuously jumping over high steps and irregular rocks, and climbing up and down slopes and stairs. This showcases the depth camera's utility and the framework's robustness even when visual conditions are extremely challenging for perception. ## 6.2. Ablation Studies / Parameter Analysis The ablation studies, detailed in Section 6.1.1 (Simulation Experiments) and Section 6.1.2 (Real-World Indoor Experiments), rigorously verified the contributions of each component of the `dual-level implicit-explicit estimation` framework. * **Importance of `Implicit Successor State Estimation` (\hat{\mathbf{o}}_{t+1}\hat{\mathbf{o}}_{t+1}\hat{\mathbf{m}}_t\hat{\mathbf{m}}_t\hat{\mathbf{v}}_t\hat{\mathbf{v}}_t\hat{\mathbf{h}}_t^f\hat{\mathbf{h}}_t^f\mathbf{z}_t\hat{\mathbf{o}}_{t+1}):** `PIE using predicted`\mathbf{o}_{t+1}\\$ performed worse than using thepurely latent vector(). This suggests that the compressed, regularized latent representation from theVAE` is more informative and easier for the policy to use than a high-dimensional, potentially noisy raw prediction of the successor state, especially given the complex distribution of .The
domain randomizationparameters (Table II) were crucial forsim-to-real transfer. By varying physical properties (mass, friction, CoM) and sensor characteristics (camera position, FOV, system delay), the policy was trained to be robust to real-world discrepancies, leading tozero-shot deployment.
6.3. Emergent Behaviors
The paper notes several emergent behaviors that highlight the framework's sophistication:
-
Natural and Agile Gait: Despite using a simple reward function and no imitation learning, the robot developed a natural and agile gait, seamlessly transitioning across complex terrains, akin to real animals. The authors hypothesize this is due to the ability to predict the
successor state, which improves delay management and fosters aninternal modelof itself and the environment. -
Robustness in Emergent Scenarios: The robot demonstrated timely and accurate responses even in scenarios where external perception had slight deviations (e.g., being tripped, misstepping during a jump takeoff). This is particularly impressive for
Reinforcement Learning, which is not typically adept at precise maneuvers.The following figure (Figure 7 from the original paper) illustrates the robot's ability to quickly regain stability despite estimation inaccuracies during intense maneuvers:
该图像是一个展示四足机器人在不同环境中进行跳跃和稳定性的顺序图示。尽管在跳跃过程中存在距离估计误差,该机器人依然能够成功完成跳跃并稳稳着陆,同时在上楼梯时遇到突发的台阶缺口时也能迅速恢复稳定性。
The figure illustrates two examples of the robot's robust recovery:
- Gap Jump Recovery: When leaping over a gap, even if distance estimation errors caused the front and rear legs to not fully support on the platform before takeoff, the robot still successfully executed the jump and landed smoothly.
- Stair Void Recovery: When encountering a sudden step void while ascending stairs, the robot promptly stabilized itself and continued stepping upwards.
These examples showcase
PIE's ability to handle perception-actuation discrepancies and maintain stability through dynamic recovery.
7. Conclusion & Reflections
7.1. Conclusion Summary
In this work, the authors proposed PIE (Parkour with Implicit-Explicit learning framework for legged robots), a novel one-stage, end-to-end learning-based framework designed to enhance the parkour capabilities of legged robots. The core innovation lies in its dual-level explicit-implicit estimation mechanism, which refines the robot's understanding of both its own state and its environment.
Through extensive experiments in simulation and real-world environments, PIE demonstrated:
-
Superior Parkour Performance: It significantly pushed the limits of robot parkour, enabling a low-cost quadruped robot (DEEP Robotics Lite3) to traverse obstacles up to 3x its height/length and stairs up to 1x its height. This represents a substantial improvement (at least 50%) over existing state-of-the-art learning-based parkour frameworks.
-
Robustness to Perception Errors: The
implicit successor state estimationproved crucial for maintaining performance even with unreliable and noisy camera inputs. -
Effective Sim-to-Real Transfer: The framework, trained entirely in simulation with
domain randomization, achieved successfulzero-shot deploymentin diverse real-world indoor and challenging outdoor environments (including a mountain hike and nighttime tests), showcasing remarkable generalization capabilities and robustness. -
Efficient Training: It achieved these results with a relatively simple training process and reward function, without relying on complex imitation learning or intricate behavior constraints.
Overall,
PIEpresents a unified policy that robustly integratesproprioceptiveandexteroceptiveinformation to enable highly agile and generalizedparkourlocomotion for legged robots.
7.2. Limitations & Future Work
The authors acknowledge several limitations of the current PIE framework and propose future research directions:
-
Lack of 3D Terrain Understanding: The current framework primarily relies on 2D height maps or latent representations derived from depth images, meaning it lacks a full 3D understanding of the terrain. Consequently, the robot is
unable to crouch under obstacles. -
Limited Semantic Information from Perception: The external perception relies solely on
depth images, which provide geometric information but lack the richersemantic informationthat could be extracted fromRGB images(e.g., identifying object types, traversability cues beyond height). -
Training Confined to Static Environments: The current training is conducted in
static environments. It has not been extended todynamic scenes(e.g., moving obstacles, changing ground conditions), which could potentially lead toconfusion in visual estimationand compromise performance.In the future, the authors aim to:
-
Design a New Unified Learning-Based Sensorimotor Integration Framework: This framework would extract
3D terrain informationfromdepth imagesand obtainabundant semantic informationfromRGB images. -
Achieve Better Adaptability and Mobility in Various Environments: The goal is to enhance the robot's capabilities to handle a wider range of complex and dynamic environments by leveraging richer, multi-modal perceptual inputs.
7.3. Personal Insights & Critique
This paper presents a highly impressive advancement in legged robot locomotion, particularly for parkour.
Inspirations and Strengths:
- Elegance of One-Stage Learning: Moving from a two-stage teacher-student paradigm to a one-stage end-to-end learning framework is a significant simplification that inherently reduces potential information bottlenecks and training complexity. The
asymmetric actor-criticarchitecture is cleverly utilized within this one-stage design. - Power of Implicit-Explicit Estimation: The
dual-level implicit-explicit estimationis the core innovation. The idea that a robot can implicitly infer its future state and surroundings, effectively building an "internal model" throughsuccessor proprioceptive state prediction, is very powerful. This, combined with explicit estimations (height map, base velocity, foot clearance), creates a robust and comprehensive understanding that is resilient to noisy sensors. This approach hints at a more generalizable form of intelligence where prediction and self-awareness augment direct perception. - Remarkable Sim-to-Real Transfer: The
zero-shot deploymentsuccess, especially in harsh outdoor terrains and dim light, is a testament to the effectiveness of the chosen methodology and thedomain randomizationstrategy. Achieving such high performance on a low-cost robot is also commendable, suggesting that advanced locomotion capabilities are becoming more accessible. - Emergent Agile Behaviors: The observation that the robot developed natural and agile gaits and displayed robust recovery behaviors despite a simple reward function is fascinating. It suggests that optimizing for successor state prediction and basic locomotion objectives, coupled with a rich state representation, can lead to highly sophisticated, animal-like movements. This aligns with theories of predictive coding in biological brains.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
"Low-Cost Robot" Quantification: While the paper mentions using a "low-cost quadruped robot," it doesn't provide a quantitative comparison of its cost relative to the
AnymalCor even theUnitree A1(which is often considered a more accessible research platform). A clearer definition or comparison of "low-cost" would strengthen this claim. -
"Unreliable Egocentric Depth Camera" Quantification: The term "unreliable" is subjective. While experiments show robustness to noise, quantifying the typical noise levels, latency, and failure modes of the Intel RealSense D435i camera in different environments would provide a more rigorous basis for the claims about robustness.
-
Scalability to More Complex Environments: The current
parkourterrains, while challenging, are often discrete obstacles. While the outdoor hike is impressive, general unstructured environments might involve more subtle traversability cues, dynamic elements, or terrains that require finer manipulation (e.g., pushing objects, navigating through dense foliage). The lack of full 3D understanding (as noted by authors) might limit performance in such scenarios. -
Energy Efficiency: While
joint poweris penalized in the reward function, a deeper analysis of energy consumption during various maneuvers, especially the high-impact jumps, would be valuable for long-duration deployments. -
Lack of Explicit High-Level Planning: The framework focuses on reactive control. For truly complex, multi-objective navigation (e.g., reaching a distant goal while minimizing energy and avoiding specific obstacles), integrating a high-level planner that leverages the robust terrain understanding could be a fruitful direction.
Overall,
PIEis an excellent contribution that pushes the boundaries of agile legged locomotion. Itsimplicit-explicit estimationparadigm is a powerful concept that could be generalized to other complex robotic control tasks where robust perception and internal modeling are critical.
Similar papers
Recommended via semantic vector search.