Paper status: completed

PIE: Parkour with Implicit-Explicit Learning Framework for Legged Robots

Published:08/25/2024

Legged Robot Motion Learning (1)Implicit-Explicit Learning Framework (1)Parkour Task (1)Low-Cost Quadruped Robot (1)Simulation Training and Zero-Shot Deployment (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The PIE framework enhances legged robots' parkour capabilities by utilizing dual-level implicit-explicit estimation, enabling even low-cost robots to perform exceptionally on challenging terrains through simple training and successful zero-shot deployment.

Abstract

Parkour presents a highly challenging task for legged robots, requiring them to traverse various terrains with agile and smooth locomotion. This necessitates comprehensive understanding of both the robot's own state and the surrounding terrain, despite the inherent unreliability of robot perception and actuation. Current state-of-the-art methods either rely on complex pre-trained high-level terrain reconstruction modules or limit the maximum potential of robot parkour to avoid failure due to inaccurate perception. In this paper, we propose a one-stage end-to-end learning-based parkour framework: Parkour with Implicit-Explicit learning framework for legged robots (PIE) that leverages dual-level implicit-explicit estimation. With this mechanism, even a low-cost quadruped robot equipped with an unreliable egocentric depth camera can achieve exceptional performance on challenging parkour terrains using a relatively simple training process and reward function. While the training process is conducted entirely in simulation, our real-world validation demonstrates successful zero-shot deployment of our framework, showcasing superior parkour performance on harsh terrains.

Mind Map

In-depth Reading

English Analysis~38 min read · 52,312 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is a framework for enabling legged robots to perform parkour maneuvers, titled PIE: Parkour with Implicit-Explicit Learning Framework for Legged Robots.

1.2. Authors

The authors are Shixin Luo, Songbo Li, Ruiqi Yu, Zhicheng Wang, Jun Wu, and Qiuguo Zhu. Shixin Luo and Songbo Li are noted as contributing equally to this work. Qiuguo Zhu is the corresponding author. Their affiliations are:

Institute of Cyber-Systems and Control, Zhejiang University, 310027, China.
State Key Laboratory of Industrial Control Technology, 310027, China (for Qiuguo Zhu and Jun Wu).

1.3. Journal/Conference

The paper was published at arXiv as a preprint arXiv:2408.13740. However, the manuscript received, revised, and accepted dates (April 23, 2024; July 25, 2024; August 23, 2024, respectively) and the note "This paper was recommended for publication by Editor Aleksandra Faust upon evaluation of the Associate Editor and Reviewers' comments" indicate that it has been accepted for publication, likely in a peer-reviewed robotics journal or conference. Given the nature of the work, a highly reputable robotics conference like ICRA or IROS or a journal like Science Robotics or IEEE Robotics and Automation Letters would be a fitting venue.

1.4. Publication Year

The paper was Published at (UTC): 2024-08-25T07:01:37.000Z, indicating it was published in 2024.

1.5. Abstract

The paper addresses the challenge of enabling legged robots to perform parkour, a task requiring agile and smooth locomotion over diverse terrains. This necessitates a comprehensive understanding of the robot's state and surroundings, despite inherent unreliability in robot perception and actuation. Current state-of-the-art methods either rely on complex, pre-trained high-level terrain reconstruction modules or limit robot parkour potential to avoid failures from inaccurate perception.

The authors propose PIE (Parkour with Implicit-Explicit learning framework for legged robots), a one-stage, end-to-end learning-based parkour framework. PIE employs a dual-level implicit-explicit estimation mechanism. This approach allows even low-cost quadruped robots, equipped with unreliable egocentric depth cameras, to achieve exceptional parkour performance on challenging terrains with a relatively simple training process and reward function. The framework is trained entirely in simulation, and its real-world validation demonstrates successful zero-shot deployment, showcasing superior parkour performance on harsh terrains.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2408.13740
PDF Link: https://arxiv.org/pdf/2408.13740v3.pdf The paper is available as a preprint on arXiv and has been accepted for publication in a peer-reviewed venue.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is enabling legged robots, specifically quadruped robots, to perform parkour maneuvers with high agility, smoothness, and robustness across varied and challenging terrains. Parkour involves navigating obstacles by running, jumping, and climbing, tasks that demand precise balance, stability, and real-time environmental understanding.

This problem is important because successful implementation could open up new possibilities for robot technology in extreme environments, driving advancements in real-world applications such as search and rescue, exploration in unstructured environments, and logistics in complex terrains.

Specific challenges or gaps in prior research include:

Unreliability of Perception and Actuation: Robot sensors (like depth cameras) and actuators (motors) are inherently noisy and have latency, making precise real-time terrain estimation and control difficult, especially for high-stakes maneuvers near edges or during jumps.
Complexity of State-of-the-Art Methods: Existing methods often rely on complex, pre-trained high-level terrain reconstruction modules, which complicate the overall system and training.
Limited Performance Potential: To compensate for unreliable perception, many approaches limit the maximum potential of robot parkour, avoiding challenging scenarios where inaccurate perception could lead to failure.
Two-Stage Training Paradigms: Many learning-based parkour approaches use a two-stage training process (e.g., teacher-student policies), which can lead to information loss and performance degradation in the deployed student policy.
Lack of Seamless Behavior Integration: Integrating multiple complex behaviors (running, jumping, climbing) seamlessly into a neural network using a simple training process and reward function remains challenging.
Focus on Explicit Terrain Estimation: Prior works primarily focus on explicit terrain estimation, often lacking the implicit understanding of the surroundings, which hinders maximal robot performance.

The paper's entry point or innovative idea is to propose a one-stage end-to-end learning-based parkour framework called PIE (Parkour with Implicit-Explicit learning framework for legged robots). The key innovation within PIE is its dual-level implicit-explicit estimation mechanism. This mechanism aims to enhance the robot's understanding of its own state and surroundings in a more robust and comprehensive way, pushing the limits of parkour performance even with low-cost robots and unreliable sensors.

2.2. Main Contributions / Findings

The paper's primary contributions are:

A Novel One-Stage Learning Framework: PIE is introduced as a novel one-stage, end-to-end learning-based parkour framework. This eliminates the complexities and information loss associated with two-stage training paradigms common in prior works.
Dual-Level Implicit-Explicit Estimation: The framework leverages a unique dual-level implicit-explicit estimation approach to enhance the quality of estimating the robot's state and surroundings.
- Level 1 (Understanding State/Surroundings): Integrates real-time proprioception with exteroception to implicitly infer the robot's state and surroundings by estimating its successor state, alongside explicitly estimating terrain from visual data. This improves estimation accuracy and robustness against unreliable sensors.
- Level 2 (Latent vs. Physical Quantity): Explicitly estimates specific physical quantities (like base velocity and foot clearance) alongside encoded latent vectors, further enhancing the robot's ability to execute complex parkour maneuvers.
Exceptional Parkour Capabilities: Experiments demonstrate that PIE significantly enhances the parkour capabilities of quadruped robots. The robot can:
- Leap onto and jump off steps 3x its height (e.g., $0.75 \mathrm{m}$ for a $0.25 \mathrm{m}$ robot).
- Negotiate gaps 3x its length (e.g., $1 \mathrm{m}$ for a $0.34 \mathrm{m}$ robot).
- Climb up and down stairs 1x its height (e.g., $0.25 \mathrm{m}$ for a $0.25 \mathrm{m}$ robot). These results represent a significant performance improvement (at least 50% compared to state-of-the-art frameworks) and push the limits of parkour for quadruped robots.
Sim-to-Real Transferability and Robustness: The framework, trained entirely in simulation, demonstrates successful zero-shot deployment in the real world (indoor and challenging outdoor environments) without extensive fine-tuning. This showcases remarkable robustness and generalization capabilities, even in the presence of significant disturbances to the depth camera in outdoor settings.
Simplicity in Training: PIE achieves these advanced capabilities using a relatively simple training process and reward function, contrasting with methods requiring intricate reward designs or imitation learning.

The key conclusions reached are that the dual-level implicit-explicit estimation mechanism, coupled with a one-stage end-to-end learning framework, is highly effective in enabling robust and agile parkour locomotion for legged robots, overcoming the limitations of unreliable perception and achieving impressive sim-to-real transfer.

3.1. Foundational Concepts

To understand the PIE paper, a reader should be familiar with several core concepts in robotics, artificial intelligence, and machine learning:

Legged Robots / Quadruped Robots: Robots that use legs for locomotion, offering advantages over wheeled robots in traversing uneven, unstructured, or obstacle-laden terrains. Quadruped robots have four legs, mimicking animals like dogs or cats, providing inherent stability and agility.
Parkour: A training discipline that involves moving rapidly through an area, typically an urban environment, negotiating obstacles by running, jumping, and climbing. In robotics, it refers to similar agile traversal of challenging and varied terrains.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent is not explicitly programmed with how to perform a task but learns through trial and error, observing the state of the environment and the rewards received for its actions.
- Agent: The learner or decision-maker. In this paper, it's the robot's control policy.
- Environment: The world with which the agent interacts. For PIE, this is the simulated or real-world terrain and robot physics.
- State: A complete description of the environment at a given time (e.g., robot's joint angles, velocities, orientation, terrain information).
- Action: A decision made by the agent that affects the environment (e.g., target joint angles for the robot).
- Reward: A scalar feedback signal from the environment indicating how well the agent is doing. The agent's goal is to maximize cumulative reward.
- Policy: A function that maps states to actions, defining the agent's behavior.
Deep Learning (DL): A subfield of machine learning that uses neural networks with many layers (deep neural networks) to learn complex patterns from data.
- Neural Network: A computational model inspired by the structure of the human brain, composed of interconnected neurons (nodes) organized in layers.
- Multi-Layer Perceptron (MLP): A type of feedforward neural network consisting of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer. Each node is a neuron that uses a non-linear activation function. MLPs are used for various tasks, including classification and regression.
- Convolutional Neural Network (CNN): A specialized type of neural network particularly effective for processing grid-like data, such as images. CNNs use convolutional layers to automatically and adaptively learn spatial hierarchies of features from input data. They are crucial for visual perception tasks.
Asymmetric Actor-Critic Architecture: A common design in Reinforcement Learning where the actor (policy network) and critic (value network) have different observation inputs. The actor receives only observations available during real-world deployment (e.g., proprioception, camera data), while the critic can receive additional privileged information (e.g., ground truth base velocity, full terrain height map) only available in simulation. This allows the critic to learn a more accurate value function, guiding the actor more effectively during training, but the actor remains deployable in the real world without the privileged information.
Proximal Policy Optimization (PPO): A popular Reinforcement Learning algorithm that updates the policy with a small, conservative step, ensuring stability while maximizing rewards. It's an on-policy algorithm, meaning it learns from data collected by the current policy. PPO is known for its balance of performance and ease of implementation.
Transformer Encoder: A type of neural network architecture, originally developed for natural language processing, that relies heavily on self-attention mechanisms to weigh the importance of different parts of the input. In vision and robotics, transformers are increasingly used for processing sequential or multi-modal data, allowing for complex interactions between different feature streams (e.g., visual and proprioceptive).
Gated Recurrent Unit (GRU): A type of recurrent neural network (RNN) designed to handle sequential data. GRUs have gating mechanisms that regulate the flow of information, allowing them to capture long-term dependencies in sequences and mitigate the vanishing gradient problem common in simpler RNNs. They are used to maintain memory of past states or observations.
Autoencoder / Variational Autoencoder (VAE):
- Autoencoder: A type of neural network used for unsupervised learning of efficient data codings (or latent representations). It learns to encode input data into a lower-dimensional latent space and then decode it back to reconstruct the original input. The goal is that the latent representation captures the most important features of the data.
- Variational Autoencoder (VAE): An extension of the autoencoder that provides a probabilistic way of describing the latent space. Instead of mapping inputs to a fixed vector, a VAE maps them to a probability distribution (mean and variance). It tries to ensure that the latent space has a regular, continuous structure, typically by penalizing deviations from a standard normal distribution using KL divergence. This makes VAEs useful for generating new data and for learning meaningful latent representations.
Proprioception: The sense of the relative position of one's own body parts and the strength of effort being employed in movement. In robotics, proprioceptive sensors (e.g., joint encoders, Inertial Measurement Units (IMUs)) provide data about the robot's internal state, such as joint angles, velocities, body orientation, and angular velocity.
Exteroception: The sense that provides information about the external environment. In robotics, exteroceptive sensors (e.g., cameras, LiDARs, depth sensors) provide data about the robot's surroundings, such as terrain shape, obstacles, and distances.
Sim-to-Real Transfer / Zero-Shot Deployment: The process of training a robot control policy entirely in a simulated environment and then deploying it directly (without further training or fine-tuning) in the real world. Zero-shot deployment implies successful transfer with no real-world training at all. This is a major challenge due to the reality gap (discrepancies between simulation and reality).
Domain Randomization: A technique used to bridge the sim-to-real gap. During simulation training, various physical parameters of the environment and robot (e.g., mass, friction, sensor noise, camera intrinsics) are randomly varied within specified ranges. This forces the policy to learn to be robust to these variations, making it more likely to generalize to the unpredictable real world.

3.2. Previous Works

The paper categorizes related work into Vision-Guided Locomotion and Robot Parkour.

3.2.1. Vision-Guided Locomotion

This field focuses on enabling robots to navigate using visual inputs.

Traditional Decoupled Approaches: Historically, this problem was split into:
- Perception Component: Translating visual inputs (cameras, LiDARs) into elevation maps or traversability maps.
- Controller Component: Using model-based methods (e.g., [13], [14]) or RL methods (e.g., [8], [15]) for locomotion.
- Limitation: This decoupling often leads to information loss and system delays, hindering flexible adaptation to complex terrains.
Learning-Based End-to-End Control Systems: Recent advancements have favored end-to-end approaches, showing promise in complex terrains.
- Agarwal et al. [10]: Designed a two-stage learning framework. A teacher policy (with privileged information like ground truth base velocity and height map) guides a student policy. The student policy directly predicts joint angles from depth camera inputs and proprioceptive feedback.
- Yang et al. [11]: Proposed a coupled training framework using a transformer structure to integrate both proprioception and visual observations. A self-attention mechanism fuses these inputs for autonomous navigation in indoor and outdoor environments with varying obstacles.
- Yang et al. [12]: Utilized a 3D voxel representation with SE(3) equivariance as features from visual inputs, aiming for precise terrain understanding.

3.2.2. Robot Parkour

This subfield specifically addresses agile, nimble, and robust movement in highly dynamic situations, requiring precise environmental understanding.

Hoeller et al. [1]: Described a hierarchical pipeline for navigation in parkour terrains.
- Limitation: Occupancy voxels from an encoder-decoder architecture often contain incorrectness and mistrust, leading to inappropriate responses, high training costs, and low scalability in unstructured terrain.
Zhuang et al. [2]: Proposed a multi-stage method using soft/hard constraints to accelerate training. The robot learns to traverse terrains directly from depth images.
- Limitation: Its privileged physics information is strongly tied to the geometric properties of obstacles in simulation, making it difficult to train for terrains not solely describable by geometry.
Cheng et al. [3]: Adopted a framework similar to Agarwal et al. [10].
- Innovation: Introduced waypoints into the teacher policy's privileged inputs to guide the student policy in learning autonomous heading.
- Limitation: Manual specification of waypoints based on terrain imposes considerable limitations.
Common Limitations of Prior Parkour Works:
- Two-Stage Training: All aforementioned parkour works use a two-stage training paradigm, leading to information loss and performance degradation in the deployed student policy.
- Explicit Terrain Estimation: Primarily focus on explicit terrain estimation distilled from the teacher policy, lacking implicit estimation of the surroundings, which hinders maximal performance.

3.3. Technological Evolution

The field of legged robot locomotion has evolved significantly:

Early Model-Based Control: Initial efforts relied heavily on precise physical models of robots and environments, using analytical methods for control. These were often brittle to real-world uncertainties.
Blind Locomotion with Proprioception: Advancements came with learning-based approaches focusing on proprioceptive sensors (IMUs, joint encoders). Robots learned to walk and adapt to various terrains without direct visual input, inferring terrain implicitly (e.g., Hwangbo et al. [4], Nahrendra et al. [5]). This highlighted the power of learning and robust state estimation.
Vision-Guided Locomotion (Decoupled): To tackle more complex, unknown environments, exteroceptive sensors (cameras, LiDAR) were introduced. Initially, perception and control were often decoupled, where vision generated maps, and a separate controller used these maps (e.g., Fankhauser et al. [13]).
Vision-Guided Locomotion (End-to-End RL): The trend shifted towards end-to-end Reinforcement Learning, where visual inputs directly inform the control policy (e.g., Miki et al. [8], Agarwal et al. [10], Yang et al. [11], [12]). This reduced latency and allowed for more complex, learned behaviors. The asymmetric actor-critic architecture became prominent for enabling sim-to-real transfer.
Robot Parkour: Building on vision-guided locomotion, parkour emerged as a highly challenging benchmark, requiring extreme agility and robustness (e.g., Hoeller et al. [1], Zhuang et al. [2], Cheng et al. [3]). These works often still employed two-stage learning and focused primarily on explicit terrain understanding.

PIE fits into this timeline as a cutting-edge end-to-end Reinforcement Learning approach for robot parkour. It addresses the limitations of previous parkour methods by introducing a one-stage training paradigm and a novel dual-level implicit-explicit estimation to achieve superior performance and sim-to-real transfer with unreliable sensors.

3.4. Differentiation Analysis

Compared to the main methods in related work, PIE presents several core differences and innovations:

One-Stage End-to-End Training vs. Two-Stage:
- Prior Works (e.g., Agarwal et al. [10], Cheng et al. [3], Zhuang et al. [2]): Many learning-based parkour methods use a two-stage training paradigm, typically involving a teacher policy trained with privileged information and a student policy that learns to mimic the teacher's behavior using only exteroceptive and proprioceptive inputs.
- PIE's Innovation: PIE proposes a one-stage end-to-end learning framework. This simplifies the training process, avoids potential information loss during the mimicking phase, and allows for a more direct optimization of the control policy based on raw sensor data. It leverages an asymmetric actor-critic architecture within this one-stage setup, where the critic uses privileged information, but the overall training is unified.
Dual-Level Implicit-Explicit Estimation vs. Primarily Explicit:
- Prior Works (e.g., Hoeller et al. [1], Cheng et al. [3]): These methods largely focus on explicit terrain estimation (e.g., elevation maps, occupancy voxels) to inform the control policy. While beneficial, this can be unreliable with noisy sensors and may lack a holistic understanding of the robot's interaction with the environment.
- PIE's Innovation: PIE introduces a dual-level implicit-explicit estimation.
  - Level 1 (Implicit-Explicit Understanding of State/Surroundings): It goes beyond explicit terrain maps by integrating proprioception with exteroception to implicitly infer the robot's state and surroundings through successor proprioceptive state prediction ( $\hat{\mathbf{o}}_{t+1}$ ), alongside explicit height map reconstruction ( $\hat{\mathbf{m}}_t$ ). This blend allows the robot to anticipate future states and adapt based on its internal model, even when external perception is noisy.
  - Level 2 (Latent vs. Physical Quantities): It explicitly estimates crucial physical quantities like base velocity ( $\hat{\mathbf{v}}_t$ ) and foot clearance ( $\hat{\mathbf{h}}_t^f$ ), which are vital for parkour maneuvers, in addition to purely latent vectors ( $\mathbf{z}_t^m$ , $\mathbf{z}_t$ ). This provides the policy with direct, interpretable information alongside robust, compressed representations.
Robustness to Unreliable Perception and Challenging Terrains:
- Prior Works: Often struggle with latency and noise from exteroceptive sensors, especially near edges or during dynamic maneuvers, leading to limited performance or reliance on highly accurate terrain reconstruction.
- PIE's Innovation: Through its dual-level implicit-explicit estimation, particularly the implicit estimation of successor proprioceptive state, PIE demonstrates superior robustness, especially when camera inputs are noisy or vision contradicts proprioception. This enables the robot to achieve significantly higher parkour abilities (e.g., 3x robot height/length jumps/gaps) and better sim-to-real transfer on harsh terrains and outdoor environments.
Simpler Training Process and Reward Function:
- Prior Works (e.g., Zhuang et al. [2], Cheng et al. [3]): May use complex reward functions, soft/hard constraints, or waypoints to guide training.
- PIE's Innovation: PIE utilizes a relatively simple reward function, closely aligned with prior research on blind walking. This highlights the effectiveness of its implicit-explicit estimation and network architecture in extracting complex behaviors without overly engineering the reward.
  
  In essence, PIE differentiates itself by offering a more holistic and robust understanding of the robot's internal state and external environment within a streamlined, one-stage learning framework, leading to a new level of parkour performance and generalization for legged robots.

4. Methodology

The proposed PIE framework is a one-stage, end-to-end learning-based framework that directly computes desired joint angle commands from raw depth images and onboard proprioception using a single neural network. It aims to circumvent the performance reductions seen in previous two-stage learning-based parkour methodologies by enhancing the estimation of the robot's state and surroundings through a dual-level implicit-explicit estimation.

4.1. Principles

The core idea behind PIE is to enable a quadruped robot to perform complex parkour maneuvers by providing its control policy with a comprehensive and robust understanding of both its internal state and its external environment. This understanding is achieved through a dual-level implicit-explicit estimation mechanism, which processes proprioceptive (robot's internal state) and exteroceptive (environmental perception, e.g., depth camera) sensor data.

The implicit-explicit estimation works on two levels:

Understanding of Robot's State and Surroundings: Beyond just explicitly perceiving the terrain, the framework implicitly infers the robot's future state and surrounding terrain by predicting the successor proprioceptive state (what the robot's internal state will be in the next timestep). This implicit imagination of future states, combined with explicit terrain estimation (like a height map), allows the robot to anticipate dangers and plan maneuvers more effectively, especially where vision is unreliable.
Nature of Estimated Values (Latent vs. Physical): The framework explicitly estimates specific, physically meaningful quantities (e.g., base velocity, foot clearance) that are directly relevant to control, alongside learning encoded latent vectors. This combination provides the policy with both interpretable, direct control signals and robust, compressed representations of complex information, reducing noise influence and enhancing stability.

By integrating these dual levels within a one-stage end-to-end reinforcement learning setup, PIE seeks to create a highly agile and robust control policy capable of zero-shot sim-to-real transfer, even with low-cost hardware and unreliable sensors.

4.2. Core Methodology In-depth (Layer by Layer)

The PIE framework adopts an asymmetric actor-critic architecture to combine the benefits of privileged information during training with real-world deployability. It consists of three subnetworks: an actor, a critic, and an estimator. The optimization involves two concurrent parts: optimizing the actor-critic using Proximal Policy Optimization (PPO) and optimizing the estimator.

The following figure (Figure 2 from the original paper) shows the overall architecture of the PIE framework:

该图像是一个示意图，展示了基于隐式-显式学习框架的四足机器人控制系统。图中显示了不同的观察输入以及如何通过多层感知器（MLP）、卷积神经网络（CNN）和变换器编码器处理这些信息，以实现智能决策和运动控制，具体参数包括机器人34cm的长度和25cm的高度。

4.2.1. Policy Network (Actor)

The actor network is responsible for generating the robot's actions (joint commands). Crucially, its inputs are limited to data obtainable during real-world deployment.

Inputs to the Actor Network: The actor network takes the following as input:

$\mathbf{o}_t$ : Proprioceptive observation at time $t$ .
$\hat{\mathbf{v}}_t$ : Estimated base velocity at time $t$ .
$\hat{\mathbf{h}}_t^f$ : Estimated foot clearance at time $t$ .
$\mathbf{z}_t^m$ : Encoded height map estimation (latent vector representing terrain) at time $t$ .
$\mathbf{z}_t$ : Purely latent vector (representing implicit state and surroundings) at time $t$ .

The proprioceptive observation $\mathbf{o}_t$ is a 45-dimensional vector directly measured from the robot's joint encoders and IMU. It is defined as: $ \mathbf{o}_t = \left[ \omega_t \quad \mathbf{g}_t \quad \mathbf{c}_t \quad \pmb{\theta}_t \quad \dot{\pmb{\theta}}t \quad \mathbf{\bar{a}}{t-1} \right]^T $ Where:
$\omega_t$ : Body angular velocity.
$\mathbf{g}_t$ : Gravity direction vector in the body frame.
$\mathbf{c}_t$ : Velocity command (desired linear and angular velocities for the robot).
$\pmb{\theta}_t$ : Joint angles.
$\dot{\pmb{\theta}}_t$ : Joint angular velocities.
$\mathbf{\bar{a}}_{t-1}$ : Previous action (motor commands), providing temporal context.

The terms $\hat{\mathbf{v}}_t$ , $\hat{\mathbf{h}}_t^f$ , $\mathbf{z}_t^m$ , and $\mathbf{z}_t$ are the outputs of the estimator subnetwork, which processes raw depth images and historical proprioception to provide these estimations.

Output of the Actor Network: The actor network produces an action $\mathbf{a}_t$ , which is a 12-dimensional vector. This vector represents the desired joint position offsets for the 12 joints of the quadruped robot (3 joints per leg, 4 legs).

Action Space & Target Joint Angle: For stability and to maintain a default stance, the action $\mathbf{a}_t$ is added as a bias to the robot's standstill pose $\pmb{\theta}_{stand}$ . The final target joint angle $\pmb{\theta}_{target}$ is thus defined as: $ \pmb{\theta}{target} = \pmb{\theta}{stand} + \mathbf{a}_t $ This $\pmb{\theta}_{target}$ is then sent to the robot's low-level joint controllers (e.g., PD controllers) to achieve the desired joint positions.

4.2.2. Value Network (Critic)

The value network (critic) is responsible for estimating the state value ( $\hat{\mathbf{v}}_t$ ), which represents the expected cumulative future reward from a given state. To provide a more accurate estimation and guide the actor effectively, the critic network can incorporate privileged information that is typically only available in simulation.

Inputs to the Value Network: The input to the value network, $\mathbf{s}_t$ , includes:

$\mathbf{o}_t$ : Proprioceptive observation (same as for the actor).
$\mathbf{v}_t$ : Ground truth base velocity (privileged information).
$\mathbf{m}_t$ : Height map scan dots (privileged information, representing the true terrain height map).

Thus, $\mathbf{s}_t$ is defined as: $ \mathbf{s}_t = \left[ \mathbf{o}_t \quad \mathbf{v}_t \quad \mathbf{m}_t \right]^T $ The use of privileged information for the critic is a common practice in asymmetric actor-critic architectures to accelerate learning and improve the quality of the value estimates.

4.2.3. Reward Function

To emphasize the robustness of the PIE framework and avoid over-engineering, a relatively simple reward function is utilized, similar to those found in prior research on blind walking [19], [20]. The reward function encourages desired behaviors (like velocity tracking and stable locomotion) and penalizes undesirable ones (like collisions, high accelerations, and power consumption). The total reward is a sum of weighted individual reward terms.

The following are the results from Table I of the original paper:

Reward	Equation(ri)	Weight(wi)
Lin. velocity tracking	exp{−4(vcmd − vxy)2} xy	1.5
Ang. velocity tracking	exp{−4(ωcrhd − ωyaw)2}	0.5
Linear velocity (z)		-1.0
Angular velocity (xy)		-0.05
Orientation	g\|2 \|82	-1.0
Joint accelerations		−2.5 × 10⁻⁷
Joint power	\|τ\|	-2 × 10^-5
Collision	−ncollision	-10.0
Action rate	(at − at−1)2	-0.01
Smoothness	(at − 2at−1 + at−2)2	-0.01

Where:

Lin. velocity tracking: Rewards the robot for matching its horizontal linear velocity ( $\mathbf{v}_{xy}$ ) with the commanded linear velocity ( $\mathbf{v}_{cmd}$ ). The exponential term heavily penalizes deviations.
Ang. velocity tracking: Rewards the robot for matching its yaw angular velocity ( $\omega_{yaw}$ ) with the commanded angular velocity ( $\omega_{cmd}^{rhd}$ ).
Linear velocity (z): Penalizes vertical linear velocity, encouraging the robot to maintain a stable height.
Angular velocity (xy): Penalizes roll and pitch angular velocities, promoting body stability and preventing tipping.
Orientation: Penalizes deviations from a desired upright orientation, typically measured by the magnitude of the gravity vector $g$ in the body frame.
Joint accelerations: Penalizes high joint accelerations, promoting smooth movements and reducing wear on actuators.
Joint power: Penalizes high joint power consumption, encouraging energy-efficient locomotion. $\tau$ represents joint torque.
Collision: Heavily penalizes collisions of any part of the robot (other than the feet) with the environment. $n_{collision}$ is the number of collision points.
Action rate: Penalizes large changes in actions (joint position commands) between consecutive time steps, promoting smooth control signals.
Smoothness: Penalizes large second-order differences in actions, further encouraging smooth, continuous movements.

4.2.4. Estimator

The estimator subnetwork is at the heart of the PIE framework, implementing the dual-level implicit-explicit estimation. Its primary role is to process raw sensor data and generate the estimations ( $\hat{\mathbf{v}}_t$ , $\hat{\mathbf{h}}_t^f$ , $\mathbf{z}_t^m$ , $\mathbf{z}_t$ ) that are fed into the actor network.

Dual-Level Implicit-Explicit Estimation: The estimated vectors are categorized by two levels:

Level 1: Understanding Robot's State and Surroundings (Implicit vs. Explicit) This level addresses how the robot comprehends its environment and its interaction with it.

Challenge: Exteroceptive sensors (like depth cameras) suffer from latency and noise, making purely explicit terrain estimation unreliable, especially for critical parkour maneuvers near edges. A blind policy can implicitly infer surroundings but needs direct environmental interaction for feedback, which is too slow for anticipatory parkour.
PIE's Solution: Multi-Head Auto-Encoder Mechanism: PIE integrates implicit and explicit estimation of the robot's state and surroundings.
- Encoder Module Inputs:
  - Temporal Depth Images ( $\mathbf{d}_t^{H_2}$ ): Depth observations from the near past are stacked along the channel dimension. The paper sets $H_2 = 2$ , meaning two recent depth images are used. These are processed by a CNN encoder to extract visual features.
  - Temporal Proprioceptive Observations ( $\mathbf{o}_t^{H_1}$ ): Proprioceptive observations from the recent past are concatenated. The paper sets $H_1 = 10$ , using ten recent proprioceptive states. These are processed by an MLP encoder to extract proprioceptive features.
- Cross-Modal Reasoning: A shared transformer encoder is employed to integrate the visual features from the CNN and proprioceptive features from the MLP. This allows for cross-modal attention to fuse information from both modalities.
- Memory Generation: The outputs of the transformer (fused depth and proprioceptive features) are concatenated and fed into a GRU (Gated Recurrent Unit). The GRU generates memories of the state and terrain, effectively capturing temporal dependencies and historical context, especially important since the egocentric camera cannot perceive terrain beneath or behind the robot.
- Estimator Outputs for Actor Input: The outputs of the GRU serve as input vectors for the policy network (actor). These include:
  - Encoded Height Map Estimation ( $\mathbf{z}_t^m$ ): This is a latent vector representing the explicit understanding of the surrounding terrain. It is then decoded into $\hat{\mathbf{m}}_t$ to reconstruct the high-dimensional height map $\mathbf{m}_t$ (the ground truth terrain map). This is an explicit estimation of the terrain.
  - Purely Latent Vector ( $\mathbf{z}_t$ ): This vector is designed to encapsulate implicit information about the robot's state and surroundings. It's not directly interpretable as a physical quantity. PIE uses a Variational Autoencoder (VAE) structure for $\mathbf{z}_t$ . When decoded alongside other vectors, $\mathbf{z}_t$ aims to reconstruct $\hat{\mathbf{o}}_{t+1}$ , which represents the robot's successor proprioceptive state $\mathbf{o}_{t+1}$ . This is an implicit estimation of the robot's state and surrounding interaction.
- Loss for Level 1:
  - Mean-Squared-Error (MSE) loss is used between the decoded reconstruction $\hat{\mathbf{m}}_t$ and the ground truth $\mathbf{m}_t$ for explicit terrain estimation.
  - Mean-Squared-Error (MSE) loss is used between the decoded reconstruction $\hat{\mathbf{o}}_{t+1}$ and the ground truth $\mathbf{o}_{t+1}$ for implicit successor state estimation.
  - KL divergence is used as a latent loss for the VAE structure of $\mathbf{z}_t$ to regularize its distribution.

Level 2: Nature of Estimated Values (Encoded Latent Vector vs. Explicit Physical Quantity) This level concerns whether the output vector is a compressed, abstract representation or a direct physical measurement.

Encoded Latent Vectors ( $\mathbf{z}_t^m$ , $\mathbf{z}_t$ ): These are the outputs from the GRU that are further processed by decoders for reconstruction. The benefit of using encoded latent vectors is compression and dimension reduction, which helps in noise reduction and increases robustness.
Explicit Physical Quantities ( $\hat{\mathbf{v}}_t$ , $\hat{\mathbf{h}}_t^f$ ): These are explicitly estimated by the estimator and are directly interpretable as physical measurements.
- $\hat{\mathbf{v}}_t$ : Estimated base velocity. Explicitly estimating this helps in velocity tracking during training.
- $\hat{\mathbf{h}}_t^f$ : Estimated foot clearance. This provides relevant, direct information about the height of the terrain relative to the robot's feet, which is crucial for understanding terrain in parkour scenarios (e.g., stepping on edges, clearing obstacles).
- Loss for Level 2: Mean-Squared-Error (MSE) loss is used for both $\hat{\mathbf{v}}_t$ (between estimated and ground truth base velocity $\mathbf{v}_t$ ) and $\hat{\mathbf{h}}_t^f$ (between estimated and ground truth foot clearance $\mathbf{h}_t^f$ ).

Overall Training Loss for the Estimator: The total training loss for the estimator combines all these objectives: $ \mathcal{L} = D_{\mathrm{KL}} \big( q ( \mathbf{z}_t | \mathbf{o}_t^{H_1} , \mathbf{d}t^{H_2} ) \ | \ p ( \mathbf{z}t ) \big) + \mathbf{MSE} \big( \hat{\mathbf{o}}{t+1} , \mathbf{o}{t+1} \big) + \mathbf{MSE} ( \hat{\mathbf{m}}_t , \mathbf{m}_t ) + \mathbf{MSE} ( \hat{\mathbf{v}}_t , \mathbf{v}_t ) + \mathbf{MSE} ( \hat{\mathbf{h}}_t^f , \mathbf{h}_t^f ) $ Where:

$D_{\mathrm{KL}} \big( q ( \mathbf{z}_t | \mathbf{o}_t^{H_1} , \mathbf{d}_t^{H_2} ) \ | \ p ( \mathbf{z}_t ) \big)$ $D_{KL} (q (z_{t} ∣ o_{t}^{H_{1}}, d_{t}^{H_{2}}) ∣ p (z_{t}))$ : KL divergence loss.
- $q ( \mathbf{z}_t | \mathbf{o}_t^{H_1} , \mathbf{d}_t^{H_2} )$ : The posterior distribution of the latent vector $\mathbf{z}_t$ , conditioned on the historical proprioceptive observations $\mathbf{o}_t^{H_1}$ and temporal depth images $\mathbf{d}_t^{H_2}$ . This is learned by the encoder part of the VAE.
- $p ( \mathbf{z}_t )$ : The prior distribution of the latent vector $\mathbf{z}_t$ , which is parameterized by a standard normal distribution (mean 0, variance 1). This term encourages the learned latent space to be well-structured.
$\mathbf{MSE} \big( \hat{\mathbf{o}}_{t+1} , \mathbf{o}_{t+1} \big)$ : Mean-Squared-Error loss between the predicted successor proprioceptive state $\hat{\mathbf{o}}_{t+1}$ and the ground truth successor proprioceptive state $\mathbf{o}_{t+1}$ .
$\mathbf{MSE} ( \hat{\mathbf{m}}_t , \mathbf{m}_t )$ : Mean-Squared-Error loss between the reconstructed height map $\hat{\mathbf{m}}_t$ and the ground truth height map $\mathbf{m}_t$ .
$\mathbf{MSE} ( \hat{\mathbf{v}}_t , \mathbf{v}_t )$ : Mean-Squared-Error loss between the estimated base velocity $\hat{\mathbf{v}}_t$ and the ground truth base velocity $\mathbf{v}_t$ .
$\mathbf{MSE} ( \hat{\mathbf{h}}_t^f , \mathbf{h}_t^f )$ : Mean-Squared-Error loss between the estimated foot clearance $\hat{\mathbf{h}}_t^f$ and the ground truth foot clearance $\mathbf{h}_t^f$ .

4.2.5. Training Details

1. Simulation Platform:

The actor, critic, and estimator are trained in simulation using Isaac Gym, a high-performance GPU-accelerated robotics simulator.
Isaac Gym is configured with 4096 parallel environments, enabling massively parallel training.
Leveraging NVIDIA Warp (a Python framework for high-performance GPU programming), the training process completes 10,000 iterations in under 20 hours on an NVIDIA RTX 4090 GPU. This efficiency allows for rapid development and deployment of the network.

2. Training Curriculum:

A training curriculum is employed, progressively increasing the difficulty of the terrains over time. This allows the policy to first master simpler tasks and then gradually adapt to more challenging environments.
Parkour terrains in simulation include:
- Gaps: Up to 1m wide.
- Steps: Up to 0.75m height.
- Hurdles: Up to 0.75m height (covered by step randomization).
- Stairs: With a height up to 0.25m.
Lateral velocity commands are sampled from $[0.0, 1.5] m/s$ .
Horizontal angular velocity commands are sampled from $[-1.2, 1.2] rad/s$ .
Terrain Randomization: To ensure the policy generalizes beyond fixed simulation paradigms, various terrains are randomized. For example, the depth and width of gaps are randomized to train the robot to handle scenarios where a large gap might be mistaken for flat ground between two steps, requiring a jump down and then up.

3. Domain Randomization: To enhance the robustness of the network trained in simulation and facilitate smooth sim-to-real transfer, various physical and sensor parameters are randomized during training.

The following are the results from Table II of the original paper:

Parameter	Randomization range	Unit
Payload	[−1, 2]	kg
Kp factor	[0.9, 1.1]	Nm/rad
factor	[0.9, 1.1]	Nms/rad
Motor strength factor	[0.9, 1.1]	Nm
Center of mass shift	[−50, 50]	mm
Friction coefficient	[0.2, 1.2]	-
Initial joint positions	[0.5, 1.5]	rad
System delay	[0, 15]	ms
Camera position (x)	[−10, 10]	mm
Camera position (y)	[−10, 10]	mm
Camera position (z)	[−10, 10]	mm
Camera pitch	[−1, 1]	deg
Camera horizontal FOV	[86, 88]	deg

These parameters are randomized to introduce variability that mimics real-world uncertainties and variations, thereby improving the policy's ability to generalize to unseen real-world conditions. Examples include variations in robot weight (Payload), joint stiffness (Kp factor), damping (Kd factor), motor capabilities (Motor strength factor), balance (Center of mass shift), ground interaction (Friction coefficient), initial conditions (Initial joint positions), control loop latency (System delay), and sensor noise/placement (Camera position, Camera pitch, Camera horizontal FOV).

5. Experimental Setup

The effectiveness of the PIE framework was evaluated through ablation studies and comparisons with existing parkour frameworks in both simulation and real-world environments.

5.1. Datasets

The experiments primarily used simulated parkour terrains for training and evaluation. These terrains were procedurally generated and randomized to include a variety of obstacles at different difficulty levels.

Types of Terrains:
- Gaps: Open spaces the robot must jump over.
- Steps / Hurdles: Obstacles of varying heights the robot must climb over or jump onto/off.
- Stairs: A series of steps with varying heights and widths.
- Ramps: Inclined surfaces (used for generalization testing in real-world, not specifically trained on).
Difficulty Levels: For simulation metrics, terrains were configured with ten levels of increasing difficulty.
Real-World Environments:
- Indoor Labs: Structured parkour courses with fabricated steps, gaps, and stairs for controlled testing and direct comparison.
- Outdoor Environments:
  - Long-distance hike: A 2 km round trip on a mountain trail with an elevation gain of 153m, featuring continuous curved stairs of varying heights/widths, irregularly shaped steps/hurdles, steep ramps, deformable/slippery ground, and rocky surfaces.
  - Dark outdoor conditions: Nighttime tests involving steps, irregular rocks, slopes, and stairs, challenging the depth camera's performance.
    
    The datasets (simulated terrains) were chosen to cover a wide range of parkour challenges, ensuring the learned policy is robust and versatile. The real-world environments then validate the sim-to-real transfer capabilities and generalization.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a complete explanation:

5.2.1. Mean Terminated Difficulty Level

Conceptual Definition: This metric quantifies the average difficulty level of terrain that a robot can successfully traverse before terminating due to a fall or collision. It is specifically used in simulation experiments to assess the policy's maximum traversal capability across different terrain types (gap, stairs, step). Terrains are arranged in a curriculum-like fashion with increasing difficulty levels (e.g., 1 to 10).
Mathematical Formula: The paper does not provide an explicit formula, but conceptually it is calculated as the average of the maximum difficulty levels reached across multiple trials and environments for a specific terrain type. Let $L_{i,j}$ be the maximum difficulty level reached by robot $j$ in environment $i$ for a given terrain type. $ \text{Mean Terminated Difficulty Level} = \frac{1}{N \cdot M} \sum_{i=1}^{N} \sum_{j=1}^{M} L_{i,j} $
Symbol Explanation:
- $N$ : Total number of environment sets created for a specific terrain type (e.g., 40 sets).
- $M$ : Number of robots tested simultaneously in each environment set (e.g., 100 robots).
- $L_{i,j}$ : The highest difficulty level successfully completed by robot $j$ in environment set $i$ before termination.

5.2.2. Average Success Rate

Conceptual Definition: This metric measures the percentage of trials where the robot successfully completes a given task or traverses a specific terrain without falling or colliding. It is used in both simulation (for camera error studies) and real-world experiments (for ablations and SOTA comparisons).
Mathematical Formula: $ \text{Average Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
Symbol Explanation:
- Number of Successful Trials: The count of attempts where the robot completed the task as defined (e.g., traversed the obstacle, reached the goal) without failure.
- Total Number of Trials: The total number of attempts made for a specific task or terrain.

5.2.3. Parkour Abilities (Relative Size of Traversable Obstacles)

Conceptual Definition: This metric provides a qualitative and quantitative comparison of how large an obstacle a robot can reliably traverse relative to its own physical dimensions (height or length). It's a key indicator of parkour performance, typically expressed as a multiple (e.g., "3x robot height").
Mathematical Formula: The paper does not provide a formal formula, but it is conceptually calculated as: $ \text{Obstacle Ratio} = \frac{\text{Maximum Traversable Obstacle Dimension}}{\text{Robot's Relevant Dimension}} $
Symbol Explanation:
- Maximum Traversable Obstacle Dimension: The largest height (for steps/stairs) or length (for gaps) of an obstacle that the robot can consistently overcome.
- Robot's Relevant Dimension: The robot's thigh joint height (for steps/stairs) or distance between two thigh joints (for gaps), as defined in the paper.

5.3. Baselines

The paper compared PIE against several baselines:

5.3.1. Ablation Study Baselines (PIE variants):

These variants were designed to verify the effectiveness of different components of PIE's dual-level implicit-explicit estimation.

PIE w/o reconstructing $\hat{\mathbf{o}}_{t+1}$ : This variant removes the implicit estimation of the robot's successor proprioceptive state. It relies solely on explicit terrain estimation from the vision and other explicit estimates.
**PIE w/o reconstructing \hat{\mathbf{m}}t**: This variant lacks the explicit estimation of the terrain (height map reconstruction). It uses both proprioception and exteroception but only reconstructs\mathbf{o}{t+1}$ as an implicit estimation of surroundings.
**PIE w/o estimating \hat{\mathbf{v}}_t**: This variant is trained without explicitly estimating the base velocity (\hat{\mathbf{v}}_t$) as an input to the actor.
**PIE w/o estimating \hat{\mathbf{h}}_t^f**: This variant is trained without explicitly estimating the foot clearance (\hat{\mathbf{h}}_t^f$) as an input to the actor.
PIE using predicted\mathbf{o}_{t+1} $**: In this variant, the policy network uses the directly predicted `successor proprioceptive state` ($\hat{\mathbf{o}}_{t+1}$) as input, instead of the `purely latent vector` ($\mathbf{z}_t$) used for reconstruction. This tests if a direct, higher-dimensional prediction is better than a compressed latent representation. ### 5.3.2. Prior Work Baselines: These are state-of-the-art methods in `robot parkour`. * **Hoeller et al. [1]**: `AnymalC` robot, hierarchical pipeline for parkour. * **Zhuang et al. [2]**: `Unitree-A1` robot, multi-stage method using soft/hard constraints. * **Cheng et al. [3]**: `Unitree-A1` robot, two-stage framework with waypoints for guidance. ### 5.3.3. Robot Hardware: * `PIE` was deployed on a **DEEP Robotics Lite3** quadruped robot. * **Specifications:** Thigh joint height: $25 \mathrm{cm}$, distance between two thigh joints: $34 \mathrm{cm}$, weight: $12.7 \mathrm{kg}$, peak knee joint torque: $30.5 \mathrm{Nm}$. * **Sensors:** Onboard Intel RealSense D435i depth camera (10 Hz), joint encoders, IMU. * **Processor:** Onboard Rockchip RK3588 processor. * The comparison baselines `Zhuang et al. [2]` and `Cheng et al. [3]` used the **Unitree A1** robot. * **Specifications:** Thigh joint height: $26 \mathrm{cm}$, distance between two thigh joints: $40 \mathrm{cm}$, weight: $12 \mathrm{kg}$, peak knee joint torque: $33.5 \mathrm{Nm}$. * The paper notes that the specifications of the Lite3 are comparable to the Unitree A1, making a direct comparison of parkour abilities reasonable. # 6. Results & Analysis ## 6.1. Core Results Analysis The experimental results demonstrate `PIE`'s superior performance in `parkour`, its robustness to sensor errors, and its excellent `sim-to-real transfer` capabilities. ### 6.1.1. Simulation Experiments In simulation, `PIE` was evaluated against its ablations on various terrains (gap, stairs, step) with 10 increasing difficulty levels. The following are the results from Table III of the original paper: <div class="table-wrapper"><table> <thead> <tr> <th rowspan="2">Method</th> <th colspan="3">Mean Terminated Difficulty Level</th> </tr> <tr> <th>Gap</th> <th>Stairs</th> <th>Step</th> </tr> </thead> <tbody> <tr> <td>PIE (ours)</td> <td>9.9</td> <td>9.86</td> <td>9.81</td> </tr> <tr> <td>PIE w/o ôt+1</td> <td>9.51</td> <td>9.45</td> <td>9.62</td> </tr> <tr> <td>PIE w/o hf</td> <td>7.41</td> <td>7.36</td> <td>3.09</td> </tr> <tr> <td>PIE w/o t</td> <td>8.7</td> <td>8.22</td> <td>8.48</td> </tr> <tr> <td>PIE w/o nt</td> <td>9.75</td> <td>4.25</td> <td>1.67</td> </tr> <tr> <td>PIE using predicted ôt+1</td> <td>9.23</td> <td>7.28</td> <td>3.29</td> </tr> </tbody> </table></div> * **`PIE (ours)`** consistently achieved the highest mean terminated difficulty levels across all terrain types (9.9 for Gap, 9.86 for Stairs, 9.81 for Step). This validates the effectiveness of the full `dual-level implicit-explicit estimation` framework. * **`PIE w/o reconstructing $\hat{\mathbf{o}}_{t+1}$`** performed commendably, but slightly below the full `PIE`. Its reliance solely on explicit terrain estimation implies a lack of comprehensive terrain understanding, making it more susceptible to falls on the most challenging terrains due to minor deviations in foothold. This highlights the benefit of `implicit successor state prediction`. * **`PIE w/o estimating $\hat{\mathbf{h}}_t^f`** showed a significant decline, especially on `Step` terrains (3.09). This indicates that explicit estimation of `foot clearance` is crucial for the robot's direct understanding of the terrain beneath its feet, essential for extreme parkour maneuvers along terrain edges. * **`PIE w/o estimating $\hat{\mathbf{v}}_t`** also demonstrated reduced performance. Without an explicit estimate of `base velocity`, the policy introduces biases in `velocity tracking`, diminishing maneuverability on high-difficulty terrains. * **`PIE w/o reconstructing $\hat{\mathbf{m}}_t`** performed poorly, particularly on `Stairs` (4.25) and `Step` (1.67). This shows that regressing a reconstructed `height map` from the `latent vector` is vital for extracting useful terrain information from the depth image, and without it, the robot struggles with complex terrains. * **`PIE using predicted`\mathbf{o}_{t+1}$ struggled significantly on Step (3.29) and Stairs (7.28). This suggests that using the raw predicted successor proprioceptive state as input is less effective than a compressed purely latent vector ( $\mathbf{z}_t$ ). The distribution of $\mathbf{o}_{t+1}$ is more complex than a standard normal distribution, making it harder for the policy network to extract useful information directly, reinforcing the benefit of the VAE structure and latent representations.

The following figure (Figure 3 from the original paper) shows the average success rate of PIE and PIE without $\hat{\mathbf{o}}_{t + 1}$ under various camera input errors:

$Fig. 3. Simulation experiments results for PIE and PIE without $\\hat { \\mathbf { o } } _ { t + 1 }$ in the presence of various camera input errors. Five plots represent the five types of camera errors introduced, with the $\\mathbf { X }$ coordinate system along the negative $\\mathbf { X }$ a of the robot in the $\\mathbf { X } .$ y and $\\mathbf { Z }$ encountered by the robot during training, with the $\\mathbf { y }$ representing the average sucess rate across all terrains, incuding gap stairs and step. Similary, to$ 该图像是图表，展示了PIE及PIE无 $ackslash hat { ackslash boldsymbol { o } }_{ t + 1 }$ 在不同摄像头输入误差下的平均成功率。图中包括五种摄像头误差类型的结果，展示了不同条件下机器人的表现。

The figure illustrates the average success rate of PIE and PIE w/o reconstructing $\hat{\mathbf{o}}_{t+1}$ under five types of camera input errors, beyond standard domain randomization.

Overall Performance: PIE consistently outperforms PIE w/o reconstructing $\hat{\mathbf{o}}_{t+1}$ across all terrains and camera error types.
Impact of Noise: The performance disparity between the two methods becomes more significant when considerable noise is added to the camera. This strongly suggests that the implicit estimation via $\hat{\mathbf{o}}_{t+1}$ is crucial for robustness against unreliable depth cameras and various interferences.
Vision-Proprioception Trust: The results imply that with $\hat{\mathbf{o}}_{t+1}$ estimation, the policy can make correct decisions even when vision contradicts proprioception, placing more trust in proprioception when visual inputs are compromised. This enables PIE to better acquire robot state estimation and terrain understanding under challenging visual conditions.

6.1.2. Real-World Indoor Experiments

Real-world experiments were conducted on a DEEP Robotics Lite3 robot, comparing PIE with its ablations and prior works on step, gap, stairs, and ramp terrains (ramp was for generalization testing). Each method was tested for ten trials on each difficulty level.

The following figure (Figure 4 from the original paper) shows the comparative success rates of PIE and its ablations on various real-world terrains:

该图像是插图，展示了四种不同的障碍物（步高、间隙、楼梯和坡道）以及相应的成功率图表。每种障碍物下方显示了相应的成功测试，测试中使用了不同版本的PIE框架。右侧的图表则表明了在不同条件下，机器人成功率的变化情况，提供了与其他方法的对比数据。

Outstanding Performance of PIE: PIE demonstrates outstanding performance across all skills and terrains in the real world, surpassing all ablations and previous related works.
- It enables the robot to climb obstacles as high as $0.75 \mathrm{m}$ (3x robot height).
- Leap over gaps as large as $1 \mathrm{m}$ (3x robot length).
- Climb stairs as high as $0.25 \mathrm{m}$ (1x robot height).
Significant Improvement: These results represent a significant performance improvement of at least 50% compared to state-of-the-art robot parkour frameworks, as summarized in Table IV.
Remarkable Sim-to-Real Transferability: PIE exhibits consistent success rates compared to its simulation performance, showcasing excellent sim-to-real transferability without extensive fine-tuning.

Generalization to Unseen Terrains: Despite no specific training on ramp terrains in simulation, PIE demonstrates better generalization performance on ramps compared to ablations.

The following are the results from Table IV of the original paper:

Method	Robot	Step	Gap	Stairs
Hoeller et al. [1]	AnymalC	2×	1.5×	0.5×
Zhuang et al. [2]	Unitree-A1	1.6×	1.5×	-
Cheng et al. [3]	Unitree-A1	2×	2×	-
PIE (ours)	DEEP Robotics Lite3	3×	3×	1×

This table provides a direct comparison of Parkour Abilities (relative to robot height/length). PIE significantly outperforms all previous works, achieving 3x for both Step and Gap and 1x for Stairs.

Ablation Performance in Real-World:

PIE w/o reconstructing $\hat{\mathbf{o}}_{t+1}$ : Performed relatively better among ablations, but its performance noticeably decreased compared to simulation due to larger delay and noise in real-world perception and actuation. This further underscores the importance of the implicit estimation for real-world robustness.
PIE w/o estimating $\hat{\mathbf{h}}_t^f`**: Struggled to handle terrain edges correctly due to the lack of intuitive `foot clearance` estimation, resulting in lower success rates. * **`PIE w/o estimating$ \hat{\mathbf{v}}_t: As terrain difficulty increased, the estimation of base velocity deteriorated, leading to a noticeable decline in success rates when following velocity commands.
**PIE w/o reconstructing $\hat{\mathbf{m}}_t`**: Faced significant challenges in extracting useful terrain information from more complicated real-world depth images (compared to simulation). External perception became an interference rather than assistance, leading to near-zero success rates across all terrains. This highlights the crucial role of the explicit `height map reconstruction` in real-world visual understanding. ### 6.1.3. Real-World Outdoor Experiments To assess robustness and generalization in highly disturbed outdoor settings, extensive tests were conducted: The following figure (Figure 5 from the original paper) shows a quadruped robot performing parkour tasks over various terrains during a long-distance hike: ![该图像是一个展示四足机器人在不同地形上执行跑酷任务的示意图。图中包含机器人的运动轨迹和多种自然地形的照片，验证了所提出的PIE框架的有效性，包括180m和27m的高度标记。](/files/papers/6959ca5b5411c3e2652eaedc/images/5.jpg) *该图像是一个展示四足机器人在不同地形上执行跑酷任务的示意图。图中包含机器人的运动轨迹和多种自然地形的照片，验证了所提出的PIE框架的有效性，包括180m和27m的高度标记。* * **Long-Distance Hike:** The robot completed a `2 km round trip` from ZJU Yuquan campus to Laohe Mountain, with an `elevation gain of 153m`, in just `40 minutes` without stops. This challenging trail included continuous curved stairs of varying heights/widths, irregularly shaped steps/hurdles, steep ramps, deformable/slippery ground, and rocky surfaces. This demonstrates remarkable robustness and sustained performance in complex, unstructured natural environments. The following figure (Figure 6 from the original paper) shows tests in a dark outdoor environment: ![Fig. 6. Tests in dark outdoor environment. Despite the near absence of visible light, the robot was able to accurately perform agile maneuvers.](/files/papers/6959ca5b5411c3e2652eaedc/images/6.jpg) *该图像是显示机器人在黑暗户外环境中进行测试的插图。尽管几乎没有可见光，该机器人仍然能够准确地执行灵活的动作，并在执行艰难的越野任务时表现出色。* * **Dark Outdoor Conditions:** Tests were successfully conducted at night in dim outdoor conditions. Despite the near absence of visible light, the robot accurately performed agile maneuvers, including continuously jumping over high steps and irregular rocks, and climbing up and down slopes and stairs. This showcases the depth camera's utility and the framework's robustness even when visual conditions are extremely challenging for perception. ## 6.2. Ablation Studies / Parameter Analysis The ablation studies, detailed in Section 6.1.1 (Simulation Experiments) and Section 6.1.2 (Real-World Indoor Experiments), rigorously verified the contributions of each component of the `dual-level implicit-explicit estimation` framework. * **Importance of `Implicit Successor State Estimation` ($ \hat{\mathbf{o}}_{t+1} $):** Removing this (`PIE w/o reconstructing$ \hat{\mathbf{o}}_{t+1} $`) resulted in a slight performance drop in simulation and a more noticeable decrease in the real world, especially under camera noise. This confirms that implicitly inferring the robot's future state provides comprehensive terrain understanding and crucial robustness against unreliable `exteroception`. * **Importance of `Explicit Height Map Reconstruction` ($ \hat{\mathbf{m}}_t $):** `PIE w/o reconstructing$ \hat{\mathbf{m}}_t $` showed very poor performance, particularly in real-world environments. This confirms the necessity of explicitly constructing a height map from depth images to extract useful terrain information for challenging parkour. * **Importance of `Explicit Base Velocity Estimation` ($ \hat{\mathbf{v}}_t $):** `PIE w/o estimating$ \hat{\mathbf{v}}_t $` had significant issues with `velocity tracking` on difficult terrains, demonstrating that explicitly estimating and providing this physical quantity is vital for accurate maneuverability. * **Importance of `Explicit Foot Clearance Estimation` ($ \hat{\mathbf{h}}_t^f $):** `PIE w/o estimating$ \hat{\mathbf{h}}_t^f $` drastically failed on `step` and `stairs` terrains, emphasizing that direct knowledge of foot clearance is critical for navigating edges and varied step heights. * **Effectiveness of `Latent Representation` ($ \mathbf{z}_t $) vs. Raw Prediction ($ \hat{\mathbf{o}}_{t+1} $):** `PIE using predicted`\mathbf{o}_{t+1}\$ \$ performed worse than using the purely latent vector( $\mathbf{z}_t$ ). This suggests that the compressed, regularized latent representation from theVAE` is more informative and easier for the policy to use than a high-dimensional, potentially noisy raw prediction of the successor state, especially given the complex distribution of $\mathbf{o}_{t+1}$ .

The domain randomization parameters (Table II) were crucial for sim-to-real transfer. By varying physical properties (mass, friction, CoM) and sensor characteristics (camera position, FOV, system delay), the policy was trained to be robust to real-world discrepancies, leading to zero-shot deployment.

6.3. Emergent Behaviors

The paper notes several emergent behaviors that highlight the framework's sophistication:

Natural and Agile Gait: Despite using a simple reward function and no imitation learning, the robot developed a natural and agile gait, seamlessly transitioning across complex terrains, akin to real animals. The authors hypothesize this is due to the ability to predict the successor state, which improves delay management and fosters an internal model of itself and the environment.
Robustness in Emergent Scenarios: The robot demonstrated timely and accurate responses even in scenarios where external perception had slight deviations (e.g., being tripped, misstepping during a jump takeoff). This is particularly impressive for Reinforcement Learning, which is not typically adept at precise maneuvers.

The following figure (Figure 7 from the original paper) illustrates the robot's ability to quickly regain stability despite estimation inaccuracies during intense maneuvers:

该图像是一个展示四足机器人在不同环境中进行跳跃和稳定性的顺序图示。尽管在跳跃过程中存在距离估计误差，该机器人依然能够成功完成跳跃并稳稳着陆，同时在上楼梯时遇到突发的台阶缺口时也能迅速恢复稳定性。

The figure illustrates two examples of the robot's robust recovery:

Gap Jump Recovery: When leaping over a gap, even if distance estimation errors caused the front and rear legs to not fully support on the platform before takeoff, the robot still successfully executed the jump and landed smoothly.
Stair Void Recovery: When encountering a sudden step void while ascending stairs, the robot promptly stabilized itself and continued stepping upwards. These examples showcase PIE's ability to handle perception-actuation discrepancies and maintain stability through dynamic recovery.

7. Conclusion & Reflections

7.1. Conclusion Summary

In this work, the authors proposed PIE (Parkour with Implicit-Explicit learning framework for legged robots), a novel one-stage, end-to-end learning-based framework designed to enhance the parkour capabilities of legged robots. The core innovation lies in its dual-level explicit-implicit estimation mechanism, which refines the robot's understanding of both its own state and its environment.

Through extensive experiments in simulation and real-world environments, PIE demonstrated:

Superior Parkour Performance: It significantly pushed the limits of robot parkour, enabling a low-cost quadruped robot (DEEP Robotics Lite3) to traverse obstacles up to 3x its height/length and stairs up to 1x its height. This represents a substantial improvement (at least 50%) over existing state-of-the-art learning-based parkour frameworks.
Robustness to Perception Errors: The implicit successor state estimation proved crucial for maintaining performance even with unreliable and noisy camera inputs.
Effective Sim-to-Real Transfer: The framework, trained entirely in simulation with domain randomization, achieved successful zero-shot deployment in diverse real-world indoor and challenging outdoor environments (including a mountain hike and nighttime tests), showcasing remarkable generalization capabilities and robustness.
Efficient Training: It achieved these results with a relatively simple training process and reward function, without relying on complex imitation learning or intricate behavior constraints.

Overall, PIE presents a unified policy that robustly integrates proprioceptive and exteroceptive information to enable highly agile and generalized parkour locomotion for legged robots.

7.2. Limitations & Future Work

The authors acknowledge several limitations of the current PIE framework and propose future research directions:

Lack of 3D Terrain Understanding: The current framework primarily relies on 2D height maps or latent representations derived from depth images, meaning it lacks a full 3D understanding of the terrain. Consequently, the robot is unable to crouch under obstacles.
Limited Semantic Information from Perception: The external perception relies solely on depth images, which provide geometric information but lack the richer semantic information that could be extracted from RGB images (e.g., identifying object types, traversability cues beyond height).
Training Confined to Static Environments: The current training is conducted in static environments. It has not been extended to dynamic scenes (e.g., moving obstacles, changing ground conditions), which could potentially lead to confusion in visual estimation and compromise performance.

In the future, the authors aim to:
Design a New Unified Learning-Based Sensorimotor Integration Framework: This framework would extract 3D terrain information from depth images and obtain abundant semantic information from RGB images.
Achieve Better Adaptability and Mobility in Various Environments: The goal is to enhance the robot's capabilities to handle a wider range of complex and dynamic environments by leveraging richer, multi-modal perceptual inputs.

7.3. Personal Insights & Critique

This paper presents a highly impressive advancement in legged robot locomotion, particularly for parkour.

Inspirations and Strengths:

Elegance of One-Stage Learning: Moving from a two-stage teacher-student paradigm to a one-stage end-to-end learning framework is a significant simplification that inherently reduces potential information bottlenecks and training complexity. The asymmetric actor-critic architecture is cleverly utilized within this one-stage design.
Power of Implicit-Explicit Estimation: The dual-level implicit-explicit estimation is the core innovation. The idea that a robot can implicitly infer its future state and surroundings, effectively building an "internal model" through successor proprioceptive state prediction, is very powerful. This, combined with explicit estimations (height map, base velocity, foot clearance), creates a robust and comprehensive understanding that is resilient to noisy sensors. This approach hints at a more generalizable form of intelligence where prediction and self-awareness augment direct perception.
Remarkable Sim-to-Real Transfer: The zero-shot deployment success, especially in harsh outdoor terrains and dim light, is a testament to the effectiveness of the chosen methodology and the domain randomization strategy. Achieving such high performance on a low-cost robot is also commendable, suggesting that advanced locomotion capabilities are becoming more accessible.
Emergent Agile Behaviors: The observation that the robot developed natural and agile gaits and displayed robust recovery behaviors despite a simple reward function is fascinating. It suggests that optimizing for successor state prediction and basic locomotion objectives, coupled with a rich state representation, can lead to highly sophisticated, animal-like movements. This aligns with theories of predictive coding in biological brains.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

"Low-Cost Robot" Quantification: While the paper mentions using a "low-cost quadruped robot," it doesn't provide a quantitative comparison of its cost relative to the AnymalC or even the Unitree A1 (which is often considered a more accessible research platform). A clearer definition or comparison of "low-cost" would strengthen this claim.
"Unreliable Egocentric Depth Camera" Quantification: The term "unreliable" is subjective. While experiments show robustness to noise, quantifying the typical noise levels, latency, and failure modes of the Intel RealSense D435i camera in different environments would provide a more rigorous basis for the claims about robustness.
Scalability to More Complex Environments: The current parkour terrains, while challenging, are often discrete obstacles. While the outdoor hike is impressive, general unstructured environments might involve more subtle traversability cues, dynamic elements, or terrains that require finer manipulation (e.g., pushing objects, navigating through dense foliage). The lack of full 3D understanding (as noted by authors) might limit performance in such scenarios.
Energy Efficiency: While joint power is penalized in the reward function, a deeper analysis of energy consumption during various maneuvers, especially the high-impact jumps, would be valuable for long-duration deployments.
Lack of Explicit High-Level Planning: The framework focuses on reactive control. For truly complex, multi-objective navigation (e.g., reaching a distant goal while minimizing energy and avoiding specific obstacles), integrating a high-level planner that leverages the robust terrain understanding could be a fruitful direction.

Overall, PIE is an excellent contribution that pushes the boundaries of agile legged locomotion. Its implicit-explicit estimation paradigm is a powerful concept that could be generalized to other complex robotic control tasks where robust perception and internal modeling are critical.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.