Paper status: completed

PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations

Published:12/15/2025

Contrastive Learning Framework (3)Humanoid Whole-Body Control (5)Humanoid Robot Learning (1)State Representation Learning Methods (1)Data-Efficient Reinforcement Learning (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces the PvP framework, addressing sample inefficiency in humanoid robot control by leveraging the complementarity of proprioceptive and privileged states. It improves learning efficiency without manual data augmentation, significantly enhancing performance in vel

Abstract

Achieving efficient and robust whole-body control (WBC) is essential for enabling humanoid robots to perform complex tasks in dynamic environments. Despite the success of reinforcement learning (RL) in this domain, its sample inefficiency remains a significant challenge due to the intricate dynamics and partial observability of humanoid robots. To address this limitation, we propose PvP, a Proprioceptive-Privileged contrastive learning framework that leverages the intrinsic complementarity between proprioceptive and privileged states. PvP learns compact and task-relevant latent representations without requiring hand-crafted data augmentations, enabling faster and more stable policy learning. To support systematic evaluation, we develop SRL4Humanoid, the first unified and modular framework that provides high-quality implementations of representative state representation learning (SRL) methods for humanoid robot learning. Extensive experiments on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to baseline SRL methods. Our study further provides practical insights into integrating SRL with RL for humanoid WBC, offering valuable guidance for data-efficient humanoid robot learning.

Mind Map

In-depth Reading

English Analysis~43 min read · 55,816 chars

1. Bibliographic Information

1.1. Title

PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations

1.2. Authors

Mingqi Yuan $^1$ ^2 $^*$ , Tao Yu $^2$ , Haolin Song $^2$ ^4, Bo Li^1, Xin Jin^3 $^\ddagger$ , Hua Chen $^2$ ^5 $^\ddagger$ , Wenjun Zeng $^3$

Affiliations: $^1$ HK PolyU (Hong Kong Polytechnic University) $^2$ LimX Dynamics $^3$ EIT, Ningbo (Ningbo Institute of Industrial Technology, Chinese Academy of Sciences) $^4$ USTC (University of Science and Technology of China) $^5$ ZJU-UIUC (Zhejiang University-University of Illinois at Urbana-Champaign Institute) $^6$ SUSTech (Southern University of Science and Technology)

The authors' backgrounds appear to span robotics, artificial intelligence, and computer vision, given their affiliations with universities and a robotics company (LimX Dynamics).

1.3. Journal/Conference

This paper is published as a preprint on arXiv (arxiv.org/abs/2512.13093). As an arXiv preprint, it signifies that the research has been made publicly available but has not yet undergone formal peer review by a journal or conference. arXiv is a widely recognized platform for sharing cutting-edge research in fields like AI and Robotics, allowing rapid dissemination of new findings.

1.4. Publication Year

2025 (specifically, December 15, 2025)

1.5. Abstract

The paper addresses the challenge of sample inefficiency in reinforcement learning (RL) for whole-body control (WBC) of humanoid robots, a crucial aspect for complex tasks in dynamic environments. This inefficiency stems from the intricate dynamics and partial observability inherent to humanoids. To tackle this, the authors propose PvP, a Proprioceptive-Privileged contrastive learning framework. PvP leverages the inherent complementary nature between proprioceptive states (sensor readings directly from the robot's body, like joint positions) and privileged states (full simulator information, often unavailable in the real world but useful during training). This framework learns compact, task-relevant latent representations without needing manually designed data augmentations, leading to faster and more stable policy learning. To facilitate systematic evaluation, the authors also developed SRL4Humanoid, an open-source, modular framework that provides high-quality implementations of representative State Representation Learning (SRL) methods for humanoid robot learning. Extensive experiments conducted on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to other SRL baselines. The study also offers practical insights into integrating SRL with RL for humanoid WBC, providing valuable guidance for data-efficient humanoid robot learning.

1.6. Original Source Link

https://arxiv.org/abs/2512.13093 PDF Link: https://arxiv.org/pdf/2512.13093v1.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the sample inefficiency of Reinforcement Learning (RL) when applied to Whole-Body Control (WBC) of humanoid robots. WBC is essential for humanoids to perform complex tasks, coordinate numerous joints, and achieve balanced, agile, and safe behaviors in dynamic, real-world environments.

This problem is particularly important due to several challenges:

Complex Dynamics: Humanoid robots possess highly intricate dynamics, underactuation (fewer actuators than degrees of freedom), and strong coupling between different actions (like locomotion, manipulation, and balance).
Partial Observability: Real-world robots often operate with incomplete sensory information (e.g., only internal sensor readings, not a full understanding of the environment's state).
Composite Reward Structures: To ensure both task performance (e.g., tracking accuracy) and reliability (e.g., energy efficiency) in real-world deployment, RL policies often need to optimize complex reward functions, further increasing the difficulty and sample complexity.
Limitations of Traditional Methods: Traditional model-based control methods struggle with flexible real-time control and robust performance under non-stationary conditions, leading to a shift towards data-driven RL approaches.

While RL has shown promising results in humanoid WBC (e.g., for motion tracking and multi-gait locomotion), its high sample requirements remain a significant barrier to widespread application. Existing solutions like simulation acceleration and data augmentation help but don't fully address the underlying issue of processing high-dimensional, noisy, and redundant sensory inputs.

The paper's entry point is State Representation Learning (SRL). SRL offers a promising solution by transforming raw, high-dimensional sensory inputs into compact, informative latent representations. These representations ideally preserve task-relevant dynamics while filtering out noise and redundancy. Prior SRL work in robotics has explored reconstruction-based methods and contrastive learning, but their integration into humanoid WBC in a unified, end-to-end framework that enhances both learning efficiency and real-world deployment reliability remains underexplored. The innovative idea is to leverage the intrinsic complementarity between proprioceptive states (what the robot directly senses about itself) and privileged states (full environmental and robot state information available in simulation) using contrastive learning.

2.2. Main Contributions / Findings

The paper makes three primary contributions:

Proposed PvP Framework: They introduce PvP (Proprioceptive-Privileged contrastive learning), a novel framework that enhances proprioceptive representations for policy learning. PvP achieves this by performing contrastive learning between the robot's proprioceptive states and privileged states. This approach effectively combines complementary information from different sensory modalities without requiring hand-crafted data augmentations, leading to more stable and incrementally improved learning.
Developed SRL4Humanoid Framework: The authors developed SRL4Humanoid, which they claim is the first unified, modular, and open-source framework. This toolkit provides high-quality implementations of representative SRL methods specifically tailored for humanoid robot learning. SRL4Humanoid aims to enable reproducible research and accelerate future progress in the community by offering a systematic platform for evaluating SRL techniques in humanoid WBC.
Extensive Experimental Validation and Practical Insights: Through extensive experiments on the LimX Oli humanoid robot across two challenging tasks (velocity tracking and motion imitation), PvP is demonstrated to significantly outperform existing SRL baselines in terms of both sample efficiency and final policy performance. The study also provides valuable practical insights into how different SRL methods, training configurations (e.g., update intervals, data proportions), and application targets (policy vs. value encoder) affect the efficiency and performance of humanoid WBC learning. Key findings include that PvP not only accelerates learning but also ensures greater action smoothness, crucial for real-world reliability, and that applying SRL to the policy encoder is generally more effective than to the value encoder.

3.1. Foundational Concepts

Humanoid Robots: Robots designed to resemble the human body, typically featuring a torso, head, two arms, and two legs. Their human-like morphology offers advantages like versatility, adaptability to human-centered environments, and intuitive interaction. However, their complex structure (many Degrees of Freedom or DoF) makes control challenging.
Whole-Body Control (WBC): A control strategy for robots with many DoF that aims to coordinate all joints and actuators simultaneously to achieve desired tasks while maintaining balance, respecting physical limits (e.g., joint limits, torque limits), and optimizing for criteria like energy efficiency or smoothness. It's crucial for complex behaviors like walking, running, or interacting with objects.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make optimal decisions by interacting with an environment. The agent performs actions in the environment, receives observations (information about the state), and gets rewards or penalties. The goal is to learn a policy – a mapping from observations to actions – that maximizes the cumulative reward over time.
- Partially Observable Markov Decision Process (POMDP): A mathematical framework for modeling sequential decision-making in situations where the agent does not have direct access to the environment's true state but instead receives observations that are probabilistically related to the state. This is highly relevant for robots, which often rely on noisy or incomplete sensor readings. A POMDP is defined by the tuple $(S, \mathcal{O}, \mathcal{A}, P, \Omega, R, \gamma)$ $(S, O, A, P, Ω, R, γ)$ :
  - $S$ : The full state space of the environment and agent.
  - $\mathcal{O}$ : The observation space, which is the information the agent perceives.
  - $\mathcal{A}$ : The action space, the set of possible actions the agent can take.
  - $P(s' | s, a)$ : The transition probability function, describing the probability of moving to state $s'$ after taking action $a$ in state $s$ .
  - $\Omega(o, s, a)$ : The observation function, describing the probability of observing $o$ when the environment is in state $s$ and action $a$ was taken.
  - R(s, a, s'): The reward function, which assigns a numerical value to state-action-state' transitions, indicating how desirable they are.
  - $\gamma \in [0, 1]$ : The discount factor, which determines the present value of future rewards. A higher $\gamma$ makes the agent consider future rewards more heavily.
- Policy ( $\pi_{\pmb\theta}$ ): A function that maps observations (and potentially commands) to actions. In RL, this policy is often parameterized by a set of weights $\pmb\theta$ (e.g., a neural network) which the RL algorithm aims to optimize.
- Sample Efficiency: In RL, sample efficiency refers to how much experience (i.e., how many interactions with the environment, or "samples") an agent needs to learn a good policy. Low sample efficiency means requiring a huge amount of data, which is costly and time-consuming, especially for real robots.
State Representation Learning (SRL): A set of techniques aimed at learning low-dimensional, compact, and informative representations of raw, high-dimensional sensory inputs (like images, sensor readings, joint positions, etc.). The goal is to extract task-relevant features while discarding noise and redundant information. This can significantly improve RL by simplifying the observation space, making the learning process faster, more stable, and more generalizable.
Proprioceptive State ( $\mathbf{o}_t$ ): Refers to the internal sensory information a robot receives about its own body's position, orientation, and movement. This includes signals directly measurable on hardware. For humanoid robots, this typically comprises:
- joint positions ( $\pmb q_t$ ): The angles of its joints.
- joint velocities ( $\dot{\pmb q}_t$ ): The rates of change of joint angles.
- base angular velocity ( $\omega_t$ ): The rotational speed of the robot's main body (base).
- gravity (orientation) ( $\nabla_{\boldsymbol g_t}$ ): An estimate of the robot's orientation relative to gravity, often represented in the base frame.
Privileged State ( $\pmb s_t$ ): The complete and true state of the robot and its environment, typically available only in a simulator during training. It contains information that is either difficult, unreliable, or impossible to measure directly on a real robot. This can include:
- root pose and velocity: The exact position, orientation, linear, and angular velocity of the robot's base in the world.
- per-link poses and velocities: The exact positions and velocities of all individual body links.
- contact indicators: Precise information about which parts of the robot are in contact with the environment.
- environment/terrain features: Detailed information about the terrain, obstacles, or other elements in the environment. Crucially, the proprioceptive state is a subset of the privileged state ( $\mathbf{o} \subset \mathbf{s}$ ).
Contrastive Learning: A self-supervised learning paradigm where representations are learned by bringing "similar" (positive) pairs of data points closer together in a latent space while pushing "dissimilar" (negative) pairs apart. This helps the model learn what makes different data points similar or different. In SRL, positive pairs might be different augmented views of the same observation, or temporally adjacent observations.
Proximal Policy Optimization (PPO): A popular on-policy RL algorithm. It's an actor-critic method (meaning it uses separate neural networks for the policy and value function). PPO improves stability and sample efficiency compared to earlier policy gradient methods by using a clipped surrogate objective function. This objective limits the size of policy updates at each step, preventing large, destructive changes and ensuring that the new policy doesn't stray too far from the old one.
SimSiam: Stands for "Simple Siamese" representation learning. It's a self-supervised learning method that learns meaningful representations without needing negative sample pairs, large batches, or momentum encoders (common components in other contrastive learning methods). It uses two identical encoder networks (a Siamese network architecture) that process two different augmented views of the same input. A key innovation is the stop-gradient operation on one branch, which prevents the network from collapsing (i.e., learning trivial, constant representations for all inputs). The objective is to maximize the cosine similarity between the outputs of the two branches.
Variational Autoencoders (VAE): A type of generative model that learns a compressed, latent representation of input data. It consists of an encoder that maps input data to a probabilistic distribution over a latent space, and a decoder that reconstructs the original data from samples drawn from this latent space. VAEs enforce a prior distribution (e.g., Gaussian) on the latent space through a Kullback-Leibler (KL) divergence term in their loss function, balancing reconstruction accuracy with regularization to ensure the latent space is well-structured and facilitates generation.
Self-Predictive Representations (SPR): An SRL method that learns latent representations by enforcing multi-step consistency between predicted and encoded future states. It encourages representations that capture environment dynamics. The agent learns to predict future latent states from current latent states and actions, thereby creating representations that are useful for predicting future outcomes and understanding dynamics.

3.2. Previous Works

The paper frames SRL for RL into three predominant categories:

Reconstruction-based Methods: These methods learn representations by encoding high-dimensional observations into a lower-dimensional latent space and then attempting to reconstruct the original observations (or parts of them) from this latent representation.
- Examples cited: [4, 5, 7, 10, 46].
- A pioneering example is VAE [10], which balances reconstruction fidelity with regularization by enforcing a prior distribution on the latent space. Beta-VAE [7] extends this by adding a hyperparameter to control the disentanglement of latent factors.
- Relevance to paper: The paper acknowledges that reconstruction-based methods can struggle with suboptimal representation quality and poor generalization because they often focus on preserving complete state information, including irrelevant details, rather than solely task-relevant features.
Dynamics Modeling Methods: These SRL approaches encourage representations that explicitly capture the environment's dynamics through predictive modeling.
- Examples cited: [9, 12, 20, 25, 31].
- Forward models predict future states from current state-action pairs, while inverse models infer actions from transitions [9, 12, 25]. These models aim to provide rich controllable features.
- SPR [31] is a specific dynamics modeling method mentioned, which learns predictive latent representations by enforcing multi-step consistency between predicted and encoded future latent states.
- Relevance to paper: SPR is used as a baseline SRL method in the paper's experiments.
Contrastive Learning Techniques: These methods structure the latent space by enforcing similarity between "positive pairs" (e.g., different augmented views of the same observation, or temporally adjacent states) while pushing "negative pairs" apart. This yields invariant and temporally smooth representations that can improve RL efficiency and generalization.
- Examples cited: [14, 15, 36, 45, 51].
- CURL [15] leverages augmented image pairs to enforce invariances. ATC [36] aligns embeddings of temporally close observations. CDPC [51] refines contrastive predictive coding with temporal-difference objectives.
- SimSiam [1] is specifically mentioned as a contrastive learning baseline that the paper's PvP framework builds upon and adapts.
- Relevance to paper: SimSiam is a direct inspiration and a baseline for PvP.

SRL for Humanoid Robot Learning (Specific Prior Work): The paper highlights pioneering work applying SRL to humanoid robots:

[50] proposes Any2Track, which uses SRL to enhance motion tracking by integrating a history-informed adaptation module. This module extracts dynamics-aware world model predictions to adapt to disturbances like terrain or external forces.
[38] presents a world model reconstruction framework that uses sensor denoising and world state estimation to improve locomotion in unpredictable environments.
[37] explores reconstruction-based SRL (e.g., elevation-based internal maps) for humanoid locomotion over challenging terrain.
[18] uses contrastive learning-based abstractions (perceptive internal models from height maps) to refine state embeddings and improve RL performance.
Limitations of prior SRL in Humanoids: The paper notes that while these approaches show potential, they are often either reconstruction-based (which might preserve irrelevant details) or rely on a single state modality (like only perceptive or proprioceptive information). This limits their ability to capture the full spectrum of task-relevant dynamics or be seamlessly integrated into an end-to-end RL framework for real-world reliability.

3.3. Technological Evolution

The evolution of humanoid WBC has generally progressed as follows:

Traditional Model-Based Control: Early approaches heavily relied on precise mathematical models of the robot's dynamics and environment. Techniques like inverse kinematics, optimization-based control [11], and Model Predictive Control (MPC) [26] were common. While providing theoretical guarantees, these methods often struggle with real-time flexibility, robustness to unmodeled disturbances, and performance in complex, dynamic, or non-stationary real-world conditions.
Data-Driven Reinforcement Learning (RL): To overcome the limitations of model-based methods, RL emerged as a dominant paradigm. RL allows robots to learn complex behaviors directly from interaction, adapting to uncertainties and achieving behaviors difficult to hand-design [8, 16, 17, 44, 48, 49]. RL policies can achieve impressive generalization (e.g., diverse motions [17] or multi-gait locomotion [44]).
Addressing RL Sample Inefficiency: Despite its successes, RL's primary bottleneck is sample inefficiency, especially for high-dimensional, complex systems like humanoids. This led to research into:
- Simulation Acceleration: Faster simulators (e.g., Isaac Gym [19], MuJoCo XLA [39]) to generate more data quickly.
- Data Augmentation: Techniques to artificially expand the training data diversity [22, 29, 30].
- State Representation Learning (SRL): The focus of this paper, which aims to make RL more efficient by providing better inputs to the RL agent.
  
  This paper's work fits into the third stage, specifically focusing on advancing SRL for humanoid WBC. It aims to bridge the gap in SRL by proposing a method (PvP) that leverages both proprioceptive and privileged states to learn more effective representations, and by providing a standardized framework (SRL4Humanoid) for reproducible research in this area.

3.4. Differentiation Analysis

Compared to main methods in related work, PvP offers several core differences and innovations:

Leveraging Proprioceptive and Privileged States for Contrastive Learning:
- Differentiation: Many prior SRL methods, especially those for RL, rely on a single input modality (e.g., only images for CURL [15], or only proprioceptive states for methods like PIM [18] (which uses height maps as "perceptive internal models")). Reconstruction-based methods [38, 43] often try to predict privileged information from proprioceptive states, but this can lead to suboptimal representations by forcing the encoder to reconstruct all details, not just task-relevant ones.
- Innovation: PvP directly uses both proprioceptive and privileged states as input to a contrastive learning framework. It treats the privileged state (which inherently contains more comprehensive, ground-truth information) as a "pseudo augmentation" of the proprioceptive state. This allows the policy encoder to learn representations that implicitly incorporate richer, privileged information during training, without needing that information during real-world deployment (where only proprioceptive states are available to the policy).
Absence of Hand-crafted Data Augmentations:
- Differentiation: Many contrastive learning methods (e.g., CURL [15]) heavily rely on hand-crafted data augmentations (like random crops, color jittering, Gaussian noise) to generate positive pairs and learn invariances. Designing effective augmentations for diverse robotic sensor inputs can be challenging and task-specific.
- Innovation: PvP leverages the "intrinsic complementarity" between the proprioceptive and privileged states themselves. By applying a ZeroMasking operation to the privileged state to create a proprioceptive-like view ( $\tilde{\pmb s}$ ), it generates two related views—the full privileged state ( $\pmb s$ ) and the masked privileged state ( $\tilde{\pmb s}$ )—that serve as natural positive pairs for contrastive learning. This removes the need for designing complex, domain-specific augmentations.
Enhanced Task-Relevant Feature Learning:
- Differentiation: Reconstruction-based methods often prioritize fidelity to the input, potentially retaining irrelevant details. Single-modality contrastive methods might be limited by the information contained within that single modality.
- Innovation: By contrasting a rich privileged state with a constrained proprioceptive-like view, PvP is encouraged to learn features from the proprioceptive state that are robust and predictive of aspects present in the privileged state (like root linear velocity), which are often highly task-relevant. This leads to more compact and informative latent representations that filter out noise and redundancy, accelerating RL and improving final performance.
SRL4Humanoid Framework:
- Innovation: The development of SRL4Humanoid as a unified, modular, and open-source framework is itself a significant contribution. It addresses the common challenge of reproducibility and systematic evaluation in RL research by providing high-quality implementations of diverse SRL methods specifically for humanoid robots. This facilitates comparative studies and future research in the community.
  
  In essence, PvP uniquely capitalizes on the rich information available in simulation (privileged state) to improve the representation learning for real-world deployable proprioceptive policies, doing so in a self-supervised, augmentation-free manner within a contrastive learning framework.

4. Methodology

4.1. Principles

The core idea behind PvP is to address the sample inefficiency of Reinforcement Learning (RL) for humanoid Whole-Body Control (WBC) by learning superior state representations. It achieves this through contrastive learning that leverages the inherent complementary relationship between two distinct sources of state information: the proprioceptive state and the privileged state.

The intuition is that while the proprioceptive state (e.g., joint angles, velocities, base angular velocity) is what the robot directly perceives and uses for control in the real world, the privileged state (e.g., root linear velocity, full environment state) provides a more complete, ground-truth understanding of the robot's true state and its interaction with the environment, especially during simulation training. The privileged state contains all the information from the proprioceptive state plus additional, typically unobservable, rich context. By setting up a contrastive learning task where the proprioceptive view is contrasted with a privileged view derived from the same underlying state, the policy encoder learns to extract robust, task-relevant features from the proprioceptive state that are predictive of the privileged information. This process implicitly guides the encoder to focus on meaningful aspects of the proprioceptive input, leading to compact, informative latent representations that are less susceptible to noise and redundancy, without requiring manually designed data augmentations. This, in turn, facilitates faster and more stable policy learning in RL.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology section first outlines the general humanoid Whole-Body Control (WBC) problem and the Reinforcement Learning (RL) framework, then details the proposed PvP algorithm, and finally introduces the SRL4Humanoid framework that supports the experiments.

4.2.1. Humanoid Whole-Body Control (WBC)

WBC is fundamental for humanoid robots to perform diverse and complex tasks. The objective is to design a control function that maps continuous commands ( $\mathcal{C}$ ) and observations ( $\mathcal{O}$ ) to appropriate control signals. The paper employs learning-based methods to directly learn a parameterized policy $\pi_{\pmb\theta}: \mathcal{O} \times \mathcal{C} \to \mathcal{A}$ that outputs joint actions. In practice, actions are often defined as offsets to nominal joint positions for various body parts. These offsets are then added to pre-defined nominal targets to obtain the final target joint positions, which are subsequently tracked by a proportional-derivative (PD) controller with fixed gains.

4.2.2. Reinforcement Learning (RL) Background

The paper casts humanoid WBC as an infinite-horizon partially observable Markov decision process (POMDP) [35]. A POMDP is formally defined by the tuple $\mathcal{M} = (S, \mathcal{O}, \mathcal{A}, P, \Omega, R, \gamma)$ :

$S$ : The full state space of the humanoid robot and its environment.
$\mathcal{O}$ : The observation space, which is the information the agent receives.
$\mathcal{A}$ : The action space, the set of possible actions the agent can take.
$P(\pmb s' | \pmb s, \pmb a)$ : The transition probability function, describing the probability of moving to state $\pmb s'$ after taking action $\pmb a$ in state $\pmb s$ .
$\Omega(o, s, a)$ : The observation function, which describes the probability of observation $o$ occurring when the environment is in state $s$ and action $a$ was taken.
R(s, a, s'): The reward function, which provides a scalar feedback signal for transitions.
$\gamma \in [0, 1]$ : The discount factor, which weights the importance of future rewards.

The primary objective of RL is to learn an optimal policy $\pi_{\pmb\theta}$ that maximizes the expected discounted return, $J_{\pi}(\pmb\theta)$ : $ J_{\pi}(\pmb\theta) = \mathbb{E}{\pi} \Big[ \sum{t=0}^{\infty} \gamma^t R(s_t, \pmb a_t, \pmb s_{t+1}) \Big] . $ Here, $\mathbb{E}_{\pi}$ denotes the expectation over trajectories (sequences of states, actions, and observations) generated by policy $\pi$ .

The paper further defines specific state and action spaces for the humanoid robot:

Proprioceptive State Space ( $\mathbf{o}_t$ ): This is the observation space $\mathcal{O}$ $O$ for the policy. At time step $t$ $t$ , $\mathbf{o}_t \in \mathbb{R}^n$ $o_{t} \in R^{n}$ consists of signals directly measurable on hardware. Typically, these include:
- joint positions ( $\pmb q_t$ )
- joint velocities ( $\dot{\pmb q}_t$ )
- base angular velocity ( $\omega_t$ )
- gravity (orientation) ( $\nabla_{\mathbf{\boldsymbol g}_t}$ ) estimate in the base frame.
Privileged State Space ( $\pmb s_t$ ): This represents the full simulator state (from the state space $S$ $S$ ). At time step $t$ $t$ , $\pmb s_t \in \mathbb{R}^m$ $s_{t} \in R^{m}$ is used only during training (e.g., by the critic/teacher network) and is unavailable or unreliable on the real robot. Typical components include:
- root pose and velocity
- per-link poses and velocities
- contact indicators
- environment/terrain features. Importantly, the proprioceptive state is a subset of the privileged state: $\mathbf{o} \subset \mathbf{s}$ .
Action Space ( $\mathbf{a}_t$ ): The action space $\mathcal{A}$ . The action $\mathbf{a}_t \in \mathbb{R}^k$ specifies angular deviations for $k$ actuated joints relative to their nominal positions. These deviations are added to nominal configurations to get final target joint positions, which a PD controller then tracks.

4.2.3. PvP Implementation

The PvP framework aims to enhance policy learning by leveraging both proprioceptive and privileged states through contrastive learning. This approach is motivated by the limitations of previous methods: reconstruction-based SRL often learns suboptimal and poorly generalizable representations by trying to preserve all state details, while existing contrastive learning methods for RL typically rely on a single state modality, missing the rich privileged information.

The overview of the PvP approach is illustrated in Figure 2 (from the original paper):

Figure 2. An overview of the PvP approach. (a) The components of the privileged state and the proprioceptive state. (b) PvP conducts contrastive learning based on the intrinsic complementarity between the two state modalities. 该图像是图示，展示了PvP方法的概述。(a) 显示了特权状态（例如，根线性速度）和本体状态（例如，关节位置）的组成。(b) PvP基于两种状态模态之间的内在互补性进行对比学习。

Figure 2. An overview of the PvP approach. (a) The components of privileged state (e.g., root linear velocity) and proprioceptive state (e.g., joint positions). (b) PvP conducts contrastive learning based on the intrinsic complementarity between the two state modalities.

As shown in Figure 2(a), the privileged state $\pmb s$ contains comprehensive information, including the proprioceptive observations (e.g., joint positions, angular velocities) and additional privileged information (e.g., root linear velocity). This comprehensive privileged state can be considered a rich, implicit "pseudo augmentation" of the proprioceptive state $\mathbf{o}$ .

To create a pair for contrastive learning that reflects the intrinsic complementarity, PvP generates a second view from the privileged state. This is done by applying a ZeroMasking operation to the privileged information part of $\pmb s$ , while keeping the proprioceptive observations part intact: $ \tilde{\pmb s}_t = \mathrm{ZeroMasking}(\pmb s_t) . $ Here, ZeroMasking specifically targets and zeroes out components of the privileged state that are not part of the proprioceptive state (e.g., root linear velocity, contact indicators). This operation effectively creates a proprioceptive-only view ( $\tilde{\pmb s}_t$ ) that is structurally similar to the proprioceptive state $\mathbf{o}_t$ but derived from the same comprehensive privileged state $\pmb s_t$ .

The derived data pair $(\pmb s, \tilde{\pmb s})$ is then used to train the policy encoder, following the architecture and principles of the SimSiam algorithm [1]. Formally, let $f_{\pmb\theta}$ be the policy encoder (a neural network parameterized by $\pmb\theta$ ) and $h_{\psi}$ be the predictor (another neural network parameterized by $\psi$ ). The process involves encoding both views and then predicting from one to the other: $ \begin{array} { r l } { z = f_{\pmb\theta}(\pmb s) , } & { { } \tilde{z} = f_{\pmb\theta}(\tilde{\pmb s}) } \ { \pmb p = h_{\psi}\left( z \right) , } & { { } \tilde{p} = h_{\psi}\left( \tilde{z} \right) } \end{array} $ Here, $z$ and $\tilde{z}$ are the latent representations obtained by feeding the privileged state and the zero-masked privileged state (proprioceptive-like view) through the policy encoder $f_{\pmb\theta}$ , respectively. $\pmb p$ and $\tilde{p}$ are the outputs of the predictor $h_{\psi}$ applied to these latent representations.

Finally, the PvP loss ( $L_{\mathrm{PvP}}$ ) is defined using negative cosine similarity with a stop-gradient operation, as in SimSiam: $ L_{\mathrm{PvP}} = D_{\mathrm{ncs}}\left( \pmb p, \mathbf{sg}(\tilde{\pmb z}) \right) + D_{\mathrm{ncs}}\left( \tilde{\pmb p}, \mathbf{sg}(\pmb z) \right) , $ where $D_{\mathrm{ncs}}(p, z) = - \frac{p}{\|p\|_2} \cdot \frac{z}{\|z\|_2}$ is the negative cosine similarity loss between the normalized vectors $p$ and $z$ . The stop-gradient operation, denoted by $\mathbf{sg}(\cdot)$ , is crucial. It means that gradients are computed for the branch containing the predictor output ( $\pmb p$ or $\tilde{\pmb p}$ ) and propagated back to its encoder ( $f_{\pmb\theta}$ and $h_{\psi}$ ), but not for the target representation ( $\mathbf{sg}(\tilde{\pmb z})$ or $\mathbf{sg}(\pmb z)$ ). This prevents the network from learning a trivial solution where both outputs collapse to a constant, which would happen if gradients were allowed to flow through both branches simultaneously.

Advantages of PvP:

Richer Information: By leveraging both proprioceptive and privileged states, PvP combines complementary information, reducing SRL complexity while enhancing learned representations. This provides an alternative way for the policy to access implicit privileged information.
No Hand-crafted Augmentations: PvP exploits the intrinsic complementarity between the two state modalities, avoiding the need for complex, task-specific data augmentations.
Versatility: The framework is simple and general, making it applicable to a wide range of tasks.

4.2.4. The SRL4Humanoid Framework

To support systematic experimentation and reproducible research, the authors introduce SRL4Humanoid, a unified, modular, and open-source framework.

The architecture of SRL4Humanoid is depicted in Figure 3 (from the original paper):

Figure 3. The architecture of the SRL4Humanoid framework, in which the SRL and RL processes are fully decoupled. 该图像是SRL4Humanoid框架的架构示意图，展示了来自本体状态和特权状态的输入如何通过策略编码器和价值编码器进行处理，并生成相应的策略头和价值头，最后通过SRL损失和PPO损失更新模型。

Figure 3. The architecture of the SRL4Humanoid framework, in which the SRL and RL processes are fully decoupled.

As shown in Figure 3, the SRL4Humanoid framework designs the SRL and RL processes to be fully decoupled.

It uses Proximal Policy Optimization (PPO) [28] as the backbone RL algorithm.
The policy network (which decides actions) accepts the proprioceptive state of the robot as input.
The value network (which estimates the expected future reward) accepts the privileged state of the environment to perform value estimation. This is a common practice in RL for robustness and faster learning, as the critic can use more information than the actor.
The SRL objective can be applied to either the policy encoder (to improve proprioceptive representations) or the value encoder (to improve privileged state representations).

SRL4Humanoid implements three representative SRL algorithms for comparative study: SimSiam [1], SPR [31], and VAE [10], each representing a different methodological paradigm (contrastive, dynamics modeling, reconstruction-based, respectively).

The joint optimization objective of RL and SRL is defined as: $ \mathcal{L}{\mathrm{Total}} = \mathcal{L}{\mathrm{RL}} + \lambda \cdot \mathcal{L}_{\mathrm{SRL}} , $ where $\mathcal{L}_{\mathrm{RL}}$ is the RL loss (e.g., PPO loss), $\mathcal{L}_{\mathrm{SRL}}$ is the SRL loss (e.g., PvP loss, SimSiam loss, SPR loss, or VAE loss), and $\lambda$ is a weighting coefficient that balances the contribution of the SRL objective.

By default, the updates for $\mathcal{L}_{\mathrm{RL}}$ and $\mathcal{L}_{\mathrm{SRL}}$ are synchronized, meaning they share data batches and follow RL's update frequency. However, the authors found that continuous SRL updates can sometimes degrade learning efficiency, especially in early stages when RL generates large amounts of repetitive, low-quality data. This can cause the SRL module to prematurely converge to local optima. To mitigate this, an interval update mechanism is employed: $ L_{\mathrm{Total}} = L_{\mathrm{RL}} + \mathbb{1}(T) \cdot \lambda \cdot L_{\mathrm{SRL}} , $ where $\mathbb{1}(T)$ is an indicator function that equals 1 every $T$ time steps (i.e., the SRL loss is applied only at intervals of $T$ time steps); otherwise, it equals 0. This allows the SRL module to be trained less frequently during the initial noisy phases, preserving its ability to continuously influence policy learning more effectively.

The workflow of SRL4Humanoid is summarized in Algorithm 1:

1 Initialize the policy πθ and value network Vφ; 2 Initialize the SRL module Sψ; 3 Set all the hyperparameters, such as the maximum
number of episodes E, and the number of update epochs K, etc.
	4 for episode = 1 to E do Sample rollouts using the policy network πθ;
5 6	Perform the generalized advantage estimation (GAE) to get the estimated task returns;
	for epoch = 1 to K do
7 8	Sample a mini-batch B from the rollouts data;
9	Use B to compute the policy and value loss;
10	Use B to compute the SRL loss; Compute the total loss following Eq. (6);
11	Update the policy network, value network,
13	and the SRL module; end
14 end	Output the optimized policy πθ

Algorithm 1: SRL4Humanoid Training Workflow

Initialization: Initialize the policy network ( $\pi_{\theta}$ ), value network ( $V_{\phi}$ ), and the SRL module ( $S_{\psi}$ ). Set hyperparameters like maximum episodes ( $E$ ) and update epochs ( $K$ ).
Episode Loop: For each episode from 1 to $E$ : a. Rollout Collection: Sample rollouts (sequences of states, actions, rewards) using the current policy network $\pi_{\theta}$ . b. Return Estimation: Compute estimated task returns using Generalized Advantage Estimation (GAE) [27]. GAE helps in estimating how much better an action is compared to the average action from a given state, crucial for policy gradient methods. c. Epoch Loop: For each epoch from 1 to $K$ : i. Mini-batch Sampling: Sample a mini-batch $B$ from the collected rollouts data. ii. Loss Computation: * Compute policy loss ( $L_{\pi}$ ) and value loss ( $L_V$ ) using mini-batch $B$ . * Compute SRL loss ( $L_{\mathrm{SRL}}$ ) using mini-batch $B$ . iii. Total Loss Calculation: Compute the total loss using the combined objective: $L_{\mathrm{Total}} = L_{\mathrm{RL}} + \mathbb{1}(T) \cdot \lambda \cdot L_{\mathrm{SRL}}$ (Equation 6). iv. Parameter Update: Update the parameters of the policy network $\theta$ , value network $\phi$ , and SRL module $\psi$ by performing gradient descent on $L_{\mathrm{Total}}$ .
Output: After all episodes, output the optimized policy network $\pi_{\theta}$ .

SRL Algorithmic Baselines Implemented in SRL4Humanoid (from Appendix A): The framework integrates different SRL methods, each with its own loss function:

PPO [28]: The backbone RL algorithm.
- Policy Loss: $ \begin{array} { r l } & { L_{\pi} ( \pmb { \theta } ) = - \mathbb { E } _ { \tau \sim \pi } \left[ \operatorname* { m i n } \left( \rho _ { t } ( \pmb { \theta } ) A _ { t } , \right. \right. } \ & { ~ \left. \left. \mathrm { c l i p } \left( \rho _ { t } ( \pmb { \theta } ) , 1 - \epsilon , 1 + \epsilon \right) A _ { t } \right) \right] , } \end{array} $ where $\rho_t(\pmb\theta) = \frac{\pi_{\pmb\theta}(\pmb a_t | \pmb s_t)}{\pi_{\pmb\theta_{\mathrm{old}}}(\pmb a_t | \pmb s_t)}$ is the probability ratio between the new policy $\pi_{\pmb\theta}$ and the old policy $\pi_{\pmb\theta_{\mathrm{old}}}$ , and $A_t$ is the advantage estimate (from GAE). $\epsilon$ is a clipping range coefficient that limits how much the probability ratio can deviate from 1. This clipping helps to prevent large, destructive policy updates.
- Value Loss: The value network $V_{\phi}$ is trained to minimize the mean squared error between its predicted value and the target discounted returns computed with GAE: $ L_V ( \phi ) = \mathbb { E } _ { \tau \sim \pi } \left[ \left( V_{\phi} ( s ) - V_t^{\mathrm{target}} \right)^2 \right] . $
VAE (Variational Autoencoders) [10]: A reconstruction-based SRL method.
- Loss Function: $ \begin{array} { r l } { \mathcal{L}_{\mathrm{VAE}} = - \mathbb { E } _ { q _ { \phi } ( z | o ) } [ \log p _ { \theta } ( o | z ) ] \ } & { { } } \ { { + } \ D _ { \mathrm { KL } } ( q _ { \phi } ( z | o ) | p _ { \theta } ( z ) ) , \ } & { { } } \end{array} $ where $q_{\phi}(z | o)$ is the encoder (mapping observation $o$ to latent variable $z$ ), $p_{\pmb\theta}(o | z)$ is the decoder (reconstructing observation $o$ from latent variable $z$ ), and $D_{\mathrm{KL}}$ is the Kullback-Leibler (KL) divergence. The first term is the reconstruction loss (expected negative log-likelihood of the observation given the latent variable), encouraging fidelity. The second term is the KL divergence between the encoder's distribution $q_{\phi}(z | o)$ and a prior distribution $p_{\theta}(z)$ (typically a standard Gaussian), acting as a regularizer to ensure the latent space is well-behaved.
SPR (Self-Predictive Representations) [31]: A dynamics modeling SRL method.
- Loss Function: $ L_{\mathrm{SPR}} = \sum _ { k = 1 } ^ { K } \lVert f _ { \pmb \theta } ^ { ( k ) } ( z _ { t } , \pmb { a } _ { t : t + k - 1 } ) - \mathbf{sg}(g_{\phi}(z_{t+k})) \lVert_2^2 , $ This loss encourages the online dynamics model $f_{\pmb\theta}^{(k)}$ to predict future latent states accurately. Here, $f_{\pmb\theta}^{(k)}(z_t, \pmb a_{t:t+k-1})$ represents the $k$ -step prediction of the future latent state $z_{t+k}$ starting from $z_t$ and taking actions $\pmb a_{t:t+k-1}$ . $\mathbf{sg}(\cdot)$ is the stop-gradient operation. $g_{\phi}$ is the target dynamics model whose parameters $\phi$ are an exponential moving average (EMA) of the online dynamics model parameters $\theta$ . The L2 norm $\| \cdot \|_2^2$ measures the squared difference between the predicted and target future latent states.
SimSiam (Simple Siamese) [1]: A self-supervised contrastive learning method.
- Loss Function: $ L_{\mathrm{SimSiam}} = \frac { 1 } { 2 } \left[ - \frac { f _ { \theta } ( x _ { 1 } ) \cdot f _ { \theta } ( x _ { 2 } ) } { | f _ { \theta } ( x _ { 1 } ) | _ { 2 } | f _ { \theta } ( x _ { 2 } ) | _ { 2 } } \right] , $ This is the negative cosine similarity between the outputs of the encoder network $f_{\pmb\theta}$ for two augmented views of the same input, $x_1$ and $x_2$ . The SimSiam architecture applies this loss in a symmetrical way, similar to how PvP uses it, typically involving a predictor head and a stop-gradient on one of the encoder outputs to prevent collapse. The term presented here is the core cosine similarity part. In the context of PvP and SimSiam, the full loss involves two symmetrical terms, each with a predictor and a stop-gradient on the target latent representation.

4.3. Proprioceptive State vs. Privileged State for Tasks

The paper provides detailed breakdowns of the proprioceptive and privileged states used for each task in the supplementary material.

The following are the details of the proprioceptive state and privileged state of the LimX-Oli-31dof-Velocity task (Table 2 from the original paper):

Proprioceptive State	Privileged State
`base_ang_vel` (3x5)	`base_lin_vel` (3)
`projected_gravity` (3x5)	`base_ang_vel` (3)
`gait` (5)	`projected_gravity` (3)
`velocity_commands` (3x5)	`velocity_commands` (3)
`joint_pos` (31x5)	`joint_pos` (31)
`joint_vel` (31x5)	`joint_vel` (31)
`actions` (31x5)	`actions` (31)
`gait` (5)

For the velocity tracking task, the policy encoder's input is a history of 5 consecutive proprioceptive states (e.g., base_ang_vel (3-dim vector) is stacked 5 times, making it 3x5=15 dimensions). This is done to enhance robustness. The privileged state for the critic provides single-time-step ground truth information.

The following are the details of the proprioceptive state and privileged state of the LimX-Oli-31dof-Mimic task (Table 4 from the original paper):

Proprioceptive State	Privileged State
`base_ang_vel` (3)	`base_lin_vel` (3)
`projected_gravity` (3)	`base_ang_vel` (3)
`joint_pos` (31)	`base_pos_z` (1)
`joint_vel` (31)	`body_mass` (40)
`actions` (31)	`base_quat` (6)
`mimic reference` (69)	`projected_gravity` (3)
	`velocity_commands` (3)
	`joint_pos` (31)
	`joint_vel` (31)
	`actions` (31)
	`previous actions` (31)
	`mimic reference` (69)

For the motion imitation task, the proprioceptive state also includes mimic reference (69 dimensions), which describes the target motion. The privileged state is significantly richer, including body_mass, base_quat (quaternion for orientation), and previous actions, which are typically not available or reliably observable by the robot's proprioceptors.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on the LimX Oli humanoid robot, which serves as the test platform.

Robot Platform: LimX Oli (Figure 4 from the original paper).
- A full-size humanoid robot with 31 degrees of freedom (DoF).
  
  The specifications of the LimX Oli humanoid robot used in the experiments, and the screenshots of the two designed tasks are presented in Figure 4 (from the original paper):
  
  该图像是图表，展示了LimX Oli人形机器人的规格，包括身高、肩宽、臂长、重量及各节自由度的详细信息。同时，图中右侧展示了两个任务的截图，分别为速度跟踪和动作模仿。

Figure 4. The specifications of the LimX Oli humanoid robot used in the experiments, and the screenshots of the two designed tasks.

Tasks: Two representative Whole-Body Control (WBC) tasks are designed on this platform:
1. LimX-Oli-31dof-Velocity (Velocity Tracking):
  - Description: The robot is required to track a velocity command on flat terrain.
  - Commands: Velocity commands are resampled every 10 seconds.
    - Linear velocity on the x-axis: $(-0.5, 1.0) \mathrm{m/s}$
    - Linear velocity on the y-axis: $(-0.3, 0.3) \mathrm{m/s}$
    - Angular velocity on the z-axis: $(-1.0, 1.0) \mathrm{rad/s}$
  - Reward Terms: Key reward terms are detailed in Table 1 (from original paper).
2. LimX-Oli-31dof-Mimic (Motion Imitation):
  - Description: The robot is required to imitate different pre-recorded human animations.
  - Data: A set of 20 pre-recorded human motions (examples shown in Figure 12 from original paper), each with a maximum length of 43 seconds and 4,300 frames.
  - Reward Terms: Key reward terms are detailed in Table 3 (from original paper).
    
    Example screenshots of the motion capture data are presented in Figure 12 (from the original paper):
    
    该图像是图表，展示了运动捕捉数据的多个示例截图。图中包含了多个姿势和动作的序列，反映了人形机器人在动态环境中的运动情况。

Figure 12. Example screenshots of the motion capture data.

Choice of Datasets/Tasks: These two tasks cover primary categories of evaluation benchmarks in current humanoid robot research [44], allowing for a comprehensive evaluation of the proposed methods' ability to handle both reactive control (velocity tracking) and complex, high-dimensional trajectory following (motion imitation). The LimX Oli robot provides a realistic and challenging platform.

5.2. Evaluation Metrics

The evaluation metrics focus on four aspects: overall task performance, specific Key Performance Indicators (KPIs), training efficiency, and real-world deployment effectiveness.

Overall Task Performance:
- Conceptual Definition: This metric quantifies the overall success of the RL agent in achieving its objective by summing up all individual reward components. It reflects the expected discounted return that the policy aims to maximize.
- Calculation: Computed as the weighted summation of all sub-reward functions defined for each task (e.g., linear velocity tracking, angular velocity tracking, base height, action smoothness, etc.). The specific weights are given in Tables 1 and 3.
Key Performance Indicators (KPIs):
- Velocity Tracking Accuracy (for LimX-Oli-31dof-Velocity task):
  - Conceptual Definition: Measures how closely the robot's actual linear and angular velocities match the commanded velocities. Higher accuracy indicates better command following.
  - Mathematical Formula (from Table 1):
    - Linear velocity tracking: $\exp\left(-\frac{\|v_{xy} - v_{\mathrm{cmd}}\|^2}{2\sigma^2}\right)$
    - Angular velocity tracking: $-\frac{(\omega_z - \omega_{\mathrm{cmd}})^2}{2\sigma^2}$
  - Symbol Explanation:
    - $v_{xy}$ : Robot's actual linear velocity in the xy-plane.
    - $v_{\mathrm{cmd}}$ : Commanded linear velocity.
    - $\omega_z$ : Robot's actual angular velocity around the z-axis.
    - $\omega_{\mathrm{cmd}}$ : Commanded angular velocity.
    - $\sigma$ : A scaling factor (standard deviation-like term) to adjust sensitivity.
    - $\|\cdot\|^2$ : Squared Euclidean norm.
    - $\exp(\cdot)$ : Exponential function, used to convert a negative error into a positive reward, where a smaller error yields a reward closer to 1.
    - The angular velocity reward is presented as a negative squared error, implying it's a penalty term that is minimized, or transformed into a positive reward similar to linear velocity tracking.
- Position Alignment Accuracy (for LimX-Oli-31dof-Mimic task):
  - Conceptual Definition: Quantifies how well the robot's body parts (joints, feet, waist) match the reference poses from the human motion capture data. Higher accuracy means better imitation.
  - Mathematical Formula (from Table 3):
    - Position tracking: $\exp\left(-\frac{\|q - q_{\mathrm{ref}}\|^2}{2\sigma^2}\right)$
    - Feet distance tracking: $\exp\left(-\frac{|d - d_{\mathrm{ref}}|^2}{\sigma^2}\right)$
    - Waist pitch orientation tracking: \exp\left(-\frac{\sum_{i=1}^n |\delta_i - \delta_{\mathrm{ref}}|^2}{n^2}\right)
  - Symbol Explanation:
    - $q$ : Robot's current joint positions.
    - $q_{\mathrm{ref}}$ : Reference joint positions from the mimic motion.
    - $d$ : Robot's actual feet distance.
    - $d_{\mathrm{ref}}$ : Reference feet distance from the mimic motion.
    - $\delta_i$ : Pitch orientation of the $i$ -th waist joint (if multiple, or a specific one).
    - $\delta_{\mathrm{ref}}$ : Reference waist pitch orientation from the mimic motion.
    - $n$ : Number of waist joints considered for orientation.
    - $\sigma$ : A scaling factor.
    - $\|\cdot\|^2$ : Squared Euclidean norm.
    - $|\cdot|^2$ : Squared absolute value.
    - $\exp(\cdot)$ : Exponential function, converting error into a positive reward.
- Action Smoothness (for LimX-Oli-31dof-Velocity and implicit for LimX-Oli-31dof-Mimic):
  - Conceptual Definition: Measures the jerkiness or abruptness of the robot's movements. A smoother policy changes actions less drastically over consecutive time steps, which is crucial for stability, energy efficiency, and preventing damage to the robot in real-world deployment.
  - Mathematical Formula (from Table 1, under Action smoothness):
    - $k\|a_t - 2a_{t-1} + a_{t-2}\|^2$ (Note: The formula in the table is $k|a_t - 2a_{t-1} - a_{t-2}|_2^2$ , which seems to be a typo or a non-standard form for jerk. The standard finite difference for acceleration of action is $a_t - 2a_{t-1} + a_{t-2}$ . Assuming the intent is to penalize rapid changes in acceleration/jerk, the first form is more standard. However, strictly adhering to the paper's formula, it is $k\|a_t - 2a_{t-1} - a_{t-2}\|_2^2$ )
  - Symbol Explanation:
    - $a_t$ : Action at current time step $t$ .
    - $a_{t-1}$ : Action at previous time step t-1.
    - $a_{t-2}$ : Action at two previous time steps t-2.
    - $k$ : A negative weighting coefficient (e.g., -2.5e-3 from Table 1), indicating a penalty.
    - $\|\cdot\|^2$ : Squared Euclidean norm, penalizing large differences.
Training Efficiency:
- Conceptual Definition: Evaluates how quickly an RL agent can learn a good policy and converge to high performance. It is often measured by the number of environment interactions or training episodes required to reach a certain performance level.
- Measurement: Compared by analyzing the learning curves (e.g., reward accumulation over episodes) of different methods, looking for faster increases and earlier plateaus.
Real-world Deployment Effectiveness:
- Conceptual Definition: Assesses how well the learned policy performs on a physical robot or in a highly realistic simulation, considering factors like robustness, stability, and safety. Action smoothness is a key indicator here.
- Measurement: Primarily through Sim2Sim evaluation (e.g., on MuJoCo) and real-robot testing, visually and quantitatively observing behaviors, especially the action smoothness reward term.

5.3. Baselines

The paper uses PPO [28] as the backbone RL algorithm. The proposed PvP method is compared against vanilla PPO and PPO combined with three representative SRL methods, each exemplifying a distinct methodological paradigm:

Vanilla PPO:
- Description: The standard PPO algorithm without any State Representation Learning module. This serves as the fundamental baseline to evaluate the effectiveness of adding SRL.
- Representativeness: Represents the current state-of-the-art policy gradient method for many RL tasks, especially in robotics.
PPO + VAE [10]:
- Description: PPO integrated with a Variational Autoencoder (VAE) for SRL. The VAE learns latent representations by attempting to reconstruct the input observations, while regularizing the latent space.
- Representativeness: Exemplifies reconstruction-based SRL methods, which are widely used for learning compressed representations.
PPO + SPR [31]:
- Description: PPO integrated with Self-Predictive Representations (SPR) for SRL. SPR learns latent representations by enforcing multi-step consistency between predicted and encoded future states, focusing on capturing environment dynamics.
- Representativeness: Represents dynamics modeling-based SRL methods, which aim to extract features relevant to predicting future states and actions.
PPO + SimSiam [1]:
- Description: PPO integrated with the SimSiam algorithm for SRL. SimSiam is a self-supervised contrastive learning method that learns representations by comparing augmented views of the same input, using a Siamese network architecture and stop-gradient to prevent collapse.
- Representativeness: Exemplifies contrastive learning-based SRL methods that emphasize learning invariant features without explicit negative pairs. This is the closest baseline to PvP in terms of its underlying contrastive learning mechanism.
  
  By comparing against these diverse baselines, the authors aim to demonstrate PvP's superiority across different SRL paradigms. By default, these SRL loss terms are applied to the policy encoder. A hyperparameter search was conducted for each method to determine initial hyperparameters, including data augmentation operations and loss coefficients. All methods share the same network architectures and training steps for fair comparison.

5.4. PPO Network Architectures and Hyperparameters

The network architectures for the policy and value networks and the PPO hyperparameters are fixed across all experiments to ensure a fair comparison and isolate the effects of the SRL methods.

The following are the architectures of the policy and value network, which remain fixed for all the experiments (Table 5 from the original paper):

Part	Policy Network	Value Network
Encoder	`Linear`(O. D., 512) `ELU()` `Linear`(512, 256) `ELU()` `Linear`(256, 128)	`Linear`(O. D., 512) `ELU()` `Linear`(512, 256) `ELU()` `Linear`(256, 128)
Head	`Linear`(128, 128) `ELU()` `Linear`(128, 31)	`Linear`(256, 128) `ELU()` `Linear`(128, 1)

O. D. stands for "On-Demand," meaning the input dimension is determined by the specific task's observation space.
ELU() is the Exponential Linear Unit activation function.

Both policy and value networks have a multi-layer encoder structure, but their heads differ: the policy head outputs 31 values (for 31 actuated joints), while the value head outputs a single scalar value.

The following are the PPO hyperparameters for the two tasks, which remain fixed for all experiments (Table 6 from the original paper):

Hyperparameter	Value
`Reward normalization`	`Yes`
`LSTM`	`No`
`Maximum Episodes`	`30000`
`Episode steps`	`32`
`Number of workers`	`1`
`Environments per worker`	`4096`
`Optimizer`	`Adam`
`Learning rate`	`1e-3`
`Learning rate scheduler`	`Adaptive`
`GAE coefficient`	`0.95`
`Action entropy coefficient`	`0.01`
`Value loss coefficient`	`1.0`
`Value clip range`	`0.2`
`Max gradient norm`	`0.5`
`Number of mini-batches`	`4`
`Number of learning epochs`	`5`
`Desired KL divergence`	`0.01`
`Discount factor`	`0.99`

5.5. SRL Specific Hyperparameters (Appendix C)

PPO + PvP:
- Privileged Information: Root linear velocity (relative to world coordinate system) for velocity tracking, with root orientation information also involved for motion imitation.
- Zero Masking: Applied to the proprioceptive state in the whole training to align its dimension with the privileged state. (Note: this phrasing is slightly confusing. Based on the PvP method description, ZeroMasking is applied to the privileged state to derive a proprioceptive-like view, $\tilde{\pmb s}$ , which is then contrasted with the original privileged state, $\pmb s$ . It's not masking the proprioceptive state itself for dimension alignment, but masking elements within the privileged state.)
- Loss Coefficient ( $\lambda$ ): Searched over $\{0.1, 0.5, 1.0\}$ , with 0.5 used as the baseline.
PPO + SimSiam [1]:
- Loss Coefficient ( $\lambda$ ): Searched over $\{0.1, 0.5, 1.0\}$ , with 0.5 used as the baseline.
- Data Augmentation Operations: Searched over {random_masking, gaussian_noise, random_amplitude_scaling, identity_mapping}. random_masking and identity_mapping operations were used as baseline settings. This is because the proprioceptive state in the simulator is already subject to domain randomization, which acts as a form of data augmentation.
PPO + SPR [31]:
- Loss Coefficient ( $\lambda$ ): Searched over $\{0.1, 0.5, 1.0\}$ , with 0.5 used as the baseline.
- Data Augmentation Operations: Searched over {random_masking, gaussian_noise, random_amplitude_scaling, identity_mapping}. gaussian_noise operation was used as the baseline setting.
- Number of Prediction Steps ( $K$ in $L_{\mathrm{SPR}}$ ): Searched over $\{1, 5, 10, 15\}$ , with 5 steps used as the baseline.
- Average Loss: Whether to use an average loss (not specified if included in baseline).
PPO + VAE [10]:
- Loss Coefficient ( $\lambda$ ): Searched over $\{0.1, 0.5, 1.0\}$ , with 0.1 used as the baseline setting.

6. Results & Analysis

The experimental results are presented to answer specific research questions (Q1-Q6) regarding the performance, impact of training configurations, and computational efficiency of PvP and other SRL methods for humanoid WBC.

6.1. Core Results Analysis

Q1: Can the proposed PvP algorithm outperform the baseline methods?

The paper presents an analysis of overall task performance, action smoothness optimization, and key performance indicators (KPIs) to answer this question.

The overall reward comparison between the vanilla PPO agent and its combination with the four SRL methods on the two humanoid WBC tasks is presented in Figure 5 (from the original paper):

该图像是一个示意图，展示了LimX Oli 机器人在速度跟踪和模仿任务中不同算法的正则化得分随训练进展的变化。图中对比了PPO与多种结合策略，包括VAE、SPR、SimSiam和PvP，结果显示PvP在样本效率和最终性能上具有显著提升。

Figure 5. The normalized scores of the LimX Oli robot across velocity tracking and motion imitation tasks with various algorithms over training progress. It compares PPO with several combined strategies, including VAE, SPR, SimSiam, and PvP, showing that PvP significantly improves sample efficiency and final performance.

Overall Task Performance (Figure 5):
- Velocity Tracking Task (Left Plot): PvP significantly accelerates the learning process, achieving higher normalized scores much faster than other methods. Vanilla PPO is the slowest. SPR and SimSiam offer marginal improvements over PPO. VAE shows minimal benefit. This demonstrates PvP's advantage in leveraging privileged information to extract more informative features, thereby accelerating learning.
- Motion Imitation Task (Right Plot): PvP again achieves the highest performance and sample efficiency. SimSiam and SPR also outperform Vanilla PPO, showing the benefits of SRL. However, VAE exhibits performance degradation, suggesting that simple reconstruction of sensory data is insufficient or even detrimental for enhancing learning efficiency in this complex task.
- Conclusion: Learning high-quality features, especially with PvP, improves both learning efficiency and final performance in humanoid WBC tasks.
  
  The comparison of action smoothness optimization between the vanilla PPO agent and its combinations with the four SRL methods is presented in Figure 6 (from the original paper):
  
  该图像是图表，展示了在训练进程中，PPO及其与四种SRL方法组合的动作平滑度优化结果。图中实线与阴影区域分别表示均值与标准差。

Figure 6. The comparison of action smoothness optimization between the vanilla PPO agent and its combinations with the four SRL methods. The solid line and shaded region denote the mean and standard deviation, respectively.

Action Smoothness Optimization (Figure 6 - Velocity Tracking Task):
- Action smoothness is a crucial reward term that penalizes abrupt movements, ensuring smoother and more controlled motions, which is vital for real-world deployment reliability.
- PvP significantly accelerates the convergence of this penalty term (meaning it minimizes the penalty, thus achieving smoother actions, much faster). This indicates that PvP not only speeds up policy learning in simulation but also directly contributes to safer and more reliable behavior in real-world applications.
- Other SRL methods also show some improvement over Vanilla PPO, but PvP's effect is most pronounced.
  
  The tracking performance comparison between the PPO agent and its combinations with the four SRL methods is presented in Figure 7 (from the original paper):
  
  该图像是图表，展示了PPO代理及其与四种SRL方法组合的追踪性能比较，包含了腰部俯仰角、脚距和关节位置三个关键指标。PvP的追踪性能在这三个指标中表现最佳。

Figure 7. The tracking performance comparison between the PPO agent and its combinations with the four SRL methods. Our PvP achieves the highest performance in terms of the three key tracking metrics.

Tracking Performance (Figure 7 - Motion Imitation Task):
- For motion imitation, control precision is paramount. The figure compares PvP and other baselines across three KPIs: Waist pitch orientation, Feet distance, and Joint position (likely tracking error, smaller is better for the negative values).
- PvP consistently achieves the highest performance across all three metrics. This indicates that PvP provides reliable increments for critical KPIs in diverse, demanding tasks. The curves show that PvP reaches lower error (higher reward) faster and achieves a better final performance.
  
  Overall Answer to Q1: Yes, PvP consistently and significantly outperforms Vanilla PPO and other SRL baselines (VAE, SPR, SimSiam) in terms of sample efficiency, final performance (overall reward and specific KPIs), and action smoothness across both velocity tracking and motion imitation tasks.

Q2: How does the proportion of training time affect SRL's performance?

The training progress comparison of the four SRL methods with different training time proportions on the two humanoid WBC tasks is presented in Figure 8 (from the original paper):

该图像是一个比较图表，展示了不同方法在 LimX Oli 机器人上进行速度跟踪和模仿任务中的规范化分数。图中包括 VAE、SPR、SimSiam 和 PvP 四种方法的表现，识别了不同初始学习率下的效果。

Figure 8. Training progress comparison of the four SRL methods with different training time proportions on the two humanoid WBC tasks. The solid line and shaded region denote the mean and standard deviation, respectively.

This question investigates the interval update mechanism for SRL loss (i.e., $\mathbb{1}(T)$ in $L_{\mathrm{Total}}$ ), exploring different update intervals ( $T$ ). The intervals tested are 1 (synchronous update), 50, and 100.
Velocity Tracking Task (Left Plot): Adjusting the update interval has minimal effect on the performance curves for all SRL methods. The curves for intervals 1, 50, and 100 are largely overlapping.
Motion Imitation Task (Right Plot): Here, the update interval has a clear impact. An update interval of 50 (meaning SRL loss is applied every 50 time steps) generally yields optimal or near-optimal performance for all SRL methods. Applying SRL at every step (interval 1) or too infrequently (interval 100) can lead to slightly lower performance or slower convergence compared to interval 50.
Conclusion: Carefully selecting the update interval (e.g., $T=50$ ) can improve SRL performance, especially in more complex tasks like motion imitation. This is attributed to preventing SRL from prematurely falling into local optima due to low-quality data in early training stages and boosting overall training efficiency by reducing computational overhead.

Q3: How does the proportion of training data affect SRL's performance?

The training progress comparison of the four SRL methods with different training data proportions on the two humanoid WBC tasks is presented in Figure 9 (from the original paper):

该图像是图表，展示了LimX Oli机器人的不同状态表示方法（VAE、SPR、SimSiam、PvP）在速度跟踪和模仿任务中的标准化评分表现。通过比较数据比例（D. P. = 10%，50%，100%）对比不同方法的样本效率，验证了PvP方法的优势。

Figure 9. Training progress comparison of the four SRL methods with different training data proportions on the two humanoid WBC tasks. The solid line and shaded region denote the mean and standard deviation, respectively.

This question explores the impact of using different percentages of the collected rollouts data for SRL updates in each episode ( $10\%$ , $50\%$ , and $100\%$ ), with data resampled via random masking.
Velocity Tracking Task (Left Plot): Using different proportions of training data for SRL results in nearly identical training curves for all methods. This suggests that for this task, the quantity of data used for SRL within each update doesn't significantly alter performance.
Motion Imitation Task (Right Plot): In contrast, for motion imitation, increasing the data proportion generally enhances performance. This effect is particularly noticeable for SimSiam and PvP. Using $100\%$ of the data often leads to the best or comparable performance to $50\%$ .
Conclusion: Allocating an appropriate proportion of training data for SRL (often $50\%$ or $100\%$ ) accelerates learning and improves performance, especially in more data-sensitive tasks like motion imitation.

Q4: Which encoder (policy or value) benefits more from applying SRL loss?

The learning curves of applying the SRL to the value encoder are presented in Figure 10 (from the original paper):

Figure 10. Learning curves of applying the SRL to the value encoder. The solid line and shaded region denote the mean and standard deviation, respectively. 该图像是图表，展示了在 LimX Oli 机器人上进行的速度跟踪和模仿任务的学习曲线。上图显示动作平滑度随训练集数的变化，下图展示了不同编码器的标准化评分随训练进度的变化趋势。图中标注了“训练崩溃”的位置。

Figure 10. Learning curves of applying the SRL to the value encoder. The solid line and shaded region denote the mean and standard deviation, respectively.

Previous experiments applied SRL loss to the policy encoder (which processes the proprioceptive state). This ablation study investigates applying SRL loss to the value encoder (which processes the richer privileged state). SPR is excluded because it requires state-action pairs for training, which might not align cleanly with the value encoder's typical input solely for state evaluation.
Comparison (Figure 10): Applying SRL loss to the value encoder generally leads to slower convergence and lower overall performance compared to applying it to the policy encoder (as seen in Figure 5).
Velocity Tracking Task (Top Plot - Action Smoothness): When SRL (specifically SimSiam or PvP) is applied to the value encoder, a "collapse in training" is observed, indicated by a sharp drop in action smoothness (meaning highly unsmooth, penalized actions) before eventually recovering. This suggests significant instability.
Motion Imitation Task (Bottom Plot - Normalized Score): While PvP and SimSiam on the value encoder still learn, their final performance and convergence speed are inferior to when SRL is applied to the policy encoder.
Conclusion: Applying SRL to the policy encoder (which learns representations for the proprioceptive state used by the actor) yields a more stable learning process and enhanced performance. The value encoder (which guides the critic based on privileged states) appears less receptive to direct SRL enhancement, or at least requires careful tuning to avoid instability.

Q5: How computation-efficient are these SRL methods?

Findings: The authors state that all experiments were run on a single GPU using IsaacLab [23]. Their implementation allows the SRL module to run entirely on the GPU, implying that it does not significantly impact the overall training efficiency or introduce a CPU bottleneck.
Conclusion: SRL4Humanoid effectively accelerates humanoid WBC tasks with minimal additional computational resource cost. The specific quantitative benchmarks (e.g., training time with/without SRL, memory usage) are not detailed in the paper text but are noted to be attached to the compute reporting form (CRF).

Q6: How do the SRL-enhanced methods behave in real-world deployment?

The Sim2Sim evaluation on the MuJoCo simulator is presented in Figure 11 (from the original paper):

Figure 11. Sim2Sim evaluation on the MuJoCo simulator. The first two rows demonstrate motion imitation ability, and the last two rows show velocity tracking ability. 该图像是一个示意图，展示了在MuJoCo模拟器上进行的Sim2Sim评估。上两行展示了机器人进行动作模仿的能力，下两行则显示了其速度跟踪的能力。

Figure 11. Sim2Sim evaluation on the MuJoCo simulator. The first two rows demonstrate motion imitation ability, and the last two rows show velocity tracking ability.

Sim2Sim Evaluation (Figure 11): The authors first conduct a thorough Sim2Sim evaluation on the MuJoCo platform [40]. MuJoCo is known for its high simulation precision, which is considered closer to real-world conditions than IsaacLab [21]. Figure 11 visually demonstrates the robot's ability to execute complex tasks (motion imitation in the first two rows and velocity tracking in the last two rows) using the learned policy. This indicates that the policies generalize well to a different, more realistic simulator.
Real-world Evaluation (Figure 1): Following the Sim2Sim evaluation, real-robot testing was performed on the LimX Oli humanoid robot. The paper includes an image of the robot performing tasks, implying successful deployment. (Note: Figure 1 is the VLM-captioned first image of the entire paper, showing the robot in action). More demonstrations are available in the Supporting Videos.
Conclusion: The Sim2Sim and real-robot evaluations confirm the effectiveness of the SRL-enhanced policies (particularly PvP) in real-world scenarios, demonstrating the robot's ability to perform complex tasks with the learned policies. The earlier results on action smoothness (Figure 6) also implicitly support better real-world behavior.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully proposes PvP (Proprioceptive-Privileged contrastive learning), a novel framework designed to enhance sample efficiency and performance in humanoid Whole-Body Control (WBC) tasks. PvP achieves this by effectively leveraging the intrinsic complementarity between proprioceptive and privileged states for contrastive learning, thereby producing robust and task-relevant proprioceptive representations for policy learning without relying on complex, hand-crafted data augmentations.

Alongside PvP, the authors introduce SRL4Humanoid, a unified, modular, and open-source framework. This framework provides high-quality implementations of representative State Representation Learning (SRL) techniques, facilitating reproducible research and systematic evaluation in humanoid robot learning.

Extensive experiments on the LimX Oli humanoid robot, across challenging velocity tracking and motion imitation tasks, demonstrate PvP's significant improvements in sample efficiency, overall reward, key performance indicators (KPIs), and crucially, action smoothness, compared to various SRL baselines. The study also provides valuable practical insights into optimal SRL integration strategies, such as the effectiveness of interval updates and the superior benefit of applying SRL to the policy encoder over the value encoder. These findings collectively highlight the substantial potential of SRL-empowered Reinforcement Learning for humanoid WBC tasks.

7.2. Limitations & Future Work

The authors acknowledge several limitations and outline future research directions:

Exploration of Additional SRL Techniques: While the efficacy of the PvP framework was demonstrated using several representative SRL methods, future research could explore integrating other SRL techniques to further enhance policy learning. The field of SRL is rapidly evolving, and new methods might offer further benefits.
Incorporating Multimodal Data: Recent advancements in perception-based humanoid research highlight the potential of integrating multimodal data, such as RGB or depth images, into policy learning. The current PvP framework primarily focuses on proprioceptive and privileged states (which are mostly numerical robot/environment states). The authors aim to extend their work to these settings, which would expand the capabilities of humanoid robots to operate in more complex, visually rich environments.

7.3. Personal Insights & Critique

This paper presents a well-structured and empirically strong contribution to data-efficient humanoid robot learning.

Personal Insights:

Elegant Use of Privileged Information: The core idea of PvP—using the privileged state as a "pseudo augmentation" for the proprioceptive state in a contrastive learning framework—is quite elegant. It cleverly bridges the gap between the rich information available in simulation and the limited information available to a real robot's policy. This approach allows the policy encoder to learn highly informative features from proprioceptive data that implicitly capture privileged dynamics without ever explicitly exposing privileged information to the deployed policy, making it robust for sim-to-real transfer.
Practical Framework for Reproducibility: The SRL4Humanoid framework is a significant practical contribution. The lack of standardized, modular codebases is a common bottleneck in RL research. By open-sourcing high-quality implementations, the authors facilitate reproducibility and accelerate future research in humanoid WBC, allowing other researchers to easily build upon or compare against their work.
Thorough Ablation Studies: The detailed ablation studies on training time proportion, training data proportion, and SRL application to value encoder provide invaluable practical guidance. These insights are often overlooked in theoretical papers but are critical for practitioners aiming to implement SRL effectively in complex RL systems. The finding about interval updates for SRL loss is particularly useful for avoiding performance degradation during early training.
Emphasis on Real-world Metrics: The focus on action smoothness as a key performance indicator and its significant improvement with PvP is commendable. For real-world robot deployment, safety and smooth operation are often as important as task accuracy.

Critique / Areas for Improvement:

"ZeroMasking" Nuance: While the concept of ZeroMasking is clear, the paper states, "we attach the zero mask to the proprioceptive state in the whole training to align its dimension with the privileged state" in the supplementary ( $C.2. PPO+PvP$ ). This phrasing might be slightly confusing. The methodological description (Section 4.1) correctly states ZeroMasking is applied to the privileged state to derive a proprioceptive-like view. Clarifying this in the supplementary or main text would prevent potential misinterpretation. It would also be interesting to know if other masking strategies (e.g., Gaussian noise on privileged features, or simply dropping privileged features) were considered for creating $\tilde{\pmb s}$ , and why ZeroMasking was chosen.
Quantitative Computational Efficiency: While the paper mentions SRL4Humanoid runs on GPU with minimal cost and attaches logs to a CRF, providing some quantitative benchmarks (e.g., actual training time comparison, GPU memory usage with/without SRL) within the paper's results section would strengthen this claim and offer more concrete evidence for practitioners.
Generalization of Privileged Information: The privileged state in this paper includes root linear velocity and root orientation. While these are very useful, they are still somewhat restricted. The authors' future work mentions multimodal data, which is a step further. It would be insightful to discuss how this privileged information might be scaled or generalized if the privileged state includes even more abstract or complex concepts (e.g., intent of a human collaborator, or properties of an unknown object the robot is interacting with).
Robustness to Imperfect Privileged Information: In some simulations, even privileged information might not be perfectly accurate. The paper assumes a perfect privileged state. Exploring PvP's robustness if the privileged state itself is noisy or slightly inaccurate would be an interesting avenue.
Further Investigation into Value Encoder Collapse: The "collapse in training" when SRL is applied to the value encoder (Figure 10) is a significant observation. While the paper notes it and advises against it, a deeper analysis into why this occurs (e.g., gradient dynamics, conflicting objectives, differing representation needs of actor vs. critic) would be valuable for understanding SRL integration more generally.

Transferability & Applications: The PvP framework and the SRL4Humanoid toolkit are highly transferable. The principle of using privileged information to guide representation learning for a proprioceptive policy could be applied to other complex robotic systems (e.g., legged robots, manipulators) or even other RL domains where a rich "teacher" signal is available during training but not deployment. The modular nature of SRL4Humanoid also makes it a valuable resource for comparing different SRL approaches in various RL contexts, beyond just humanoids.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~43 min read · 55,816 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Humanoid Whole-Body Control (WBC)

4.2.2. Reinforcement Learning (RL) Background

4.2.3. PvP Implementation

4.2.4. The SRL4Humanoid Framework

4.3. Proprioceptive State vs. Privileged State for Tasks

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. PPO Network Architectures and Hyperparameters

5.5. SRL Specific Hyperparameters (Appendix C)

6. Results & Analysis

6.1. Core Results Analysis

Q1: Can the proposed PvP algorithm outperform the baseline methods?

Q2: How does the proportion of training time affect SRL's performance?

Q3: How does the proportion of training data affect SRL's performance?

Q4: Which encoder (policy or value) benefits more from applying SRL loss?

Q5: How computation-efficient are these SRL methods?

Q6: How do the SRL-enhanced methods behave in real-world deployment?

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers