PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations
TL;DR Summary
The paper introduces the PvP framework, addressing sample inefficiency in humanoid robot control by leveraging the complementarity of proprioceptive and privileged states. It improves learning efficiency without manual data augmentation, significantly enhancing performance in vel
Abstract
Achieving efficient and robust whole-body control (WBC) is essential for enabling humanoid robots to perform complex tasks in dynamic environments. Despite the success of reinforcement learning (RL) in this domain, its sample inefficiency remains a significant challenge due to the intricate dynamics and partial observability of humanoid robots. To address this limitation, we propose PvP, a Proprioceptive-Privileged contrastive learning framework that leverages the intrinsic complementarity between proprioceptive and privileged states. PvP learns compact and task-relevant latent representations without requiring hand-crafted data augmentations, enabling faster and more stable policy learning. To support systematic evaluation, we develop SRL4Humanoid, the first unified and modular framework that provides high-quality implementations of representative state representation learning (SRL) methods for humanoid robot learning. Extensive experiments on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to baseline SRL methods. Our study further provides practical insights into integrating SRL with RL for humanoid WBC, offering valuable guidance for data-efficient humanoid robot learning.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations
1.2. Authors
Mingqi Yuan^2, Tao Yu, Haolin Song^4, Bo Li^1, Xin Jin^3, Hua Chen^5, Wenjun Zeng
Affiliations: HK PolyU (Hong Kong Polytechnic University) LimX Dynamics EIT, Ningbo (Ningbo Institute of Industrial Technology, Chinese Academy of Sciences) USTC (University of Science and Technology of China) ZJU-UIUC (Zhejiang University-University of Illinois at Urbana-Champaign Institute) SUSTech (Southern University of Science and Technology)
The authors' backgrounds appear to span robotics, artificial intelligence, and computer vision, given their affiliations with universities and a robotics company (LimX Dynamics).
1.3. Journal/Conference
This paper is published as a preprint on arXiv (arxiv.org/abs/2512.13093). As an arXiv preprint, it signifies that the research has been made publicly available but has not yet undergone formal peer review by a journal or conference. arXiv is a widely recognized platform for sharing cutting-edge research in fields like AI and Robotics, allowing rapid dissemination of new findings.
1.4. Publication Year
2025 (specifically, December 15, 2025)
1.5. Abstract
The paper addresses the challenge of sample inefficiency in reinforcement learning (RL) for whole-body control (WBC) of humanoid robots, a crucial aspect for complex tasks in dynamic environments. This inefficiency stems from the intricate dynamics and partial observability inherent to humanoids. To tackle this, the authors propose PvP, a Proprioceptive-Privileged contrastive learning framework. PvP leverages the inherent complementary nature between proprioceptive states (sensor readings directly from the robot's body, like joint positions) and privileged states (full simulator information, often unavailable in the real world but useful during training). This framework learns compact, task-relevant latent representations without needing manually designed data augmentations, leading to faster and more stable policy learning. To facilitate systematic evaluation, the authors also developed SRL4Humanoid, an open-source, modular framework that provides high-quality implementations of representative State Representation Learning (SRL) methods for humanoid robot learning. Extensive experiments conducted on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to other SRL baselines. The study also offers practical insights into integrating SRL with RL for humanoid WBC, providing valuable guidance for data-efficient humanoid robot learning.
1.6. Original Source Link
https://arxiv.org/abs/2512.13093 PDF Link: https://arxiv.org/pdf/2512.13093v1.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the sample inefficiency of Reinforcement Learning (RL) when applied to Whole-Body Control (WBC) of humanoid robots. WBC is essential for humanoids to perform complex tasks, coordinate numerous joints, and achieve balanced, agile, and safe behaviors in dynamic, real-world environments.
This problem is particularly important due to several challenges:
-
Complex Dynamics: Humanoid robots possess highly intricate dynamics, underactuation (fewer actuators than degrees of freedom), and strong coupling between different actions (like locomotion, manipulation, and balance).
-
Partial Observability: Real-world robots often operate with incomplete sensory information (e.g., only internal sensor readings, not a full understanding of the environment's state).
-
Composite Reward Structures: To ensure both task performance (e.g., tracking accuracy) and reliability (e.g., energy efficiency) in real-world deployment,
RLpolicies often need to optimize complex reward functions, further increasing the difficulty and sample complexity. -
Limitations of Traditional Methods: Traditional model-based control methods struggle with flexible real-time control and robust performance under non-stationary conditions, leading to a shift towards data-driven
RLapproaches.While
RLhas shown promising results in humanoidWBC(e.g., for motion tracking and multi-gait locomotion), its high sample requirements remain a significant barrier to widespread application. Existing solutions like simulation acceleration and data augmentation help but don't fully address the underlying issue of processing high-dimensional, noisy, and redundant sensory inputs.
The paper's entry point is State Representation Learning (SRL). SRL offers a promising solution by transforming raw, high-dimensional sensory inputs into compact, informative latent representations. These representations ideally preserve task-relevant dynamics while filtering out noise and redundancy. Prior SRL work in robotics has explored reconstruction-based methods and contrastive learning, but their integration into humanoid WBC in a unified, end-to-end framework that enhances both learning efficiency and real-world deployment reliability remains underexplored. The innovative idea is to leverage the intrinsic complementarity between proprioceptive states (what the robot directly senses about itself) and privileged states (full environmental and robot state information available in simulation) using contrastive learning.
2.2. Main Contributions / Findings
The paper makes three primary contributions:
- Proposed PvP Framework: They introduce
PvP(Proprioceptive-Privilegedcontrastive learning), a novel framework that enhancesproprioceptive representationsfor policy learning.PvPachieves this by performing contrastive learning between the robot'sproprioceptive statesandprivileged states. This approach effectively combines complementary information from different sensory modalities without requiring hand-crafted data augmentations, leading to more stable and incrementally improved learning. - Developed SRL4Humanoid Framework: The authors developed
SRL4Humanoid, which they claim is the first unified, modular, and open-source framework. This toolkit provides high-quality implementations of representativeSRLmethods specifically tailored for humanoid robot learning.SRL4Humanoidaims to enable reproducible research and accelerate future progress in the community by offering a systematic platform for evaluatingSRLtechniques inhumanoid WBC. - Extensive Experimental Validation and Practical Insights: Through extensive experiments on the
LimX Olihumanoid robot across two challenging tasks (velocity trackingandmotion imitation),PvPis demonstrated to significantly outperform existingSRLbaselines in terms of both sample efficiency and final policy performance. The study also provides valuable practical insights into how differentSRLmethods, training configurations (e.g., update intervals, data proportions), and application targets (policy vs. value encoder) affect the efficiency and performance ofhumanoid WBClearning. Key findings include thatPvPnot only accelerates learning but also ensures greater action smoothness, crucial for real-world reliability, and that applyingSRLto the policy encoder is generally more effective than to the value encoder.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Humanoid Robots: Robots designed to resemble the human body, typically featuring a torso, head, two arms, and two legs. Their human-like morphology offers advantages like versatility, adaptability to human-centered environments, and intuitive interaction. However, their complex structure (many
Degrees of FreedomorDoF) makes control challenging. -
Whole-Body Control (WBC): A control strategy for robots with many
DoFthat aims to coordinate all joints and actuators simultaneously to achieve desired tasks while maintaining balance, respecting physical limits (e.g., joint limits, torque limits), and optimizing for criteria like energy efficiency or smoothness. It's crucial for complex behaviors like walking, running, or interacting with objects. -
Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to make optimal decisions by interacting with anenvironment. Theagentperformsactionsin theenvironment, receivesobservations(information about the state), and getsrewardsorpenalties. The goal is to learn apolicy– a mapping fromobservationstoactions– that maximizes the cumulativerewardover time.- Partially Observable Markov Decision Process (POMDP): A mathematical framework for modeling sequential decision-making in situations where the
agentdoes not have direct access to theenvironment'struestatebut instead receivesobservationsthat are probabilistically related to thestate. This is highly relevant for robots, which often rely on noisy or incomplete sensor readings. APOMDPis defined by the tuple :- : The full
statespace of theenvironmentandagent. - : The
observationspace, which is the information theagentperceives. - : The
actionspace, the set of possible actions theagentcan take. - : The transition probability function, describing the probability of moving to
stateafter takingactioninstate. - : The observation function, describing the probability of observing when the
environmentis instateandactionwas taken. R(s, a, s'): Therewardfunction, which assigns a numerical value tostate-action-state'transitions, indicating how desirable they are.- : The
discount factor, which determines the present value of future rewards. A higher makes the agent consider future rewards more heavily.
- : The full
- Policy (): A function that maps
observations(and potentiallycommands) toactions. InRL, thispolicyis often parameterized by a set of weights (e.g., a neural network) which theRLalgorithm aims to optimize. - Sample Efficiency: In
RL,sample efficiencyrefers to how much experience (i.e., how many interactions with theenvironment, or "samples") anagentneeds to learn a goodpolicy. Lowsample efficiencymeans requiring a huge amount of data, which is costly and time-consuming, especially for real robots.
- Partially Observable Markov Decision Process (POMDP): A mathematical framework for modeling sequential decision-making in situations where the
-
State Representation Learning (SRL): A set of techniques aimed at learning low-dimensional, compact, and informative representations of raw, high-dimensional sensory inputs (like images, sensor readings, joint positions, etc.). The goal is to extract
task-relevant featureswhile discarding noise and redundant information. This can significantly improveRLby simplifying theobservation space, making the learning process faster, more stable, and more generalizable. -
Proprioceptive State (): Refers to the internal sensory information a robot receives about its own body's position, orientation, and movement. This includes signals directly measurable on hardware. For humanoid robots, this typically comprises:
joint positions(): The angles of its joints.joint velocities(): The rates of change of joint angles.base angular velocity(): The rotational speed of the robot's main body (base).gravity (orientation)(): An estimate of the robot's orientation relative to gravity, often represented in the base frame.
-
Privileged State (): The complete and true
stateof therobotand itsenvironment, typically available only in asimulatorduring training. It contains information that is either difficult, unreliable, or impossible to measure directly on a real robot. This can include:root pose and velocity: The exact position, orientation, linear, and angular velocity of the robot's base in the world.per-link poses and velocities: The exact positions and velocities of all individual body links.contact indicators: Precise information about which parts of the robot are in contact with the environment.environment/terrain features: Detailed information about the terrain, obstacles, or other elements in the environment. Crucially, theproprioceptive stateis a subset of theprivileged state().
-
Contrastive Learning: A self-supervised learning paradigm where representations are learned by bringing "similar" (positive) pairs of data points closer together in a latent space while pushing "dissimilar" (negative) pairs apart. This helps the model learn what makes different data points similar or different. In
SRL, positive pairs might be different augmented views of the same observation, or temporally adjacent observations. -
Proximal Policy Optimization (PPO): A popular on-policy
RLalgorithm. It's an actor-critic method (meaning it uses separate neural networks for thepolicyandvalue function).PPOimproves stability andsample efficiencycompared to earlierpolicy gradientmethods by using a clipped surrogate objective function. This objective limits the size of policy updates at each step, preventing large, destructive changes and ensuring that the newpolicydoesn't stray too far from the old one. -
SimSiam: Stands for "Simple Siamese" representation learning. It's a
self-supervised learningmethod that learns meaningful representations without needingnegative sample pairs, large batches, ormomentum encoders(common components in othercontrastive learningmethods). It uses two identicalencodernetworks (aSiamese networkarchitecture) that process two different augmented views of the same input. A key innovation is thestop-gradientoperation on one branch, which prevents the network from collapsing (i.e., learning trivial, constant representations for all inputs). The objective is to maximize thecosine similaritybetween the outputs of the two branches. -
Variational Autoencoders (VAE): A type of
generative modelthat learns a compressed,latent representationof input data. It consists of anencoderthat maps input data to a probabilistic distribution over alatent space, and adecoderthat reconstructs the original data from samples drawn from thislatent space.VAEs enforce a prior distribution (e.g., Gaussian) on thelatent spacethrough aKullback-Leibler (KL) divergenceterm in their loss function, balancing reconstruction accuracy with regularization to ensure thelatent spaceis well-structured and facilitates generation. -
Self-Predictive Representations (SPR): An
SRLmethod that learnslatent representationsby enforcing multi-step consistency between predicted and encoded future states. It encourages representations that capture environment dynamics. Theagentlearns to predict futurelatent statesfrom currentlatent statesandactions, thereby creatingrepresentationsthat are useful for predicting future outcomes and understandingdynamics.
3.2. Previous Works
The paper frames SRL for RL into three predominant categories:
-
Reconstruction-based Methods: These methods learn representations by encoding high-dimensional observations into a lower-dimensional latent space and then attempting to reconstruct the original observations (or parts of them) from this latent representation.
- Examples cited: [4, 5, 7, 10, 46].
- A pioneering example is
VAE[10], which balances reconstruction fidelity with regularization by enforcing a prior distribution on the latent space.Beta-VAE[7] extends this by adding a hyperparameter to control the disentanglement of latent factors. - Relevance to paper: The paper acknowledges that reconstruction-based methods can struggle with suboptimal representation quality and poor generalization because they often focus on preserving complete state information, including irrelevant details, rather than solely task-relevant features.
-
Dynamics Modeling Methods: These
SRLapproaches encouragerepresentationsthat explicitly capture theenvironment's dynamicsthrough predictive modeling.- Examples cited: [9, 12, 20, 25, 31].
Forward modelspredict future states from current state-action pairs, whileinverse modelsinfer actions from transitions [9, 12, 25]. These models aim to provide richcontrollable features.SPR[31] is a specific dynamics modeling method mentioned, which learnspredictive latent representationsby enforcing multi-step consistency between predicted and encoded futurelatent states.- Relevance to paper:
SPRis used as a baselineSRLmethod in the paper's experiments.
-
Contrastive Learning Techniques: These methods structure the
latent spaceby enforcing similarity between "positive pairs" (e.g., different augmented views of the same observation, or temporally adjacent states) while pushing "negative pairs" apart. This yieldsinvariantandtemporally smooth representationsthat can improveRL efficiencyand generalization.- Examples cited: [14, 15, 36, 45, 51].
CURL[15] leverages augmented image pairs to enforce invariances.ATC[36] aligns embeddings of temporally close observations.CDPC[51] refinescontrastive predictive codingwith temporal-difference objectives.SimSiam[1] is specifically mentioned as a contrastive learning baseline that the paper'sPvPframework builds upon and adapts.- Relevance to paper:
SimSiamis a direct inspiration and a baseline forPvP.
SRL for Humanoid Robot Learning (Specific Prior Work):
The paper highlights pioneering work applying SRL to humanoid robots:
- [50] proposes
Any2Track, which usesSRLto enhance motion tracking by integrating ahistory-informed adaptation module. This module extractsdynamics-aware world model predictionsto adapt to disturbances like terrain or external forces. - [38] presents a
world model reconstruction frameworkthat usessensor denoisingandworld state estimationto improve locomotion in unpredictable environments. - [37] explores reconstruction-based
SRL(e.g., elevation-based internal maps) forhumanoid locomotion over challenging terrain. - [18] uses
contrastive learning-based abstractions (perceptive internal models from height maps) to refinestate embeddingsand improveRLperformance. - Limitations of prior SRL in Humanoids: The paper notes that while these approaches show potential, they are often either reconstruction-based (which might preserve irrelevant details) or rely on a single state modality (like only perceptive or proprioceptive information). This limits their ability to capture the full spectrum of task-relevant dynamics or be seamlessly integrated into an end-to-end
RLframework for real-world reliability.
3.3. Technological Evolution
The evolution of humanoid WBC has generally progressed as follows:
- Traditional Model-Based Control: Early approaches heavily relied on precise mathematical models of the robot's dynamics and environment. Techniques like
inverse kinematics,optimization-based control[11], andModel Predictive Control (MPC)[26] were common. While providing theoretical guarantees, these methods often struggle with real-time flexibility, robustness to unmodeled disturbances, and performance in complex, dynamic, or non-stationary real-world conditions. - Data-Driven Reinforcement Learning (RL): To overcome the limitations of model-based methods,
RLemerged as a dominant paradigm.RLallows robots to learn complex behaviors directly from interaction, adapting to uncertainties and achieving behaviors difficult to hand-design [8, 16, 17, 44, 48, 49].RLpolicies can achieve impressive generalization (e.g., diverse motions [17] or multi-gait locomotion [44]). - Addressing RL Sample Inefficiency: Despite its successes,
RL's primary bottleneck issample inefficiency, especially for high-dimensional, complex systems like humanoids. This led to research into:-
Simulation Acceleration: Faster simulators (e.g.,
Isaac Gym[19],MuJoCo XLA[39]) to generate more data quickly. -
Data Augmentation: Techniques to artificially expand the training data diversity [22, 29, 30].
-
State Representation Learning (SRL): The focus of this paper, which aims to make
RLmore efficient by providing better inputs to theRL agent.This paper's work fits into the third stage, specifically focusing on advancing
SRLforhumanoid WBC. It aims to bridge the gap inSRLby proposing a method (PvP) that leverages bothproprioceptiveandprivileged statesto learn more effective representations, and by providing a standardized framework (SRL4Humanoid) for reproducible research in this area.
-
3.4. Differentiation Analysis
Compared to main methods in related work, PvP offers several core differences and innovations:
-
Leveraging
ProprioceptiveandPrivileged Statesfor Contrastive Learning:- Differentiation: Many prior
SRLmethods, especially those forRL, rely on a single input modality (e.g., only images forCURL[15], or only proprioceptive states for methods likePIM[18] (which uses height maps as "perceptive internal models")). Reconstruction-based methods [38, 43] often try to predictprivileged informationfromproprioceptive states, but this can lead to suboptimal representations by forcing theencoderto reconstruct all details, not just task-relevant ones. - Innovation:
PvPdirectly uses bothproprioceptiveandprivileged statesas input to acontrastive learningframework. It treats theprivileged state(which inherently contains more comprehensive, ground-truth information) as a "pseudo augmentation" of theproprioceptive state. This allows thepolicy encoderto learn representations that implicitly incorporate richer,privileged informationduring training, without needing that information during real-world deployment (where onlyproprioceptive statesare available to thepolicy).
- Differentiation: Many prior
-
Absence of Hand-crafted Data Augmentations:
- Differentiation: Many
contrastive learningmethods (e.g.,CURL[15]) heavily rely onhand-crafted data augmentations(like random crops, color jittering, Gaussian noise) to generate positive pairs and learn invariances. Designing effective augmentations for diverse robotic sensor inputs can be challenging and task-specific. - Innovation:
PvPleverages the "intrinsic complementarity" between theproprioceptiveandprivileged statesthemselves. By applying aZeroMaskingoperation to theprivileged stateto create aproprioceptive-likeview (), it generates two related views—the fullprivileged state() and the maskedprivileged state()—that serve as natural positive pairs forcontrastive learning. This removes the need for designing complex, domain-specific augmentations.
- Differentiation: Many
-
Enhanced Task-Relevant Feature Learning:
- Differentiation: Reconstruction-based methods often prioritize fidelity to the input, potentially retaining irrelevant details. Single-modality contrastive methods might be limited by the information contained within that single modality.
- Innovation: By contrasting a rich
privileged statewith a constrainedproprioceptive-likeview,PvPis encouraged to learn features from theproprioceptive statethat are robust and predictive of aspects present in theprivileged state(likeroot linear velocity), which are often highlytask-relevant. This leads to more compact and informativelatent representationsthat filter out noise and redundancy, acceleratingRLand improving final performance.
-
SRL4Humanoid Framework:
-
Innovation: The development of
SRL4Humanoidas a unified, modular, and open-source framework is itself a significant contribution. It addresses the common challenge of reproducibility and systematic evaluation inRLresearch by providing high-quality implementations of diverseSRLmethods specifically for humanoid robots. This facilitates comparative studies and future research in the community.In essence,
PvPuniquely capitalizes on the rich information available in simulation (privileged state) to improve the representation learning for real-world deployableproprioceptive policies, doing so in a self-supervised, augmentation-free manner within acontrastive learningframework.
-
4. Methodology
4.1. Principles
The core idea behind PvP is to address the sample inefficiency of Reinforcement Learning (RL) for humanoid Whole-Body Control (WBC) by learning superior state representations. It achieves this through contrastive learning that leverages the inherent complementary relationship between two distinct sources of state information: the proprioceptive state and the privileged state.
The intuition is that while the proprioceptive state (e.g., joint angles, velocities, base angular velocity) is what the robot directly perceives and uses for control in the real world, the privileged state (e.g., root linear velocity, full environment state) provides a more complete, ground-truth understanding of the robot's true state and its interaction with the environment, especially during simulation training. The privileged state contains all the information from the proprioceptive state plus additional, typically unobservable, rich context. By setting up a contrastive learning task where the proprioceptive view is contrasted with a privileged view derived from the same underlying state, the policy encoder learns to extract robust, task-relevant features from the proprioceptive state that are predictive of the privileged information. This process implicitly guides the encoder to focus on meaningful aspects of the proprioceptive input, leading to compact, informative latent representations that are less susceptible to noise and redundancy, without requiring manually designed data augmentations. This, in turn, facilitates faster and more stable policy learning in RL.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology section first outlines the general humanoid Whole-Body Control (WBC) problem and the Reinforcement Learning (RL) framework, then details the proposed PvP algorithm, and finally introduces the SRL4Humanoid framework that supports the experiments.
4.2.1. Humanoid Whole-Body Control (WBC)
WBC is fundamental for humanoid robots to perform diverse and complex tasks. The objective is to design a control function that maps continuous commands () and observations () to appropriate control signals. The paper employs learning-based methods to directly learn a parameterized policy that outputs joint actions.
In practice, actions are often defined as offsets to nominal joint positions for various body parts. These offsets are then added to pre-defined nominal targets to obtain the final target joint positions, which are subsequently tracked by a proportional-derivative (PD) controller with fixed gains.
4.2.2. Reinforcement Learning (RL) Background
The paper casts humanoid WBC as an infinite-horizon partially observable Markov decision process (POMDP) [35].
A POMDP is formally defined by the tuple :
-
: The full
statespace of the humanoid robot and its environment. -
: The
observationspace, which is the information theagentreceives. -
: The
actionspace, the set of possible actions theagentcan take. -
: The transition probability function, describing the probability of moving to
stateafter takingactioninstate. -
: The
observation function, which describes the probability ofobservationoccurring when theenvironmentis instateandactionwas taken. -
R(s, a, s'): Thereward function, which provides a scalar feedback signal for transitions. -
: The
discount factor, which weights the importance of future rewards.The primary objective of
RLis to learn an optimalpolicythat maximizes theexpected discounted return, : $ J_{\pi}(\pmb\theta) = \mathbb{E}{\pi} \Big[ \sum{t=0}^{\infty} \gamma^t R(s_t, \pmb a_t, \pmb s_{t+1}) \Big] . $ Here, denotes the expectation over trajectories (sequences of states, actions, and observations) generated bypolicy.
The paper further defines specific state and action spaces for the humanoid robot:
- Proprioceptive State Space (): This is the
observationspace for thepolicy. At time step , consists of signals directly measurable on hardware. Typically, these include:joint positions()joint velocities()base angular velocity()gravity (orientation)() estimate in the base frame.
- Privileged State Space (): This represents the full simulator
state(from thestatespace ). At time step , is used only during training (e.g., by thecritic/teachernetwork) and is unavailable or unreliable on the real robot. Typical components include:root pose and velocityper-link poses and velocitiescontact indicatorsenvironment/terrain features. Importantly, theproprioceptive stateis a subset of theprivileged state: .
- Action Space (): The
actionspace . The action specifies angular deviations for actuated joints relative to their nominal positions. These deviations are added to nominal configurations to get final target joint positions, which aPD controllerthen tracks.
4.2.3. PvP Implementation
The PvP framework aims to enhance policy learning by leveraging both proprioceptive and privileged states through contrastive learning. This approach is motivated by the limitations of previous methods: reconstruction-based SRL often learns suboptimal and poorly generalizable representations by trying to preserve all state details, while existing contrastive learning methods for RL typically rely on a single state modality, missing the rich privileged information.
The overview of the PvP approach is illustrated in Figure 2 (from the original paper):
该图像是图示,展示了PvP方法的概述。(a) 显示了特权状态(例如,根线性速度)和本体状态(例如,关节位置)的组成。(b) PvP基于两种状态模态之间的内在互补性进行对比学习。
Figure 2. An overview of the PvP approach. (a) The components of privileged state (e.g., root linear velocity) and proprioceptive state (e.g., joint positions). (b) PvP conducts contrastive learning based on the intrinsic complementarity between the two state modalities.
As shown in Figure 2(a), the privileged state contains comprehensive information, including the proprioceptive observations (e.g., joint positions, angular velocities) and additional privileged information (e.g., root linear velocity). This comprehensive privileged state can be considered a rich, implicit "pseudo augmentation" of the proprioceptive state .
To create a pair for contrastive learning that reflects the intrinsic complementarity, PvP generates a second view from the privileged state. This is done by applying a ZeroMasking operation to the privileged information part of , while keeping the proprioceptive observations part intact:
$
\tilde{\pmb s}_t = \mathrm{ZeroMasking}(\pmb s_t) .
$
Here, ZeroMasking specifically targets and zeroes out components of the privileged state that are not part of the proprioceptive state (e.g., root linear velocity, contact indicators). This operation effectively creates a proprioceptive-only view () that is structurally similar to the proprioceptive state but derived from the same comprehensive privileged state .
The derived data pair is then used to train the policy encoder, following the architecture and principles of the SimSiam algorithm [1].
Formally, let be the policy encoder (a neural network parameterized by ) and be the predictor (another neural network parameterized by ). The process involves encoding both views and then predicting from one to the other:
$
\begin{array} { r l } { z = f_{\pmb\theta}(\pmb s) , } & { { } \tilde{z} = f_{\pmb\theta}(\tilde{\pmb s}) } \ { \pmb p = h_{\psi}\left( z \right) , } & { { } \tilde{p} = h_{\psi}\left( \tilde{z} \right) } \end{array}
$
Here, and are the latent representations obtained by feeding the privileged state and the zero-masked privileged state (proprioceptive-like view) through the policy encoder , respectively. and are the outputs of the predictor applied to these latent representations.
Finally, the PvP loss () is defined using negative cosine similarity with a stop-gradient operation, as in SimSiam:
$
L_{\mathrm{PvP}} = D_{\mathrm{ncs}}\left( \pmb p, \mathbf{sg}(\tilde{\pmb z}) \right) + D_{\mathrm{ncs}}\left( \tilde{\pmb p}, \mathbf{sg}(\pmb z) \right) ,
$
where is the negative cosine similarity loss between the normalized vectors and . The stop-gradient operation, denoted by , is crucial. It means that gradients are computed for the branch containing the predictor output ( or ) and propagated back to its encoder ( and ), but not for the target representation ( or ). This prevents the network from learning a trivial solution where both outputs collapse to a constant, which would happen if gradients were allowed to flow through both branches simultaneously.
Advantages of PvP:
- Richer Information: By leveraging both
proprioceptiveandprivileged states,PvPcombines complementary information, reducingSRLcomplexity while enhancing learnedrepresentations. This provides an alternative way for the policy to access implicitprivileged information. - No Hand-crafted Augmentations:
PvPexploits the intrinsic complementarity between the two state modalities, avoiding the need for complex, task-specificdata augmentations. - Versatility: The framework is simple and general, making it applicable to a wide range of tasks.
4.2.4. The SRL4Humanoid Framework
To support systematic experimentation and reproducible research, the authors introduce SRL4Humanoid, a unified, modular, and open-source framework.
The architecture of SRL4Humanoid is depicted in Figure 3 (from the original paper):
该图像是SRL4Humanoid框架的架构示意图,展示了来自本体状态和特权状态的输入如何通过策略编码器和价值编码器进行处理,并生成相应的策略头和价值头,最后通过SRL损失和PPO损失更新模型。
Figure 3. The architecture of the SRL4Humanoid framework, in which the SRL and RL processes are fully decoupled.
As shown in Figure 3, the SRL4Humanoid framework designs the SRL and RL processes to be fully decoupled.
-
It uses
Proximal Policy Optimization (PPO)[28] as the backboneRLalgorithm. -
The
policy network(which decides actions) accepts theproprioceptive stateof the robot as input. -
The
value network(which estimates the expected future reward) accepts theprivileged stateof the environment to performvalue estimation. This is a common practice inRLfor robustness and faster learning, as thecriticcan use more information than theactor. -
The
SRL objectivecan be applied to either thepolicy encoder(to improveproprioceptive representations) or thevalue encoder(to improveprivileged state representations).SRL4Humanoidimplements three representativeSRLalgorithms for comparative study:SimSiam[1],SPR[31], andVAE[10], each representing a different methodological paradigm (contrastive, dynamics modeling, reconstruction-based, respectively).
The joint optimization objective of RL and SRL is defined as:
$
\mathcal{L}{\mathrm{Total}} = \mathcal{L}{\mathrm{RL}} + \lambda \cdot \mathcal{L}_{\mathrm{SRL}} ,
$
where is the RL loss (e.g., PPO loss), is the SRL loss (e.g., PvP loss, SimSiam loss, SPR loss, or VAE loss), and is a weighting coefficient that balances the contribution of the SRL objective.
By default, the updates for and are synchronized, meaning they share data batches and follow RL's update frequency. However, the authors found that continuous SRL updates can sometimes degrade learning efficiency, especially in early stages when RL generates large amounts of repetitive, low-quality data. This can cause the SRL module to prematurely converge to local optima.
To mitigate this, an interval update mechanism is employed:
$
L_{\mathrm{Total}} = L_{\mathrm{RL}} + \mathbb{1}(T) \cdot \lambda \cdot L_{\mathrm{SRL}} ,
$
where is an indicator function that equals 1 every time steps (i.e., the SRL loss is applied only at intervals of time steps); otherwise, it equals 0. This allows the SRL module to be trained less frequently during the initial noisy phases, preserving its ability to continuously influence policy learning more effectively.
The workflow of SRL4Humanoid is summarized in Algorithm 1:
| 1 Initialize the policy πθ and value network Vφ; 2 Initialize the SRL module Sψ; 3 Set all the hyperparameters, such as the maximum | ||
| number of episodes E, and the number of update epochs K, etc. | ||
| 4 for episode = 1 to E do Sample rollouts using the policy network πθ; | ||
| 5 6 | Perform the generalized advantage estimation (GAE) to get the estimated task returns; | |
| for epoch = 1 to K do | ||
| 7 8 | Sample a mini-batch B from the rollouts data; | |
| 9 | Use B to compute the policy and value loss; | |
| 10 | Use B to compute the SRL loss; Compute the total loss following Eq. (6); | |
| 11 | Update the policy network, value network, | |
| 13 | and the SRL module; end | |
| 14 end | Output the optimized policy πθ | |
Algorithm 1: SRL4Humanoid Training Workflow
- Initialization: Initialize the
policy network(),value network(), and theSRL module(). Set hyperparameters like maximum episodes () and update epochs (). - Episode Loop: For each
episodefrom 1 to : a. Rollout Collection: Samplerollouts(sequences of states, actions, rewards) using the currentpolicy network. b. Return Estimation: Compute estimatedtask returnsusingGeneralized Advantage Estimation (GAE)[27].GAEhelps in estimating how much better an action is compared to the average action from a given state, crucial forpolicy gradientmethods. c. Epoch Loop: For eachepochfrom 1 to : i. Mini-batch Sampling: Sample amini-batchfrom the collectedrollouts data. ii. Loss Computation: * Computepolicy loss() andvalue loss() usingmini-batch. * ComputeSRL loss() usingmini-batch. iii. Total Loss Calculation: Compute the total loss using the combined objective: (Equation 6). iv. Parameter Update: Update the parameters of thepolicy network,value network, andSRL moduleby performing gradient descent on . - Output: After all episodes, output the optimized
policy network.
SRL Algorithmic Baselines Implemented in SRL4Humanoid (from Appendix A):
The framework integrates different SRL methods, each with its own loss function:
-
PPO [28]: The backbone
RLalgorithm.- Policy Loss:
$
\begin{array} { r l } & { L_{\pi} ( \pmb { \theta } ) = - \mathbb { E } _ { \tau \sim \pi } \left[ \operatorname* { m i n } \left( \rho _ { t } ( \pmb { \theta } ) A _ { t } , \right. \right. } \ & { ~ \left. \left. \mathrm { c l i p } \left( \rho _ { t } ( \pmb { \theta } ) , 1 - \epsilon , 1 + \epsilon \right) A _ { t } \right) \right] , } \end{array}
$
where is the
probability ratiobetween the newpolicyand the oldpolicy, and is theadvantage estimate(fromGAE). is aclipping range coefficientthat limits how much theprobability ratiocan deviate from 1. This clipping helps to prevent large, destructive policy updates. - Value Loss: The
value networkis trained to minimize the mean squared error between its predictedvalueand the targetdiscounted returnscomputed withGAE: $ L_V ( \phi ) = \mathbb { E } _ { \tau \sim \pi } \left[ \left( V_{\phi} ( s ) - V_t^{\mathrm{target}} \right)^2 \right] . $
- Policy Loss:
$
\begin{array} { r l } & { L_{\pi} ( \pmb { \theta } ) = - \mathbb { E } _ { \tau \sim \pi } \left[ \operatorname* { m i n } \left( \rho _ { t } ( \pmb { \theta } ) A _ { t } , \right. \right. } \ & { ~ \left. \left. \mathrm { c l i p } \left( \rho _ { t } ( \pmb { \theta } ) , 1 - \epsilon , 1 + \epsilon \right) A _ { t } \right) \right] , } \end{array}
$
where is the
-
VAE (Variational Autoencoders) [10]: A reconstruction-based
SRLmethod.- Loss Function:
$
\begin{array} { r l } { \mathcal{L}_{\mathrm{VAE}} = - \mathbb { E } _ { q _ { \phi } ( z | o ) } [ \log p _ { \theta } ( o | z ) ] \ } & { { } } \ { { + } \ D _ { \mathrm { KL } } ( q _ { \phi } ( z | o ) | p _ { \theta } ( z ) ) , \ } & { { } } \end{array}
$
where is the
encoder(mappingobservationtolatent variable), is thedecoder(reconstructingobservationfromlatent variable), and is theKullback-Leibler (KL) divergence. The first term is thereconstruction loss(expected negative log-likelihood of the observation given the latent variable), encouraging fidelity. The second term is theKL divergencebetween theencoder'sdistribution and a prior distribution (typically a standard Gaussian), acting as a regularizer to ensure thelatent spaceis well-behaved.
- Loss Function:
$
\begin{array} { r l } { \mathcal{L}_{\mathrm{VAE}} = - \mathbb { E } _ { q _ { \phi } ( z | o ) } [ \log p _ { \theta } ( o | z ) ] \ } & { { } } \ { { + } \ D _ { \mathrm { KL } } ( q _ { \phi } ( z | o ) | p _ { \theta } ( z ) ) , \ } & { { } } \end{array}
$
where is the
-
SPR (Self-Predictive Representations) [31]: A dynamics modeling
SRLmethod.- Loss Function:
$
L_{\mathrm{SPR}} = \sum _ { k = 1 } ^ { K } \lVert f _ { \pmb \theta } ^ { ( k ) } ( z _ { t } , \pmb { a } _ { t : t + k - 1 } ) - \mathbf{sg}(g_{\phi}(z_{t+k})) \lVert_2^2 ,
$
This loss encourages the
online dynamics modelto predict futurelatent statesaccurately. Here, represents the -step prediction of the futurelatent statestarting from and taking actions . is thestop-gradientoperation. is thetarget dynamics modelwhose parameters are anexponential moving average (EMA)of theonline dynamics modelparameters . TheL2 normmeasures the squared difference between the predicted and target futurelatent states.
- Loss Function:
$
L_{\mathrm{SPR}} = \sum _ { k = 1 } ^ { K } \lVert f _ { \pmb \theta } ^ { ( k ) } ( z _ { t } , \pmb { a } _ { t : t + k - 1 } ) - \mathbf{sg}(g_{\phi}(z_{t+k})) \lVert_2^2 ,
$
This loss encourages the
-
SimSiam (Simple Siamese) [1]: A self-supervised
contrastive learningmethod.- Loss Function:
$
L_{\mathrm{SimSiam}} = \frac { 1 } { 2 } \left[ - \frac { f _ { \theta } ( x _ { 1 } ) \cdot f _ { \theta } ( x _ { 2 } ) } { | f _ { \theta } ( x _ { 1 } ) | _ { 2 } | f _ { \theta } ( x _ { 2 } ) | _ { 2 } } \right] ,
$
This is the
negative cosine similaritybetween the outputs of theencoder networkfor two augmented views of the same input, and . TheSimSiamarchitecture applies this loss in a symmetrical way, similar to howPvPuses it, typically involving apredictorhead and astop-gradienton one of theencoderoutputs to prevent collapse. The term presented here is the corecosine similaritypart. In the context ofPvPandSimSiam, the full loss involves two symmetrical terms, each with apredictorand astop-gradienton the targetlatent representation.
- Loss Function:
$
L_{\mathrm{SimSiam}} = \frac { 1 } { 2 } \left[ - \frac { f _ { \theta } ( x _ { 1 } ) \cdot f _ { \theta } ( x _ { 2 } ) } { | f _ { \theta } ( x _ { 1 } ) | _ { 2 } | f _ { \theta } ( x _ { 2 } ) | _ { 2 } } \right] ,
$
This is the
4.3. Proprioceptive State vs. Privileged State for Tasks
The paper provides detailed breakdowns of the proprioceptive and privileged states used for each task in the supplementary material.
The following are the details of the proprioceptive state and privileged state of the LimX-Oli-31dof-Velocity task (Table 2 from the original paper):
| Proprioceptive State | Privileged State |
|---|---|
base_ang_vel (3x5) |
base_lin_vel (3) |
projected_gravity (3x5) |
base_ang_vel (3) |
gait (5) |
projected_gravity (3) |
velocity_commands (3x5) |
velocity_commands (3) |
joint_pos (31x5) |
joint_pos (31) |
joint_vel (31x5) |
joint_vel (31) |
actions (31x5) |
actions (31) |
gait (5) |
For the velocity tracking task, the policy encoder's input is a history of 5 consecutive proprioceptive states (e.g., base_ang_vel (3-dim vector) is stacked 5 times, making it 3x5=15 dimensions). This is done to enhance robustness. The privileged state for the critic provides single-time-step ground truth information.
The following are the details of the proprioceptive state and privileged state of the LimX-Oli-31dof-Mimic task (Table 4 from the original paper):
| Proprioceptive State | Privileged State |
|---|---|
base_ang_vel (3) |
base_lin_vel (3) |
projected_gravity (3) |
base_ang_vel (3) |
joint_pos (31) |
base_pos_z (1) |
joint_vel (31) |
body_mass (40) |
actions (31) |
base_quat (6) |
mimic reference (69) |
projected_gravity (3) |
velocity_commands (3) |
|
joint_pos (31) |
|
joint_vel (31) |
|
actions (31) |
|
previous actions (31) |
|
mimic reference (69) |
For the motion imitation task, the proprioceptive state also includes mimic reference (69 dimensions), which describes the target motion. The privileged state is significantly richer, including body_mass, base_quat (quaternion for orientation), and previous actions, which are typically not available or reliably observable by the robot's proprioceptors.
5. Experimental Setup
5.1. Datasets
The experiments are conducted on the LimX Oli humanoid robot, which serves as the test platform.
- Robot Platform:
LimX Oli(Figure 4 from the original paper).-
A full-size humanoid robot with 31
degrees of freedom (DoF).The specifications of the LimX Oli humanoid robot used in the experiments, and the screenshots of the two designed tasks are presented in Figure 4 (from the original paper):
该图像是图表,展示了LimX Oli人形机器人的规格,包括身高、肩宽、臂长、重量及各节自由度的详细信息。同时,图中右侧展示了两个任务的截图,分别为速度跟踪和动作模仿。
-
Figure 4. The specifications of the LimX Oli humanoid robot used in the experiments, and the screenshots of the two designed tasks.
- Tasks: Two representative
Whole-Body Control (WBC)tasks are designed on this platform:LimX-Oli-31dof-Velocity(Velocity Tracking):- Description: The robot is required to track a velocity command on flat terrain.
- Commands: Velocity commands are resampled every 10 seconds.
- Linear velocity on the x-axis:
- Linear velocity on the y-axis:
- Angular velocity on the z-axis:
- Reward Terms: Key reward terms are detailed in Table 1 (from original paper).
LimX-Oli-31dof-Mimic(Motion Imitation):-
Description: The robot is required to imitate different pre-recorded human animations.
-
Data: A set of 20 pre-recorded human motions (examples shown in Figure 12 from original paper), each with a maximum length of 43 seconds and 4,300 frames.
-
Reward Terms: Key reward terms are detailed in Table 3 (from original paper).
Example screenshots of the motion capture data are presented in Figure 12 (from the original paper):
该图像是图表,展示了运动捕捉数据的多个示例截图。图中包含了多个姿势和动作的序列,反映了人形机器人在动态环境中的运动情况。
-
Figure 12. Example screenshots of the motion capture data.
- Choice of Datasets/Tasks: These two tasks cover primary categories of evaluation benchmarks in current humanoid robot research [44], allowing for a comprehensive evaluation of the proposed methods' ability to handle both reactive control (velocity tracking) and complex, high-dimensional trajectory following (motion imitation). The
LimX Olirobot provides a realistic and challenging platform.
5.2. Evaluation Metrics
The evaluation metrics focus on four aspects: overall task performance, specific Key Performance Indicators (KPIs), training efficiency, and real-world deployment effectiveness.
-
Overall Task Performance:
- Conceptual Definition: This metric quantifies the overall success of the
RL agentin achieving its objective by summing up all individual reward components. It reflects theexpected discounted returnthat thepolicyaims to maximize. - Calculation: Computed as the weighted summation of all sub-reward functions defined for each task (e.g.,
linear velocity tracking,angular velocity tracking,base height,action smoothness, etc.). The specific weights are given in Tables 1 and 3.
- Conceptual Definition: This metric quantifies the overall success of the
-
Key Performance Indicators (KPIs):
-
Velocity Tracking Accuracy (for
LimX-Oli-31dof-Velocitytask):- Conceptual Definition: Measures how closely the robot's actual linear and angular velocities match the commanded velocities. Higher accuracy indicates better command following.
- Mathematical Formula (from Table 1):
- Linear velocity tracking:
- Angular velocity tracking:
- Symbol Explanation:
- : Robot's actual linear velocity in the xy-plane.
- : Commanded linear velocity.
- : Robot's actual angular velocity around the z-axis.
- : Commanded angular velocity.
- : A scaling factor (standard deviation-like term) to adjust sensitivity.
- : Squared Euclidean norm.
- : Exponential function, used to convert a negative error into a positive reward, where a smaller error yields a reward closer to 1.
- The angular velocity reward is presented as a negative squared error, implying it's a penalty term that is minimized, or transformed into a positive reward similar to linear velocity tracking.
-
Position Alignment Accuracy (for
LimX-Oli-31dof-Mimictask):- Conceptual Definition: Quantifies how well the robot's body parts (joints, feet, waist) match the reference poses from the human motion capture data. Higher accuracy means better imitation.
- Mathematical Formula (from Table 3):
- Position tracking:
- Feet distance tracking:
- Waist pitch orientation tracking:
\exp\left(-\frac{\sum_{i=1}^n |\delta_i - \delta_{\mathrm{ref}}|^2}{n^2}\right)
- Symbol Explanation:
- : Robot's current joint positions.
- : Reference joint positions from the mimic motion.
- : Robot's actual feet distance.
- : Reference feet distance from the mimic motion.
- : Pitch orientation of the -th waist joint (if multiple, or a specific one).
- : Reference waist pitch orientation from the mimic motion.
- : Number of waist joints considered for orientation.
- : A scaling factor.
- : Squared Euclidean norm.
- : Squared absolute value.
- : Exponential function, converting error into a positive reward.
-
Action Smoothness (for
LimX-Oli-31dof-Velocityand implicit forLimX-Oli-31dof-Mimic):- Conceptual Definition: Measures the jerkiness or abruptness of the robot's movements. A smoother policy changes actions less drastically over consecutive time steps, which is crucial for stability, energy efficiency, and preventing damage to the robot in real-world deployment.
- Mathematical Formula (from Table 1, under
Action smoothness):- (Note: The formula in the table is , which seems to be a typo or a non-standard form for jerk. The standard finite difference for acceleration of action is . Assuming the intent is to penalize rapid changes in acceleration/jerk, the first form is more standard. However, strictly adhering to the paper's formula, it is )
- Symbol Explanation:
- : Action at current time step .
- : Action at previous time step
t-1. - : Action at two previous time steps
t-2. - : A negative weighting coefficient (e.g., -2.5e-3 from Table 1), indicating a penalty.
- : Squared Euclidean norm, penalizing large differences.
-
-
Training Efficiency:
- Conceptual Definition: Evaluates how quickly an
RL agentcan learn a goodpolicyand converge to high performance. It is often measured by the number ofenvironment interactionsortraining episodesrequired to reach a certain performance level. - Measurement: Compared by analyzing the learning curves (e.g., reward accumulation over episodes) of different methods, looking for faster increases and earlier plateaus.
- Conceptual Definition: Evaluates how quickly an
-
Real-world Deployment Effectiveness:
- Conceptual Definition: Assesses how well the learned
policyperforms on a physical robot or in a highly realistic simulation, considering factors like robustness, stability, and safety.Action smoothnessis a key indicator here. - Measurement: Primarily through
Sim2Sim evaluation(e.g., onMuJoCo) andreal-robot testing, visually and quantitatively observing behaviors, especially theaction smoothness reward term.
- Conceptual Definition: Assesses how well the learned
5.3. Baselines
The paper uses PPO [28] as the backbone RL algorithm. The proposed PvP method is compared against vanilla PPO and PPO combined with three representative SRL methods, each exemplifying a distinct methodological paradigm:
-
Vanilla PPO:
- Description: The standard
PPOalgorithm without anyState Representation Learningmodule. This serves as the fundamental baseline to evaluate the effectiveness of addingSRL. - Representativeness: Represents the current state-of-the-art
policy gradientmethod for manyRLtasks, especially in robotics.
- Description: The standard
-
PPO + VAE [10]:
- Description:
PPOintegrated with aVariational Autoencoder (VAE)forSRL. TheVAElearnslatent representationsby attempting to reconstruct the input observations, while regularizing thelatent space. - Representativeness: Exemplifies reconstruction-based
SRLmethods, which are widely used for learning compressed representations.
- Description:
-
PPO + SPR [31]:
- Description:
PPOintegrated withSelf-Predictive Representations (SPR)forSRL.SPRlearnslatent representationsby enforcing multi-step consistency between predicted and encoded future states, focusing on capturing environment dynamics. - Representativeness: Represents
dynamics modeling-basedSRLmethods, which aim to extract features relevant to predicting future states and actions.
- Description:
-
PPO + SimSiam [1]:
-
Description:
PPOintegrated with theSimSiamalgorithm forSRL.SimSiamis aself-supervised contrastive learningmethod that learns representations by comparing augmented views of the same input, using aSiamese networkarchitecture andstop-gradientto prevent collapse. -
Representativeness: Exemplifies
contrastive learning-basedSRLmethods that emphasize learninginvariant featureswithout explicit negative pairs. This is the closest baseline toPvPin terms of its underlyingcontrastive learningmechanism.By comparing against these diverse baselines, the authors aim to demonstrate
PvP's superiority across differentSRLparadigms. By default, theseSRL loss termsare applied to thepolicy encoder. A hyperparameter search was conducted for each method to determine initial hyperparameters, including data augmentation operations and loss coefficients. All methods share the same network architectures and training steps for fair comparison.
-
5.4. PPO Network Architectures and Hyperparameters
The network architectures for the policy and value networks and the PPO hyperparameters are fixed across all experiments to ensure a fair comparison and isolate the effects of the SRL methods.
The following are the architectures of the policy and value network, which remain fixed for all the experiments (Table 5 from the original paper):
| Part | Policy Network | Value Network |
|---|---|---|
| Encoder | Linear(O. D., 512)ELU()Linear(512, 256)ELU()Linear(256, 128) |
Linear(O. D., 512)ELU()Linear(512, 256)ELU()Linear(256, 128) |
| Head | Linear(128, 128)ELU()Linear(128, 31) |
Linear(256, 128)ELU()Linear(128, 1) |
-
O. D.stands for "On-Demand," meaning the input dimension is determined by the specific task's observation space. -
ELU()is theExponential Linear Unitactivation function. -
Both
policyandvaluenetworks have a multi-layerencoderstructure, but theirheadsdiffer: thepolicy headoutputs 31 values (for 31 actuated joints), while thevalue headoutputs a single scalar value.The following are the PPO hyperparameters for the two tasks, which remain fixed for all experiments (Table 6 from the original paper):
Hyperparameter Value Reward normalizationYesLSTMNoMaximum Episodes30000Episode steps32Number of workers1Environments per worker4096OptimizerAdamLearning rate1e-3Learning rate schedulerAdaptiveGAE coefficient0.95Action entropy coefficient0.01Value loss coefficient1.0Value clip range0.2Max gradient norm0.5Number of mini-batches4Number of learning epochs5Desired KL divergence0.01Discount factor0.99
5.5. SRL Specific Hyperparameters (Appendix C)
-
PPO + PvP:
Privileged Information:Root linear velocity(relative to world coordinate system) forvelocity tracking, withroot orientation informationalso involved formotion imitation.Zero Masking: Applied to theproprioceptive statein the whole training to align its dimension with theprivileged state. (Note: this phrasing is slightly confusing. Based on thePvPmethod description,ZeroMaskingis applied to theprivileged stateto derive aproprioceptive-likeview, , which is then contrasted with the originalprivileged state, . It's not masking theproprioceptive stateitself for dimension alignment, but masking elements within theprivileged state.)Loss Coefficient(): Searched over , with0.5used as the baseline.
-
PPO + SimSiam [1]:
Loss Coefficient(): Searched over , with0.5used as the baseline.Data Augmentation Operations: Searched over{random_masking, gaussian_noise, random_amplitude_scaling, identity_mapping}.random_maskingandidentity_mappingoperations were used as baseline settings. This is because theproprioceptive statein the simulator is already subject todomain randomization, which acts as a form ofdata augmentation.
-
PPO + SPR [31]:
Loss Coefficient(): Searched over , with0.5used as the baseline.Data Augmentation Operations: Searched over{random_masking, gaussian_noise, random_amplitude_scaling, identity_mapping}.gaussian_noiseoperation was used as the baseline setting.Number of Prediction Steps( in ): Searched over , with5steps used as the baseline.Average Loss: Whether to use an average loss (not specified if included in baseline).
-
PPO + VAE [10]:
Loss Coefficient(): Searched over , with0.1used as the baseline setting.
6. Results & Analysis
The experimental results are presented to answer specific research questions (Q1-Q6) regarding the performance, impact of training configurations, and computational efficiency of PvP and other SRL methods for humanoid WBC.
6.1. Core Results Analysis
Q1: Can the proposed PvP algorithm outperform the baseline methods?
The paper presents an analysis of overall task performance, action smoothness optimization, and key performance indicators (KPIs) to answer this question.
The overall reward comparison between the vanilla PPO agent and its combination with the four SRL methods on the two humanoid WBC tasks is presented in Figure 5 (from the original paper):
该图像是一个示意图,展示了LimX Oli 机器人在速度跟踪和模仿任务中不同算法的正则化得分随训练进展的变化。图中对比了PPO与多种结合策略,包括VAE、SPR、SimSiam和PvP,结果显示PvP在样本效率和最终性能上具有显著提升。
Figure 5. The normalized scores of the LimX Oli robot across velocity tracking and motion imitation tasks with various algorithms over training progress. It compares PPO with several combined strategies, including VAE, SPR, SimSiam, and PvP, showing that PvP significantly improves sample efficiency and final performance.
- Overall Task Performance (Figure 5):
-
Velocity Tracking Task (Left Plot):
PvPsignificantly accelerates the learning process, achieving higher normalized scores much faster than other methods.Vanilla PPOis the slowest.SPRandSimSiamoffer marginal improvements overPPO.VAEshows minimal benefit. This demonstratesPvP'sadvantage in leveragingprivileged informationto extract more informative features, thereby accelerating learning. -
Motion Imitation Task (Right Plot):
PvPagain achieves the highest performance and sample efficiency.SimSiamandSPRalso outperformVanilla PPO, showing the benefits ofSRL. However,VAEexhibits performance degradation, suggesting that simple reconstruction of sensory data is insufficient or even detrimental for enhancing learning efficiency in this complex task. -
Conclusion: Learning high-quality features, especially with
PvP, improves both learning efficiency and final performance inhumanoid WBCtasks.The comparison of action smoothness optimization between the vanilla PPO agent and its combinations with the four SRL methods is presented in Figure 6 (from the original paper):
该图像是图表,展示了在训练进程中,PPO及其与四种SRL方法组合的动作平滑度优化结果。图中实线与阴影区域分别表示均值与标准差。
-
Figure 6. The comparison of action smoothness optimization between the vanilla PPO agent and its combinations with the four SRL methods. The solid line and shaded region denote the mean and standard deviation, respectively.
- Action Smoothness Optimization (Figure 6 - Velocity Tracking Task):
-
Action smoothnessis a crucial reward term that penalizes abrupt movements, ensuring smoother and more controlled motions, which is vital for real-world deployment reliability. -
PvPsignificantly accelerates the convergence of this penalty term (meaning it minimizes the penalty, thus achieving smoother actions, much faster). This indicates thatPvPnot only speeds uppolicy learningin simulation but also directly contributes to safer and more reliable behavior in real-world applications. -
Other
SRLmethods also show some improvement overVanilla PPO, butPvP'seffect is most pronounced.The tracking performance comparison between the PPO agent and its combinations with the four SRL methods is presented in Figure 7 (from the original paper):
该图像是图表,展示了PPO代理及其与四种SRL方法组合的追踪性能比较,包含了腰部俯仰角、脚距和关节位置三个关键指标。PvP的追踪性能在这三个指标中表现最佳。
-
Figure 7. The tracking performance comparison between the PPO agent and its combinations with the four SRL methods. Our PvP achieves the highest performance in terms of the three key tracking metrics.
- Tracking Performance (Figure 7 - Motion Imitation Task):
-
For
motion imitation, control precision is paramount. The figure comparesPvPand other baselines across threeKPIs:Waist pitch orientation,Feet distance, andJoint position(likely tracking error, smaller is better for the negative values). -
PvPconsistently achieves the highest performance across all three metrics. This indicates thatPvPprovides reliable increments for criticalKPIsin diverse, demanding tasks. The curves show thatPvPreaches lower error (higher reward) faster and achieves a better final performance.Overall Answer to Q1: Yes,
PvPconsistently and significantly outperformsVanilla PPOand otherSRLbaselines (VAE,SPR,SimSiam) in terms ofsample efficiency,final performance(overall reward and specificKPIs), andaction smoothnessacross bothvelocity trackingandmotion imitationtasks.
-
Q2: How does the proportion of training time affect SRL's performance?
The training progress comparison of the four SRL methods with different training time proportions on the two humanoid WBC tasks is presented in Figure 8 (from the original paper):
该图像是一个比较图表,展示了不同方法在 LimX Oli 机器人上进行速度跟踪和模仿任务中的规范化分数。图中包括 VAE、SPR、SimSiam 和 PvP 四种方法的表现,识别了不同初始学习率下的效果。
Figure 8. Training progress comparison of the four SRL methods with different training time proportions on the two humanoid WBC tasks. The solid line and shaded region denote the mean and standard deviation, respectively.
- This question investigates the
interval update mechanismforSRLloss (i.e., in ), exploring different update intervals (). The intervals tested are 1 (synchronous update), 50, and 100. - Velocity Tracking Task (Left Plot): Adjusting the
update intervalhas minimal effect on the performance curves for allSRLmethods. The curves for intervals 1, 50, and 100 are largely overlapping. - Motion Imitation Task (Right Plot): Here, the
update intervalhas a clear impact. Anupdate intervalof50(meaningSRLloss is applied every 50 time steps) generally yields optimal or near-optimal performance for allSRLmethods. ApplyingSRLat every step (interval 1) or too infrequently (interval 100) can lead to slightly lower performance or slower convergence compared to interval 50. - Conclusion: Carefully selecting the
update interval(e.g., ) can improveSRLperformance, especially in more complex tasks likemotion imitation. This is attributed to preventingSRLfrom prematurely falling intolocal optimadue to low-quality data in early training stages and boosting overall training efficiency by reducing computational overhead.
Q3: How does the proportion of training data affect SRL's performance?
The training progress comparison of the four SRL methods with different training data proportions on the two humanoid WBC tasks is presented in Figure 9 (from the original paper):
该图像是图表,展示了LimX Oli机器人的不同状态表示方法(VAE、SPR、SimSiam、PvP)在速度跟踪和模仿任务中的标准化评分表现。通过比较数据比例(D. P. = 10%,50%,100%)对比不同方法的样本效率,验证了PvP方法的优势。
Figure 9. Training progress comparison of the four SRL methods with different training data proportions on the two humanoid WBC tasks. The solid line and shaded region denote the mean and standard deviation, respectively.
- This question explores the impact of using different percentages of the collected
rollouts dataforSRLupdates in each episode (, , and ), with data resampled via random masking. - Velocity Tracking Task (Left Plot): Using different
proportions of training dataforSRLresults in nearly identical training curves for all methods. This suggests that for this task, the quantity of data used forSRLwithin each update doesn't significantly alter performance. - Motion Imitation Task (Right Plot): In contrast, for
motion imitation, increasing thedata proportiongenerally enhances performance. This effect is particularly noticeable forSimSiamandPvP. Using of the data often leads to the best or comparable performance to . - Conclusion: Allocating an appropriate
proportion of training dataforSRL(often or ) accelerates learning and improves performance, especially in more data-sensitive tasks likemotion imitation.
Q4: Which encoder (policy or value) benefits more from applying SRL loss?
The learning curves of applying the SRL to the value encoder are presented in Figure 10 (from the original paper):
该图像是图表,展示了在 LimX Oli 机器人上进行的速度跟踪和模仿任务的学习曲线。上图显示动作平滑度随训练集数的变化,下图展示了不同编码器的标准化评分随训练进度的变化趋势。图中标注了“训练崩溃”的位置。
Figure 10. Learning curves of applying the SRL to the value encoder. The solid line and shaded region denote the mean and standard deviation, respectively.
- Previous experiments applied
SRLloss to thepolicy encoder(which processes theproprioceptive state). This ablation study investigates applyingSRLloss to thevalue encoder(which processes the richerprivileged state).SPRis excluded because it requiresstate-action pairsfor training, which might not align cleanly with thevalue encoder'stypical input solely for state evaluation. - Comparison (Figure 10): Applying
SRLloss to thevalue encodergenerally leads to slower convergence and lower overall performance compared to applying it to thepolicy encoder(as seen in Figure 5). - Velocity Tracking Task (Top Plot - Action Smoothness): When
SRL(specificallySimSiamorPvP) is applied to thevalue encoder, a "collapse in training" is observed, indicated by a sharp drop inaction smoothness(meaning highly unsmooth, penalized actions) before eventually recovering. This suggests significant instability. - Motion Imitation Task (Bottom Plot - Normalized Score): While
PvPandSimSiamon thevalue encoderstill learn, their final performance and convergence speed are inferior to whenSRLis applied to thepolicy encoder. - Conclusion: Applying
SRLto thepolicy encoder(which learns representations for theproprioceptive stateused by theactor) yields a more stable learning process and enhanced performance. Thevalue encoder(which guides thecriticbased onprivileged states) appears less receptive to directSRLenhancement, or at least requires careful tuning to avoid instability.
Q5: How computation-efficient are these SRL methods?
- Findings: The authors state that all experiments were run on a single
GPUusingIsaacLab[23]. Their implementation allows theSRL moduleto run entirely on theGPU, implying that it does not significantly impact the overall training efficiency or introduce a CPU bottleneck. - Conclusion:
SRL4Humanoideffectively accelerateshumanoid WBCtasks with minimal additional computational resource cost. The specific quantitative benchmarks (e.g., training time with/without SRL, memory usage) are not detailed in the paper text but are noted to be attached to the compute reporting form (CRF).
Q6: How do the SRL-enhanced methods behave in real-world deployment?
The Sim2Sim evaluation on the MuJoCo simulator is presented in Figure 11 (from the original paper):
该图像是一个示意图,展示了在MuJoCo模拟器上进行的Sim2Sim评估。上两行展示了机器人进行动作模仿的能力,下两行则显示了其速度跟踪的能力。
Figure 11. Sim2Sim evaluation on the MuJoCo simulator. The first two rows demonstrate motion imitation ability, and the last two rows show velocity tracking ability.
- Sim2Sim Evaluation (Figure 11): The authors first conduct a thorough
Sim2Sim evaluationon theMuJoCoplatform [40].MuJoCois known for its high simulation precision, which is considered closer to real-world conditions thanIsaacLab[21]. Figure 11 visually demonstrates the robot's ability to execute complex tasks (motion imitationin the first two rows andvelocity trackingin the last two rows) using the learnedpolicy. This indicates that the policies generalize well to a different, more realistic simulator. - Real-world Evaluation (Figure 1): Following the
Sim2Sim evaluation,real-robot testingwas performed on theLimX Oli humanoid robot. The paper includes an image of the robot performing tasks, implying successful deployment. (Note: Figure 1 is the VLM-captioned first image of the entire paper, showing the robot in action). More demonstrations are available in theSupporting Videos. - Conclusion: The
Sim2Simandreal-robot evaluationsconfirm the effectiveness of theSRL-enhanced policies(particularlyPvP) in real-world scenarios, demonstrating the robot's ability to perform complex tasks with the learned policies. The earlier results onaction smoothness(Figure 6) also implicitly support better real-world behavior.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully proposes PvP (Proprioceptive-Privileged contrastive learning), a novel framework designed to enhance sample efficiency and performance in humanoid Whole-Body Control (WBC) tasks. PvP achieves this by effectively leveraging the intrinsic complementarity between proprioceptive and privileged states for contrastive learning, thereby producing robust and task-relevant proprioceptive representations for policy learning without relying on complex, hand-crafted data augmentations.
Alongside PvP, the authors introduce SRL4Humanoid, a unified, modular, and open-source framework. This framework provides high-quality implementations of representative State Representation Learning (SRL) techniques, facilitating reproducible research and systematic evaluation in humanoid robot learning.
Extensive experiments on the LimX Oli humanoid robot, across challenging velocity tracking and motion imitation tasks, demonstrate PvP's significant improvements in sample efficiency, overall reward, key performance indicators (KPIs), and crucially, action smoothness, compared to various SRL baselines. The study also provides valuable practical insights into optimal SRL integration strategies, such as the effectiveness of interval updates and the superior benefit of applying SRL to the policy encoder over the value encoder. These findings collectively highlight the substantial potential of SRL-empowered Reinforcement Learning for humanoid WBC tasks.
7.2. Limitations & Future Work
The authors acknowledge several limitations and outline future research directions:
- Exploration of Additional SRL Techniques: While the efficacy of the
PvPframework was demonstrated using several representativeSRLmethods, future research could explore integrating otherSRLtechniques to further enhancepolicy learning. The field ofSRLis rapidly evolving, and new methods might offer further benefits. - Incorporating Multimodal Data: Recent advancements in
perception-based humanoid researchhighlight the potential of integrating multimodal data, such asRGBordepth images, intopolicy learning. The currentPvPframework primarily focuses onproprioceptiveandprivileged states(which are mostly numerical robot/environment states). The authors aim to extend their work to these settings, which would expand the capabilities of humanoid robots to operate in more complex, visually rich environments.
7.3. Personal Insights & Critique
This paper presents a well-structured and empirically strong contribution to data-efficient humanoid robot learning.
Personal Insights:
- Elegant Use of Privileged Information: The core idea of
PvP—using theprivileged stateas a "pseudo augmentation" for theproprioceptive statein acontrastive learningframework—is quite elegant. It cleverly bridges the gap between the rich information available in simulation and the limited information available to a real robot's policy. This approach allows thepolicy encoderto learn highly informative features fromproprioceptive datathat implicitly captureprivileged dynamicswithout ever explicitly exposingprivileged informationto the deployed policy, making it robust forsim-to-real transfer. - Practical Framework for Reproducibility: The
SRL4Humanoidframework is a significant practical contribution. The lack of standardized, modular codebases is a common bottleneck inRLresearch. By open-sourcing high-quality implementations, the authors facilitate reproducibility and accelerate future research inhumanoid WBC, allowing other researchers to easily build upon or compare against their work. - Thorough Ablation Studies: The detailed ablation studies on
training time proportion,training data proportion, andSRL application to value encoderprovide invaluable practical guidance. These insights are often overlooked in theoretical papers but are critical for practitioners aiming to implementSRLeffectively in complexRLsystems. The finding aboutinterval updatesforSRLloss is particularly useful for avoiding performance degradation during early training. - Emphasis on Real-world Metrics: The focus on
action smoothnessas a key performance indicator and its significant improvement withPvPis commendable. For real-world robot deployment, safety and smooth operation are often as important as task accuracy.
Critique / Areas for Improvement:
- "ZeroMasking" Nuance: While the concept of
ZeroMaskingis clear, the paper states, "we attach the zero mask to the proprioceptive state in the whole training to align its dimension with the privileged state" in the supplementary (). This phrasing might be slightly confusing. The methodological description (Section 4.1) correctly statesZeroMaskingis applied to theprivileged stateto derive aproprioceptive-likeview. Clarifying this in the supplementary or main text would prevent potential misinterpretation. It would also be interesting to know if other masking strategies (e.g., Gaussian noise on privileged features, or simply dropping privileged features) were considered for creating , and whyZeroMaskingwas chosen. - Quantitative Computational Efficiency: While the paper mentions
SRL4Humanoidruns onGPUwith minimal cost and attaches logs to a CRF, providing some quantitative benchmarks (e.g., actual training time comparison,GPUmemory usage with/withoutSRL) within the paper's results section would strengthen this claim and offer more concrete evidence for practitioners. - Generalization of Privileged Information: The
privileged statein this paper includesroot linear velocityandroot orientation. While these are very useful, they are still somewhat restricted. The authors' future work mentions multimodal data, which is a step further. It would be insightful to discuss how thisprivileged informationmight be scaled or generalized if theprivileged stateincludes even more abstract or complex concepts (e.g., intent of a human collaborator, or properties of an unknown object the robot is interacting with). - Robustness to Imperfect Privileged Information: In some
simulations, evenprivileged informationmight not be perfectly accurate. The paper assumes a perfectprivileged state. ExploringPvP'srobustness if theprivileged stateitself is noisy or slightly inaccurate would be an interesting avenue. - Further Investigation into Value Encoder Collapse: The "collapse in training" when
SRLis applied to thevalue encoder(Figure 10) is a significant observation. While the paper notes it and advises against it, a deeper analysis into why this occurs (e.g., gradient dynamics, conflicting objectives, differing representation needs ofactorvs.critic) would be valuable for understandingSRLintegration more generally.
Transferability & Applications:
The PvP framework and the SRL4Humanoid toolkit are highly transferable. The principle of using privileged information to guide representation learning for a proprioceptive policy could be applied to other complex robotic systems (e.g., legged robots, manipulators) or even other RL domains where a rich "teacher" signal is available during training but not deployment. The modular nature of SRL4Humanoid also makes it a valuable resource for comparing different SRL approaches in various RL contexts, beyond just humanoids.
Similar papers
Recommended via semantic vector search.