WHOLEBODYVLA: TOWARDS UNIFIED LATENT VLA FOR WHOLE-BODY LOCO-MANIPULATION CONTROL
TL;DR Summary
This study presents `WholeBodyVLA`, a unified latent vision-language-action framework enhancing humanoid robots' performance in loco-manipulation tasks. It learns from low-cost egocentric videos and employs a tailored reinforcement learning policy, achieving a 21.3% performance b
Abstract
Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To execute the desired locomotion commands more precisely, we present a loco–manipulation–oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco–manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is the development of WholeBodyVLA, a unified latent vision-language-action (VLA) framework designed for whole-body loco-manipulation control in humanoid robots.
1.2. Authors
The authors are: Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi, Yanjie Zhang, Delong Li, Chuanzhe Suo, Chuang Wang, Zhihui Peng, and Hongyang Li. Their affiliations include Fudan University, OpenDriveLab & MMLab at The University of Hong Kong, AgiBot, and SII. Haoran Jiang and Jin Chen are noted as having equal contributions, while Zhihui Peng and Hongyang Li are indicated as project co-leads. These affiliations suggest a collaborative effort between academic institutions and robotics companies, pooling expertise in robotics, computer vision, and machine learning.
1.3. Journal/Conference
The paper does not explicitly state the journal or conference it was published in. However, given the nature of the research (humanoid robotics, vision-language-action models, reinforcement learning) and the affiliations (MMLab at HKU, OpenDriveLab), it is likely intended for a top-tier robotics or AI conference (e.g., CoRL, RSS, ICRA, IROS, NeurIPS, ICML) or a reputable journal in these fields. These venues are highly competitive and influential in their respective domains.
1.4. Publication Year
The publication date (UTC) is 2025-12-11T00:00:00.000Z.
1.5. Abstract
Humanoid robots face significant challenges in performing complex loco-manipulation tasks due to deficiencies in manipulation-aware locomotion in existing modular or end-to-end approaches. This limitation restricts robots to small workspaces. The paper identifies two root causes: (1) the scarcity of humanoid teleoperation data for acquiring loco-manipulation knowledge, and (2) the imprecise and unstable execution of locomotion commands by current RL controllers.
To address these, the authors propose WholeBodyVLA, a unified latent learning framework. This framework enables Vision-Language-Action (VLA) systems to learn from low-cost action-free egocentric videos, augmenting data collection with an efficient human data pipeline. To improve execution, they introduce a loco-manipulation-oriented (LMO) RL policy specifically designed for accurate and stable core loco-manipulation movements (advancing, turning, squatting).
WholeBodyVLA is presented as a pioneering framework for large-space humanoid loco-manipulation. Experimental results on the AgiBot X2 humanoid demonstrate superior performance, outperforming prior baselines by 21.3%, along with strong generalization and extensibility across diverse tasks.
1.6. Original Source Link
https://opendrivelab.com/WholeBodyVLA/static/pdf/WholeBodyVLA.pdf This is the official PDF link provided for the research paper, which appears to be a preprint or technical report associated with the OpenDriveLab and AgiBot projects.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is the limited ability of humanoid robots to perform large-space, complex loco-manipulation tasks effectively. Loco-manipulation refers to tasks that require close coordination between locomotion (movement through an environment) and manipulation (dexterous interaction with objects).
This problem is critically important because humanoid robots are envisioned as general-purpose embodied agents capable of operating in human-centric environments. Achieving this vision necessitates seamless loco-manipulation. However, current approaches, whether modular (separating locomotion and manipulation stages) or end-to-end (single policy for both), suffer from manipulation-aware locomotion deficiencies. This means robots often fail to position themselves optimally for manipulation tasks, restricting their operational workspace.
The paper attributes these deficiencies to two main challenges:
-
Data Scarcity: Learning robust
loco-manipulation knowledgerequires vast amounts of data.Humanoid teleoperation data, where skilled operators control a robot to perform tasks, is prohibitively expensive and time-consuming to collect at scale. Without sufficient data, models lack the experience to learn how locomotion should strategically support manipulation. -
Execution Imprecision and Instability: Even with a high-level plan, existing
Reinforcement Learning (RL)controllers for humanoid locomotion often struggle withprecisionandstabilityin executing commands like starting, stopping, turning, or squatting. Thisdecision-execution misalignmentleads to failures (e.g., stumbling, path deviation) that prevent successfulloco-manipulation.The paper's innovative idea is to tackle these issues by (1) leveraging readily available low-cost
action-free egocentric videosto acquire richloco-manipulation priorsthrough aunified latent learning framework, and (2) developing a specializedLoco-Manipulation-Oriented (LMO) RL policythat provides more stable and precise low-level control for manipulation-relevant movements.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Introduction of WholeBodyVLA: They present
WholeBodyVLA, a novelVision-Language-Action (VLA)framework that enables bipedal humanoids to performend-to-end large-space loco-manipulationautonomously in real-world settings. This framework unifies high-level reasoning with low-level control, addressing the limitations of prior decoupled or partial loco-manipulation systems. -
Unified Latent Learning Framework: They propose
unified latent learning, a method that allowsjoint locomotion-manipulation learningfrom abundant, low-costaction-free egocentric videos. This approach significantly alleviates thedata scarcityproblem by converting observation-only videos intodiscrete latent actionsthat serve as supervision forVLAtraining. The framework uses separateLatent Action Models (LAMs)for manipulation and locomotion to handle their distinct visual change patterns, which are then jointly used to supervise the VLA. -
Loco-Manipulation-Oriented (LMO) RL Policy: They introduce an
LMO RL policyspecifically designed to mitigatedecision-execution misalignment. Unlike traditionalvelocity-tracking objectives, theLMO policyutilizes a simplifieddiscrete command interfacetailored for accurate and stable execution of fundamental loco-manipulation movements such as advancing, turning, and squatting, even under dynamic disturbances. This ensures that high-level VLA decisions are faithfully translated into reliable robot actions.The key conclusions and findings are:
-
WholeBodyVLAsignificantly outperforms priormodularandend-to-endbaselines in real-worldloco-manipulationtasks on the AgiBot X2 humanoid, achieving a 21.3% higher success rate than the best prior baseline and 24.0% better than another (from Table 2: 78.0% vs 64.0% and 54.0%). -
The
unified latent learningfromaction-free videosdramatically improves performance and reduces the reliance on expensiveteleoperation data. Models pretrained with more human videos achieve comparable performance with significantly lessfine-tuning data. -
The
LMO RL policyis crucial for reliablelong-rangeandmulti-step loco-manipulation, reducingstumbling,path deviation, andturning while advancingerrors that plague velocity-based controllers. -
WholeBodyVLAdemonstrates stronggeneralizationcapabilities acrosschanged start-poses,objects,layouts,appearances, and extends tolong-horizonandchallenging scenarios(e.g., uneven terrain traversal, visual navigation, vacuum cleaning, wiping).These findings address the core problem by providing a framework that can learn from more accessible data and execute actions more reliably, paving the way for more capable and autonomous humanoid robots.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand WholeBodyVLA, a foundational grasp of several robotics and machine learning concepts is essential for beginners:
- Humanoid Robots: These are robots designed to resemble the human body, typically bipedal (two legs) with a torso, head, and two arms. Their human-like form makes them suitable for operating in environments designed for humans, but also poses significant challenges in terms of balance, locomotion, and dexterous manipulation. The AgiBot X2 humanoid used in this paper is an example, featuring multiple degrees of freedom (DoF) for arms, legs, and waist.
- Locomotion: The ability of a robot to move through its environment. For humanoid robots, this includes walking forward/backward, lateral stepping, turning, squatting, and maintaining balance.
- Manipulation: The ability of a robot to interact with objects in its environment, such as grasping, lifting, placing, pushing, or using tools. This often requires
dexterous(skillful) control of robotic arms and end-effectors (like grippers). - Loco-manipulation: The coordinated execution of both locomotion and manipulation tasks. For example, walking towards a table, then squatting to pick up an object, and finally turning to place it on a cart. This requires seamless integration and awareness between movement and interaction.
- Vision-Language-Action (VLA) Systems: These are AI systems that take
visual observations(e.g., camera images) andlanguage instructions(e.g., "pick up the red box") as input and outputrobot actions(e.g., motor commands). They aim to enable robots to understand high-level human commands and translate them into physical actions in the real world. - Reinforcement Learning (RL): A machine learning paradigm where an
agentlearns to make decisions by interacting with anenvironment. The agent receivesrewardsfor desired behaviors andpenaltiesfor undesired ones, learning apolicy(a mapping from states to actions) that maximizes cumulative reward.RL controllersin robotics are policies that learn to generate low-level motor commands to achieve a desired behavior (e.g., walking stably). - Imitation Learning: A learning paradigm where a robot learns by observing demonstrations from an expert (e.g., a human teleoperating the robot). The goal is to
imitatethe expert's actions given the same observations. This is often used to bootstrapRLpolicies or learn complex skills that are difficult to define with explicit reward functions. - Teleoperation Data: Data collected by a human operator remotely controlling a robot. This provides direct
action-labeled trajectories(observations paired with corresponding robot actions), which are valuable forimitation learning. However, it is typically expensive and time-consuming to gather. - Egocentric Videos / Action-Free Videos: Videos captured from the perspective of the robot (or human performing the task, simulating the robot's viewpoint).
Action-freemeans these videos only contain visual observations, without explicit robot motor commands or human joint angles as labels. They are cheap and abundant but require methods to extract actionable information. - Latent Space / Latent Actions: In machine learning, a
latent spaceis a lower-dimensional representation of data, often learned by neural networks.Latent actionsare abstract, compressed representations of actions within this latent space. Instead of directly predicting raw motor commands, a model might predict alatent actiontoken, which is then decoded into actual robot movements. This can help in learning from diverse data sources or simplifying complex action spaces. - Vector Quantized Variational AutoEncoder (VQ-VAE): An unsupervised deep learning model that learns to compress data into a
discrete latent representation.- Autoencoder: A neural network that learns to encode data into a lower-dimensional
latent spaceand then decode it back to its original form. The goal isreconstructionof the input. - Variational Autoencoder (VAE): A type of autoencoder that learns a probability distribution for the latent space, allowing for generative capabilities.
- Vector Quantization (VQ): A technique that forces the latent representations to be discrete by mapping them to the closest vector in a fixed, learned
codebook. This turns continuous latent variables into discretetokens, which are easier to work with in some contexts (e.g., language models). VQ-VAEis used in this paper to turnframe-to-frame visual changesfrom videos intocompact, discrete latent tokensthat represent actions.
- Autoencoder: A neural network that learns to encode data into a lower-dimensional
- Transformer Architecture: A neural network architecture originally developed for natural language processing, known for its
self-attentionmechanism.Transformersare highly effective at processing sequential data and capturing long-range dependencies. They are increasingly used in vision (e.g.,Vision Transformers,DINOv2) and forVLAsystems to process visual and language inputs.DINOv2is a self-supervised vision transformer used for robust visual features. - Proprioceptive States: Sensory information about the robot's own body state, such as joint angles, joint velocities, base angular velocity, and gravity vector. These are internal measurements, as opposed to
exteroceptive(external environment sensing like vision).
3.2. Previous Works
The paper discusses related work in two main categories: humanoid whole-body control and Vision-Language-Action models.
3.2.1. Humanoid Whole-Body Control
-
Loco-manipulation controllers:
- Early works focused on isolated manipulation or motion imitation (e.g., Cheng et al., 2024; Ji et al., 2024; He et al., 2024; 2025b;a; Ze et al., 2025; Truong et al., 2025; Chen et al., 2025b; Xie et al., 2025; Wang et al., 2025; Cheng et al., 2025; Fu et al., 2025).
- More recently,
RL-based whole-body controllersemerged, often usingvelocity-tracking interfaces(e.g., Ben et al., 2025; Shi et al., 2025; Li et al., 2025a; Zhang et al., 2025a;b; Sun et al., 2025). These optimize for per-step errors against commanded speeds.- Limitations: While suitable for general cruising, velocity-tracking formulations often leave
start-stop semanticsimplicit, lead tofragmented gaitsacross different speeds, and lack supervision forepisode-level controllability(e.g., braking precision, heading fidelity) crucial for loco-manipulation. Upper-body influence is typically modeled as simple noise, not reflecting realisticinertial couplingsfrom manipulation tasks. - Robustness Improvements: Some methods added robustness, like
HOMIE(Ben et al., 2025) with PD-stabilized arms,AMO(Li et al., 2025a) with trajectory-optimization hybrids,FALCON(Zhang et al., 2025a) with force curriculum,R2S2(Zhang et al., 2025b) with ski libraries, orULC(Sun et al., 2025) with residual controllers. However, these still inherit the limitations of velocity-centric training, resulting in inconsistent gaits and unstable teleoperation trajectories.
- Limitations: While suitable for general cruising, velocity-tracking formulations often leave
-
High-level planners for humanoids:
- These systems aim to process
RGB visionorlanguage inputsto provide higher-level commands toRL controllers. LEVERB(Xue et al., 2025) embedslatent verbsinto RL for low-level control.R2S2(Zhang et al., 2025b),Being-0(Yuan et al., 2025), andHEAD(Chen et al., 2025a) usemodular plannersdriven byvision-language models (VLMs)to sequence discrete locomotion and manipulation skills.- Limitations: These frameworks suffer from
brittle skill boundaries(robots ending up in unstable configurations),error accumulationdue to lack of closed-loop feedback, and reliance oncloud-based perception(introducing latency).Being-0also usesGPT-4oanddetectorsas external modules.
- Limitations: These frameworks suffer from
- VLA for humanoids:
- Initial efforts like
Humanoid-VLA(Ding et al., 2025) focus on locomotion, whileGR00T(Bjorck et al., 2025) targets manipulation for humanoid embodiments, neglecting the other critical primitive for seamlessloco-manipulation. - The
Boston Dynamics demonstration(Boston Dynamics, 2025) shows impressiveloco-manipulationbut is constrained to alimited workspaceand relies heavily on expensiveMotion Capture (MoCap) data.
- Initial efforts like
- Overall Gap: These limitations highlight the need for unified frameworks that directly couple vision and language with whole-body control without brittle modular boundaries, a gap
WholeBodyVLAaims to fill.
- These systems aim to process
3.2.2. Vision-Language-Action Models
-
General VLA systems:
- Recent advances, building on
multimodal foundation modelsandimitation learningfromlarge-scale real-robot trajectories, have led to powerfulVLA systems(e.g.,RT-2(Brohan et al., 2023),OpenVLA(Kim et al., 2024),RDT(Liu et al., 2025),Pi0(Black et al., 2024),Pi0.5(Intelligence et al., 2025)). - Limitations: These models typically emphasize
upper-body manipulation onlyand do not provide an end-to-end solution forautonomous whole-body controlrequired byloco-manipulation tasks. WholeBodyVLAaims to integrate locomotion and manipulation into a unified VLA for bipedal humanoids.
- Recent advances, building on
-
Latent action learning:
- This approach addresses the
data scarcityofaction-labeled trajectoriesby learning fromaction-free videos. It compressesframe-to-frame visual changesintocompact, discrete tokensthat supervise policy learning. - Representative examples include
Genie(Bruce et al., 2024),LAPA(Ye et al., 2025),IGOR(Chen et al., 2024), andUniVLA(Bu et al., 2025b). These studies demonstrate the effectiveness of usingaction-free videosas supervisory signals for VLA training. - Relevance to WholeBodyVLA: Given the even greater difficulty of obtaining
large-scale humanoid loco-manipulation data,WholeBodyVLAleverageslatent action learningin a unified manner for both locomotion and manipulation.
- This approach addresses the
3.3. Technological Evolution
The field has seen a progression from isolated control of robot components (e.g., separate navigation for mobile robots, arm control for manipulators) to more integrated whole-body control for complex platforms like humanoids. Initially, whole-body control often relied on motion imitation from human MoCap data or simpler velocity-tracking RL policies, which were effective for general motions but struggled with precision and stability under real-world manipulation loads.
The advent of Vision-Language Models (VLMs) and multimodal foundation models has shifted the paradigm towards Vision-Language-Action (VLA) systems, enabling robots to understand semantic instructions and perceive their environment using rich visual input. However, initial VLA efforts for humanoids tended to focus either on locomotion or manipulation, not their seamless integration. Moreover, these systems still faced the hurdle of data scarcity for robot-specific action-labeled trajectories.
Latent action learning emerged as a crucial technique to bridge the data gap, allowing models to learn from readily available action-free videos. WholeBodyVLA fits into this timeline by combining these advancements: it builds on the VLA paradigm, uses latent action learning from action-free videos to overcome data scarcity for humanoid loco-manipulation, and crucially, introduces a specialized RL controller (LMO) to address the precision and stability issues inherent in low-level humanoid locomotion for manipulation-aware tasks. This represents an evolution towards truly end-to-end, general-purpose humanoid agents.
3.4. Differentiation Analysis
WholeBodyVLA differentiates itself from previous works primarily through two key innovations: unified latent learning and the loco-manipulation-oriented (LMO) RL policy.
-
Against Modular Pipelines (e.g.,
R2S2,Being-0,HEAD):- Differentiation:
WholeBodyVLAoffers anend-to-endframework, learningclosed-loop feedbackandjoint optimizationbetween locomotion and manipulation. Modular approaches typically use high-level planners to sequence discrete skills (e.g., navigate, then grasp), leading tobrittle skill boundariesanderror accumulationwhere the robot might end up in suboptimal or unstable configurations after locomotion, making subsequent manipulation difficult. - Innovation:
WholeBodyVLAavoids thesehandoff issuesby learning a cohesive action space where locomotion and manipulation interact, guided byunified latent actions.
- Differentiation:
-
Against End-to-End VLA for Humanoids (e.g.,
Humanoid-VLA,GR00T,Boston Dynamics):- Differentiation:
Humanoid-VLAfocuses on locomotion,GR00Ton manipulation, and theBoston Dynamicsdemo, while impressive, relies onexpensive MoCap dataand is constrained to alimited workspace.WholeBodyVLAspecifically targetsunified loco-manipulationacrosslarge spaces. - Innovation:
WholeBodyVLAleveragesunified latent learningfromlow-cost action-free videosto mitigate thedata scarcityproblem, which is a major bottleneck for trainingend-to-end whole-body policies. This allows it to acquireloco-manipulation knowledgeat a scale not feasible withteleoperationorMoCapalone.
- Differentiation:
-
Against Existing RL Loco-manipulation Controllers (e.g.,
HOMIE,AMO,FALCON,ULC):- Differentiation: These controllers often use
continuous velocity-tracking objectives, which are good for general cruising but lack theprecisionandstabilityneeded forstart-stop semanticsandfine-grained positional controlcritical formanipulation-aware locomotion. They also poorly model structuredinertial couplingsfrom real manipulation tasks. - Innovation: The
LMO RL policyinWholeBodyVLAreplacesvelocity trackingwith adiscrete command interfacespecifically designed for fundamentalloco-manipulation movements(advancing, turning, squatting). This ensures morefaithful execution,predictable on/off transitions, and betterstabilityunderdynamic disturbances(like carrying loads or arm movements).
- Differentiation: These controllers often use
-
Against General Latent Action Learning (e.g.,
Genie,LAPA,IGOR,UniVLA):-
Differentiation: While previous works demonstrated
latent action learningeffectiveness for tabletop manipulation or general tasks,WholeBodyVLAapplies it specifically to the much more complex domain ofhumanoid whole-body loco-manipulation. -
Innovation: It introduces a crucial distinction by training
separate latent action models (LAMs)formanipulationandlocomotion. This addresses the inherentmodality differences(static camera in manipulation vs. moving camera in locomotion) andconflicting attention objectivesthat would degrade a single shared LAM, leading to more robust latent representations forhumanoid loco-manipulation.In essence,
WholeBodyVLA's core innovation lies in its holistic approach: addressing both the data acquisition challenge (via unified latent learning from action-free videos) and the reliable execution challenge (via the LMO RL policy) within a single, unified VLA framework for large-space humanoid loco-manipulation.
-
4. Methodology
The WholeBodyVLA framework integrates two main components: unified latent learning for high-level Vision-Language-Action (VLA) policy training and a loco-manipulation-oriented (LMO) RL policy for low-level, stable locomotion execution. This section details how these components are designed and interact to enable large-space humanoid loco-manipulation.
4.1. Principles
The core idea behind WholeBodyVLA is to address the dual challenges of data scarcity and execution misalignment in humanoid loco-manipulation.
- Overcoming Data Scarcity with Latent Learning: Humanoid
teleoperation datais expensive. The paper proposes to leverage abundant, low-costaction-free egocentric videos(human demonstrations) to learn abstractloco-manipulation priors. This is achieved throughLatent Action Models (LAMs)which convert visual changes in videos intodiscrete latent actions. By supervising aVLApolicy with these latent actions, the robot can learn complex behaviors without explicit robot-specific action labels from day one. - Ensuring Reliable Execution with LMO RL Policy: High-level
VLAdecisions need to be faithfully executed by the robot's lower-body. TraditionalReinforcement Learning (RL)controllers, often optimized for continuousvelocity tracking, struggle withprecisionandstabilityfor the specificstart-stopanddirectional controlrequired bymanipulation-aware locomotion. TheLMO RL policyintroduces a simplifieddiscrete command interfaceand a targeted training curriculum to achieve highly precise and stable execution of fundamentalloco-manipulation movements, bridging the gap between high-level intent and low-level physical realization.
4.2. Core Methodology In-depth (Layer by Layer)
The WholeBodyVLA pipeline consists of three main stages: LAM pretraining, VLA training, and real-world deployment with the LMO RL policy.
The overall pipeline of WholeBodyVLA is illustrated in Figure 2 (from the original paper).
该图像是WholeBodyVLA的示意图,展示了统一潜在学习框架的流程。LAM模块通过操控和移动数据预训练,生成统一的潜在监督,VLM则根据任务指令和观察输入进行编码。LMO强化学习策略负责生成精确的运动指令和操控命令,保证稳健的整体运动与操控。图中所示公式有 a_1, a_2, ext{和} a_n 表示运动命令。
Figure 2: Pipeline of WholeBodyVLA. LAM is pretrained on manipulation and manipulationaware locomotion videos, yielding unified latent supervision for the VLM. Meanwhile, the LMO RL policy is trained for precise and stable locomotion under disturbances. At runtime, egocentric images and language instructions are encoded by the VLM into latent action tokens, which are decoded into (i) dual-arm joint actions and (ii) locomotion commands executed by LMO at , enabling robust whole-body loco-manipulation.
4.2.1. Unified Latent Action Model (LAM)
The central idea here is to learn manipulation and locomotion primitives from egocentric videos using Latent Action Models (LAMs), which then provide latent supervision for VLA training. The paper observes that training a single LAM on mixed data (manipulation and locomotion) yields suboptimal performance due to fundamental differences in the visual modalities:
-
Manipulation videos: Camera pose is almost static; image variations are dominated by arm motion, biasing the model to arm regions.
-
Locomotion videos: Camera pose changes continuously; image variations arise from environment motion, forcing attention on the entire scene.
-
Ambiguity: In locomotion videos, arm movement in the
Field of View (FOV)might be misinterpreted asarm motionrather thancamera motion(locomotion), leading to ambiguouslatent encodings.To counter this,
WholeBodyVLAtrains two separateLAMs: amanipulation LAMon manipulation data and alocomotion LAMon locomotion data. BothLAMsare then used jointly to supervise theVLAtraining.
4.2.1.1. LAM Training Details
Following prior work like Genie and UniVLA, a VQ-VAE (Vector Quantized Variational AutoEncoder) architecture is adopted for the LAMs. The encoder is built on DINOv2 features, a self-supervised vision transformer, to extract robust visual features.
Given a pair of consecutive frames :
-
Continuous Latent Vector Generation: The
LAM encoderfirst processes these frames to emit a continuous latent vector . $ z_t = \mathcal{E}i (o_t, o{t+k}) $ Here, represents the current visual observation (image frame) at time , and is the visual observation at a future time . is the encoder for LAM, where the subscript indicates whether it's the manipulation LAM or the locomotion LAM. This encoder learns to capture the visual changes between the two frames into a continuous vector representation. -
Quantization to Discrete Latent Action: This continuous latent vector is then
quantizedto the nearest entry in a learnedcodebook. $ c_t = \mathrm{quantize}(z_t) := \arg \min_{c \in \mathcal{C}_i} | z_t - c |_2, \quad c_t \in \mathcal{C}_i $ Here, is the discretelatent actiontoken, chosen from thecodebook(a set of learned discrete vectors). This quantization step is crucial for transforming continuous visual changes into a compact, discrete representation. The operator finds the codebook entry that is closest to in norm. -
Frame Reconstruction and Loss: To train both the encoder and the
codebook, aLAM decoderis used. The decoder receives the former frame and the quantized latent action , and is trained to reconstruct the latter frame . $ \hat{o}{t+k} = \mathcal{D}i (o_t, c_t) $ The reconstruction is optimized by minimizing amean-squared error (MSE)loss: $ \mathcal{L}{\mathrm{mse}} = | o{t+k} - \hat{o}_{t+k} |_2^2 $ This loss measures how well the decoder can reconstruct the future frame given the current frame and the discrete latent action, forcing the latent action to effectively encode the temporal visual change. -
VQ-VAE Objective: The overall
LAMtraining objective combines theMSEreconstruction loss with standardVQ-VAEterms to jointly train the encoder, decoder, and codebook: $ \mathcal{L}{\mathrm{LAM}} = \mathcal{L}{\mathrm{mse}} + \Vert \mathrm{sg}[c_t] - z_t \Vert_2^2 + \beta \Vert c_t - \mathrm{sg}[z_t] \Vert_2^2 $- : The reconstruction loss, as explained above.
- : The
codebook loss. denotes thestop-gradientoperator, meaning gradients do not flow through this part. This term pulls the continuous encoder output towards the selected codebook entry . - : The
commitment loss. This term encourages the codebook entries tocommitto the encoder's outputs, ensuring that the codebook learns to represent the actual latent actions. is a hyperparameter called thecommitment cost. This loss function ensures that theLAMlearns to encode temporal visual dynamics into a meaningful and reconstructible discrete latent space.
4.2.1.2. VLA Training
After the LAMs are pretrained, the VLA policy is trained. Its goal is to jointly predict both manipulation and locomotion latent actions given visual observations and task language instructions . This is formulated as maximum likelihood estimation (MLE) using a Cross-Entropy loss:
$
\operatorname*{min}{\theta} [ - \log \pi{\theta}(c_t^{\operatorname*{mani}}, c_t^{\operatorname{loco}} \mid o_t, \ell) ]
$
Here, is the VLA policy parameterized by . and are the ground-truth discrete latent actions (pseudo-action labels) obtained from the pretrained manipulation and locomotion LAMs, respectively. is the current visual observation, and is the language instruction. This joint prediction forces the VLA model to learn how locomotion and manipulation interact in a single, cohesive action space, thereby enabling task execution.
4.2.1.3. Execution on Humanoids
For real-world deployment on humanoids, a lightweight decoder is introduced. This decoder grounds the predicted latent actions into robot-specific control commands:
$
a_t = f (\hat{c}_t^{\mathrm{mani}}, \hat{c}_t^{\mathrm{loco}}, s_t)
$
Here, represents the robot's executable actions at time . and are the predicted manipulation and locomotion latent actions from the VLA. is the current robot state (proprioceptive information). The decoder produces two types of outputs:
-
Upper-body joint angles: For
manipulationtasks. -
Locomotion command: For the low-level
RL controller(specifically, theLMO RL policy), which translates this command into lower-body torques.This division of labor allows the
VLAto make high-level unified latent decisions, the decoder to translate them into embodiment-specific control signals, and theLMO RL policyto ensure stable and precise low-level execution for robust whole-body loco-manipulation.
4.2.1.4. Manipulation-aware Locomotion Data Collection
To scale the benefits of unified latent learning, the paper devises an efficient, low-cost pipeline for collecting human egocentric manipulation-aware locomotion videos. This pipeline is characterized by:
-
Low Cost and Efficiency: Only a single operator wearing a
head-mounted camera(Intel RealSense D435i RGB-D or GoPro for wider FOV) is required, avoiding expensiveMoCaporteleoperationsetups. -
Coverage of Humanoid Primitives: Operators are instructed to perform a diverse range of motions crucial for humanoids, including
advancing,turning, andsquatting. -
Goal-Directed Execution: Operators perform locomotion specifically to
contact potential manipulation goals. This ensures that the collected locomotion data is directly aligned withloco-manipulation learningand inherentlymanipulation-aware.An overview of this data collection process is depicted in Figure 4 (from the original paper).
该图像是示意图,展示了低成本人类数据收集和潜在行动模型的结构。左侧部分展示了操作员在执行 locomotion 和 manipulation 时的动作捕捉,右侧部分展示了编码器和解码器的处理流程,以及统一动作用的代码本。
Figure 4: Low-cost egocentric data collection on locomotion task and pretraining pipeline for LAM. A single operator records egocentric video while executing eight canonical motion primitives toward potential manipulation goals. This task-oriented pipeline captures locomotion patterns that are diverse, structured, and directly relevant to humanoid locomanipulation. Then the collected manipulation-aware locomotion data along with the open-source in-place manipulation dataset are used to pretrain the latent action model in a VQ-VAE-style pipeline.
4.2.2. Loco-Manipulation-Oriented (LMO) RL Policy
The LMO RL policy addresses the decision-execution misalignment by replacing continuous random velocity-tracking objectives (common in existing RL controllers) with a discrete command interface tailored for loco-manipulation. This design explicitly optimizes for stability and precision under dynamic manipulation disturbances.
4.2.2.1. Observation Space
The LMO policy relies solely on proprioceptive egocentric states with a short history stack for its observations:
$
O_t = \Big[ u_t, \omega_t, \mathbf{g}_t, \mathbf{q}_t, \dot{\mathbf{q}}t, \mathbf{a}{t-1} \Big]
$
- : The discrete high-level command from the
VLA. This is the target intention for the low-level controller. - : The base angular velocity of the robot.
- : The gravity vector in the robot's base frame, indicating its orientation relative to gravity.
- : The joint positions and velocities of the robot's lower body (legs).
- : The previous action executed by the
RL policy. This compact, purely proprioceptive design avoids reliance on privileged environment information from the simulator and is sufficient for closed-loop stability.
4.2.2.2. Discrete Command Interface
The lower-body control is formulated as goal-conditioned regulation. The policy executes discrete high-level commands while maintaining balance. At each time step , the high-level planner (or VLA decoder) issues:
$
u_t = [ s_x, s_y, s_{\psi}, h^{\star} ] \in { -1, 0, 1 }^{\bar{3}} \times \mathbb{R}
$
-
: A discrete indicator for
forward/backwardlocomotion (1for forward,-1for backward,0for stop). -
: A discrete indicator for
laterallocomotion (1for right,-1for left,0for stop). -
: A discrete indicator for
turning(1for clockwise,-1for counter-clockwise,0for stop). -
: The desired
stance heightof the robot (e.g., for squatting or standing tall).This discrete interface explicitly enforces
start-stop semanticsand reducestrajectory variance, improving the training process for both theRL controllerand the high-level planner, unlike continuous velocity-based formulations.
4.2.2.3. Reference Shaping
Since the intents are ternary (), they are converted into smooth velocity references to prevent abrupt accelerations and ensure predictable transitions: $ v_k^{\mathrm{ref}}(t) = v_k^{\mathrm{goal}} \sigma \big( \alpha (s_k - \bar{s}_k (t)) \big), \quad \bar{s}_k (t) \gets (1 - \lambda) \bar{s}_k (t - 1) + \lambda s_k $
- : The smoothed velocity reference for axis (where for forward/backward, lateral, and yaw respectively) at time .
- : The goal speed magnitude for axis . The sign of is determined by the intent .
- : A
saturating nonlinearity(liketanhas mentioned in the main text orsigmoidas mentioned in the appendix) that smooths the transition. The appendix uses as a generic saturating nonlinearity, while the main text usestanhspecifically. For consistency with the specific formula in the appendix, I'll use . - : A scaling factor controlling the steepness of the saturating function.
- : An
exponentially smoothed flagfor intent . This is an exponential moving average of the current and past intents, making transitions gradual. - : A smoothing factor for the exponential moving average.
This ensures
predictable on/off transitionsand reducesoscillations, crucial for stableloco-manipulation.
4.2.2.4. Two-Stage Curriculum
The LMO policy is trained using a two-stage curriculum to first acquire basic locomotion and then specialize it for precise and stable loco-manipulation.
Stage I (Basic Gait Acquisition):
- Goal: Learn a minimal gait that prevents falls and responds to discrete commands.
- Locomotion: For each axis , if , a goal speed magnitude is sampled from a uniform distribution . If , .
- Upper Body: The arms track pose targets that are resampled at a fixed interval and interpolated for smooth motion.
- Disturbances: Joint limits are gradually relaxed by a curriculum factor, exposing the legs to progressively stronger, but task-agnostic, disturbances.
- Outcome: This stage enables the policy to develop a basic gait that prevents falling, establishing a stable foundation.
Stage II (Precision and Stability):
- Goal: Refine the gait for
loco-manipulation-level precisionandstability. - Locomotion:
- Fixed Cruising Speed: Per-axis cruising speeds are fixed to constants () when an intent is given (). This prevents
fragmented gaitsthat arise from training across varying reference speeds. - Directional Accuracy Reward: Precision is enforced by penalizing
yaw driftat episode termination. This is critical for tasks requiring precise orientation. $ \mathcal{I}{\mathrm{dir}} = | \mathrm{wrap} \big( \psi{\mathrm{end}} - \psi_{\mathrm{start}} \big) | $ Here, is the yaw drift penalty. is the yaw orientation at the end of the locomotion phase, and is the yaw orientation at the start. ensures the angle is within a specific range (e.g., ). An episode begins when any axis flag flips from0to and ends when it returns to0and the base stabilizes. Minimizing the expected value of enforces precise initiation, steady cruising, and consistent braking.
- Fixed Cruising Speed: Per-axis cruising speeds are fixed to constants () when an intent is given (). This prevents
- Manipulation Side (Structured Perturbations):
- Realistic perturbations are injected by sampling short
arm-motion segmentsfromAgiBot World(a real-robot manipulation dataset). These segments are interpolated into continuous signals and replayed at varied rates with light noise. - The
arm motion replayis defined by: $ \omega_{i+1} = \mathrm{min}(L, \omega_i + (\gamma + \delta_i) \Delta t), \quad \omega_0 = 0 $ Where is the current time index into the arm trajectory, is the length of the segment, is the base speed, is a random noise term, and is the simulation time step. - Per-step targets are given by , with .
- This forces the legs to compensate for
structured inertial couplings(realistic arm movements influencing the robot's balance) rather than just unstructured disturbances.
- Realistic perturbations are injected by sampling short
- Stand-Still Penalty: For stationary episodes (), a penalty is added to discourage unnecessary leg actions: $ \mathcal{T}_{\mathrm{stand}} = | a_i^{\mathrm{leg}} |_2^2 $ Here, represents the leg actions (e.g., torques or joint positions), and the penalty minimizes their magnitude to ensure stable standing.
Together, these designs in Stage II yield stable, repeatable gaits and reliable whole-body coordination, crucial for avoiding the fragmented motion patterns often induced by velocity-tracking objectives.
4.2.2.5. Reward Functions and Domain Randomization
The LMO policy's training involves a detailed reward structure (Table 4 in Appendix B.1) and extensive domain randomization (Table 5 in Appendix B.1).
Reward Functions: Reward terms are categorized into:
-
Intent Execution: Rewards for precisely tracking forward, lateral, and yaw intentions, and maintaining desired height, along with penalties for vertical and angular velocity deviations.
-
Posture & Joints: Penalties for deviations from default hip/ankle poses, knee deviation during squatting,
DoFacceleration/velocity/position limit violations, and torque limit violations. Rewards for roll/pitch stabilization. -
Locomotion Structure: Rewards for
feet air time, penalties forfoot clearance,foot lateral spacing(maintaining appropriate stance width),feet ground parallelism,feet parallelism,no-fly penalty(ensuring at least one foot is on the ground),foot slip, andfoot stumble. -
Energy & Smoothness: Penalties for
action rate(how much actions change),smoothness(second-order action changes),joint power,torque usage,feet contact force,contact momentum, andaction vanish penalty. -
Stability (New in Stage II):
Joint tracking errorand the previously mentionedstand-still penaltyare introduced or emphasized in Stage II to improveloco-manipulation robustness.Domain Randomization: Parameters for
dynamics(joint torque, actuation offset, link mass, COM/body displacement),contact properties(friction, restitution, payload),controller gains(Kp,Kd), andlatency/sensor noise(action,DoFstate,IMUlag) are randomized during training. Stage II increases the strength and frequency ofperturbations, includingpush disturbancesandmanipulation-induced payload variations, to further enhance robustness. This helps the policy generalize to real-world variations.
The LMO policy runs at 50 Hz on an onboard computer, while the VLA backbone operates at ~10 Hz for perception and reasoning, with communication handled via ZeroMQ over Ethernet for low-latency command streaming.
5. Experimental Setup
5.1. Datasets
The experiments utilized a combination of real-robot and human video datasets:
-
AgiBot X2 Teleoperation Data:
- Source: Collected via physical robot
teleoperationon the prototype AgiBot X2 humanoid. - Characteristics: Operators used a Meta Quest Pro headset for
egocentric VR teleoperationof the upper body and a joystick forlocomotion commands. This providedaction-labeled trajectoriesforfine-tuningtheVLApolicy. - Scale: Each of the three primary tasks (Bag Packing, Box Loading, Cart Pushing) was executed 50 times, generating diverse trajectories.
- Domain: Real-world humanoid
loco-manipulationtasks. - Purpose: Used for
fine-tuningall VLA-based methods in Stage III, grounding the latent actions into robot-executable commands.
- Source: Collected via physical robot
-
Human Egocentric Manipulation-aware Locomotion Videos:
- Source: Collected by the authors using a low-cost, efficient pipeline. A single operator wore a head-mounted camera (Intel RealSense D435i RGB-D or GoPro).
- Characteristics: These are
action-free videos(only visual observations) coveringlocomotion primitives(advancing, turning, squatting) performedgoal-directedtowards potentialmanipulation goals. - Scale: Approximately 300 hours of video data covering various scenes.
- Domain: Human demonstrations of
manipulation-aware locomotion. - Purpose: Used for pretraining the
locomotion LAMin Stage I ofWholeBodyVLA'sunified latent learning.
-
AgiBot World Dataset (Bu et al., 2025a):
- Source: One of the largest real-robot manipulation datasets.
- Characteristics: Focuses on in-place
bimanual manipulation. - Domain: Real-robot manipulation data.
- Purpose: Used for pretraining the
manipulation LAMin Stage I ofWholeBodyVLA'sunified latent learning. Also used to sampleshort arm-motion segmentsforstructured perturbationsduringLMO RL policytraining (Stage II).
Why these datasets were chosen:
- Human videos: Address
data scarcityforloco-manipulation knowledgeby providing abundant, diversepriorsat low cost. - AgiBot World: Provides a rich source of
real-robot manipulation datato learn precise arm movements. - Teleoperation data: Essential for
fine-tuningtheVLAto the specificAgiBot X2 embodimentand groundinglatent actionsintorobot-executable commands. The combination allows learning from diverse, cheapaction-freedata and thenfine-tuningon limited, expensiverobot-specificdata.
5.2. Evaluation Metrics
The paper employs several metrics to evaluate WholeBodyVLA's performance, focusing on success rates for complex tasks and precision/stability for low-level control.
-
Success Rate:
- Conceptual Definition: Measures the percentage of trials where the robot successfully completes a given task or subgoal. This is a primary indicator of overall task achievement. For multi-subgoal tasks, if an earlier subgoal fails, subsequent ones are automatically considered failures.
- Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
- Symbol Explanation:
Number of Successful Trials: The count of attempts where the robot completed the task/subgoal according to predefined criteria.Total Number of Trials: The total number of attempts made for the task/subgoal (e.g., 25 trials per subgoal in the real-robot experiments).
-
Locomotion Accuracy (Position / Orientation Error):
- Conceptual Definition: Quantifies how precisely the robot tracks desired locomotion trajectories in terms of position and orientation. This is crucial for
manipulation-aware locomotionwhere accurate positioning is paramount. Evaluated during a settling phase after a command, capturingstop-and-settle precision. - Mathematical Formula (for position error): $ \text{Position Error} = \sqrt{(x_{\text{end}} - x_{\text{ref}})^2 + (y_{\text{end}} - y_{\text{ref}})^2 + (z_{\text{end}} - z_{\text{ref}})^2} $ (This is a Euclidean distance. The paper reports mean standard deviation for 2D position error, so it likely refers to the horizontal plane for locomotion).
- Mathematical Formula (for yaw orientation error, as per in Stage II): $ \text{Orientation Error (Yaw)} = | \mathrm{wrap}(\psi_{\text{end}} - \psi_{\text{ref}}) | $
- Symbol Explanation:
- : The final position of the robot's base.
- : The reference (target) position.
- : The final yaw orientation of the robot's base.
- : The reference (target) yaw orientation.
- : A function to normalize the angle difference to a range like .
- Conceptual Definition: Quantifies how precisely the robot tracks desired locomotion trajectories in terms of position and orientation. This is crucial for
-
Manipulation Stability (Center-of-Mass Sway, CoMS):
- Conceptual Definition: Measures the robot's balance during manipulation tasks, especially when standing or squatting, by quantifying the deviation of its Center of Mass (CoM) in the horizontal plane. Lower sway indicates better stability.
- Mathematical Formula: $ \mathrm{CoMS} = \sqrt{ \frac{1}{T} \int_0^T | \mathbf{c}(t) - \bar{\mathbf{c}} |^2 dt } $
- Symbol Explanation:
CoMS: Center-of-Mass Sway.- : The total duration of the evaluation period.
- : The horizontal projection of the robot's
CoMtrajectory at time . - : The temporal mean of the horizontal
CoMtrajectory over the period . - : The squared Euclidean norm (distance).
-
Relative Reconstruction Gain (RRG):
- Conceptual Definition: An internal metric used to assess the quality of the
Latent Action Model (LAM)by comparing its ability to reconstruct future frames against a simple temporal baseline (copying the current frame). A higher RRG indicates more predictivelatent codes. - Mathematical Formula: $ \mathrm{RRG} = \frac{ \mathrm{MSE}{\mathrm{base}} - \mathrm{MSE}{\mathrm{recon}} }{ \mathrm{MSE}_{\mathrm{base}} } $
- Symbol Explanation:
RRG: Relative Reconstruction Gain.- : Mean Squared Error of the reconstructed future frame against the ground-truth (i.e., ).
- : Mean Squared Error of a baseline that estimates the future frame by directly copying the current frame (i.e., ).
- Conceptual Definition: An internal metric used to assess the quality of the
5.3. Baselines
The paper compares WholeBodyVLA against a comprehensive set of baselines, including modular designs, other VLA frameworks, and ablated variants of its own design, to thoroughly evaluate each component's contribution.
-
Modular Design:
- Description: This baseline emulates a traditional
modular pipeline. A humanteleoperatorcontrols onlylocomotionvia a joystick andFPV headset, relying on task instructions and scene context. Once navigation is complete, control is handed off toWholeBodyVLAformanipulation. After manipulation, control returns to the operator. - Representativeness: Represents a strong, almost oracle-like, modular approach where the navigation (locomotion) part benefits from human intelligence, providing an upper bound for such pipelines.
- Description: This baseline emulates a traditional
-
GR00T w/LMO (Bjorck et al., 2025):
- Description:
GR00T N1.5is a recentVLAsystem forwhole-body control. For a fair comparison, its output is adapted: instead of directly predicting lower-body joints,GR00Tpredictslocomotion commands, which are then executed byWholeBodyVLA'sLMO controller. - Representativeness: A state-of-the-art
VLAframework adapted to isolate its high-level reasoning from low-level locomotion stability, allowing comparison of the VLA component.
- Description:
-
OpenVLA-OFT w/LMO (Kim et al., 2025):
-
Description: Another
VLAframework that shares the samePrismatic-7Binitialization asWholeBodyVLA. It is trained to predictupper-body joint actionsandlocomotion commands, which are then executed by theLMO controller. -
Representativeness: Provides a baseline with a similar architectural starting point, allowing for a more direct comparison of
WholeBodyVLA'sunified latent learningand overall VLA training.Ablated Variants of WholeBodyVLA: These variants isolate the contributions of
WholeBodyVLA's key components.
-
-
WholeBodyVLA w/o RL:
- Description: In this variant, the
VLAdirectly predictslower-body jointswithout theLMO policy. - Purpose: To assess the necessity and impact of the
LMO RL policyfor low-level locomotion execution.
- Description: In this variant, the
-
WholeBodyVLA w/ Velocity-Based RL:
- Description: This replaces the
LMO policywith a conventionalvelocity-tracking RL controller(reproduced and refined fromHOMIE(Ben et al., 2025)). - Purpose: To compare the
LMO policy's discrete command interface and specialized curriculum against a standard velocity-based approach.
- Description: This replaces the
-
WholeBodyVLA w/o LAM:
- Description: The
VLAis directly fine-tuned fromPrismatic-7Bwithoutunified latent learning(i.e., noLAM pretraining). - Purpose: To evaluate the contribution of learning from
action-free videosviaunified latent learning.
- Description: The
-
WholeBodyVLA w/ Manipulation LAM:
- Description: This variant performs
latent learningonly onin-place manipulation datafromAgiBot World, withoutlocomotion-aware pretrainingfrom human videos. - Purpose: To isolate the impact of
locomotion-aware latent learningspecifically.
- Description: This variant performs
-
WholeBodyVLA w/ Shared LAM:
- Description: Instead of separate
manipulationandlocomotion LAMs, a singleshared LAMis trained on themixed datawithout modality separation. - Purpose: To validate the design choice of using
separate LAMsdue to visual modality differences.
- Description: Instead of separate
5.4. Hardware, Tasks, and Data Collection
-
Hardware:
-
Robot Platform: Prototype Agibot X2 humanoid.
Arms: 7 Degrees of Freedom (DoF) each, equipped with Omnipicker grippers.Waist: 1 DoF.Legs: 6 DoF each.Perception: An Intel RealSense D435i RGB-D camera mounted on the head foregocentric vision.
-
Teleoperation for Data Collection:
Upper Body: Meta Quest Pro headset foregocentric VR teleoperation.Locomotion Commands: Joystick.
-
Deployment:
VLAruns on an RTX 4090 GPU workstation.RL policydeployed on a NanoPi onboard computer. Communication viaZeroMQover Ethernet.Figure 5 (from the original paper) illustrates the hardware setup for data collection.
该图像是一个示意图,展示了遥操作系统如何通过RGB输入设备和逆运动学控制人形机器人。图中显示了遥操作用的虚拟现实设备、手柄及对应的人形机器人,说明了 locomotion 命令的输入流程。
Figure 5: Description of the hardware. VR and a joystick are used for data collection.
-
-
Tasks: Three main task suites, each with multiple subgoals, designed to comprehensively test
loco-manipulationcapabilities:- Bag Packing:
Primitives: Bimanual grasping, lateral stepping, squatting, precise placement.Scenario: Grasp a paper bag, sidestep to a carton, squat, and place the bag inside.
- Box Loading:
Primitives: Squatting, bimanual grasping, turning, object placement, maintaining balance.Scenario: Squat to grasp a box, stand up, turn to face a cart, and place the box onto the cart.
- Cart Pushing:
Primitives: Grasping a handle, sustained forward locomotion, reliable heading control, stability under heavy loads.Scenario: Grasp the handle of a 50kg cart with both arms and push it several meters forward stably without lateral drift. These tasks collectively evaluate dual-arm coordination, squat precision, turning accuracy, and stability under heavy loads.
- Bag Packing:
5.5. Training Protocol
The training follows a standard VLA recipe of large-scale pretraining followed by real-robot fine-tuning.
-
Pretraining for WholeBodyVLA (Two Steps):
- Stage I (LAM Pretraining): Separate
manipulation LAMandlocomotion LAMsare pretrained.Locomotion LAM: Trained on the authors' collected low-costegocentric locomotion videos(approx. 300 hours).Manipulation LAM: Trained on theAgiBot Worlddataset (real-robot bimanual manipulation data).Shared LAM(for ablation): Trained on mixed locomotion and manipulation data with balanced sampling.Schedule: 30,000 steps, batch size 256.
- Stage II (VLA Pretraining): The
VLA(initialized fromPrismatic-7B) is trained to predict bothlatent actions(from the pretrainedLAMs) on the same corpus of human videos and AgiBot World data, using theLAM codesaspseudo-action labels.Schedule: 20,000 steps, batch size 1024.
- Stage I (LAM Pretraining): Separate
-
Pretraining for VLA Baselines (GR00T, OpenVLA-OFT):
- Their publicly released pretrained models are used.
-
Finetuning Stage (Stage III):
- All methods (
WholeBodyVLAand baselines) are fine-tuned on the sameAgiBot X2 teleoperation trajectoriescollected for all three tasks. - For the main experiments (Section 4.2), one model is fine-tuned on all three tasks jointly, rather than task-specific fine-tuning.
Schedule: 10,000 steps, batch size 64. Fine-tuning usesLoRA(Low-Rank Adaptation of Large Language Models) (Hu et al., 2022).
- All methods (
-
LMO Controller Training:
- The
LMO controller(and its velocity-based baseline) is trained separately insimulationand remains fixed duringteleoperation data collectionand final deployment. This means its performance is independent of theVLAfine-tuning.
- The
-
Computational Resources:
LAMandVLAtraining: NVIDIA H100 GPUs.RL policytraining: Single NVIDIA H100 GPU.
5.6. Real-Robot Experiment Protocols
To ensure fair and reproducible comparisons in real-robot experiments:
- Adjudication: Two independent judges, blinded to the active policy and naive to the data collection process, adjudicated success/failure for each subgoal. Consensus labels were reached. Method order was randomized to mitigate subjective bias.
- Sequential Evaluation: For tasks with multiple subgoals, if the first subgoal failed, the subsequent subgoals were automatically counted as failures.
- Trials: Each task was evaluated over 25 trials. The mean score across all subgoals is reported.
6. Results & Analysis
This section delves into the experimental results, validating WholeBodyVLA's effectiveness, the contribution of its components, and its generalization capabilities.
6.1. Core Results Analysis
The paper evaluates WholeBodyVLA against various baselines across three complex loco-manipulation task suites: Bag Packing, Box Loading, and Cart Pushing. Each task is broken down into two subgoals to assess performance granularly.
The following are the results from Table 2 of the original paper:
| Method | Bag Packing | Box Loading | Cart Pushing | Avg. Score | |||
|---|---|---|---|---|---|---|---|
| Grasp Bags | Move & Squat | Squat & Grasp | Rise & Turn | Grab Handle | Push Ahead | ||
| Modular Design | 22/25 | 12/25 | 9/25 | 9/25 | 22/25 | 22/25 | 64.0% |
| GRO0T w/LMO | 20/25 | 10/25 | 6/25 | 4/25 | 12/25 | 11/25 | 42.0% |
| OpenVLA-OFT w/LMO | 19/25 | 6/25 | 12/25 | 12/25 | 22/25 | 14/25 | 56.7% |
| WholeBodyVLA (ours) | 23/25 | 13/25 | 19/25 | 17/25 | 23/25 | 22/25 | 78.0% |
| WholeBodyVLA w/ vel.-based RL | 22/25 | 1/25 | 16/25 | 3/25 | 24/25 | 15/25 | 54.0% |
| WholeBodyVLA w/o lam | 15/25 | 4/25 | 8/25 | 6/25 | 16/25 | 10/25 | 39.3% |
| WholeBodyVLA w/ manip. lam | 24/25 | 7/25 | 17/25 | 11/25 | 20/25 | 14/25 | 63.3% |
| WholeBodyVLA w/ shared lam | 18/25 | 11/25 | 16/25 | 16/25 | 20/25 | 18/25 | 66.0% |
Analysis:
- Overall Superiority:
WholeBodyVLA (ours)achieves the highest average success rate of 78.0%, significantly outperforming all baselines. This represents a 21.3% improvement over theModular Design(64.0%) and a 24.0% improvement overOpenVLA-OFT w/LMO(56.7%). This strong performance validates its ability to handleloco-manipulationtasks in practice. - Task-Specific Strengths:
- For
Bag Packing,WholeBodyVLAperforms strongly in both subgoals (23/25Grasp Bags, 13/25Move & Squat). - For
Box Loading, it shows particular strength inSquat & Grasp(19/25) andRise & Turn(17/25), indicating effective coordination ofsquatting,grasping, andturningunder balance constraints. - For
Cart Pushing, it achieves near-perfectGrab Handle(23/25) and excellentPush Ahead(22/25), demonstrating robustness under heavy loads and sustained locomotion.
- For
- Baseline Performance:
Modular Designperforms reasonably well (64.0%), especially in the first subgoals (e.g.,Grasp Bags,Grab Handle), but struggles with the more complex locomotion-integrated subgoals (e.g.,Move & Squat,Squat & Grasp,Rise & Turn), likely due tobrittle skill boundariesanderror accumulation.- and
OpenVLA-OFT w/LMO, despite using theLMO controller, perform worse thanWholeBodyVLA, suggesting that the VLA'sunified latent learningand overall architecture are critical. struggles notably withBox Loading(6/25Squat & Grasp, 4/25Rise & Turn), highlighting the challenge of generalizingVLAreasoning to complex loco-manipulation.
6.2. How do Action-Free Videos Contribute to Loco-Manipulation?
To answer Q2, the paper compares WholeBodyVLA with ablated versions to demonstrate the impact of unified latent learning from action-free videos.
Analysis from Table 2:
WholeBodyVLA w/o LAM: This variant, which skipslatent pretrainingfromaction-free videos, achieves only 39.3% average success. This is a substantial drop of 38.7% compared to the fullWholeBodyVLA(78.0%). This stark difference strongly indicates thatunified latent learningeffectively extracts usefulpriorsfromaction-free human videos, significantly enhancing downstreampolicy learning.WholeBodyVLA w/ manip. lam: This variant only performslatent learningforin-place manipulation(usingAgiBot World) but lackslocomotion-aware pretraining. It achieves 63.3% average success, which is 14.7% lower than the full model. The largest performance gaps are observed in tasks requiring substantial locomotion before manipulation (e.g.,Move & Squat: 7/25 vs 13/25,Rise & Turn: 11/25 vs 17/25). This confirms thatlocomotion-aware latent learningfrom human videos is crucial for robustloco-manipulation.WholeBodyVLA w/ shared lam: This variant trains a singleLAMonmixed datawithout modality separation. It achieves 66.0% average success, slightly worse than the full model (78.0%). This result supports the design choice ofdecoupling the two LAMs(manipulation and locomotion) to handle their distinct visual characteristics, although the shared LAM still offers a significant improvement overw/o LAM.
Data Scaling under Generalization Settings (Figure 3):
The paper further investigates the impact of latent pretraining on generalization and data efficiency through data scaling curves.
该图像是图表,展示了WholeBodyVLA的实际场景泛化能力。上部显示了机器人起始姿势和场景外观的变化,并附有数据缩放曲线;下部比较了不同基线在扩展任务上的表现。
Figure 3: Real-world generalization of WholeBodyVLA. Top: variations in robot start-pose and scene appearance, with data-scaling curves. Bottom: comparison on extended tasks with different baselines. See videos on https:/ /opendrivelab. com/WholeBodyVLA.
- Locomotion Generalization (Figure 3(a) - Finetuning Data Scaling for Start-Pose Generalization): This plot shows average success rates for tasks primarily evaluating
locomotion generalization(e.g., varying robotstart-poses).- Models pretrained with
more human egocentric videos(e.g., 50% or 100%) consistently achieve higher success rates across all amounts ofteleoperation fine-tuning data. - Crucially, a model pretrained with
50% human video(and 100% AgibotWorld data) can match the performance of a variant using less than25% human videoswhile using only25 teleoperation trajectoriesfor fine-tuning. The latter requires200 trajectoriesto reach similar performance. This demonstrates thathuman video pretrainingsignificantly reduces the reliance on expensiveteleoperation data.
- Models pretrained with
- Manipulation Generalization (Figure 3(b) - Finetuning Data Scaling for Scene Generalization): This plot focuses on
manipulation generalization(e.g., varyingobjects,layouts,appearance).-
Similar trends are observed: stronger
latent pretraining(moreAgiBot World datafor manipulation LAM) leads to higher success rates and reduces the required amount offine-tuning data.Conclusion on Action-Free Videos: These results unequivocally show that
human egocentric videosandunified latent learningare vital. They extract valuablepriorsthat lead to substantial improvements inVLA generalizationand significantly reduce the need for costlyteleoperation data, especially under a fixed fine-tuning budget.
-
6.3. How does LMO Contribute to Loco-Manipulation?
To address Q3, the paper evaluates the Loco-Manipulation-Oriented (LMO) RL policy's contribution to precision and stability.
Analysis from Table 2 (Real-Robot Results):
WholeBodyVLA w/ vel.-based RL: This variant replaces theLMO policywith a conventionalvelocity-based RL controller. Its average success rate is 54.0%, which is 24.0% lower than the fullWholeBodyVLA(78.0%).- A significant portion of this gap (91.7%) stems from failures in the
second subgoalof each task, which typically involves more complex and sustainedlocomotion(e.g.,Move & Squat,Rise & Turn,Push Ahead). For instance,Move & Squatdropped from 13/25 to 1/25, andRise & Turnfrom 17/25 to 3/25. This clearly indicates that thevelocity-based controllerstruggles with theprecisionandstabilityrequired for thelocomotion componentsofloco-manipulation.
Extended Tasks (Figure 3(c) - Bottom Panel):
- The
velocity-based RL variantconsistently fails substantially more often thanWholeBodyVLAon extended tasks designed to stress locomotion and long-horizon execution (e.g.,uneven-terrain traversal,long multi-step sequences,following extended floor markings for visual navigation). These failures are attributed to suboptimal behaviors likestumbling,path deviation, orturning while advancing, which are inherent limitations of the velocity-based controller.
Ablation Studies in MuJoCo (Table 3):
To further dissect the LMO design, controlled ablations are performed in simulation.
The following are the results from Table 3 of the original paper:
| Method | Locomotion Accuracy (Pos. / Quat. Error) | Manipulation Stability (CoMS) | |||
|---|---|---|---|---|---|
| Forward&Backward | Left&Right | Turning | Standing | Squatting | |
| LMO (ours) | 0.21±0.01 /0.05±0.01 | 0.55±0.01/0.06±0.01 | 0.05±0.01/0.19±0.01 | 0.03±0.02 | 0.03±0.02 |
| LMO w/o Eq. 3 | 0.24±0.02 / 0.07±0.01 | 0.61±0.02 /0.09±0.01 | 0.05±0.01/0.28±0.02 | 0.04±0.03 | 0.03±0.02 |
| LMO w/o stage 2 | 0.27±0.02 /0.09±0.01 | 0.72±0.03/0.11±0.02 | 0.20±0.01 / 0.32±0.03 | 0.05±0.04 | 0.07±0.03 |
| LMO w/o stage 1 | 0.30±0.03 /0.11±0.01 | 0.66±0.04 /0.13±0.03 | 0.46±0.01/0.34±0.04 | 0.05±0.03 | 0.04±0.03 |
| Vel.-based policy | 0.24±0.04 / 0.12±0.02 | 0.60±0.05/0.17±0.06 | 0.26±0.01 / 0.20±0.06 | 0.06±0.04 | 0.05±0.04 |
Analysis of Ablations:
-
LMO w/o Eq. 3(Removing directional accuracy reward ): Degrades turning precision significantly (yaw orientation error for turning increased from 0.19 to 0.28), demonstrating the importance of this reward term for accurate directional control. -
LMO w/o stage 2(Disabling precision and stability refinement): Increases trajectory error across all locomotion primitives (e.g., yaw error for turning jumped from 0.19 to 0.32) and significantly increasessquatting sway(from 0.03 to 0.07), highlighting the necessity of targeted refinement forloco-manipulationtasks. -
LMO w/o stage 1(Disabling basic gait acquisition): This variant produces the largest errors overall (e.g., yaw error for turning at 0.34, position error for forward/backward at 0.30), indicating that without the initial stage to acquire stable gaits, the policy fails fundamentally. -
Vel.-based policy: Performs worse than the fullLMOacross most metrics, particularly inturning accuracy(0.20 vs 0.19 for yaw error) andmanipulation stability(CoMS for standing: 0.06 vs 0.03). This confirms that thediscrete command interfaceandtwo-stage curriculumofLMOare more effective for precise trajectory tracking and stable whole-body coordination than velocity-based methods.Conclusion on LMO: The
LMO RL policyis critical forreliable long-rangeandmulti-step loco-manipulation. Itsdiscrete command interface,two-stage curriculum, andstructured perturbationseffectively mitigate thedecision-execution misalignmentand ensure stable, precise execution of core movements, which is paramount formanipulation-aware locomotion.
6.4. Does WholeBodyVLA Generalize to Long-Horizon and Extended Loco-Manipulation Scenarios?
To address Q4, the paper evaluates WholeBodyVLA's extensibility to more challenging and diverse scenarios, as shown in the bottom row of Figure 3 (c) and further detailed in Appendix C.1.
- Extended Tasks: These include
traversing uneven terrain(e.g., steps, foam, gravel),executing long-horizon multi-step sequences(e.g., walking to a table, grasping two bags, placing them in a drawer, closing it),following floor markings for visual navigation, and everydayloco-manipulation activitieslikewiping a tableandvacuum cleaning. - Generalization Performance: As indicated in Figure 3(c),
WholeBodyVLAconsistently remains superior across all these challenging settings compared to the baselines. This demonstrates that the framework scales effectively beyond the benchmark tasks and preserves robustgeneralizationcapabilities. For instance, interrain traversalandvisual navigation,WholeBodyVLAsignificantly outperformsOpenVLA-OFT w/LMOand , highlighting its robust locomotion and visual understanding. Its performance inwiping stainsandvacuum cleaningshows its ability to adapt to practical household tasks.
6.4.1. Additional Visual Generalization under Object and Load Variations
Further generalization experiments (detailed in Appendix C.2 and Figure 6) perturb the three main loco-manipulation tasks by modifying manipulated objects and their loads while keeping the scene layout fixed.
该图像是图表,展示了在未见物体和负载变体下进行的袋子打包、箱子装载和推车任务的成功率。WholeBodyVLA方法在所有三个任务中均优于GR00T和OpenVLA-OFT,表明其在分布变化下的鲁棒性。
Figure 6: Visual generalization under object and load variations. We evaluate Bag Packing, Box Loading, and Cart Pushing under unseen variations of bag appearance, box contents, and cart loads. WholeBodyVLA consistently outperforms GR00T and OpenVLA-OFT across all three tasks, indicating robustness to these distribution shifts.
Analysis:
WholeBodyVLAmaintains the highest success rates across all perturbed settings (e.g., different appearance/size/weight for bags, plastic containers instead of bags in boxes, 60kg barbell plates instead of a carton on the cart). This demonstrates strongvisual generalizationand robustness todistribution shiftsin object appearance and physical properties (loads), which is crucial for real-world deployment.
6.4.2. Effect of Proprioceptive State
To verify that WholeBodyVLA primarily relies on visual observations rather than proprioceptive state injected into the action decoder, an ablation WholeBodyVLA w/o state is tested.
The following are the results from Table 6 of the original paper:
| Method | Bag Packing | Box Loading | Cart Pushing | Avg. Score | |||
|---|---|---|---|---|---|---|---|
| Grasp Bags | Move & Squat | Squat & Grasp | Rise & Turn | Grab Handle | Push Ahead | ||
| WholeBodyVLA w/o state | 21/25 | 12/25 | 22/25 | 14/25 | 24/25 | 22/25 | 76.7% |
| WholeBodyVLA | 23/25 | 13/25 | 19/25 | 17/25 | 23/25 | 22/25 | 78.0% |
The following are the results from Table 7 of the original paper:
| Method | Bag Packing * | Box Loading * | Cart Pushing * | Avg. Score | |||
|---|---|---|---|---|---|---|---|
| Grasp Bags | Move & Squat | Squat & Grasp | Rise & Turn | Grab Handle | Push Ahead | ||
| WholeBodyVLA w/o state | 12/25 | 5/25 | 14/25 | 12/25 | 21/25 | 21/25 | 76.7% |
| WholeBodyVLA | 15/25 | 9/25 | 15/25 | 15/25 | 22/25 | 20/25 | 64.0% |
Analysis:
- In the original setting (Table 6), removing the state input causes a minor performance drop (76.7% vs 78.0%).
- Under
visual variations(Table 7), the performance drop is more noticeable (64.0% for full model vs 76.7% forw/o stateon average score in this table - note the average score calculation seems to be for visual variations, which is counterintuitive. It seems the "WholeBodyVLA" row's average score for table 7 is lower than "WholeBodyVLA w/o state", which might be a typo in the paper's average score calculation for this specific table or a different averaging method. However, looking at individual subgoal scores, "WholeBodyVLA" generally performs better in most visually perturbed conditions, likeBag Packing *andBox Loading *for the second subgoals.) - Despite some variance, the
w/o-statevariant still achieves comparable task completion rates. This implies thatWholeBodyVLAhas largely learned to accomplishloco-manipulation tasksfromvisual input, and itsvisual generalizationdoes not solely depend onlow-level proprioceptive information.
6.5. Failure Mode Analysis
To understand remaining limitations, a post-hoc failure analysis is conducted using start-pose generalization experiments. For each locomotion primitive (advance, sidestep, squat, turn), 50 failed rollouts are manually annotated.
该图像是图表,展示了WholeBodyVLA在起始姿态泛化中的失败模式。对于每个运动原语—(a) 前进,(b) 侧步,(c) 蹲下,和(d) 转身,收集了50个失败试验数据,使用Sankey图解析了失败原因,分为运动和抓取/放置错误,以及更细的原因,如物体/篮子不可及、错误朝向、提前停止、超出目标、碰撞、失足等。
Figure 7: Failure modes of WholeBodyVLA in start-pose generalization. For each locomotion primitive—(a) advance, (b) sidestep, (c) squat, and (d) turn—we collect 50 failed trials from the start-pose generalization experiments (pick object and place into basket). Each Sankey diagram decomposes failures into locomotion vs. pick/place errors and finer-grained causes such as object/basket unreachable, wrong orientation, early stop, overshoot, collisions, stumble, and inaccurate grasp or placement.
Analysis:
- Dominant Failure Type: For horizontal locomotion primitives (
advance,sidestep,turn),locomotion-related issuesaccount for the majority of failures. Most unsuccessful episodes pass through the "object/basket unreachable" node. This suggests that even moderatestanceororientation deviationsduring approach propagate intoinfeasible pick-and-place attempts. - Less Frequent Catastrophic Failures:
Severe collisionsorstumblesoccur much less frequently. - Squat Primitive: Failures are more evenly divided between
locomotion(incorrect final height, contact during descent) andpick/place errors(imprecise arm trajectories, grasp alignment). - Key Insight: The dominant failure modes are not catastrophic behaviors but rather small but systematic stance and orientation errors during approach. Improving
approach precision, especially forturning,lateral stepping, andsquatting, is expected to directly reduce downstreammanipulation failures.
6.6. Task Execution Time
Average execution duration for successful trials is reported. The following are the results from Table 8 of the original paper:
| Method | Bag Packing | Box Loading | Cart Pushing | |||
|---|---|---|---|---|---|---|
| Grasp Bags | Move & Squat | Squat & Grasp | Rise & Turn | Grab Handle | Push Ahead | |
| Modular Design | 19.2 | 23.0 | 21.5 | 7.9 | 12.0 | 11.7 |
| GROOT w/LMO | 26.3 | 38.6 | 21.1 | 8.0 | 19.5 | 13.8 |
| OpenVLA-OFT w/LMO | 23.6 | 35.9 | 33.2 | 13.8 | 16.9 | 13.1 |
| WholeBodyVLA (ours) | 18.4 | 29.7 | 16.8 | 7.6 | 11.3 | 12.7 |
Analysis:
WholeBodyVLAgenerally achieves comparable or faster execution times for many subgoals compared to baselines, especially forGrasp Bags(18.4s vs. 19.2s for Modular),Squat & Grasp(16.8s vs. 21.5s for Modular), andRise & Turn(7.6s vs. 7.9s for Modular). This suggests that its unified and robust control leads to efficient task completion.- For
Move & Squat,WholeBodyVLA(29.7s) is slower thanModular Design(23.0s), but faster than (38.6s) andOpenVLA-OFT w/LMO(35.9s). This might indicate that whileWholeBodyVLAis more successful, its locomotion component formove and squatmight be more deliberate.
6.7. Shared Latent Action Space Across Embodiments
The paper highlights that its three-stage training pipeline naturally leads to a common latent action space for human and robot data.
该图像是一个示意图,展示了跨域检索中共享潜在动作的效果。可以看到,对于相同潜在动作(如前进、左转、下蹲),人类和机器人剪辑展现出一致的语义。
Figure 8: Cross-domain retrieval with shared latent actions. Human and robot clips retrieved for the same latent action (e.g., go forward, turn left, squat) exhibit consistent semantics across domains.
Analysis:
- Mechanism:
Stage ItrainsVQ-VAE LAMsasinverse dynamics modelspurely from visual videos, wherediscrete codessummarizevisual changes.Stage IIuses theseLAMsto supervise theVLA.Stage IIIthen grounds these latents torobot-specific joint targetsusingteleoperation data. - Evidence: Figure 8 visually demonstrates this by showing that for the same
latent action(e.g., "Go Forward," "Turn Left," "Squat"), the retrieved clips exhibitconsistent semanticsacross bothhuman-collected videos(left) andteleoperated robot demonstrations(right). - Implication: This
shared latent spaceis crucial fortransfer learningand effectively leverages abundantaction-free human datato inform robot behavior, supporting the strong performance gap betweenWholeBodyVLAandWholeBodyVLA w/o LAM(Table 2).
6.8. LAM Quality Assessment with Relative Reconstruction Gain (RRG)
Beyond downstream VLA performance, the quality of the LAMs is directly assessed using Relative Reconstruction Gain (RRG).
The following are the results from Table 9 of the original paper:
| Method | Bag Packing | Box Loading | Cart Pushing | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Grasp | Move | Squat | Place | Squat | Grasp | Rise | Turn | Place | Grab | Push | |
| manip. lam | 21.78 | 23.64 | 18.71 | 24.73 | 21.22 | 25.09 | 22.92 | 28.15 | 23.96 | 19.12 | 19.92 |
| shared lam | 19.70 | 23.58 | 20.62 | 25.69 | 19.41 | 17.38 | 23.43 | 27.68 | 23.61 | 18.49 | 17.79 |
| loco. lam | 16.39 | 25.77 | 29.46 | 20.60 | 22.72 | 20.24 | 25.40 | 30.74 | 24.81 | 15.27 | 20.27 |
Analysis:
- The
RRGmetric shows that the separately trainedLAMs(manipulation LAM and locomotion LAM) generally achieve betterdownstream reconstruction performancecompared to theshared LAMvariant.- For
manipulation-heavy primitives(e.g.,Grasp,Place),manip. lamoften has higher RRG (e.g.,Grasp Bags: 21.78 vs 19.70 for shared). - For
locomotion-heavy primitives(e.g.,Move,Squat,Turn),loco. lamtends to have higher RRG (e.g.,Squat Box Loading: 22.72 vs 19.41 for shared;Turn Box Loading: 30.74 vs 27.68 for shared).
- For
- The
shared LAMvariant underperforms the separate models on bothmanipulationandlocomotion primitivesin many cases. This quantitative result, along with the qualitative analysis from Table 2, suggests that there areinherent conflictsbetweenlocomotion-specificandmanipulation-specific objectiveswhen a singleLAMis trained jointly on mixed data, justifying theWholeBodyVLAdesign ofseparate LAMs.
7. Conclusion & Reflections
7.1. Conclusion Summary
WholeBodyVLA pioneers a Vision-Language-Action (VLA) framework enabling humanoid robots to perform large-space loco-manipulation tasks autonomously in the real world. Its core innovation lies in two synergistic components: unified latent learning and a loco-manipulation-oriented (LMO) RL policy. Unified latent learning effectively addresses the data scarcity challenge by leveraging abundant, low-cost action-free egocentric videos to acquire rich loco-manipulation priors, using separate Latent Action Models (LAMs) for manipulation and locomotion. The LMO RL policy then ensures precise and stable execution of fundamental loco-manipulation movements by employing a discrete command interface, mitigating the decision-execution misalignment common in prior velocity-tracking controllers. Comprehensive experiments on the AgiBot X2 humanoid demonstrate superior performance over baselines, strong generalization to novel scenarios, and high extensibility across a broad range of tasks, marking a significant step towards general-purpose embodied agents.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
-
Long-Horizon and Dexterous Tasks: While effective,
WholeBodyVLAstill faces challenges in reliably handling extremelylong-horizonand highlydexterous tasks. These tasks often require more complex reasoning and finer motor control. -
Mapping and Memory: Future work will focus on incorporating
lightweight mappingandmemorycapabilities. This would enableextended planningand allow the robot to build and utilize internal representations of its environment over longer durations, improving its ability to navigate and interact in complex spaces. -
Active Perception Strategies: To improve robustness in
clutteredordynamic environments, developingactive perception strategiesis crucial. This means the robot would intelligently choose where and how to look to gather the most relevant information for its current task, rather than relying on passive observation.These directions aim to enhance
WholeBodyVLA's scalability andgeneralization, paving the way for more versatile real-worldhumanoid loco-manipulation.
7.3. Personal Insights & Critique
This paper presents a robust and well-thought-out framework for a critical problem in humanoid robotics. The integration of unified latent learning and a specialized LMO RL policy is particularly insightful.
Inspirations and Applications:
- Data Efficiency: The approach of learning
loco-manipulation priorsfromlow-cost, action-free egocentric videosis a powerful paradigm for other data-scarce robotic domains. This can significantly democratize robot learning, moving away from expensiveteleoperationorMoCapsetups. - Bridging the High-Low Gap: The
LMO RL policy's discrete command interface and two-stage curriculum provide a valuable blueprint for designinglow-level controllersthat are explicitly tailored to the needs ofhigh-level plannersin complex, dynamic tasks. Thisdecision-execution alignmentis a common bottleneck in hierarchical control systems. - Embodied AI: The strong
generalizationandextensibilityacross variousloco-manipulationtasks (e.g., uneven terrain, visual navigation, household chores) demonstrateWholeBodyVLA's potential as a foundation for trulygeneral-purpose embodied AIcapable of operating in diverse human environments. Its ability to learn from human demonstrations and then adapt to real-robot specifics is a key step in this direction.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
"Lightweight" Decoder: While the paper mentions a "lightweight decoder" for grounding latent actions to robot-specific commands, the details of its learning process (e.g., how it maps discrete latent tokens to continuous joint angles and locomotion commands, and how
robot stateis incorporated) could be further elaborated. Thestate injectionin the decoder, while shown to not be the primary driver of visual generalization, still contributes, suggesting some reliance. -
Discrete Command Granularity: The
LMO policy's discrete command interface () is effective for precisestart-stopanddirectional control. However, for very fine-grained or highly dynamic continuous motions that might be needed in futuredexterous loco-manipulation(e.g., smoothly adjusting speed while balancing on a narrow beam), a purely discrete interface might introduce limitations compared to a more continuous velocity or force control paradigm. Thereference shapinghelps, but the underlying discrete nature remains. -
Scalability of Human Video Collection: While stated as "low-cost," scaling
human egocentric video collectionto trulyopen-ended, diverse environmentsand tasks, with sufficient variability to cover all possibleloco-manipulationscenarios, still presents a logistical challenge. The current dataset of 300 hours, while significant, might represent a fraction of whatreal-world general intelligencewould demand. -
Sim-to-Real Gap for LMO: The
LMO policyis trained insimulation. Whiledomain randomizationis extensive, the transition fromsimulationto thereal AgiBot X2 robotfor such a complexwhole-body controllercan still present asim-to-real gap. Further analysis on the residualsim-to-realchallenges for theLMOitself would be valuable. -
Interpretability of Latent Actions: While the paper shows
shared latent actionsacross embodiments, a deeper analysis into theinterpretabilityof these discrete tokens could enhance understanding and debugging. What are the specific visual features theLAMslearn to associate with each discrete action token?Overall,
WholeBodyVLAis a very promising contribution. It systematically tackles key challenges inhumanoid loco-manipulationand offers a robust, data-efficient, and effective solution that pushes the boundaries of autonomous humanoid control.
Similar papers
Recommended via semantic vector search.