Paper status: completed

WHOLEBODYVLA: TOWARDS UNIFIED LATENT VLA FOR WHOLE-BODY LOCO-MANIPULATION CONTROL

Published:12/11/2025

Vision-Language-Action Model (3)Robotic Action Learning (18)Whole-Body Humanoid Robot Control (1)Action Learning from Low-Cost Videos (1)Loco-Manipulation-Oriented Reinforcement Learning (1)

Original Link

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study presents `WholeBodyVLA`, a unified latent vision-language-action framework enhancing humanoid robots' performance in loco-manipulation tasks. It learns from low-cost egocentric videos and employs a tailored reinforcement learning policy, achieving a 21.3% performance b

Abstract

Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To execute the desired locomotion commands more precisely, we present a loco–manipulation–oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco–manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks.

Mind Map

In-depth Reading

English Analysis~42 min read · 57,431 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is the development of WholeBodyVLA, a unified latent vision-language-action (VLA) framework designed for whole-body loco-manipulation control in humanoid robots.

1.2. Authors

The authors are: Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi, Yanjie Zhang, Delong Li, Chuanzhe Suo, Chuang Wang, Zhihui Peng, and Hongyang Li. Their affiliations include Fudan University, OpenDriveLab & MMLab at The University of Hong Kong, AgiBot, and SII. Haoran Jiang and Jin Chen are noted as having equal contributions, while Zhihui Peng and Hongyang Li are indicated as project co-leads. These affiliations suggest a collaborative effort between academic institutions and robotics companies, pooling expertise in robotics, computer vision, and machine learning.

1.3. Journal/Conference

The paper does not explicitly state the journal or conference it was published in. However, given the nature of the research (humanoid robotics, vision-language-action models, reinforcement learning) and the affiliations (MMLab at HKU, OpenDriveLab), it is likely intended for a top-tier robotics or AI conference (e.g., CoRL, RSS, ICRA, IROS, NeurIPS, ICML) or a reputable journal in these fields. These venues are highly competitive and influential in their respective domains.

1.4. Publication Year

The publication date (UTC) is 2025-12-11T00:00:00.000Z.

1.5. Abstract

Humanoid robots face significant challenges in performing complex loco-manipulation tasks due to deficiencies in manipulation-aware locomotion in existing modular or end-to-end approaches. This limitation restricts robots to small workspaces. The paper identifies two root causes: (1) the scarcity of humanoid teleoperation data for acquiring loco-manipulation knowledge, and (2) the imprecise and unstable execution of locomotion commands by current RL controllers.

To address these, the authors propose WholeBodyVLA, a unified latent learning framework. This framework enables Vision-Language-Action (VLA) systems to learn from low-cost action-free egocentric videos, augmenting data collection with an efficient human data pipeline. To improve execution, they introduce a loco-manipulation-oriented (LMO) RL policy specifically designed for accurate and stable core loco-manipulation movements (advancing, turning, squatting).

WholeBodyVLA is presented as a pioneering framework for large-space humanoid loco-manipulation. Experimental results on the AgiBot X2 humanoid demonstrate superior performance, outperforming prior baselines by 21.3%, along with strong generalization and extensibility across diverse tasks.

1.6. Original Source Link

https://opendrivelab.com/WholeBodyVLA/static/pdf/WholeBodyVLA.pdf This is the official PDF link provided for the research paper, which appears to be a preprint or technical report associated with the OpenDriveLab and AgiBot projects.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the limited ability of humanoid robots to perform large-space, complex loco-manipulation tasks effectively. Loco-manipulation refers to tasks that require close coordination between locomotion (movement through an environment) and manipulation (dexterous interaction with objects).

This problem is critically important because humanoid robots are envisioned as general-purpose embodied agents capable of operating in human-centric environments. Achieving this vision necessitates seamless loco-manipulation. However, current approaches, whether modular (separating locomotion and manipulation stages) or end-to-end (single policy for both), suffer from manipulation-aware locomotion deficiencies. This means robots often fail to position themselves optimally for manipulation tasks, restricting their operational workspace.

The paper attributes these deficiencies to two main challenges:

Data Scarcity: Learning robust loco-manipulation knowledge requires vast amounts of data. Humanoid teleoperation data, where skilled operators control a robot to perform tasks, is prohibitively expensive and time-consuming to collect at scale. Without sufficient data, models lack the experience to learn how locomotion should strategically support manipulation.
Execution Imprecision and Instability: Even with a high-level plan, existing Reinforcement Learning (RL) controllers for humanoid locomotion often struggle with precision and stability in executing commands like starting, stopping, turning, or squatting. This decision-execution misalignment leads to failures (e.g., stumbling, path deviation) that prevent successful loco-manipulation.

The paper's innovative idea is to tackle these issues by (1) leveraging readily available low-cost action-free egocentric videos to acquire rich loco-manipulation priors through a unified latent learning framework, and (2) developing a specialized Loco-Manipulation-Oriented (LMO) RL policy that provides more stable and precise low-level control for manipulation-relevant movements.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Introduction of WholeBodyVLA: They present WholeBodyVLA, a novel Vision-Language-Action (VLA) framework that enables bipedal humanoids to perform end-to-end large-space loco-manipulation autonomously in real-world settings. This framework unifies high-level reasoning with low-level control, addressing the limitations of prior decoupled or partial loco-manipulation systems.
Unified Latent Learning Framework: They propose unified latent learning, a method that allows joint locomotion-manipulation learning from abundant, low-cost action-free egocentric videos. This approach significantly alleviates the data scarcity problem by converting observation-only videos into discrete latent actions that serve as supervision for VLA training. The framework uses separate Latent Action Models (LAMs) for manipulation and locomotion to handle their distinct visual change patterns, which are then jointly used to supervise the VLA.
Loco-Manipulation-Oriented (LMO) RL Policy: They introduce an LMO RL policy specifically designed to mitigate decision-execution misalignment. Unlike traditional velocity-tracking objectives, the LMO policy utilizes a simplified discrete command interface tailored for accurate and stable execution of fundamental loco-manipulation movements such as advancing, turning, and squatting, even under dynamic disturbances. This ensures that high-level VLA decisions are faithfully translated into reliable robot actions.

The key conclusions and findings are:

WholeBodyVLA significantly outperforms prior modular and end-to-end baselines in real-world loco-manipulation tasks on the AgiBot X2 humanoid, achieving a 21.3% higher success rate than the best prior baseline and 24.0% better than another (from Table 2: 78.0% vs 64.0% and 54.0%).
The unified latent learning from action-free videos dramatically improves performance and reduces the reliance on expensive teleoperation data. Models pretrained with more human videos achieve comparable performance with significantly less fine-tuning data.
The LMO RL policy is crucial for reliable long-range and multi-step loco-manipulation, reducing stumbling, path deviation, and turning while advancing errors that plague velocity-based controllers.
WholeBodyVLA demonstrates strong generalization capabilities across changed start-poses, objects, layouts, appearances, and extends to long-horizon and challenging scenarios (e.g., uneven terrain traversal, visual navigation, vacuum cleaning, wiping).

These findings address the core problem by providing a framework that can learn from more accessible data and execute actions more reliably, paving the way for more capable and autonomous humanoid robots.

3.1. Foundational Concepts

To understand WholeBodyVLA, a foundational grasp of several robotics and machine learning concepts is essential for beginners:

Humanoid Robots: These are robots designed to resemble the human body, typically bipedal (two legs) with a torso, head, and two arms. Their human-like form makes them suitable for operating in environments designed for humans, but also poses significant challenges in terms of balance, locomotion, and dexterous manipulation. The AgiBot X2 humanoid used in this paper is an example, featuring multiple degrees of freedom (DoF) for arms, legs, and waist.
Locomotion: The ability of a robot to move through its environment. For humanoid robots, this includes walking forward/backward, lateral stepping, turning, squatting, and maintaining balance.
Manipulation: The ability of a robot to interact with objects in its environment, such as grasping, lifting, placing, pushing, or using tools. This often requires dexterous (skillful) control of robotic arms and end-effectors (like grippers).
Loco-manipulation: The coordinated execution of both locomotion and manipulation tasks. For example, walking towards a table, then squatting to pick up an object, and finally turning to place it on a cart. This requires seamless integration and awareness between movement and interaction.
Vision-Language-Action (VLA) Systems: These are AI systems that take visual observations (e.g., camera images) and language instructions (e.g., "pick up the red box") as input and output robot actions (e.g., motor commands). They aim to enable robots to understand high-level human commands and translate them into physical actions in the real world.
Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by interacting with an environment. The agent receives rewards for desired behaviors and penalties for undesired ones, learning a policy (a mapping from states to actions) that maximizes cumulative reward. RL controllers in robotics are policies that learn to generate low-level motor commands to achieve a desired behavior (e.g., walking stably).
Imitation Learning: A learning paradigm where a robot learns by observing demonstrations from an expert (e.g., a human teleoperating the robot). The goal is to imitate the expert's actions given the same observations. This is often used to bootstrap RL policies or learn complex skills that are difficult to define with explicit reward functions.
Teleoperation Data: Data collected by a human operator remotely controlling a robot. This provides direct action-labeled trajectories (observations paired with corresponding robot actions), which are valuable for imitation learning. However, it is typically expensive and time-consuming to gather.
Egocentric Videos / Action-Free Videos: Videos captured from the perspective of the robot (or human performing the task, simulating the robot's viewpoint). Action-free means these videos only contain visual observations, without explicit robot motor commands or human joint angles as labels. They are cheap and abundant but require methods to extract actionable information.
Latent Space / Latent Actions: In machine learning, a latent space is a lower-dimensional representation of data, often learned by neural networks. Latent actions are abstract, compressed representations of actions within this latent space. Instead of directly predicting raw motor commands, a model might predict a latent action token, which is then decoded into actual robot movements. This can help in learning from diverse data sources or simplifying complex action spaces.
Vector Quantized Variational AutoEncoder (VQ-VAE): An unsupervised deep learning model that learns to compress data into a discrete latent representation.
- Autoencoder: A neural network that learns to encode data into a lower-dimensional latent space and then decode it back to its original form. The goal is reconstruction of the input.
- Variational Autoencoder (VAE): A type of autoencoder that learns a probability distribution for the latent space, allowing for generative capabilities.
- Vector Quantization (VQ): A technique that forces the latent representations to be discrete by mapping them to the closest vector in a fixed, learned codebook. This turns continuous latent variables into discrete tokens, which are easier to work with in some contexts (e.g., language models).
- VQ-VAE is used in this paper to turn frame-to-frame visual changes from videos into compact, discrete latent tokens that represent actions.
Transformer Architecture: A neural network architecture originally developed for natural language processing, known for its self-attention mechanism. Transformers are highly effective at processing sequential data and capturing long-range dependencies. They are increasingly used in vision (e.g., Vision Transformers, DINOv2) and for VLA systems to process visual and language inputs. DINOv2 is a self-supervised vision transformer used for robust visual features.
Proprioceptive States: Sensory information about the robot's own body state, such as joint angles, joint velocities, base angular velocity, and gravity vector. These are internal measurements, as opposed to exteroceptive (external environment sensing like vision).

3.2. Previous Works

The paper discusses related work in two main categories: humanoid whole-body control and Vision-Language-Action models.

3.2.1. Humanoid Whole-Body Control

Loco-manipulation controllers:
- Early works focused on isolated manipulation or motion imitation (e.g., Cheng et al., 2024; Ji et al., 2024; He et al., 2024; 2025b;a; Ze et al., 2025; Truong et al., 2025; Chen et al., 2025b; Xie et al., 2025; Wang et al., 2025; Cheng et al., 2025; Fu et al., 2025).
- More recently, RL-based whole-body controllers emerged, often using velocity-tracking interfaces (e.g., Ben et al., 2025; Shi et al., 2025; Li et al., 2025a; Zhang et al., 2025a;b; Sun et al., 2025). These optimize for per-step errors against commanded speeds.
  - Limitations: While suitable for general cruising, velocity-tracking formulations often leave start-stop semantics implicit, lead to fragmented gaits across different speeds, and lack supervision for episode-level controllability (e.g., braking precision, heading fidelity) crucial for loco-manipulation. Upper-body influence is typically modeled as simple noise, not reflecting realistic inertial couplings from manipulation tasks.
  - Robustness Improvements: Some methods added robustness, like HOMIE (Ben et al., 2025) with PD-stabilized arms, AMO (Li et al., 2025a) with trajectory-optimization hybrids, FALCON (Zhang et al., 2025a) with force curriculum, R2S2 (Zhang et al., 2025b) with ski libraries, or ULC (Sun et al., 2025) with residual controllers. However, these still inherit the limitations of velocity-centric training, resulting in inconsistent gaits and unstable teleoperation trajectories.
High-level planners for humanoids:
- These systems aim to process RGB vision or language inputs to provide higher-level commands to RL controllers.
- LEVERB (Xue et al., 2025) embeds latent verbs into RL for low-level control.
- R2S2 (Zhang et al., 2025b), Being-0 (Yuan et al., 2025), and HEAD (Chen et al., 2025a) use modular planners driven by vision-language models (VLMs) to sequence discrete locomotion and manipulation skills.
  - Limitations: These frameworks suffer from brittle skill boundaries (robots ending up in unstable configurations), error accumulation due to lack of closed-loop feedback, and reliance on cloud-based perception (introducing latency). Being-0 also uses GPT-4o and detectors as external modules.
- VLA for humanoids:
  - Initial efforts like Humanoid-VLA (Ding et al., 2025) focus on locomotion, while GR00T (Bjorck et al., 2025) targets manipulation for humanoid embodiments, neglecting the other critical primitive for seamless loco-manipulation.
  - The Boston Dynamics demonstration (Boston Dynamics, 2025) shows impressive loco-manipulation but is constrained to a limited workspace and relies heavily on expensive Motion Capture (MoCap) data.
- Overall Gap: These limitations highlight the need for unified frameworks that directly couple vision and language with whole-body control without brittle modular boundaries, a gap WholeBodyVLA aims to fill.

3.2.2. Vision-Language-Action Models

General VLA systems:
- Recent advances, building on multimodal foundation models and imitation learning from large-scale real-robot trajectories, have led to powerful VLA systems (e.g., RT-2 (Brohan et al., 2023), OpenVLA (Kim et al., 2024), RDT (Liu et al., 2025), Pi0 (Black et al., 2024), Pi0.5 (Intelligence et al., 2025)).
- Limitations: These models typically emphasize upper-body manipulation only and do not provide an end-to-end solution for autonomous whole-body control required by loco-manipulation tasks.
- WholeBodyVLA aims to integrate locomotion and manipulation into a unified VLA for bipedal humanoids.
Latent action learning:
- This approach addresses the data scarcity of action-labeled trajectories by learning from action-free videos. It compresses frame-to-frame visual changes into compact, discrete tokens that supervise policy learning.
- Representative examples include Genie (Bruce et al., 2024), LAPA (Ye et al., 2025), IGOR (Chen et al., 2024), and UniVLA (Bu et al., 2025b). These studies demonstrate the effectiveness of using action-free videos as supervisory signals for VLA training.
- Relevance to WholeBodyVLA: Given the even greater difficulty of obtaining large-scale humanoid loco-manipulation data, WholeBodyVLA leverages latent action learning in a unified manner for both locomotion and manipulation.

3.3. Technological Evolution

The field has seen a progression from isolated control of robot components (e.g., separate navigation for mobile robots, arm control for manipulators) to more integrated whole-body control for complex platforms like humanoids. Initially, whole-body control often relied on motion imitation from human MoCap data or simpler velocity-tracking RL policies, which were effective for general motions but struggled with precision and stability under real-world manipulation loads.

The advent of Vision-Language Models (VLMs) and multimodal foundation models has shifted the paradigm towards Vision-Language-Action (VLA) systems, enabling robots to understand semantic instructions and perceive their environment using rich visual input. However, initial VLA efforts for humanoids tended to focus either on locomotion or manipulation, not their seamless integration. Moreover, these systems still faced the hurdle of data scarcity for robot-specific action-labeled trajectories.

Latent action learning emerged as a crucial technique to bridge the data gap, allowing models to learn from readily available action-free videos. WholeBodyVLA fits into this timeline by combining these advancements: it builds on the VLA paradigm, uses latent action learning from action-free videos to overcome data scarcity for humanoid loco-manipulation, and crucially, introduces a specialized RL controller (LMO) to address the precision and stability issues inherent in low-level humanoid locomotion for manipulation-aware tasks. This represents an evolution towards truly end-to-end, general-purpose humanoid agents.

3.4. Differentiation Analysis

WholeBodyVLA differentiates itself from previous works primarily through two key innovations: unified latent learning and the loco-manipulation-oriented (LMO) RL policy.

Against Modular Pipelines (e.g., R2S2, Being-0, HEAD):
- Differentiation: WholeBodyVLA offers an end-to-end framework, learning closed-loop feedback and joint optimization between locomotion and manipulation. Modular approaches typically use high-level planners to sequence discrete skills (e.g., navigate, then grasp), leading to brittle skill boundaries and error accumulation where the robot might end up in suboptimal or unstable configurations after locomotion, making subsequent manipulation difficult.
- Innovation: WholeBodyVLA avoids these handoff issues by learning a cohesive action space where locomotion and manipulation interact, guided by unified latent actions.
Against End-to-End VLA for Humanoids (e.g., Humanoid-VLA, GR00T, Boston Dynamics):
- Differentiation: Humanoid-VLA focuses on locomotion, GR00T on manipulation, and the Boston Dynamics demo, while impressive, relies on expensive MoCap data and is constrained to a limited workspace. WholeBodyVLA specifically targets unified loco-manipulation across large spaces.
- Innovation: WholeBodyVLA leverages unified latent learning from low-cost action-free videos to mitigate the data scarcity problem, which is a major bottleneck for training end-to-end whole-body policies. This allows it to acquire loco-manipulation knowledge at a scale not feasible with teleoperation or MoCap alone.
Against Existing RL Loco-manipulation Controllers (e.g., HOMIE, AMO, FALCON, ULC):
- Differentiation: These controllers often use continuous velocity-tracking objectives, which are good for general cruising but lack the precision and stability needed for start-stop semantics and fine-grained positional control critical for manipulation-aware locomotion. They also poorly model structured inertial couplings from real manipulation tasks.
- Innovation: The LMO RL policy in WholeBodyVLA replaces velocity tracking with a discrete command interface specifically designed for fundamental loco-manipulation movements (advancing, turning, squatting). This ensures more faithful execution, predictable on/off transitions, and better stability under dynamic disturbances (like carrying loads or arm movements).
Against General Latent Action Learning (e.g., Genie, LAPA, IGOR, UniVLA):
- Differentiation: While previous works demonstrated latent action learning effectiveness for tabletop manipulation or general tasks, WholeBodyVLA applies it specifically to the much more complex domain of humanoid whole-body loco-manipulation.
- Innovation: It introduces a crucial distinction by training separate latent action models (LAMs) for manipulation and locomotion. This addresses the inherent modality differences (static camera in manipulation vs. moving camera in locomotion) and conflicting attention objectives that would degrade a single shared LAM, leading to more robust latent representations for humanoid loco-manipulation.
  
  In essence, WholeBodyVLA's core innovation lies in its holistic approach: addressing both the data acquisition challenge (via unified latent learning from action-free videos) and the reliable execution challenge (via the LMO RL policy) within a single, unified VLA framework for large-space humanoid loco-manipulation.

4. Methodology

The WholeBodyVLA framework integrates two main components: unified latent learning for high-level Vision-Language-Action (VLA) policy training and a loco-manipulation-oriented (LMO) RL policy for low-level, stable locomotion execution. This section details how these components are designed and interact to enable large-space humanoid loco-manipulation.

4.1. Principles

The core idea behind WholeBodyVLA is to address the dual challenges of data scarcity and execution misalignment in humanoid loco-manipulation.

Overcoming Data Scarcity with Latent Learning: Humanoid teleoperation data is expensive. The paper proposes to leverage abundant, low-cost action-free egocentric videos (human demonstrations) to learn abstract loco-manipulation priors. This is achieved through Latent Action Models (LAMs) which convert visual changes in videos into discrete latent actions. By supervising a VLA policy with these latent actions, the robot can learn complex behaviors without explicit robot-specific action labels from day one.
Ensuring Reliable Execution with LMO RL Policy: High-level VLA decisions need to be faithfully executed by the robot's lower-body. Traditional Reinforcement Learning (RL) controllers, often optimized for continuous velocity tracking, struggle with precision and stability for the specific start-stop and directional control required by manipulation-aware locomotion. The LMO RL policy introduces a simplified discrete command interface and a targeted training curriculum to achieve highly precise and stable execution of fundamental loco-manipulation movements, bridging the gap between high-level intent and low-level physical realization.

4.2. Core Methodology In-depth (Layer by Layer)

The WholeBodyVLA pipeline consists of three main stages: LAM pretraining, VLA training, and real-world deployment with the LMO RL policy.

The overall pipeline of WholeBodyVLA is illustrated in Figure 2 (from the original paper).

$Figure 2: Pipeline of WholeBodyVLA. LAM is pretrained on manipulation and manipulationaware locomotion videos, yielding unified latent supervision for the VLM. Meanwhile, the LMO RL policy is trained for precise and stable locomotion under disturbances. At runtime, egocentric images and language instructions are encoded by the VLM into latent action tokens, which are decoded $( \\sim 1 0 \\mathrm { { H z } ) }$ into (i) dual-arm joint actions and (ii) locomotion commands executed by LMO at $5 0 \\mathrm { H z }$ , enabling robust whole-body loco-manipulation.$ 该图像是WholeBodyVLA的示意图，展示了统一潜在学习框架的流程。LAM模块通过操控和移动数据预训练，生成统一的潜在监督，VLM则根据任务指令和观察输入进行编码。LMO强化学习策略负责生成精确的运动指令和操控命令，保证稳健的整体运动与操控。图中所示公式有 a_1, a_2, ext{和} a_n 表示运动命令。

Figure 2: Pipeline of WholeBodyVLA. LAM is pretrained on manipulation and manipulationaware locomotion videos, yielding unified latent supervision for the VLM. Meanwhile, the LMO RL policy is trained for precise and stable locomotion under disturbances. At runtime, egocentric images and language instructions are encoded by the VLM into latent action tokens, which are decoded $( \sim 1 0 \mathrm { { H z } ) }$ into (i) dual-arm joint actions and (ii) locomotion commands executed by LMO at $5 0 \mathrm { H z }$ , enabling robust whole-body loco-manipulation.

4.2.1. Unified Latent Action Model (LAM)

The central idea here is to learn manipulation and locomotion primitives from egocentric videos using Latent Action Models (LAMs), which then provide latent supervision for VLA training. The paper observes that training a single LAM on mixed data (manipulation and locomotion) yields suboptimal performance due to fundamental differences in the visual modalities:

Manipulation videos: Camera pose is almost static; image variations are dominated by arm motion, biasing the model to arm regions.
Locomotion videos: Camera pose changes continuously; image variations arise from environment motion, forcing attention on the entire scene.
Ambiguity: In locomotion videos, arm movement in the Field of View (FOV) might be misinterpreted as arm motion rather than camera motion (locomotion), leading to ambiguous latent encodings.

To counter this, WholeBodyVLA trains two separate LAMs: a manipulation LAM on manipulation data and a locomotion LAM on locomotion data. Both LAMs are then used jointly to supervise the VLA training.

4.2.1.1. LAM Training Details

Following prior work like Genie and UniVLA, a VQ-VAE (Vector Quantized Variational AutoEncoder) architecture is adopted for the LAMs. The encoder is built on DINOv2 features, a self-supervised vision transformer, to extract robust visual features.

Given a pair of consecutive frames $(o_t, o_{t+k})$ :

Continuous Latent Vector Generation: The LAM encoder $\mathcal{E}_i$ first processes these frames to emit a continuous latent vector $z_t$ . $ z_t = \mathcal{E}i (o_t, o{t+k}) $ Here, $o_t$ represents the current visual observation (image frame) at time $t$ , and $o_{t+k}$ is the visual observation at a future time $t+k$ . $\mathcal{E}_i$ is the encoder for LAM, where the subscript $i \in \{ \mathrm{mani}, \mathrm{loco} \}$ indicates whether it's the manipulation LAM or the locomotion LAM. This encoder learns to capture the visual changes between the two frames into a continuous vector representation.
Quantization to Discrete Latent Action: This continuous latent vector $z_t$ is then quantized to the nearest entry in a learned codebook $\mathcal{C}_i$ . $ c_t = \mathrm{quantize}(z_t) := \arg \min_{c \in \mathcal{C}_i} | z_t - c |_2, \quad c_t \in \mathcal{C}_i $ Here, $c_t$ is the discrete latent action token, chosen from the codebook $\mathcal{C}_i$ (a set of learned discrete vectors). This quantization step is crucial for transforming continuous visual changes into a compact, discrete representation. The $\arg \min$ operator finds the codebook entry $c$ that is closest to $z_t$ in $L_2$ norm.
Frame Reconstruction and Loss: To train both the encoder $\mathcal{E}_i$ and the codebook $\mathcal{C}_i$ , a LAM decoder $\mathcal{D}_i$ is used. The decoder receives the former frame $o_t$ and the quantized latent action $c_t$ , and is trained to reconstruct the latter frame $\hat{o}_{t+k}$ . $ \hat{o}{t+k} = \mathcal{D}i (o_t, c_t) $ The reconstruction is optimized by minimizing a mean-squared error (MSE) loss: $ \mathcal{L}{\mathrm{mse}} = | o{t+k} - \hat{o}_{t+k} |_2^2 $ This loss measures how well the decoder can reconstruct the future frame given the current frame and the discrete latent action, forcing the latent action to effectively encode the temporal visual change.
VQ-VAE Objective: The overall LAM training objective combines the MSE reconstruction loss with standard VQ-VAE terms to jointly train the encoder, decoder, and codebook: $ \mathcal{L}{\mathrm{LAM}} = \mathcal{L}{\mathrm{mse}} + \Vert \mathrm{sg}[c_t] - z_t \Vert_2^2 + \beta \Vert c_t - \mathrm{sg}[z_t] \Vert_2^2 $
- $\mathcal{L}_{\mathrm{mse}}$ : The reconstruction loss, as explained above.
- $\Vert \mathrm{sg}[c_t] - z_t \Vert_2^2$ : The codebook loss. $\mathrm{sg}[\cdot]$ denotes the stop-gradient operator, meaning gradients do not flow through this part. This term pulls the continuous encoder output $z_t$ towards the selected codebook entry $c_t$ .
- $\beta \Vert c_t - \mathrm{sg}[z_t] \Vert_2^2$ : The commitment loss. This term encourages the codebook entries to commit to the encoder's outputs, ensuring that the codebook learns to represent the actual latent actions. $\beta$ is a hyperparameter called the commitment cost. This loss function ensures that the LAM learns to encode temporal visual dynamics into a meaningful and reconstructible discrete latent space.

4.2.1.2. VLA Training

After the LAMs are pretrained, the VLA policy $\pi_{\theta}$ is trained. Its goal is to jointly predict both manipulation and locomotion latent actions given visual observations $o_t$ and task language instructions $\ell$ . This is formulated as maximum likelihood estimation (MLE) using a Cross-Entropy loss: $ \operatorname*{min}{\theta} [ - \log \pi{\theta}(c_t^{\operatorname*{mani}}, c_t^{\operatorname{loco}} \mid o_t, \ell) ] $ Here, $\pi_{\theta}$ is the VLA policy parameterized by $\theta$ . $c_t^{\operatorname*{mani}}$ and $c_t^{\operatorname{loco}}$ are the ground-truth discrete latent actions (pseudo-action labels) obtained from the pretrained manipulation and locomotion LAMs, respectively. $o_t$ is the current visual observation, and $\ell$ is the language instruction. This joint prediction forces the VLA model to learn how locomotion and manipulation interact in a single, cohesive action space, thereby enabling task execution.

4.2.1.3. Execution on Humanoids

For real-world deployment on humanoids, a lightweight decoder $f$ is introduced. This decoder grounds the predicted latent actions into robot-specific control commands: $ a_t = f (\hat{c}_t^{\mathrm{mani}}, \hat{c}_t^{\mathrm{loco}}, s_t) $ Here, $a_t$ represents the robot's executable actions at time $t$ . $\hat{c}_t^{\mathrm{mani}}$ and $\hat{c}_t^{\mathrm{loco}}$ are the predicted manipulation and locomotion latent actions from the VLA. $s_t$ is the current robot state (proprioceptive information). The decoder produces two types of outputs:

Upper-body joint angles: For manipulation tasks.
Locomotion command: For the low-level RL controller (specifically, the LMO RL policy), which translates this command into lower-body torques.

This division of labor allows the VLA to make high-level unified latent decisions, the decoder to translate them into embodiment-specific control signals, and the LMO RL policy to ensure stable and precise low-level execution for robust whole-body loco-manipulation.

4.2.1.4. Manipulation-aware Locomotion Data Collection

To scale the benefits of unified latent learning, the paper devises an efficient, low-cost pipeline for collecting human egocentric manipulation-aware locomotion videos. This pipeline is characterized by:

Low Cost and Efficiency: Only a single operator wearing a head-mounted camera (Intel RealSense D435i RGB-D or GoPro for wider FOV) is required, avoiding expensive MoCap or teleoperation setups.
Coverage of Humanoid Primitives: Operators are instructed to perform a diverse range of motions crucial for humanoids, including advancing, turning, and squatting.
Goal-Directed Execution: Operators perform locomotion specifically to contact potential manipulation goals. This ensures that the collected locomotion data is directly aligned with loco-manipulation learning and inherently manipulation-aware.

An overview of this data collection process is depicted in Figure 4 (from the original paper).

该图像是示意图，展示了低成本人类数据收集和潜在行动模型的结构。左侧部分展示了操作员在执行 locomotion 和 manipulation 时的动作捕捉，右侧部分展示了编码器和解码器的处理流程，以及统一动作用的代码本。

Figure 4: Low-cost egocentric data collection on locomotion task and pretraining pipeline for LAM. A single operator records egocentric video while executing eight canonical motion primitives toward potential manipulation goals. This task-oriented pipeline captures locomotion patterns that are diverse, structured, and directly relevant to humanoid locomanipulation. Then the collected manipulation-aware locomotion data along with the open-source in-place manipulation dataset are used to pretrain the latent action model in a VQ-VAE-style pipeline.

4.2.2. Loco-Manipulation-Oriented (LMO) RL Policy

The LMO RL policy addresses the decision-execution misalignment by replacing continuous random velocity-tracking objectives (common in existing RL controllers) with a discrete command interface tailored for loco-manipulation. This design explicitly optimizes for stability and precision under dynamic manipulation disturbances.

4.2.2.1. Observation Space

The LMO policy relies solely on proprioceptive egocentric states with a short history stack for its observations: $ O_t = \Big[ u_t, \omega_t, \mathbf{g}_t, \mathbf{q}_t, \dot{\mathbf{q}}t, \mathbf{a}{t-1} \Big] $

$u_t$ : The discrete high-level command from the VLA. This is the target intention for the low-level controller.
$\omega_t \in \mathbb{R}^3$ : The base angular velocity of the robot.
$\mathbf{g}_t \in \mathbb{R}^3$ : The gravity vector in the robot's base frame, indicating its orientation relative to gravity.
$\mathbf{q}_t, \dot{\mathbf{q}}_t$ : The joint positions and velocities of the robot's lower body (legs).
$\mathbf{a}_{t-1}$ : The previous action executed by the RL policy. This compact, purely proprioceptive design avoids reliance on privileged environment information from the simulator and is sufficient for closed-loop stability.

4.2.2.2. Discrete Command Interface

The lower-body control is formulated as goal-conditioned regulation. The policy executes discrete high-level commands while maintaining balance. At each time step $t$ , the high-level planner (or VLA decoder) issues: $ u_t = [ s_x, s_y, s_{\psi}, h^{\star} ] \in { -1, 0, 1 }^{\bar{3}} \times \mathbb{R} $

$s_x$ : A discrete indicator for forward/backward locomotion (1 for forward, -1 for backward, 0 for stop).
$s_y$ : A discrete indicator for lateral locomotion (1 for right, -1 for left, 0 for stop).
$s_{\psi}$ : A discrete indicator for turning (1 for clockwise, -1 for counter-clockwise, 0 for stop).
$h^{\star}$ : The desired stance height of the robot (e.g., for squatting or standing tall).

This discrete interface explicitly enforces start-stop semantics and reduces trajectory variance, improving the training process for both the RL controller and the high-level planner, unlike continuous velocity-based formulations.

4.2.2.3. Reference Shaping

Since the intents $s_k$ are ternary ( $-1, 0, 1$ ), they are converted into smooth velocity references to prevent abrupt accelerations and ensure predictable transitions: $ v_k^{\mathrm{ref}}(t) = v_k^{\mathrm{goal}} \sigma \big( \alpha (s_k - \bar{s}_k (t)) \big), \quad \bar{s}_k (t) \gets (1 - \lambda) \bar{s}_k (t - 1) + \lambda s_k $

$v_k^{\mathrm{ref}}(t)$ : The smoothed velocity reference for axis $k$ (where $k \in \{ x, y, \psi \}$ for forward/backward, lateral, and yaw respectively) at time $t$ .
$v_k^{\mathrm{goal}}$ : The goal speed magnitude for axis $k$ . The sign of $v_k^{\mathrm{goal}}$ is determined by the intent $s_k$ .
$\sigma(\cdot)$ : A saturating nonlinearity (like tanh as mentioned in the main text or sigmoid as mentioned in the appendix) that smooths the transition. The appendix uses $\sigma(\cdot)$ as a generic saturating nonlinearity, while the main text uses tanh specifically. For consistency with the specific formula in the appendix, I'll use $\sigma$ .
$\alpha$ : A scaling factor controlling the steepness of the saturating function.
$\bar{s}_k(t)$ : An exponentially smoothed flag for intent $s_k$ . This is an exponential moving average of the current and past intents, making transitions gradual.
$\lambda$ : A smoothing factor for the exponential moving average. This ensures predictable on/off transitions and reduces oscillations, crucial for stable loco-manipulation.

4.2.2.4. Two-Stage Curriculum

The LMO policy is trained using a two-stage curriculum to first acquire basic locomotion and then specialize it for precise and stable loco-manipulation.

Stage I (Basic Gait Acquisition):

Goal: Learn a minimal gait that prevents falls and responds to discrete commands.
Locomotion: For each axis $k \in \{ x, y, \psi \}$ , if $s_k \neq 0$ , a goal speed magnitude $v_k^{\mathrm{goal}}$ is sampled from a uniform distribution $\mathcal{U}([0, v_k^{\mathrm{max}}])$ . If $s_k = 0$ , $v_k^{\mathrm{goal}} = 0$ .
Upper Body: The arms track pose targets that are resampled at a fixed interval and interpolated for smooth motion.
Disturbances: Joint limits are gradually relaxed by a curriculum factor, exposing the legs to progressively stronger, but task-agnostic, disturbances.
Outcome: This stage enables the policy to develop a basic gait that prevents falling, establishing a stable foundation.

Stage II (Precision and Stability):

Goal: Refine the gait for loco-manipulation-level precision and stability.
Locomotion:
- Fixed Cruising Speed: Per-axis cruising speeds are fixed to constants ( $v_k^{\mathrm{goal}} = \bar{v}_k$ ) when an intent is given ( $s_k \neq 0$ ). This prevents fragmented gaits that arise from training across varying reference speeds.
- Directional Accuracy Reward: Precision is enforced by penalizing yaw drift at episode termination. This is critical for tasks requiring precise orientation. $ \mathcal{I}{\mathrm{dir}} = | \mathrm{wrap} \big( \psi{\mathrm{end}} - \psi_{\mathrm{start}} \big) | $ Here, $\mathcal{I}_{\mathrm{dir}}$ is the yaw drift penalty. $\psi_{\mathrm{end}}$ is the yaw orientation at the end of the locomotion phase, and $\psi_{\mathrm{start}}$ is the yaw orientation at the start. $\mathrm{wrap}(\cdot)$ ensures the angle is within a specific range (e.g., $(-\pi, \pi]$ ). An episode begins when any axis flag flips from 0 to $\pm 1$ and ends when it returns to 0 and the base stabilizes. Minimizing the expected value of $\mathcal{I}_{\mathrm{dir}}$ enforces precise initiation, steady cruising, and consistent braking.
Manipulation Side (Structured Perturbations):
- Realistic perturbations are injected by sampling short arm-motion segments from AgiBot World (a real-robot manipulation dataset). These segments are interpolated into continuous signals and replayed at varied rates with light noise.
- The arm motion replay is defined by: $ \omega_{i+1} = \mathrm{min}(L, \omega_i + (\gamma + \delta_i) \Delta t), \quad \omega_0 = 0 $ Where $\omega_i$ is the current time index into the arm trajectory, $L \sim \mathrm{Unif}[0.8, 2.5]$ is the length of the segment, $\gamma \sim \mathrm{Unif}[0.8, 1.5]$ is the base speed, $\delta_i \sim \mathrm{Unif}[-0.25, 0.25]$ is a random noise term, and $\Delta t$ is the simulation time step.
- Per-step targets are given by $q_i^{\mathrm{tar}} = q^{\mathrm{arm}}(\omega_i) \dot{+} \varepsilon_i$ , with $\varepsilon_i \sim \mathcal{N}(\dot{0}, 0.05^2)$ .
- This forces the legs to compensate for structured inertial couplings (realistic arm movements influencing the robot's balance) rather than just unstructured disturbances.
Stand-Still Penalty: For stationary episodes ( $s_x = s_y = s_{\psi} = 0$ ), a penalty is added to discourage unnecessary leg actions: $ \mathcal{T}_{\mathrm{stand}} = | a_i^{\mathrm{leg}} |_2^2 $ Here, $a_i^{\mathrm{leg}}$ represents the leg actions (e.g., torques or joint positions), and the penalty minimizes their magnitude to ensure stable standing.

Together, these designs in Stage II yield stable, repeatable gaits and reliable whole-body coordination, crucial for avoiding the fragmented motion patterns often induced by velocity-tracking objectives.

4.2.2.5. Reward Functions and Domain Randomization

The LMO policy's training involves a detailed reward structure (Table 4 in Appendix B.1) and extensive domain randomization (Table 5 in Appendix B.1).

Reward Functions: Reward terms are categorized into:

Intent Execution: Rewards for precisely tracking forward, lateral, and yaw intentions, and maintaining desired height, along with penalties for vertical and angular velocity deviations.
Posture & Joints: Penalties for deviations from default hip/ankle poses, knee deviation during squatting, DoF acceleration/velocity/position limit violations, and torque limit violations. Rewards for roll/pitch stabilization.
Locomotion Structure: Rewards for feet air time, penalties for foot clearance, foot lateral spacing (maintaining appropriate stance width), feet ground parallelism, feet parallelism, no-fly penalty (ensuring at least one foot is on the ground), foot slip, and foot stumble.
Energy & Smoothness: Penalties for action rate (how much actions change), smoothness (second-order action changes), joint power, torque usage, feet contact force, contact momentum, and action vanish penalty.
Stability (New in Stage II): Joint tracking error and the previously mentioned stand-still penalty are introduced or emphasized in Stage II to improve loco-manipulation robustness.

Domain Randomization: Parameters for dynamics (joint torque, actuation offset, link mass, COM/body displacement), contact properties (friction, restitution, payload), controller gains (Kp, Kd), and latency/sensor noise (action, DoF state, IMU lag) are randomized during training. Stage II increases the strength and frequency of perturbations, including push disturbances and manipulation-induced payload variations, to further enhance robustness. This helps the policy generalize to real-world variations.

The LMO policy runs at 50 Hz on an onboard computer, while the VLA backbone operates at ~10 Hz for perception and reasoning, with communication handled via ZeroMQ over Ethernet for low-latency command streaming.

5. Experimental Setup

5.1. Datasets

The experiments utilized a combination of real-robot and human video datasets:

AgiBot X2 Teleoperation Data:
- Source: Collected via physical robot teleoperation on the prototype AgiBot X2 humanoid.
- Characteristics: Operators used a Meta Quest Pro headset for egocentric VR teleoperation of the upper body and a joystick for locomotion commands. This provided action-labeled trajectories for fine-tuning the VLA policy.
- Scale: Each of the three primary tasks (Bag Packing, Box Loading, Cart Pushing) was executed 50 times, generating diverse trajectories.
- Domain: Real-world humanoid loco-manipulation tasks.
- Purpose: Used for fine-tuning all VLA-based methods in Stage III, grounding the latent actions into robot-executable commands.
Human Egocentric Manipulation-aware Locomotion Videos:
- Source: Collected by the authors using a low-cost, efficient pipeline. A single operator wore a head-mounted camera (Intel RealSense D435i RGB-D or GoPro).
- Characteristics: These are action-free videos (only visual observations) covering locomotion primitives (advancing, turning, squatting) performed goal-directed towards potential manipulation goals.
- Scale: Approximately 300 hours of video data covering various scenes.
- Domain: Human demonstrations of manipulation-aware locomotion.
- Purpose: Used for pretraining the locomotion LAM in Stage I of WholeBodyVLA's unified latent learning.
AgiBot World Dataset (Bu et al., 2025a):
- Source: One of the largest real-robot manipulation datasets.
- Characteristics: Focuses on in-place bimanual manipulation.
- Domain: Real-robot manipulation data.
- Purpose: Used for pretraining the manipulation LAM in Stage I of WholeBodyVLA's unified latent learning. Also used to sample short arm-motion segments for structured perturbations during LMO RL policy training (Stage II).

Why these datasets were chosen:

Human videos: Address data scarcity for loco-manipulation knowledge by providing abundant, diverse priors at low cost.
AgiBot World: Provides a rich source of real-robot manipulation data to learn precise arm movements.
Teleoperation data: Essential for fine-tuning the VLA to the specific AgiBot X2 embodiment and grounding latent actions into robot-executable commands. The combination allows learning from diverse, cheap action-free data and then fine-tuning on limited, expensive robot-specific data.

5.2. Evaluation Metrics

The paper employs several metrics to evaluate WholeBodyVLA's performance, focusing on success rates for complex tasks and precision/stability for low-level control.

Success Rate:
- Conceptual Definition: Measures the percentage of trials where the robot successfully completes a given task or subgoal. This is a primary indicator of overall task achievement. For multi-subgoal tasks, if an earlier subgoal fails, subsequent ones are automatically considered failures.
- Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
- Symbol Explanation:
  - Number of Successful Trials: The count of attempts where the robot completed the task/subgoal according to predefined criteria.
  - Total Number of Trials: The total number of attempts made for the task/subgoal (e.g., 25 trials per subgoal in the real-robot experiments).
Locomotion Accuracy (Position / Orientation Error):
- Conceptual Definition: Quantifies how precisely the robot tracks desired locomotion trajectories in terms of position and orientation. This is crucial for manipulation-aware locomotion where accurate positioning is paramount. Evaluated during a settling phase after a command, capturing stop-and-settle precision.
- Mathematical Formula (for position error): $ \text{Position Error} = \sqrt{(x_{\text{end}} - x_{\text{ref}})^2 + (y_{\text{end}} - y_{\text{ref}})^2 + (z_{\text{end}} - z_{\text{ref}})^2} $ (This is a Euclidean distance. The paper reports mean $\pm$ standard deviation for 2D position error, so it likely refers to the horizontal plane for locomotion).
- Mathematical Formula (for yaw orientation error, as per $\mathcal{I}_{\mathrm{dir}}$ in Stage II): $ \text{Orientation Error (Yaw)} = | \mathrm{wrap}(\psi_{\text{end}} - \psi_{\text{ref}}) | $
- Symbol Explanation:
  - $(x_{\text{end}}, y_{\text{end}}, z_{\text{end}})$ : The final position of the robot's base.
  - $(x_{\text{ref}}, y_{\text{ref}}, z_{\text{ref}})$ : The reference (target) position.
  - $\psi_{\text{end}}$ : The final yaw orientation of the robot's base.
  - $\psi_{\text{ref}}$ : The reference (target) yaw orientation.
  - $\mathrm{wrap}(\cdot)$ : A function to normalize the angle difference to a range like $(-\pi, \pi]$ .
Manipulation Stability (Center-of-Mass Sway, CoMS):
- Conceptual Definition: Measures the robot's balance during manipulation tasks, especially when standing or squatting, by quantifying the deviation of its Center of Mass (CoM) in the horizontal plane. Lower sway indicates better stability.
- Mathematical Formula: $ \mathrm{CoMS} = \sqrt{ \frac{1}{T} \int_0^T | \mathbf{c}(t) - \bar{\mathbf{c}} |^2 dt } $
- Symbol Explanation:
  - CoMS: Center-of-Mass Sway.
  - $T$ : The total duration of the evaluation period.
  - $\mathbf{c}(t) \in \mathbb{R}^2$ : The horizontal projection of the robot's CoM trajectory at time $t$ .
  - $\bar{\mathbf{c}} \in \mathbb{R}^2$ : The temporal mean of the horizontal CoM trajectory over the period $T$ .
  - $\| \cdot \|^2$ : The squared Euclidean norm (distance).
Relative Reconstruction Gain (RRG):
- Conceptual Definition: An internal metric used to assess the quality of the Latent Action Model (LAM) by comparing its ability to reconstruct future frames against a simple temporal baseline (copying the current frame). A higher RRG indicates more predictive latent codes.
- Mathematical Formula: $ \mathrm{RRG} = \frac{ \mathrm{MSE}{\mathrm{base}} - \mathrm{MSE}{\mathrm{recon}} }{ \mathrm{MSE}_{\mathrm{base}} } $
- Symbol Explanation:
  - RRG: Relative Reconstruction Gain.
  - $\mathrm{MSE}_{\mathrm{recon}}$ : Mean Squared Error of the reconstructed future frame $\hat{o}_{t+k}$ against the ground-truth $o_{t+k}$ (i.e., $\mathrm{MSE}(\widehat{o}_{t+k}, o_{t+k})$ ).
  - $\mathrm{MSE}_{\mathrm{base}}$ : Mean Squared Error of a baseline that estimates the future frame by directly copying the current frame $o_t$ (i.e., $\mathrm{MSE}(o_t, o_{t+k})$ ).

5.3. Baselines

The paper compares WholeBodyVLA against a comprehensive set of baselines, including modular designs, other VLA frameworks, and ablated variants of its own design, to thoroughly evaluate each component's contribution.

Modular Design:
- Description: This baseline emulates a traditional modular pipeline. A human teleoperator controls only locomotion via a joystick and FPV headset, relying on task instructions and scene context. Once navigation is complete, control is handed off to WholeBodyVLA for manipulation. After manipulation, control returns to the operator.
- Representativeness: Represents a strong, almost oracle-like, modular approach where the navigation (locomotion) part benefits from human intelligence, providing an upper bound for such pipelines.
GR00T w/LMO (Bjorck et al., 2025):
- Description: GR00T N1.5 is a recent VLA system for whole-body control. For a fair comparison, its output is adapted: instead of directly predicting lower-body joints, GR00T predicts locomotion commands, which are then executed by WholeBodyVLA's LMO controller.
- Representativeness: A state-of-the-art VLA framework adapted to isolate its high-level reasoning from low-level locomotion stability, allowing comparison of the VLA component.
OpenVLA-OFT w/LMO (Kim et al., 2025):
- Description: Another VLA framework that shares the same Prismatic-7B initialization as WholeBodyVLA. It is trained to predict upper-body joint actions and locomotion commands, which are then executed by the LMO controller.
- Representativeness: Provides a baseline with a similar architectural starting point, allowing for a more direct comparison of WholeBodyVLA's unified latent learning and overall VLA training.
  
  Ablated Variants of WholeBodyVLA: These variants isolate the contributions of WholeBodyVLA's key components.
WholeBodyVLA w/o RL:
- Description: In this variant, the VLA directly predicts lower-body joints without the LMO policy.
- Purpose: To assess the necessity and impact of the LMO RL policy for low-level locomotion execution.
WholeBodyVLA w/ Velocity-Based RL:
- Description: This replaces the LMO policy with a conventional velocity-tracking RL controller (reproduced and refined from HOMIE (Ben et al., 2025)).
- Purpose: To compare the LMO policy's discrete command interface and specialized curriculum against a standard velocity-based approach.
WholeBodyVLA w/o LAM:
- Description: The VLA is directly fine-tuned from Prismatic-7B without unified latent learning (i.e., no LAM pretraining).
- Purpose: To evaluate the contribution of learning from action-free videos via unified latent learning.
WholeBodyVLA w/ Manipulation LAM:
- Description: This variant performs latent learning only on in-place manipulation data from AgiBot World, without locomotion-aware pretraining from human videos.
- Purpose: To isolate the impact of locomotion-aware latent learning specifically.
WholeBodyVLA w/ Shared LAM:
- Description: Instead of separate manipulation and locomotion LAMs, a single shared LAM is trained on the mixed data without modality separation.
- Purpose: To validate the design choice of using separate LAMs due to visual modality differences.

5.4. Hardware, Tasks, and Data Collection

Hardware:
- Robot Platform: Prototype Agibot X2 humanoid.
  - Arms: 7 Degrees of Freedom (DoF) each, equipped with Omnipicker grippers.
  - Waist: 1 DoF.
  - Legs: 6 DoF each.
  - Perception: An Intel RealSense D435i RGB-D camera mounted on the head for egocentric vision.
- Teleoperation for Data Collection:
  - Upper Body: Meta Quest Pro headset for egocentric VR teleoperation.
  - Locomotion Commands: Joystick.
- Deployment: VLA runs on an RTX 4090 GPU workstation. RL policy deployed on a NanoPi onboard computer. Communication via ZeroMQ over Ethernet.
  
  Figure 5 (from the original paper) illustrates the hardware setup for data collection.
  
  该图像是一个示意图，展示了遥操作系统如何通过RGB输入设备和逆运动学控制人形机器人。图中显示了遥操作用的虚拟现实设备、手柄及对应的人形机器人，说明了 locomotion 命令的输入流程。
Figure 5: Description of the hardware. VR and a joystick are used for data collection.
Tasks: Three main task suites, each with multiple subgoals, designed to comprehensively test loco-manipulation capabilities:
1. Bag Packing:
  - Primitives: Bimanual grasping, lateral stepping, squatting, precise placement.
  - Scenario: Grasp a paper bag, sidestep to a carton, squat, and place the bag inside.
2. Box Loading:
  - Primitives: Squatting, bimanual grasping, turning, object placement, maintaining balance.
  - Scenario: Squat to grasp a box, stand up, turn to face a cart, and place the box onto the cart.
3. Cart Pushing:
  - Primitives: Grasping a handle, sustained forward locomotion, reliable heading control, stability under heavy loads.
  - Scenario: Grasp the handle of a 50kg cart with both arms and push it several meters forward stably without lateral drift. These tasks collectively evaluate dual-arm coordination, squat precision, turning accuracy, and stability under heavy loads.

5.5. Training Protocol

The training follows a standard VLA recipe of large-scale pretraining followed by real-robot fine-tuning.

Pretraining for WholeBodyVLA (Two Steps):
1. Stage I (LAM Pretraining): Separate manipulation LAM and locomotion LAMs are pretrained.
  - Locomotion LAM: Trained on the authors' collected low-cost egocentric locomotion videos (approx. 300 hours).
  - Manipulation LAM: Trained on the AgiBot World dataset (real-robot bimanual manipulation data).
  - Shared LAM (for ablation): Trained on mixed locomotion and manipulation data with balanced sampling.
  - Schedule: 30,000 steps, batch size 256.
2. Stage II (VLA Pretraining): The VLA (initialized from Prismatic-7B) is trained to predict both latent actions (from the pretrained LAMs) on the same corpus of human videos and AgiBot World data, using the LAM codes as pseudo-action labels.
  - Schedule: 20,000 steps, batch size 1024.
Pretraining for VLA Baselines (GR00T, OpenVLA-OFT):
- Their publicly released pretrained models are used.
Finetuning Stage (Stage III):
- All methods (WholeBodyVLA and baselines) are fine-tuned on the same AgiBot X2 teleoperation trajectories collected for all three tasks.
- For the main experiments (Section 4.2), one model is fine-tuned on all three tasks jointly, rather than task-specific fine-tuning.
- Schedule: 10,000 steps, batch size 64. Fine-tuning uses LoRA (Low-Rank Adaptation of Large Language Models) (Hu et al., 2022).
LMO Controller Training:
- The LMO controller (and its velocity-based baseline) is trained separately in simulation and remains fixed during teleoperation data collection and final deployment. This means its performance is independent of the VLA fine-tuning.
Computational Resources:
- LAM and VLA training: $8 \times$ NVIDIA H100 GPUs.
- RL policy training: Single NVIDIA H100 GPU.

5.6. Real-Robot Experiment Protocols

To ensure fair and reproducible comparisons in real-robot experiments:

Adjudication: Two independent judges, blinded to the active policy and naive to the data collection process, adjudicated success/failure for each subgoal. Consensus labels were reached. Method order was randomized to mitigate subjective bias.
Sequential Evaluation: For tasks with multiple subgoals, if the first subgoal failed, the subsequent subgoals were automatically counted as failures.
Trials: Each task was evaluated over 25 trials. The mean score across all subgoals is reported.

6. Results & Analysis

This section delves into the experimental results, validating WholeBodyVLA's effectiveness, the contribution of its components, and its generalization capabilities.

6.1. Core Results Analysis

The paper evaluates WholeBodyVLA against various baselines across three complex loco-manipulation task suites: Bag Packing, Box Loading, and Cart Pushing. Each task is broken down into two subgoals to assess performance granularly.

The following are the results from Table 2 of the original paper:

Method	Grasp Bags	Move & Squat	Squat & Grasp	Rise & Turn	Grab Handle	Push Ahead	Avg. Score
Method	Bag Packing		Box Loading		Cart Pushing		Avg. Score
Modular Design	22/25	12/25	9/25	9/25	22/25	22/25	64.0%
GRO0T w/LMO	20/25	10/25	6/25	4/25	12/25	11/25	42.0%
OpenVLA-OFT w/LMO	19/25	6/25	12/25	12/25	22/25	14/25	56.7%
WholeBodyVLA (ours)	23/25	13/25	19/25	17/25	23/25	22/25	78.0%
WholeBodyVLA w/ vel.-based RL	22/25	1/25	16/25	3/25	24/25	15/25	54.0%
WholeBodyVLA w/o lam	15/25	4/25	8/25	6/25	16/25	10/25	39.3%
WholeBodyVLA w/ manip. lam	24/25	7/25	17/25	11/25	20/25	14/25	63.3%
WholeBodyVLA w/ shared lam	18/25	11/25	16/25	16/25	20/25	18/25	66.0%

Analysis:

Overall Superiority: WholeBodyVLA (ours) achieves the highest average success rate of 78.0%, significantly outperforming all baselines. This represents a 21.3% improvement over the Modular Design (64.0%) and a 24.0% improvement over OpenVLA-OFT w/LMO (56.7%). This strong performance validates its ability to handle loco-manipulation tasks in practice.
Task-Specific Strengths:
- For Bag Packing, WholeBodyVLA performs strongly in both subgoals (23/25 Grasp Bags, 13/25 Move & Squat).
- For Box Loading, it shows particular strength in Squat & Grasp (19/25) and Rise & Turn (17/25), indicating effective coordination of squatting, grasping, and turning under balance constraints.
- For Cart Pushing, it achieves near-perfect Grab Handle (23/25) and excellent Push Ahead (22/25), demonstrating robustness under heavy loads and sustained locomotion.
Baseline Performance:
- Modular Design performs reasonably well (64.0%), especially in the first subgoals (e.g., Grasp Bags, Grab Handle), but struggles with the more complex locomotion-integrated subgoals (e.g., Move & Squat, Squat & Grasp, Rise & Turn), likely due to brittle skill boundaries and error accumulation.
- $GRO0T w/LMO$ and OpenVLA-OFT w/LMO, despite using the LMO controller, perform worse than WholeBodyVLA, suggesting that the VLA's unified latent learning and overall architecture are critical. $GRO0T w/LMO$ struggles notably with Box Loading (6/25 Squat & Grasp, 4/25 Rise & Turn), highlighting the challenge of generalizing VLA reasoning to complex loco-manipulation.

6.2. How do Action-Free Videos Contribute to Loco-Manipulation?

To answer Q2, the paper compares WholeBodyVLA with ablated versions to demonstrate the impact of unified latent learning from action-free videos.

Analysis from Table 2:

WholeBodyVLA w/o LAM: This variant, which skips latent pretraining from action-free videos, achieves only 39.3% average success. This is a substantial drop of 38.7% compared to the full WholeBodyVLA (78.0%). This stark difference strongly indicates that unified latent learning effectively extracts useful priors from action-free human videos, significantly enhancing downstream policy learning.
WholeBodyVLA w/ manip. lam: This variant only performs latent learning for in-place manipulation (using AgiBot World) but lacks locomotion-aware pretraining. It achieves 63.3% average success, which is 14.7% lower than the full model. The largest performance gaps are observed in tasks requiring substantial locomotion before manipulation (e.g., Move & Squat: 7/25 vs 13/25, Rise & Turn: 11/25 vs 17/25). This confirms that locomotion-aware latent learning from human videos is crucial for robust loco-manipulation.
WholeBodyVLA w/ shared lam: This variant trains a single LAM on mixed data without modality separation. It achieves 66.0% average success, slightly worse than the full model (78.0%). This result supports the design choice of decoupling the two LAMs (manipulation and locomotion) to handle their distinct visual characteristics, although the shared LAM still offers a significant improvement over w/o LAM.

Data Scaling under Generalization Settings (Figure 3): The paper further investigates the impact of latent pretraining on generalization and data efficiency through data scaling curves.

Figure 3: Real-world generalization of WholeBodyVLA. Top: variations in robot start-pose and scene appearance, with data-scaling curves. Bottom: comparison on extended tasks with different baselines. See videos on https:/ /opendrivelab. com/WholeBodyVLA. 该图像是图表，展示了WholeBodyVLA的实际场景泛化能力。上部显示了机器人起始姿势和场景外观的变化，并附有数据缩放曲线；下部比较了不同基线在扩展任务上的表现。

Locomotion Generalization (Figure 3(a) - Finetuning Data Scaling for Start-Pose Generalization): This plot shows average success rates for tasks primarily evaluating locomotion generalization (e.g., varying robot start-poses).
- Models pretrained with more human egocentric videos (e.g., 50% or 100%) consistently achieve higher success rates across all amounts of teleoperation fine-tuning data.
- Crucially, a model pretrained with 50% human video (and 100% AgibotWorld data) can match the performance of a variant using less than 25% human videos while using only 25 teleoperation trajectories for fine-tuning. The latter requires 200 trajectories to reach similar performance. This demonstrates that human video pretraining significantly reduces the reliance on expensive teleoperation data.
Manipulation Generalization (Figure 3(b) - Finetuning Data Scaling for Scene Generalization): This plot focuses on manipulation generalization (e.g., varying objects, layouts, appearance).
- Similar trends are observed: stronger latent pretraining (more AgiBot World data for manipulation LAM) leads to higher success rates and reduces the required amount of fine-tuning data.
  
  Conclusion on Action-Free Videos: These results unequivocally show that human egocentric videos and unified latent learning are vital. They extract valuable priors that lead to substantial improvements in VLA generalization and significantly reduce the need for costly teleoperation data, especially under a fixed fine-tuning budget.

6.3. How does LMO Contribute to Loco-Manipulation?

To address Q3, the paper evaluates the Loco-Manipulation-Oriented (LMO) RL policy's contribution to precision and stability.

Analysis from Table 2 (Real-Robot Results):

WholeBodyVLA w/ vel.-based RL: This variant replaces the LMO policy with a conventional velocity-based RL controller. Its average success rate is 54.0%, which is 24.0% lower than the full WholeBodyVLA (78.0%).
A significant portion of this gap (91.7%) stems from failures in the second subgoal of each task, which typically involves more complex and sustained locomotion (e.g., Move & Squat, Rise & Turn, Push Ahead). For instance, Move & Squat dropped from 13/25 to 1/25, and Rise & Turn from 17/25 to 3/25. This clearly indicates that the velocity-based controller struggles with the precision and stability required for the locomotion components of loco-manipulation.

Extended Tasks (Figure 3(c) - Bottom Panel):

The velocity-based RL variant consistently fails substantially more often than WholeBodyVLA on extended tasks designed to stress locomotion and long-horizon execution (e.g., uneven-terrain traversal, long multi-step sequences, following extended floor markings for visual navigation). These failures are attributed to suboptimal behaviors like stumbling, path deviation, or turning while advancing, which are inherent limitations of the velocity-based controller.

Ablation Studies in MuJoCo (Table 3): To further dissect the LMO design, controlled ablations are performed in simulation. The following are the results from Table 3 of the original paper:

Method	Forward&Backward	Left&Right	Turning	Standing	Squatting
Method	Locomotion Accuracy (Pos. / Quat. Error)			Manipulation Stability (CoMS)
LMO (ours)	0.21±0.01 /0.05±0.01	0.55±0.01/0.06±0.01	0.05±0.01/0.19±0.01	0.03±0.02	0.03±0.02
LMO w/o Eq. 3	0.24±0.02 / 0.07±0.01	0.61±0.02 /0.09±0.01	0.05±0.01/0.28±0.02	0.04±0.03	0.03±0.02
LMO w/o stage 2	0.27±0.02 /0.09±0.01	0.72±0.03/0.11±0.02	0.20±0.01 / 0.32±0.03	0.05±0.04	0.07±0.03
LMO w/o stage 1	0.30±0.03 /0.11±0.01	0.66±0.04 /0.13±0.03	0.46±0.01/0.34±0.04	0.05±0.03	0.04±0.03
Vel.-based policy	0.24±0.04 / 0.12±0.02	0.60±0.05/0.17±0.06	0.26±0.01 / 0.20±0.06	0.06±0.04	0.05±0.04

Analysis of Ablations:

LMO w/o Eq. 3 (Removing directional accuracy reward $\mathcal{I}_{\mathrm{dir}}$ ): Degrades turning precision significantly (yaw orientation error for turning increased from 0.19 to 0.28), demonstrating the importance of this reward term for accurate directional control.
LMO w/o stage 2 (Disabling precision and stability refinement): Increases trajectory error across all locomotion primitives (e.g., yaw error for turning jumped from 0.19 to 0.32) and significantly increases squatting sway (from 0.03 to 0.07), highlighting the necessity of targeted refinement for loco-manipulation tasks.
LMO w/o stage 1 (Disabling basic gait acquisition): This variant produces the largest errors overall (e.g., yaw error for turning at 0.34, position error for forward/backward at 0.30), indicating that without the initial stage to acquire stable gaits, the policy fails fundamentally.
Vel.-based policy: Performs worse than the full LMO across most metrics, particularly in turning accuracy (0.20 vs 0.19 for yaw error) and manipulation stability (CoMS for standing: 0.06 vs 0.03). This confirms that the discrete command interface and two-stage curriculum of LMO are more effective for precise trajectory tracking and stable whole-body coordination than velocity-based methods.

Conclusion on LMO: The LMO RL policy is critical for reliable long-range and multi-step loco-manipulation. Its discrete command interface, two-stage curriculum, and structured perturbations effectively mitigate the decision-execution misalignment and ensure stable, precise execution of core movements, which is paramount for manipulation-aware locomotion.

6.4. Does WholeBodyVLA Generalize to Long-Horizon and Extended Loco-Manipulation Scenarios?

To address Q4, the paper evaluates WholeBodyVLA's extensibility to more challenging and diverse scenarios, as shown in the bottom row of Figure 3 (c) and further detailed in Appendix C.1.

Extended Tasks: These include traversing uneven terrain (e.g., steps, foam, gravel), executing long-horizon multi-step sequences (e.g., walking to a table, grasping two bags, placing them in a drawer, closing it), following floor markings for visual navigation, and everyday loco-manipulation activities like wiping a table and vacuum cleaning.
Generalization Performance: As indicated in Figure 3(c), WholeBodyVLA consistently remains superior across all these challenging settings compared to the baselines. This demonstrates that the framework scales effectively beyond the benchmark tasks and preserves robust generalization capabilities. For instance, in terrain traversal and visual navigation, WholeBodyVLA significantly outperforms OpenVLA-OFT w/LMO and $GRO0T w/LMO$ , highlighting its robust locomotion and visual understanding. Its performance in wiping stains and vacuum cleaning shows its ability to adapt to practical household tasks.

6.4.1. Additional Visual Generalization under Object and Load Variations

Further generalization experiments (detailed in Appendix C.2 and Figure 6) perturb the three main loco-manipulation tasks by modifying manipulated objects and their loads while keeping the scene layout fixed.

Figure 6: Visual generalization under object and load variations. We evaluate Bag Packing, Box Loading, and Cart Pushing under unseen variations of bag appearance, box contents, and cart loads. WholeBodyVLA consistently outperforms GR00T and OpenVLA-OFT across all three tasks, indicating robustness to these distribution shifts. 该图像是图表，展示了在未见物体和负载变体下进行的袋子打包、箱子装载和推车任务的成功率。WholeBodyVLA方法在所有三个任务中均优于GR00T和OpenVLA-OFT，表明其在分布变化下的鲁棒性。

Analysis:

WholeBodyVLA maintains the highest success rates across all perturbed settings (e.g., different appearance/size/weight for bags, plastic containers instead of bags in boxes, 60kg barbell plates instead of a carton on the cart). This demonstrates strong visual generalization and robustness to distribution shifts in object appearance and physical properties (loads), which is crucial for real-world deployment.

6.4.2. Effect of Proprioceptive State

To verify that WholeBodyVLA primarily relies on visual observations rather than proprioceptive state injected into the action decoder, an ablation WholeBodyVLA w/o state is tested.

The following are the results from Table 6 of the original paper:

Method	Grasp Bags	Move & Squat	Squat & Grasp	Rise & Turn	Grab Handle	Push Ahead	Avg. Score
Method	Bag Packing		Box Loading		Cart Pushing		Avg. Score
WholeBodyVLA w/o state	21/25	12/25	22/25	14/25	24/25	22/25	76.7%
WholeBodyVLA	23/25	13/25	19/25	17/25	23/25	22/25	78.0%

The following are the results from Table 7 of the original paper:

Method	Grasp Bags	Move & Squat	Squat & Grasp	Rise & Turn	Grab Handle	Push Ahead	Avg. Score
Method	Bag Packing *		Box Loading *		Cart Pushing *		Avg. Score
WholeBodyVLA w/o state	12/25	5/25	14/25	12/25	21/25	21/25	76.7%
WholeBodyVLA	15/25	9/25	15/25	15/25	22/25	20/25	64.0%

Analysis:

In the original setting (Table 6), removing the state input causes a minor performance drop (76.7% vs 78.0%).
Under visual variations (Table 7), the performance drop is more noticeable (64.0% for full model vs 76.7% for w/o state on average score in this table - note the average score calculation seems to be for visual variations, which is counterintuitive. It seems the "WholeBodyVLA" row's average score for table 7 is lower than "WholeBodyVLA w/o state", which might be a typo in the paper's average score calculation for this specific table or a different averaging method. However, looking at individual subgoal scores, "WholeBodyVLA" generally performs better in most visually perturbed conditions, like Bag Packing * and Box Loading * for the second subgoals.)
Despite some variance, the w/o-state variant still achieves comparable task completion rates. This implies that WholeBodyVLA has largely learned to accomplish loco-manipulation tasks from visual input, and its visual generalization does not solely depend on low-level proprioceptive information.

6.5. Failure Mode Analysis

To understand remaining limitations, a post-hoc failure analysis is conducted using start-pose generalization experiments. For each locomotion primitive (advance, sidestep, squat, turn), 50 failed rollouts are manually annotated.

该图像是图表，展示了WholeBodyVLA在起始姿态泛化中的失败模式。对于每个运动原语—(a) 前进，(b) 侧步，(c) 蹲下，和(d) 转身，收集了50个失败试验数据，使用Sankey图解析了失败原因，分为运动和抓取/放置错误，以及更细的原因，如物体/篮子不可及、错误朝向、提前停止、超出目标、碰撞、失足等。

Figure 7: Failure modes of WholeBodyVLA in start-pose generalization. For each locomotion primitive—(a) advance, (b) sidestep, (c) squat, and (d) turn—we collect 50 failed trials from the start-pose generalization experiments (pick object and place into basket). Each Sankey diagram decomposes failures into locomotion vs. pick/place errors and finer-grained causes such as object/basket unreachable, wrong orientation, early stop, overshoot, collisions, stumble, and inaccurate grasp or placement.

Analysis:

Dominant Failure Type: For horizontal locomotion primitives (advance, sidestep, turn), locomotion-related issues account for the majority of failures. Most unsuccessful episodes pass through the "object/basket unreachable" node. This suggests that even moderate stance or orientation deviations during approach propagate into infeasible pick-and-place attempts.
Less Frequent Catastrophic Failures: Severe collisions or stumbles occur much less frequently.
Squat Primitive: Failures are more evenly divided between locomotion (incorrect final height, contact during descent) and pick/place errors (imprecise arm trajectories, grasp alignment).
Key Insight: The dominant failure modes are not catastrophic behaviors but rather small but systematic stance and orientation errors during approach. Improving approach precision, especially for turning, lateral stepping, and squatting, is expected to directly reduce downstream manipulation failures.

6.6. Task Execution Time

Average execution duration for successful trials is reported. The following are the results from Table 8 of the original paper:

Method	Grasp Bags	Move & Squat	Squat & Grasp	Rise & Turn	Grab Handle	Push Ahead
Method	Bag Packing		Box Loading		Cart Pushing
Modular Design	19.2	23.0	21.5	7.9	12.0	11.7
GROOT w/LMO	26.3	38.6	21.1	8.0	19.5	13.8
OpenVLA-OFT w/LMO	23.6	35.9	33.2	13.8	16.9	13.1
WholeBodyVLA (ours)	18.4	29.7	16.8	7.6	11.3	12.7

Analysis:

WholeBodyVLA generally achieves comparable or faster execution times for many subgoals compared to baselines, especially for Grasp Bags (18.4s vs. 19.2s for Modular), Squat & Grasp (16.8s vs. 21.5s for Modular), and Rise & Turn (7.6s vs. 7.9s for Modular). This suggests that its unified and robust control leads to efficient task completion.
For Move & Squat, WholeBodyVLA (29.7s) is slower than Modular Design (23.0s), but faster than $GRO0T w/LMO$ (38.6s) and OpenVLA-OFT w/LMO (35.9s). This might indicate that while WholeBodyVLA is more successful, its locomotion component for move and squat might be more deliberate.

6.7. Shared Latent Action Space Across Embodiments

The paper highlights that its three-stage training pipeline naturally leads to a common latent action space for human and robot data.

Figure 8: Cross-domain retrieval with shared latent actions. Human and robot clips retrieved for the same latent action (e.g., go forward, turn left, squat) exhibit consistent semantics across domains. 该图像是一个示意图，展示了跨域检索中共享潜在动作的效果。可以看到，对于相同潜在动作（如前进、左转、下蹲），人类和机器人剪辑展现出一致的语义。

Analysis:

Mechanism: Stage I trains VQ-VAE LAMs as inverse dynamics models purely from visual videos, where discrete codes summarize visual changes. Stage II uses these LAMs to supervise the VLA. Stage III then grounds these latents to robot-specific joint targets using teleoperation data.
Evidence: Figure 8 visually demonstrates this by showing that for the same latent action (e.g., "Go Forward," "Turn Left," "Squat"), the retrieved clips exhibit consistent semantics across both human-collected videos (left) and teleoperated robot demonstrations (right).
Implication: This shared latent space is crucial for transfer learning and effectively leverages abundant action-free human data to inform robot behavior, supporting the strong performance gap between WholeBodyVLA and WholeBodyVLA w/o LAM (Table 2).

6.8. LAM Quality Assessment with Relative Reconstruction Gain (RRG)

Beyond downstream VLA performance, the quality of the LAMs is directly assessed using Relative Reconstruction Gain (RRG).

The following are the results from Table 9 of the original paper:

Method	Grasp	Move	Squat	Place	Squat	Grasp	Rise	Turn	Place	Grab	Push
Method	Bag Packing				Box Loading					Cart Pushing
manip. lam	21.78	23.64	18.71	24.73	21.22	25.09	22.92	28.15	23.96	19.12	19.92
shared lam	19.70	23.58	20.62	25.69	19.41	17.38	23.43	27.68	23.61	18.49	17.79
loco. lam	16.39	25.77	29.46	20.60	22.72	20.24	25.40	30.74	24.81	15.27	20.27

Analysis:

The RRG metric shows that the separately trained LAMs (manipulation LAM and locomotion LAM) generally achieve better downstream reconstruction performance compared to the shared LAM variant.
- For manipulation-heavy primitives (e.g., Grasp, Place), manip. lam often has higher RRG (e.g., Grasp Bags: 21.78 vs 19.70 for shared).
- For locomotion-heavy primitives (e.g., Move, Squat, Turn), loco. lam tends to have higher RRG (e.g., Squat Box Loading: 22.72 vs 19.41 for shared; Turn Box Loading: 30.74 vs 27.68 for shared).
The shared LAM variant underperforms the separate models on both manipulation and locomotion primitives in many cases. This quantitative result, along with the qualitative analysis from Table 2, suggests that there are inherent conflicts between locomotion-specific and manipulation-specific objectives when a single LAM is trained jointly on mixed data, justifying the WholeBodyVLA design of separate LAMs.

7. Conclusion & Reflections

7.1. Conclusion Summary

WholeBodyVLA pioneers a Vision-Language-Action (VLA) framework enabling humanoid robots to perform large-space loco-manipulation tasks autonomously in the real world. Its core innovation lies in two synergistic components: unified latent learning and a loco-manipulation-oriented (LMO) RL policy. Unified latent learning effectively addresses the data scarcity challenge by leveraging abundant, low-cost action-free egocentric videos to acquire rich loco-manipulation priors, using separate Latent Action Models (LAMs) for manipulation and locomotion. The LMO RL policy then ensures precise and stable execution of fundamental loco-manipulation movements by employing a discrete command interface, mitigating the decision-execution misalignment common in prior velocity-tracking controllers. Comprehensive experiments on the AgiBot X2 humanoid demonstrate superior performance over baselines, strong generalization to novel scenarios, and high extensibility across a broad range of tasks, marking a significant step towards general-purpose embodied agents.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Long-Horizon and Dexterous Tasks: While effective, WholeBodyVLA still faces challenges in reliably handling extremely long-horizon and highly dexterous tasks. These tasks often require more complex reasoning and finer motor control.
Mapping and Memory: Future work will focus on incorporating lightweight mapping and memory capabilities. This would enable extended planning and allow the robot to build and utilize internal representations of its environment over longer durations, improving its ability to navigate and interact in complex spaces.
Active Perception Strategies: To improve robustness in cluttered or dynamic environments, developing active perception strategies is crucial. This means the robot would intelligently choose where and how to look to gather the most relevant information for its current task, rather than relying on passive observation.

These directions aim to enhance WholeBodyVLA's scalability and generalization, paving the way for more versatile real-world humanoid loco-manipulation.

7.3. Personal Insights & Critique

This paper presents a robust and well-thought-out framework for a critical problem in humanoid robotics. The integration of unified latent learning and a specialized LMO RL policy is particularly insightful.

Inspirations and Applications:

Data Efficiency: The approach of learning loco-manipulation priors from low-cost, action-free egocentric videos is a powerful paradigm for other data-scarce robotic domains. This can significantly democratize robot learning, moving away from expensive teleoperation or MoCap setups.
Bridging the High-Low Gap: The LMO RL policy's discrete command interface and two-stage curriculum provide a valuable blueprint for designing low-level controllers that are explicitly tailored to the needs of high-level planners in complex, dynamic tasks. This decision-execution alignment is a common bottleneck in hierarchical control systems.
Embodied AI: The strong generalization and extensibility across various loco-manipulation tasks (e.g., uneven terrain, visual navigation, household chores) demonstrate WholeBodyVLA's potential as a foundation for truly general-purpose embodied AI capable of operating in diverse human environments. Its ability to learn from human demonstrations and then adapt to real-robot specifics is a key step in this direction.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

"Lightweight" Decoder: While the paper mentions a "lightweight decoder" $f$ for grounding latent actions to robot-specific commands, the details of its learning process (e.g., how it maps discrete latent tokens to continuous joint angles and locomotion commands, and how robot state $s_t$ is incorporated) could be further elaborated. The state injection in the decoder, while shown to not be the primary driver of visual generalization, still contributes, suggesting some reliance.
Discrete Command Granularity: The LMO policy's discrete command interface ( $s_x, s_y, s_{\psi} \in \{-1, 0, 1\}$ ) is effective for precise start-stop and directional control. However, for very fine-grained or highly dynamic continuous motions that might be needed in future dexterous loco-manipulation (e.g., smoothly adjusting speed while balancing on a narrow beam), a purely discrete interface might introduce limitations compared to a more continuous velocity or force control paradigm. The reference shaping helps, but the underlying discrete nature remains.
Scalability of Human Video Collection: While stated as "low-cost," scaling human egocentric video collection to truly open-ended, diverse environments and tasks, with sufficient variability to cover all possible loco-manipulation scenarios, still presents a logistical challenge. The current dataset of 300 hours, while significant, might represent a fraction of what real-world general intelligence would demand.
Sim-to-Real Gap for LMO: The LMO policy is trained in simulation. While domain randomization is extensive, the transition from simulation to the real AgiBot X2 robot for such a complex whole-body controller can still present a sim-to-real gap. Further analysis on the residual sim-to-real challenges for the LMO itself would be valuable.
Interpretability of Latent Actions: While the paper shows shared latent actions across embodiments, a deeper analysis into the interpretability of these discrete tokens could enhance understanding and debugging. What are the specific visual features the LAMs learn to associate with each discrete action token?

Overall, WholeBodyVLA is a very promising contribution. It systematically tackles key challenges in humanoid loco-manipulation and offers a robust, data-efficient, and effective solution that pushes the boundaries of autonomous humanoid control.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

WHOLEBODYVLA: TOWARDS UNIFIED LATENT VLA FOR WHOLE-BODY LOCO-MANIPULATION CONTROL

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~42 min read · 57,431 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Humanoid Whole-Body Control

3.2.2. Vision-Language-Action Models

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Unified Latent Action Model (LAM)

4.2.1.1. LAM Training Details

4.2.1.2. VLA Training

4.2.1.3. Execution on Humanoids

4.2.1.4. Manipulation-aware Locomotion Data Collection

4.2.2. Loco-Manipulation-Oriented (LMO) RL Policy

4.2.2.1. Observation Space

4.2.2.2. Discrete Command Interface

4.2.2.3. Reference Shaping

4.2.2.4. Two-Stage Curriculum

4.2.2.5. Reward Functions and Domain Randomization

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Hardware, Tasks, and Data Collection

5.5. Training Protocol

5.6. Real-Robot Experiment Protocols

6. Results & Analysis

6.1. Core Results Analysis

6.2. How do Action-Free Videos Contribute to Loco-Manipulation?

6.3. How does LMO Contribute to Loco-Manipulation?

6.4. Does WholeBodyVLA Generalize to Long-Horizon and Extended Loco-Manipulation Scenarios?

6.4.1. Additional Visual Generalization under Object and Load Variations

6.4.2. Effect of Proprioceptive State

6.5. Failure Mode Analysis

6.6. Task Execution Time

6.7. Shared Latent Action Space Across Embodiments

6.8. LAM Quality Assessment with Relative Reconstruction Gain (RRG)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers