Paper status: completed

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Published:10/12/2025

Vision-Language-Action Model (34)Cross-Embodiment Learning (1)Soft Prompt Learning (1)Generalist Robotic Platforms (1)Large-Scale Heterogeneous Datasets (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces X-VLA, a scalable Vision-Language-Action model utilizing a soft-prompted Transformer architecture. By integrating learnable embeddings for diverse robot data sources, X-VLA achieves state-of-the-art performance across simulations and real robots, demonstratin

Abstract

Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/

Mind Map

In-depth Reading

English Analysis~26 min read · 33,474 chars

1. Bibliographic Information

1.1. Title

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

The title clearly states the paper's core contributions:

X-VLA: The name of the proposed model.
Soft-Prompted Transformer: The key architectural innovation, which uses soft prompts within a Transformer model.
Scalable Cross-Embodiment Vision-Language-Action Model: The model's intended purpose and capability. It is a Vision-Language-Action (VLA) model designed to work across different robot types (cross-embodiment) and is built to be scalable, meaning its performance can improve with more data and a larger model size.

1.2. Authors

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan.

The authors are affiliated with several prominent institutions, including Tsinghua University, Shanghai AI Laboratory, Peking University, and the AI for Science (AIS) Industry Research (AIR). This collaboration between academic and research-focused labs suggests a strong background in both theoretical AI and its practical applications in robotics.

1.3. Journal/Conference

The paper was submitted to arXiv, a preprint server for academic papers. The publication date is listed as October 11, 2025, which is a future date, indicating it is a placeholder. As a preprint, it has not yet undergone formal peer review for a conference or journal. However, arXiv is a standard platform for disseminating cutting-edge research quickly in fast-moving fields like AI and robotics.

1.4. Publication Year

2025 (as per the placeholder date on arXiv).

1.5. Abstract

The abstract introduces the challenge of training generalist Vision-Language-Action (VLA) models on large-scale, diverse robotic datasets collected from different platforms (cross-embodiment). The core problem is the heterogeneity (i.e., differences in hardware, cameras, action spaces) across these datasets, which can confuse the model.

To solve this, the authors propose a novel Soft Prompt approach. This involves introducing a small set of learnable embeddings (prompts) for each distinct data source. These prompts act as embodiment-specific guides, allowing the VLA model to effectively use the features from different robots.

The resulting model, X-VLA, is built on a simple and scalable architecture using standard Transformer encoders and a flow-matching technique for action generation. The paper presents a 0.9 billion parameter version, X-VLA-0.9B, which is evaluated on 6 simulation benchmarks and 3 real-world robots. The results show that X-VLA-0.9B achieves state-of-the-art (SOTA) performance, demonstrating strong capabilities in dexterity and quick adaptation to new robots, environments, and tasks.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2510.10274
PDF Link: https://arxiv.org/pdf/2510.10274v1.pdf
Publication Status: This is a preprint available on arXiv and has not yet been officially published in a peer-reviewed venue.

2. Executive Summary

2.1. Background & Motivation

The long-standing goal in robotics is to create a single, general-purpose "generalist" agent that can understand human instructions and perform a wide variety of tasks on different types of robots. Recent advances in large-scale AI models, like Large Language Models (LLMs) and Vision-Language Models (VLMs), have paved the way for Vision-Language-Action (VLA) models, which extend these architectures to control robots.

The key to creating a powerful VLA model is to train it on massive, diverse datasets. However, a major bottleneck arises from the data heterogeneity of robotic datasets. Data is collected from various robots (embodiments) with different:

Physical forms (e.g., single-arm, bimanual).
Action spaces (e.g., joint control vs. end-effector control).
Camera setups (e.g., fixed top-down view, wrist-mounted camera).
Task distributions and environments.

This heterogeneity creates significant distributional shifts, confusing a single model trying to learn from all sources simultaneously. Existing methods often tackle this by assigning different "action heads" (the final layers of the model that output actions) for each robot type. However, this is a superficial fix that only addresses the output format and fails to handle deeper inconsistencies in visual and physical properties, hindering the model's ability to learn shared, generalizable knowledge.

The paper's entry point is to treat these different data sources as distinct but related "tasks." Inspired by prompt learning in NLP, the authors propose that the model can be explicitly guided to handle this heterogeneity using soft prompts.

2.2. Main Contributions / Findings

The paper makes several key contributions:

A Novel Soft Prompt Mechanism for Robotics: It introduces the idea of assigning a unique set of learnable vector embeddings (soft prompts) to each data source. These prompts are fed into the model along with the standard inputs (images, language, state). They act as a "signature" for the data's origin, allowing the model to internally adjust its processing to account for embodiment-specific characteristics from the earliest stages of feature fusion. This is a parameter-efficient and scalable solution to the data heterogeneity problem.
X-VLA: A Simple and Scalable Architecture: The paper proposes X-VLA, a VLA model built with a clean and effective architecture. It uses a pretrained VLM for vision-language understanding and a standard Transformer encoder for fusing multimodal information and generating actions via a flow-matching policy. Its simplicity makes it highly scalable.
A Robust Two-Phase Training and Adaptation Recipe:
- Phase I (Pretraining): The model is pretrained on a large, mixed-robot dataset, learning a general, embodiment-agnostic policy.
- Phase II (Adaptation): For a new robot, a new soft prompt is introduced and trained, while the main model backbone is either frozen or fine-tuned. This allows for rapid and parameter-efficient adaptation to new domains.
State-of-the-Art Performance: A 0.9B parameter version, X-VLA-0.9B, achieves new SOTA results across a comprehensive suite of 6 simulation benchmarks and 3 real-world robot platforms. It demonstrates superior generalization, dexterity (e.g., on a complex cloth-folding task), and data-efficient adaptation. Notably, it can achieve performance comparable to much larger models with significantly fewer tunable parameters during adaptation.

3.1. Foundational Concepts

3.1.1. Vision-Language-Action (VLA) Models

A Vision-Language-Action (VLA) model is a type of AI model designed to control a robot. It takes in multimodal inputs—typically vision (camera images), language (a human instruction, e.g., "pick up the red block"), and proprioception (the robot's own state, like joint angles or gripper position)—and outputs a sequence of actions for the robot to execute. VLAs are often built by extending large Vision-Language Models (VLMs), which are pretrained on vast amounts of internet data (images and text) and thus possess strong general-purpose reasoning and visual understanding capabilities. The goal is to transfer this "web knowledge" to the physical world of robotics.

3.1.2. Transformers

The Transformer is a neural network architecture that has become the foundation for most modern large-scale AI models. Its key innovation is the self-attention mechanism. Instead of processing data sequentially like older architectures (e.g., RNNs), self-attention allows the model to weigh the importance of all other elements in the input sequence when processing a given element. For a VLA model, this means it can dynamically decide which visual objects, words in the instruction, or parts of the robot's state are most relevant for predicting the next action. Transformers are highly parallelizable and scalable, making them ideal for training on massive datasets.

3.1.3. Prompt Learning and Soft Prompts

Prompting is a technique for adapting a large pretrained model to a new task without retraining the entire model.

Hard Prompts: This involves manually crafting a text string to guide the model. For example, to make a language model do translation, you could prefix the input with "Translate this English sentence to French: ...".
Soft Prompts: Instead of a human-written text prompt, a soft prompt is a sequence of learnable numerical vectors (embeddings) that are prepended to the model's input. These prompt vectors are optimized via gradient descent for a specific downstream task, while the main model weights are often kept frozen. This is a form of parameter-efficient fine-tuning (PEFT). In this paper, soft prompts are used not for a new task, but to encode the identity of the robot/data source.

3.1.4. Flow Matching

Flow Matching is a modern technique for training generative models. Generative models learn to create new data samples (in this case, robot action sequences) that resemble a training dataset. The core idea is to learn a velocity field ( $v$ ) that defines a path from a simple distribution (like random noise) to the complex data distribution (expert actions). Imagine you have a random scatter of points (noise, $A^0$ ) and a target shape (the expert action, $A$ ). Flow matching learns a function that tells each random point which direction to move at any point in time to eventually land on the target shape. The training objective minimizes the difference between the model's predicted velocity $v_\theta(A^t, o, t)$ and the "true" velocity of a straight path from noise to data, which is simply $(A - A^0)$ . During inference, the model starts with random noise and iteratively applies the learned velocity to generate a coherent action sequence. This approach is often more stable and efficient to train than older diffusion models.

3.1.5. Cross-Embodiment Learning

This refers to the challenge of training a single AI policy that can operate on multiple robots with different physical bodies (embodiments). For example, one robot might be a 7-degree-of-freedom arm, while another might be a bimanual humanoid. A successful cross-embodiment policy must learn general principles of manipulation (e.g., "how to grasp an object") that are independent of the specific robot's hardware, while also being able to adapt its outputs to the unique kinematics and dynamics of each robot.

3.2. Previous Works

The authors position their work relative to several lines of research:

Generalist VLA Models: Recent models like RT-1, RT-2, Octo, and $π₀$ have shown that training on large, diverse datasets is crucial for generalization. Many of these models address data heterogeneity by using separate action decoder heads for different embodiments. X-VLA argues this is insufficient because it doesn't resolve heterogeneity at earlier feature-fusion stages.
Handling Heterogeneity: The paper explicitly compares its soft prompt method to three other approaches (as shown in Figure 2):
1. Domain-specific action projection: The standard approach of using separate output layers for each robot. This is the baseline for many existing models.
2. HPT-style projection: Inspired by Wang et al. (2024c), this method applies separate projection layers to the input observations from different domains to map them into a shared space. The authors find this can corrupt the pretrained VLM features and lead to unstable training.
3. Language prompts: This involves providing a natural language description of the hardware (e.g., "Single Franka arm, top view camera") as an input to the model. The authors argue this is not scalable as it requires careful manual scripting for every new setup.
Prompt Learning: The idea of using soft prompts is borrowed from NLP and vision, where it's used for multi-task learning (Liu et al., 2023c) and continual learning (Wang et al., 2022). This paper is one of the first to systematically apply this philosophy to absorb embodiment variations in robotics.

3.3. Technological Evolution

The field of robotic learning has evolved from training task-specific policies from scratch to building large "foundation models" for robotics.

Early Imitation Learning: Focused on training a policy for a single task on a single robot using a small dataset.
Scaling Data: Models like RT-1 showed that performance improves dramatically by scaling up the amount of real-world data for a single robot type.
Transferring Web Knowledge: RT-2 and others demonstrated that fine-tuning large, pretrained VLMs on robot data allows the model to transfer abstract knowledge (e.g., recognizing objects, understanding spatial relations) from the web to robotics, enabling better generalization.
Cross-Embodiment Training: The Open X-Embodiment project and models like Octo pushed for co-training a single policy on data from dozens of different robots. This led to the central problem of data heterogeneity.

This paper's work fits into the latest stage. It directly tackles the heterogeneity challenge that arises from cross-embodiment training, proposing a more principled and effective solution than the ad-hoc methods used previously.

3.4. Differentiation Analysis

The core innovation of X-VLA compared to its predecessors is how it handles heterogeneity:

Early vs. Late Fusion: While most models like Octo or $π₀$ handle heterogeneity at the end of the network (with separate action heads), X-VLA injects embodiment-specific information at the beginning of the policy network. This allows the entire model to be "aware" of the robot's context, leading to more consistent reasoning.
Implicit vs. Explicit Guidance: Unlike language prompts, which provide explicit textual descriptions of the hardware, soft prompts learn an implicit numerical representation. This is more flexible and scalable, as the model discovers the relevant hardware features on its own without needing human-written templates.
Preserving Pretrained Features: Compared to HPT-style input projections, which modify the observation features directly, soft prompts are simply concatenated with them. This is a less invasive approach that better preserves the valuable representations learned by the pretrained VLM, leading to more stable training.

In summary, X-VLA's soft prompt mechanism is a simple, elegant, and parameter-efficient method that marries the benefits of adaptability and preservation of general knowledge.

4. Methodology

4.1. Principles

The central principle of X-VLA is to use soft prompts to enable a single Transformer-based model to effectively process heterogeneous data from multiple robot embodiments. The intuition is that a small set of learnable parameters can act as a "switch" or "context vector" that signals to the network which robot it is currently controlling. This allows the shared backbone of the model to learn general, embodiment-agnostic skills (e.g., the concept of "picking up"), while the prompts help specialize the model's behavior to the specific hardware (e.g., the kinematics of a Franka arm vs. a UR5 arm).

4.2. Core Methodology In-depth (Layer by Layer)

The X-VLA methodology can be broken down into its architecture, its action generation policy, and its training recipe.

4.2.1. Handling Heterogeneity: The Case for Soft Prompts

Before detailing the final architecture, the authors first justify their choice of soft prompts by comparing it to other strategies. They conducted an empirical study on a mixed dataset to evaluate four methods for handling heterogeneity, with the results shown in Figure 4.

The following diagram (Figure 2 from the paper) illustrates the different approaches:

该图像是示意图，展示了不同机器人在简单操作和灵巧操作中的成功率比较。图中包含多组数据柱状图，显示不同方法的成功率，最高成功率为0.90，展示了相比其他方法的优越表现。

(a) Domain-specific action projection: A shared backbone processes all inputs, but separate linear layers (output heads) are used to project the final token representations into the action space of each specific robot. This is the most common but most limited approach.
(b) HPT-style projection: In addition to separate output heads, this method adds domain-specific projection layers to the input features. The goal is to align observations from different robots into a common feature space before they enter the main backbone. The authors found this often destabilizes training.
(c) Language prompts: A textual description of the hardware (e.g., "Single Franka, Camera Setup: Top View") is fed into the VLM along with the task instruction. This relies on the VLM's language understanding capabilities but is cumbersome to scale.
(d) Soft prompts (Ours): A set of learnable embeddings $p_i$ is assigned to each data source $i$ . These prompts are concatenated with the other input tokens and fed into the shared Transformer backbone. The model learns to interpret these prompts automatically. The authors show this method leads to the most stable training and best performance.

The training curves from this preliminary study (Figure 9 from the paper) confirm this, showing that the soft prompt method achieves a lower and more stable validation error.

该图像是图表，展示了不同提示在 Simpler-Widow 上的成功率对比。图中分别用蓝色、黄色和绿色曲线表示 Noisy Prompt、UR5 Prompt 和 Learned Prompt (two-step) 在不同训练步骤下的成功率变化，显示出相应提示的性能差异。

4.2.2. X-VLA Architecture

The X-VLA model is designed for simplicity and scalability, relying on standard Transformer encoders. Its architecture is shown in detail in Figure 2 (labeled as Figure 10 in the appendix). The key is its streamlined encoding pipeline for different input modalities.

Figure 2 | Comparison among four methods in handling heterogeneity in cross-embodiment training. 该图像是示意图，展示了四种方法在跨体现训练中处理异构性的比较。左侧为现有解决方案，右侧为我们的软提示方法，展示了如何用软提示增强VLA模型的性能，提升其对不同数据源的适应能力。

The pipeline is divided into two main streams:

High-dimensional observation stream (Images and Language):
- Inputs: Multi-view images ( $\mathrm{Img}$ ) and a language instruction ( $L$ ).
- Processing: The authors propose a disentangled encoding strategy. The main camera view (e.g., a fixed third-person view) and the language instruction are processed together by a pretrained VLM encoder (specifically, Florence-Large). This leverages the VLM's powerful joint vision-language reasoning.
- Auxiliary views (e.g., wrist-mounted cameras) are processed by a separate, shared vision backbone. This separation is crucial: fixed views provide stable context for high-level planning, while wrist views, though noisy, provide critical detail for fine-grained manipulation. Encoding them separately prevents the fast-changing wrist view from disrupting the high-level reasoning process.
Low-dimensional proprioceptive-action stream:
- Inputs: The robot's proprioceptive state $R_t$ (e.g., joint positions), and the noisy action chunk $A_t$ used for flow-matching.
- Processing: These are low-dimensional vectors with related physical meanings. They are concatenated together with a time embedding $T$ (required for the flow-matching process). This combined vector is then projected into the high-dimensional feature space of the Transformer using a simple linear layer.

Putting It All Together: The tokens from all streams—vision, language, proprioception, action, and the soft prompt—are concatenated into a single sequence. This sequence is then processed by a stack of standard Transformer encoder blocks. The self-attention mechanism within these blocks allows for all-to-all communication, enabling the model to effectively fuse information from all modalities to generate a refined action.

4.2.3. Action Generation via Flow-Matching Policy

X-VLA uses a flow-matching policy to generate actions. Instead of directly predicting the action chunk $A$ from an observation $o$ , it learns a velocity field $\nu_\theta(A^t, o, t)$ that guides a random noise vector towards the expert action vector.

Training: The model is trained to predict the direction of the straight path between a noise sample $A^0 \sim \mathcal{N}(0, I)$ and the ground-truth expert action chunk $A$ . The interpolated action at time $t$ is given by $A^t = (1-t)A^0 + tA$ . The velocity of this straight path is simply the difference vector $(A - A^0)$ . The model's predicted velocity $\nu_\theta$ is trained to match this target velocity. The loss function is the mean squared error between the predicted and target velocities:

$ \mathcal { L } _ { \mathrm { BC } } ^ { \mathrm { FM } } ( \theta ) = \mathbb { E } _ { t \sim \mathcal { U } ( 0 , 1 ) , ( o , A ) \sim \mathcal { D } } \Big [ \left| v _ { \theta } ( A ^ { t } , o , t ) - ( A - A ^- { 0 } ) \right| ^ { 2 } \Big ] $

Where:
- $\theta$ : The parameters of the neural network (our X-VLA model).
- $o$ : The multimodal observation (images, language, proprioception).
- $A$ : The sequence of expert actions (the "action chunk").
- $A^0$ : A random sample from a standard normal distribution (noise).
- $t$ : A time variable sampled uniformly from [0, 1].
- $A^t$ : The action at time $t$ , linearly interpolated between the noise $A^0$ and the expert action $A$ .
- $v_\theta(A^t, o, t)$ : The velocity vector predicted by the model, conditioned on the current action $A^t$ , observation $o$ , and time $t$ .
- $(A - A^0)$ : The ground-truth velocity vector for the straight path from noise to data.
Inference (Action Generation): To generate an action, the model starts with a random noise vector $A^0$ . It then iteratively refines this vector over a series of steps using an ODE solver like the Euler method, moving it along the learned velocity field:

$ A ^ { t + \Delta t } = A ^ { t } + \nu _ { \theta } ( A ^ { t } , o , t ) \Delta t $

After a set number of steps (as $t$ goes from 0 to 1), the final vector $A^1$ is the generated action sequence.

4.2.4. Customized Training and Adaptation Recipe

The paper proposes a specific recipe to maximize the model's effectiveness.

Pretraining (Phase I):
- The entire X-VLA backbone ( $\pi_\theta$ ) and the set of soft prompts for all pretraining data sources ( $P^H = \{p_i\}_{i=1}^H$ ) are trained jointly on the mixed-robot dataset using the flow-matching loss $\mathcal{L}_{\mathrm{BC}}^{\mathrm{FM}}$ . This phase aims to produce an embodiment-agnostic foundation model.
Adaptation (Phase II): When adapting the model to a new robot with hardware configuration $h_{new}$ :
- Step 1: Prompt Warm-up: A new, randomly initialized soft prompt $p_{new}$ is introduced. For a short period, only this prompt and the action output layers are trained, while the pretrained backbone is kept frozen. This forces the new prompt to learn how to communicate with the existing generalist features of the backbone.
- Step 2: Joint Policy Adaptation: After the warm-up, the backbone is unfrozen, and both the backbone and the new prompt $p_{new}$ are fine-tuned together on the new robot's data.
Custom Learning Rate: During both pretraining and adaptation, a smaller learning rate is used for the pretrained VLM modules and the soft prompts. This is a crucial stabilization technique that prevents "catastrophic forgetting" of the valuable general knowledge stored in the VLM and allows for smoother optimization.

4.2.5. Enhanced Data Processing

To further reduce heterogeneity and improve supervision quality, the authors employ several data processing techniques:

Aligned Action Representation: All action data, regardless of the original robot's control space, is standardized to a unified format: a 7-dimensional vector representing the end-effector (EEF) pose. This includes the 3D Cartesian position (x, y, z), a 6D representation for rotation (to avoid issues with quaternions or Euler angles), and a binary value for the gripper state (open/closed).
Intention Abstraction through Temporal Downsampling: Instead of predicting every single action in a dense trajectory, the model is trained to predict a sequence of 30 "anchor points" that summarize the robot's intended path over a 4-second horizon. This abstracts away noisy, low-level movements and helps the model learn higher-level intent.
Balanced Data Sampling: Instead of simply cycling through datasets (round-robin), the authors implement a robust shuffling strategy. Samples are shuffled both across different datasets and within trajectories in each dataset at every training iteration. This ensures the model sees a balanced mix of data, preventing it from overfitting to any single dominant data source.

5. Experimental Setup

5.1. Datasets

The authors use a comprehensive set of datasets for pretraining and evaluation to demonstrate the model's capabilities.

Pretraining Data: The pretraining mixture consists of 290K episodes from three large-scale, heterogeneous datasets, covering 7 different hardware platforms.
- Droid (Khazatsky et al., 2024): A large in-the-wild dataset collected on Franka Panda arms with diverse human teleoperators.
- RoboMind (Wu et al., 2025): A benchmark with data from multiple embodiments, including Franka, UR, and Agilex robots.
- AGIBOT (Bu et al., 2025): Another large-scale manipulation dataset. The composition of the pretraining data is shown in the chart below (Figure 8 from the paper).
  
  该图像是T-SNE可视化图，展示了来自7个数据源的软提示。图中标记显示不同的机器人胚体，例如Franka和Agibox等，体现了各个胚体在嵌入空间中的聚类关系。
Evaluation (Adaptation) Data: The model is evaluated on 6 simulation benchmarks and 3 real-world robot platforms.
- Simulation Benchmarks: LIBERO (lifelong learning), Simpler (visual and physical variations), VLABench (long-horizon reasoning), RoboTwin-2.0 (bimanual manipulation), Calvin (long-horizon language-conditioned tasks), and NAVSIM (autonomous driving).
- Real-World Platforms: A WidowX arm for pick-and-place, a bimanual AgileX platform for a dexterous cloth-folding task, and an AIRBOT for parameter-efficient fine-tuning experiments.
- Soft-Fold Dataset: The authors collected a new high-quality dataset of 1,200 demonstrations for a challenging bimanual cloth-folding task on the AgileX robot.

5.2. Evaluation Metrics

The paper uses several metrics to evaluate performance, depending on the task.

Success Rate:
1. Conceptual Definition: This is the most common metric in robotics. It measures the percentage of trials in which the robot successfully completes the instructed task. A higher success rate indicates a more reliable and effective policy.
2. Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
3. Symbol Explanation: The terms are self-explanatory.
PDMS (Planning-oriented Driving Metric Score):
1. Conceptual Definition: Used for the NAVSIM autonomous driving benchmark. It is a composite score that aggregates five key aspects of driving quality to provide a holistic measure of performance. A higher PDMS is better.
2. Components: The score is a weighted average of:
  - NC (No-Collision Rate): Percentage of scenarios completed without a collision.
  - DAC (Drivable Area Compliance): Whether the vehicle stays within the legal drivable area.
  - EP (Ego Progress): The distance the vehicle travels towards the goal.
  - TTC (Time-to-Collision Safety): A measure of safety, penalizing close calls with other agents.
  - Comfort: A metric penalizing excessive acceleration or jerk for a smoother ride.
Validation Error ( $\ell_1$ error):
1. Conceptual Definition: Used during pretraining to monitor model convergence. It measures the average absolute difference between the model's predicted actions and the ground-truth expert actions on a held-out validation set. A lower error indicates that the model is learning to imitate the expert more accurately.
2. Mathematical Formula: $ \ell_1 \text{ error} = \frac{1}{N} \sum_{i=1}^{N} |A_{\text{pred}, i} - A_{\text{gt}, i}| $
3. Symbol Explanation:
  - $N$ : Total number of action dimensions in the validation set.
  - $A_{\text{pred}, i}$ : The $i$ -th dimension of the predicted action.
  - $A_{\text{gt}, i}$ : The $i$ -th dimension of the ground-truth action.

5.3. Baselines

The paper compares X-VLA against a wide range of recent, state-of-the-art specialist and generalist VLA models. These are listed in Table 2 and include:

Generalist Models: Octo, GROOT-N1, $π₀$ , OpenVLA, UniVLA. These are large-scale models trained on cross-embodiment data.
Specialist/Advanced Models: FLOWER, MemoryVLA, TraceVLA, ThinkAct. These models often incorporate specialized components like memory or explicit planning.
Model Sizes: The baselines range widely in size, from ~100M parameters (Octo) to 9B parameters (UniVLA), providing a strong reference for how X-VLA-0.9B's performance compares in terms of parameter efficiency.

6. Results & Analysis

6.1. Core Results Analysis

The main simulation results are summarized in Table 2. X-VLA is compared against a large number of existing SOTA models across diverse benchmarks.

The following are the results from Table 2 of the original paper:

Methods	Size	VM	Simpler			LIBERO				Avg	Calvin Easy	RoboTwin-2.0		VLABench Avg. PS	NAVSIM PDMS
Methods	Size	VM	VA			WidowX Spatial	Object	Goal	Long	Avg	Calvin Easy	ABC → D	Hard	VLABench Avg. PS	NAVSIM PDMS
LBP (Liu et al., 2025a)	0.2B		-	-	-	-		88.6					-	-
MoDE (Reuss et al., 2024)	0.4B							94.0		4.01		-		-
SuSIE (Black et al., 2024b)	1B			-				76.3		2.69	-	-	-
GHIL-Glue (Hatch et al., 2025)	1B			-	- -					3.69
SpatialVLA (Qu et al., 2025)	4B	75.1	70.7	42.7	88.2		89.9	78.6	55.5	78.1	.
TraceVLA (Zheng et al., 2024b)	7B	46.2	49.1	-	84.6		85.2	75.1	54.1	74.8	-
ThinkAct (Huang et al., 2025)	7B	71.5	65.1	43.8	88.3	91.4		87.1	70.9	84.4	-
FPC-VLA (Yang et al., 2025)	7B	78.0	65.8	64.6	86.2	87.0		92.0	82.2	86.9		-
MemoryVLA (Shi et al., 2025a)	7B	77.7	72.7	71.9	98.4	98.4		96.4	93.4	96.7
Octo (Octo Model Team et al., 2024)	0.1B	16.8	1.10	23.4	78.9	85.7		84.6	51.1	75.1
GR-1 (Wu et al., 2023)	0.2B			-	-					3.06
Seer (Tian et al., 2025)	0.3B		-	-	-			87.7		4.28
UniAct (Zheng et al., 2025)	0.5B			-	77.0	87.0		77.0	70.0	77.8
RDT (Liu et al., 2025b)	1B			-	-						34.5	13.7	-
FLOWER (Reuss et al., 2025)	1B			40.0	97.1		96.7	95.6	93.5	95.7	4.53			-
SmolVLA (Shukor et al., 2025)	2B			-	93.0		94.0	91.0	77.0	88.8	-		-
GROOT-N1 (Bjorck et al., 2025)	3B	45.0	48.4	-	94.4		97.6	93.0	90.6	93.9	-		-	39.7	-
π0 (Black et al., 2024a)	3B	58.8	56.8	27.8	96.8		98.8	95.8	85.2	94.1	-	46.4	16.4	37.8	-
π0 +FAST (Pertsch et al., 2025)	3B	61.9	60.5	39.5	96.4		96.8	88.6	60.2	85.5	-		-	34.1	-
OpenVLA (Kim et al., 2024)	7B	-		8.30	84.7	88.4		79.2	53.7	76.5			-	-	-
OpenVLA-OFT (Kim et al., 2025)	7B	63.0	54.3	31.3	97.6	98.4		97.9	94.5	97.1				-
DD-VLA (Liang et al., 2025)	7B	71.2	64.1	49.3	97.2		98.6	97.4	92.0	96.3	-			-
UniVLA (Wang et al., 2025a)	9B	-	-	69.8	95.4		98.8	93.6	94.0	95.4	4.41	-	-	-	81.7
Maximum of Existing SOTA	-	78.0	72.7	71.9	98.4		98.8	97.9	94.5	97.1	4.53	46.4	16.4	39.7	81.7
X-VLA (Ours)	0.9B	80.4	75.7	95.8	98.2	98.6	97.8	97.6	98.1	4.43	70.0	39.0	51.1	87.3

Key Takeaways:

Dominant Performance: X-VLA-0.9B sets a new state of the art on the majority of the benchmarks. For instance, it achieves 80.4% on Simpler-VM and 95.8% on Simpler-WidowX, significantly outperforming the previous bests of 78.0% and 71.9%, respectively.
Parameter Efficiency: X-VLA (0.9B parameters) achieves these results while being substantially smaller than many top competitors like UniVLA (9B), OpenVLA (7B), and $π₀$ (3B). This highlights the architectural and methodological efficiency of the soft prompt approach.
Broad Generalization: The model excels across a wide variety of tasks and domains, from single-arm manipulation (Simpler, LIBERO), to bimanual tasks (RoboTwin-2.0), to autonomous driving (NAVSIM), underscoring its versatility.

6.2. Ablation Studies / Parameter Analysis

The authors perform a meticulous ablation study to validate the contribution of each component of their proposed method.

The following are the results from Table 1 of the original paper:

Type	Improvements	Val Error (PT)	Acc (AD)
Baseline Model (w/o PT)	Florence-base + Standard DiT-base	-	4.1
Pretraining Technique (Section 4.2.1)	+Custom LR (w/o PT) +Heterogeneous PT	- 0.11	39.6 (+35.5) 25.0 (-14.6)
Data Processing (Section 4.2.2)	+Action alignment +Intension abstraction +Balanced data sampling	0.077 (-0.033)	50.0 (+25.0)
Architecture Design (Section 4.1)	+Replace DiT with Transformer encoder +Encoding pipeline	0.071 (-0.006) 0.053 (-0.018) 0.041 (-0.012)	47.9 (-2.1) 64.6 (+16.7)
	+Soft-prompt +Scaling up	0.032 (-0.009)	73.8 (+9.2) 89.6 (+15.8)
		0.032	95.8 (+6.2)
Finetuning Technique (Section 4.2.1)	+Two-step adaptation

Analysis of the Ablation Path:

Baseline: A standard model without any pretraining achieves a meager 4.1% success rate on the Simpler-WidowX adaptation task.
Impact of Pretraining: Simply pretraining the model on a heterogeneous dataset (+Heterogeneous PT) degrades performance (39.6% -> 25.0%), highlighting the core problem of this paper: naive co-training fails due to data heterogeneity.
Impact of Data Processing: Introducing the proposed data processing techniques (action alignment, etc.) provides a massive boost (25.0% -> 50.0%), showing that standardizing the data is crucial.
Impact of Architecture: Replacing the standard DiT with the proposed X-VLA architecture and encoding pipeline further improves the adaptation accuracy to 64.6%.
Impact of Soft Prompts: The addition of soft prompts is the final key ingredient for pretraining, reducing validation error and boosting adaptation accuracy significantly to 73.8%.
Impact of Scaling Up: Scaling the model to 0.9B parameters provides another large jump in performance to 89.6%.
Impact of Adaptation Technique: Finally, using the proposed two-step adaptation procedure during fine-tuning brings the final accuracy to 95.8%.

This table provides a clear and compelling story, demonstrating that each proposed component contributes positively and systematically to the final impressive result. It also shows a strong correlation: as the pretraining validation error decreases, the downstream adaptation performance increases.

6.3. Scaling Properties

Figure 5 shows that X-VLA's performance (measured by lower validation error) consistently improves with increases in (1) model size, (2) data diversity (number of sources), and (3) data volume. Crucially, the trend shows no sign of saturation, suggesting that X-VLA could become even more powerful with more compute and data.

该图像是插图，展示了X-VLA模型的结构及其数据流路径。图中包括了软提示库、输入线性投影和标准自注意力Transformer块，并且展示了多模态tokens和各种输入投影的关系。

6.4. In-Depth Analysis of Soft Prompts

Qualitative Analysis (Figure 8): The t-SNE visualization of the learned soft prompts reveals that they form distinct clusters corresponding to the hardware configurations of the data sources. For example, all AGIBOT prompts cluster together, and all RoboMind-UR prompts cluster together. Interestingly, the two Franka setups from the Droid dataset (with left vs. right main views) are intermingled. This indicates the prompts are not just memorizing the dataset name, but capturing meaningful similarities and differences in the underlying hardware.

该图像是示意图，展示了在WidowX拾取和放置实验中用于评估不同泛化方面的任务，包括视觉、运动、物理和语义泛化。每个任务侧重于特定的能力，以验证模型的泛化性能。
Quantitative Analysis (Figure 9): This experiment on parameter-efficient fine-tuning (PEFT) is very insightful. It compares adapting to a new robot (WidowX) with different prompt strategies:
- Noisy Prompt (Frozen): A random prompt leads to poor performance.
- UR5 Prompt (Frozen): Using a pretrained prompt from a similar single-arm robot (UR5) gives a good starting performance but plateaus, as the hardware is not identical.
- Learned Prompt (Two-step): Training a new prompt for WidowX using the paper's two-step method leads to the fastest convergence and highest final success rate. This confirms that prompts capture embodiment-specific information and that adapting them is key to specializing the model.
  
  该图像是示意图，展示了用于真实世界实验的三种机器人硬件设置，包括(a) WidowX、(b) AgileX和(c) AIRBOT，涵盖了不同的相机配置和任务领域，以形成异构验证环境。

6.5. Real-World Dexterous Manipulation

X-VLA-0.9B demonstrates remarkable dexterity on a real-world bimanual cloth-folding task, achieving a near-100% success rate and a throughput of 33 folds per hour. This is comparable to the closed-source $π₀$ model, despite being trained on a much smaller, open-source dataset (1,200 demonstrations). This result highlights the model's strong fine-grained control and ability to learn complex, dynamic manipulation skills.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces X-VLA, a scalable and effective Vision-Language-Action model that successfully addresses the critical challenge of heterogeneity in cross-embodiment robot learning. The core contribution is a soft prompt mechanism, where small, learnable embeddings are used to encode embodiment-specific information for each data source. This simple yet powerful idea, integrated into a clean Transformer-based architecture, allows the model to learn a generalist, embodiment-agnostic policy during pretraining.

The proposed X-VLA-0.9B model, combined with a carefully designed training and adaptation recipe, achieves new state-of-the-art results on an extensive set of 6 simulation benchmarks and 3 real-world robots. It demonstrates superior capabilities in generalization, dexterity, and parameter-efficient adaptation. The work shows strong scaling potential, paving the way for future development of even more capable generalist robot policies.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Scaling Further: While X-VLA-0.9B is powerful, it is still modest in size compared to foundation models in other domains. Scaling to larger model sizes and training on even broader and more diverse robotics datasets is a clear next step.
Richer Supervision Signals: The current model is trained primarily with imitation learning on action labels. This supervision is sparse. Future work could incorporate richer signals, such as 3D geometric cues, physical dynamics, or intermediate textual subgoals, to improve learning.
Towards Seamless Deployment: The current model still requires a fine-tuning step (albeit an efficient one) for new embodiments. The ultimate goal is a truly "zero-shot" generalist model that can be deployed on a new robot without any retraining. This might be achieved by developing unified embodiment representations that abstract away hardware specifics.

7.3. Personal Insights & Critique

This paper is an excellent piece of academic research, characterized by a clear problem statement, an elegant solution, and extremely thorough empirical validation.

Strengths:

Elegant Solution: The soft prompt mechanism is a prime example of a simple idea having a profound impact. It elegantly solves the heterogeneity problem without adding significant architectural complexity or requiring manual engineering, which is a significant step forward from previous approaches.
Rigorous Experimentation: The breadth and depth of the experiments are a major strength. The comprehensive comparisons on numerous benchmarks, the meticulous ablation study, the in-depth analysis of the prompts, and the real-world validation (especially the challenging cloth-folding task) build a very convincing case for the method's effectiveness.
Clarity and Structure: The paper is well-written and easy to follow. The ablation study in Table 1, in particular, tells a clear and compelling story of how each component contributes to the final result.

Critique and Reflections:

Novelty of the Core Mechanism: The concept of a "soft prompt" or a "learnable task embedding" is not new in the broader machine learning literature. It has been used in multi-task learning and transfer learning for years. The primary novelty of this paper is its specific and highly successful application to the cross-embodiment robotics problem, and the rigorous framing and empirical validation within the VLA context. The authors do a good job of contextualizing this.
"Prompt" Terminology: Calling these learnable embeddings "prompts" is a strategic choice that aligns the work with the popular and powerful paradigm of prompting large models. It effectively frames the work in a modern context, though one could also describe it as a form of conditional computation or multi-task learning with task-specific embeddings.
Future Implications: The success of this approach suggests a promising path for building truly generalist robots. If a model can be conditioned on hardware via a small prompt, one can imagine a future where a robot's "driver" is simply a prompt file. The finding that prompts from similar robots can provide a warm start (Figure 9) hints at the possibility of a "prompt space" where one could interpolate between prompts to generalize to unseen robots in a zero-shot or few-shot manner. This is a very exciting direction for the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~26 min read · 33,474 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Vision-Language-Action (VLA) Models

3.1.2. Transformers

3.1.3. Prompt Learning and Soft Prompts

3.1.4. Flow Matching

3.1.5. Cross-Embodiment Learning

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Handling Heterogeneity: The Case for Soft Prompts

4.2.2. X-VLA Architecture

4.2.3. Action Generation via Flow-Matching Policy

4.2.4. Customized Training and Adaptation Recipe

4.2.5. Enhanced Data Processing

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.3. Scaling Properties

6.4. In-Depth Analysis of Soft Prompts

6.5. Real-World Dexterous Manipulation

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers