Paper status: completed

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Published:05/09/2025

Vision-Language-Action Model (34)Multimodal Robot Learning (10)Cross-Embodiment Latent Action Representation (3)Task-Centric Latent Action Modeling (1)Large-Scale Video-Driven Robot Policies (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

UniVLA proposes a novel framework for transferable robot policies, learning task-centric latent actions from vast unannotated videos using language instructions and DINO features. This method significantly improves cross-embodiment performance across diverse manipulation and navi

Abstract

A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.

Mind Map

In-depth Reading

English Analysis~18 min read · 22,563 chars

1. Bibliographic Information

Title: UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Authors: Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li.
Affiliations: The University of Hong Kong, OpenDriveLab, and AgiBot. The authors are affiliated with prominent academic and research institutions in the field of AI and robotics.
Journal/Conference: The paper is a preprint submitted to a top-tier robotics conference, likely the Conference on Robot Learning (CoRL), as indicated by the formatting and references to other recent CoRL papers like OpenVLA.
Publication Year: 2024 (based on arXiv submission date).
Abstract: The paper addresses the challenge of creating generalist robot policies that can operate across diverse environments and embodiments. Current methods often rely heavily on action-annotated data, limiting their scalability and transferability. The authors propose UniVLA, a framework that learns a unified, task-centric action representation from videos using a latent action model. This allows the framework to leverage vast amounts of video data, including robot and human demonstrations, without needing explicit action labels. The key innovation is a method to disentangle task-relevant dynamics from irrelevant ones using language instructions and DINO feature space. The resulting generalist policy, pretrained on internet-scale videos, can be efficiently adapted to new robots. UniVLA achieves state-of-the-art results on multiple manipulation and navigation benchmarks, significantly outperforming models like OpenVLA with a fraction of the computational cost and downstream data.
Original Source Link:
- arXiv: https://arxiv.org/abs/2505.06111
- PDF: http://arxiv.org/pdf/2505.06111v2
- Status: Preprint on arXiv.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Creating a "generalist" robot policy that can perform many tasks across different robot bodies (embodiments) and in varied environments is a central goal in robotics. A major bottleneck is the data requirement. Most existing approaches, known as Vision-Language-Action (VLA) models, require large datasets where every robot action is precisely labeled. This is expensive, hard to scale, and makes it difficult to learn from the vast amount of unlabeled video data available online (e.g., human videos).
- Gaps in Prior Work: Previous models are often tied to specific robot hardware and struggle to transfer knowledge between different robot types (e.g., a Franka arm vs. a WidowX arm) or even between robots and humans. Attempts to learn from video without action labels often fail because they capture all visual changes, including distracting, "task-irrelevant" dynamics like camera shake or background movement, which pollutes the learned action representation.
- Fresh Angle: UniVLA introduces a novel way to learn a unified and task-centric latent action space from videos alone. Instead of learning raw actions, it learns abstract, discrete "latent actions" that represent the intent of the action. The core innovation is a two-stage process that uses language instructions to explicitly separate meaningful, task-related movements from random, irrelevant visual noise.
Main Contributions / Findings (What):
1. A New Framework (UniVLA): The paper proposes a complete recipe for training a generalist policy. It learns an embodiment-agnostic action space from diverse videos and then trains a VLA model to predict actions within this unified space, enabling scalable and efficient robot control.
2. Task-centric Latent Action Learning: A novel method is introduced to extract only the task-relevant dynamics from videos. By conditioning on language instructions, the model learns to decouple goal-oriented movements from background noise, creating a cleaner and more effective action representation for policy learning.
3. State-of-the-Art Performance and Efficiency: UniVLA achieves superior results on a wide range of benchmarks, including robot manipulation (LIBERO, CALVIN) and navigation (Room2Room), as well as in real-world deployments. It outperforms the leading model, OpenVLA, by a large margin (e.g., 18.5% higher success rate on LIBERO) while using less than 1/20th of the pretraining compute and 1/10th of the downstream fine-tuning data.

Foundational Concepts:
- Vision-Language-Action (VLA) Models: These are AI models, often based on Transformer architectures, that take visual input (camera images), language instructions (e.g., "pick up the red block"), and produce robot actions (e.g., motor commands) to complete a task.
- Cross-Embodiment Learning: This is the study of how to train a single AI policy that can control different types of robots, or even transfer knowledge from human demonstrations to a robot. The challenge lies in the different action spaces, sensors, and physical properties of each "embodiment."
- Latent Action Models: Instead of predicting concrete, low-level motor commands (e.g., joint angles), these models learn a compressed, abstract representation of actions in a "latent space." This can make learning more efficient and generalizable.
- Vector Quantized Variational Autoencoder (VQ-VAE): A type of deep learning model used to learn discrete representations. It compresses input data (like the dynamics between two video frames) into a code from a finite "codebook." This is key to UniVLA's discrete latent actions.
- DINOv2 Features: DINOv2 is a powerful vision model trained by Meta AI. It can extract rich, semantic features from images that capture information about objects and their spatial relationships, without being explicitly trained for it. UniVLA uses these features instead of raw pixels to focus on meaningful content and ignore noise like lighting changes.
Previous Works:
- VLA Models (e.g., RT-1, Octo, OpenVLA): These models have shown success but are data-hungry, requiring massive datasets with ground-truth action labels. OpenVLA, for example, discretizes the robot's action space and treats actions as "words" in a language model's vocabulary, but this approach is still tied to labeled data and specific embodiments.
- Latent Action Learning from Video (e.g., LAPA, Genie): These methods try to learn actions from video without labels by predicting future frames. However, their simple reconstruction objective means they learn to represent all visual changes, including irrelevant noise (camera shake, background motion), which harms policy performance. UniVLA directly addresses this limitation.
- Cross-Embodiment Learning (e.g., Flow representations, Object-centric representations): These approaches try to find a common ground between different embodiments, for instance by predicting how pixels or objects will move (flow). However, they often require extensive and diverse datasets or explicit annotations. UniVLA's unsupervised latent action approach is more data-efficient.
Differentiation: UniVLA's primary innovation is its task-centric latent action decoupling. While LAPA also learns latent actions from video, it does so naively, mixing useful and useless information. UniVLA uses a clever two-stage process guided by language instructions to purify its latent action space, ensuring it only represents motions relevant to completing the specified task. This makes the learned representation more robust, efficient, and transferable.

4. Methodology (Core Technology & Implementation)

UniVLA's methodology is a three-part recipe: 1) Learn a unified, task-centric latent action space from diverse videos, 2) Pretrain a generalist policy to predict these latent actions, and 3) Fine-tune the policy for specific robots by decoding latent actions into real motor commands.

A. Task-centric Latent Action Learning

This is the core novelty of the paper. The goal is to create a set of discrete "pseudo actions" from video that can be used to supervise the main policy. This is achieved with a two-stage training pipeline for a latent action model, as shown in Image 1.

该图像为示意图，展示了UniVLA框架中基于DINO v2特征的两阶段训练流程。左侧为第一阶段，视频帧通过空间-时间转换器编码，潜在动作量化器提取任务无关的潜在动作表示，加入任务指令后进行监督训练。右侧为第二阶段，复制第一阶段编码器权重，潜在动作表示分为任务无关和任务相关部分，用以提升任务聚焦的动作编码，继续监督训练优化模型。图中用不同颜色和符号区分任务相关与无关的潜在动作。

Principles: The model is based on an Inverse Dynamics Model (IDM), which infers the action $a_t$ that caused the transition from observation $o_t$ to $o_{t+k}$ , and a Forward Dynamics Model (FDM), which predicts the future observation $o_{t+k}$ given the current observation $o_t$ and an action $a_t$ . Instead of using raw pixels, the model operates on the rich, semantic feature space of DINOv2, which helps ignore irrelevant details like textures and lighting.
Steps & Procedures:
1. Latent Action Quantization: The model takes two consecutive video frames, $o_t$ and $o_{t+k}$ , extracts their DINOv2 features, and feeds them into an encoder (a spatial-temporal transformer). The encoder outputs a continuous action representation, which is then discretized into a sequence of tokens $\tilde{a}$ using a VQ-VAE codebook. A decoder then tries to reconstruct the features of the future frame $O_{t+k}$ using only the features of the current frame $O_t$ and the quantized latent action tokens $\tilde{a}$ .
2. Latent Action Decoupling (Two-Stage Training): This is the key to making the actions "task-centric."
  - Stage 1: Learning Task-Irrelevant (TI) Dynamics. In the first stage, the model is trained to predict the future frame $O_{t+k}$ from the current frame $O_t$ , a latent action $\tilde{a}_{TI}$ , and the language instruction $\ell$ . Because the decoder is given the high-level task instruction, it already "knows" what the goal is. The VQ-VAE codebook for $\tilde{a}_{TI}$ has limited capacity, so it learns to encode only the remaining, unpredictable visual information necessary for reconstruction, such as background motion, camera shake, or the appearance of new objects. This effectively isolates the task-irrelevant dynamics. The process is described by: $\left\{ \begin{array} { l l } { \operatorname { E n c o d e : } } & { \hat { a } _ { I I } = \mathcal { Z } ( [ O _ { t } ; O _ { t + k } ; a _ { T I } ; \ell ] ) , \ \tilde { a } _ { I I } = \mathbf { V Q } ( \hat { a } _ { I I } ) , } \\ { \operatorname { D e c o d e : } } & { \hat { O } _ { t + k } = \mathcal { F } ( [ O _ { t } ; \tilde { a } _ { T I } ; \ell ] ) , } \end{array} \right.$ where $\mathcal{Z}$ is the encoder, $\mathcal{F}$ is the decoder, $\mathbf{VQ}$ is the vector quantization process, and [;] denotes concatenation.
  - Stage 2: Learning Task-Centric (TC) Dynamics. In the second stage, the model from Stage 1 is copied. The codebook and weights for the task-irrelevant action $\tilde{a}_{TI}$ are frozen. A new, randomly initialized codebook is introduced to learn a second set of latent actions, $\tilde{a}_{TC}$ . The language instruction is now removed from the decoder's input. The model's objective is still to reconstruct the future frame, but now using both $\tilde{a}_{TI}$ and $\tilde{a}_{TC}$ . Since $\tilde{a}_{TI}$ already explains the irrelevant dynamics, the new codebook $\tilde{a}_{TC}$ is forced to learn only the remaining, unexplained dynamics, which are precisely the task-centric movements of the robot/agent. The process is: $\left\{ \begin{array} { l l } { \mathrm { E n c o d e : } } & { \{ \hat { a } _ { T I } , \hat { a } _ { T C } \} = \mathcal { T } ( [ O _ { t } ; O _ { t + k } ; a _ { T I } ; a _ { T C } ] ) , } \\ & { \tilde { a } _ { T I } = \mathbf { V Q } ( \hat { a } _ { T I } ) , \tilde { a } _ { T C } = \mathbf { V Q } _ { T C } ( \hat { a } _ { T C } ) , } \\ { \mathrm { D e c o d e : } } & { \hat { O } _ { t + k } = \mathcal { F } ( [ O _ { t } ; \tilde { a } _ { T I } ; \tilde { a } _ { T C } ] ) , } \end{array} \right.$ After this stage, the $\tilde{a}_{TC}$ tokens are used as the "pseudo action labels" for training the main policy.

B. Pretraining of Generalist Policy

Once the latent action model is trained, it's used to label a massive dataset of videos with task-centric latent action tokens. Then, a generalist policy is trained on these labels.

该图像为示意图，展示了UniVLA框架中基于自回归Transformer的任务中心潜在动作解码流程。底部输入包括第三视角RGB图像（DINOv2特征）、任务指令（Tokenizer编码）。通过Transformer编码处理后，产出潜在动作令牌序列，再经过动作解码器映射至异构机器人动作空间，实现跨平台动作表示和执行。

Architecture: The policy is built on Prismatic-7B, a powerful Vision-Language Model (VLM). It uses visual encoders (SigLip and DINOv2), a projection layer to align vision and language features, and a LLaMA-2 large language model (LLM) backbone.
Action Representation: Unlike OpenVLA, which re-purposes existing words in the tokenizer, UniVLA extends the LLaMA vocabulary with new special tokens, e.g., {ACT_1, ACT_2, ..., ACT_C}. Each latent action token from the codebook is mapped to one of these special tokens.
Training Objective: The model is trained auto-regressively to predict the next latent action token, given the current visual observation $o_t$ , the language instruction $l$ , and previously predicted action tokens $a_{z, <i}$ . The objective is to minimize the negative log-probability of the correct next token: $\mathcal { L } = \mathbb { E } _ { o _ { t } , l , a _ { z , < i } } \left[ - \sum _ { i = 1 } ^ { N } \log \ \pi _ { \phi } ( \hat { a } _ { z , i } = a _ { z , i } \mid o _ { t } , l , a _ { z , < i } ) \right]$ where $N=4$ is the number of latent action tokens per step. This approach is much more efficient than predicting in a high-dimensional continuous action space.

C. Post-training for Deployment

To use the pretrained policy on a real robot, two additional components are introduced.

Latent Action Decoding: The abstract latent actions predicted by the policy need to be translated into concrete motor commands. A lightweight action decoder head is added (see Image 2). It uses an attention mechanism to combine information from the visual embeddings and the predicted latent action embeddings. This context-aware decoding allows the same latent action to be translated into slightly different physical movements depending on the visual scene. The process is described by: $\left\{ \begin{array} { l l } { \mathrm { Visual ~ Embed . : } } & { E _ { v } ^ { \prime } = \mathcal { A } ( Q = q _ { v } , K = V = E _ { v } ) , } \\ { \mathrm { Action ~ Embed . : } } & { \mathscr { E } _ { a } = \mathcal { A } ( Q = q _ { a } + E _ { v } ^ { \prime } , K = V = E _ { a } ) , } \end{array} \right.$ where $\mathcal{A}$ is multi-head attention and $q_v, q_a$ are learnable queries. The final action embedding is projected to the robot's specific action space. This head is trained efficiently using LoRA.
Learn from History Outputs: To improve performance on long tasks, the model is given its own past predictions as context. At each step, the latent action predicted in the previous step is fed back into the input prompt, similar to Chain-of-Thought reasoning in LLMs. This helps the policy maintain context and make more coherent sequences of decisions without the high cost of processing multiple historical images.

5. Experimental Setup

Datasets:

Pretraining: A diverse mix of data is used, leveraging UniVLA's ability to learn from action-free videos.

Robotic Manipulation: A subset of the Open X-Embodiment (OpenX) dataset, including Bridge-V2.
Navigation: The GNM dataset, featuring indoor and outdoor scenes.
Human Videos: The Ego4D dataset of egocentric human activities.

The detailed mixture is shown in the transcribed table below.

This is a manual transcription of Table A-I from the paper's appendix.

Dataset	Percentage	Dataset	Percentage
Fractal	13.9%	Austin Sailor Dataset	2.6%
Kuka	6.3%	Austin Sirius Dataset	2.0%
Bridge	6.8%	DLR EDAN Shared Control	0.1%
Taco Play	3.5%	IAMLab CMU Pickup Insert	1.1%
Jaco Play	0.6%	UTAustin Mutex	2.6%
Berkeley Cable Routing	0.3%	Berkeley Fanuc Manipulation	0.9%
Roboturk	2.8%	CMU Stretch	0.2%
Viola	1.1%	BC-Z	8.8%
Berkeley Autolab UR5	1.4%	FMB Dataset	8.4%
Toto	2.4%	DobbE	1.7%
Language Table	5.2%	RECON	8.9%
Stanford Hydra Dataset	5.3%	CoryHall	2.3%
Austin Buds Dataset	0.3%	SACSoN	3.5%
NYU Franka Play Dataset	1.0%	Ego4D	3.0%
Furniture Bench Dataset	2.9%
UCSD Kitchen Dataset	<0.1%

Evaluation:
- Manipulation: LIBERO, CALVIN, and SimplerEnv benchmarks.
- Navigation: Room2Room (R2R) benchmark in the VLN-CE environment.
- Real-World: A custom set of tasks using a 7-DoF Piper arm.

Evaluation Metrics:
1. Success Rate:
  - Conceptual Definition: The most common metric in robotics, it measures the percentage of trials where the robot successfully completes the assigned task according to predefined criteria.
  - Mathematical Formula: $\text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}}$
  - Symbol Explanation: The symbols are self-explanatory.
2. Oracle Success Rate (for R2R):
  - Conceptual Definition: Used in Vision-and-Language Navigation (VLN). An episode is counted as successful if the agent's final stopping position is within a certain radius (e.g., 3 meters) of the goal, regardless of whether the agent decided to stop at the right time. It measures the quality of the path generated by the policy.
  - Mathematical Formula: This is a conditional success rate, where the condition is being near the goal at the final step. The formula is the same as the standard success rate, but with a different success condition.
3. Average Score (for Real-World):
  - Conceptual Definition: A custom, step-wise metric designed to capture partial success. For a complex task with multiple sub-goals, the policy is awarded points for each sub-goal it completes. This provides a more nuanced measure of performance than a simple binary success/fail. For instance, successfully picking up an object but failing to place it would still earn partial credit.
Baselines: A comprehensive set of strong baselines were used for comparison.
- Generalist Policies: OpenVLA, Octo, LAPA. These are the most direct competitors.
- Specialist/Other Policies: Diffusion Policy (a strong single-task imitation learning method), MDT, MaIL.
- Navigation Models: Seq2Seq, CMA, LLaVA-Nav, NaVid.

6. Results & Analysis

The paper's overall results are summarized in Image 3, showing consistent and significant outperformance over OpenVLA across all tested domains.

该图像为示意图，展示了UniVLA框架如何通过任务中心的潜在动作空间整合多源视频数据（机器人操作、人类视频、室内外导航），实现跨身体形态的视觉-语言-动作策略学习。右侧柱状图对比了UniVLA与OpenVLA在多个基准任务（如LIBERO、Room2Room、SimplerEnv、CALVIN及真实机器人）上的效果，显示UniVLA显著优于OpenVLA。

Core Results

Manipulation Benchmark on LIBERO:

This is a manual transcription of Table I from the paper.

Method	Spatial	Object	Goal	Long	Average
LAPA* [87]	73.8	74.6	58.8	55.4	65.7
Diffusion Policy [17]	78.3	92.5	68.3	51.1	72.4
Octo [28]	78.9	85.7	84.6	50.5	75.1
MDT [68]	78.5	84.7	73.5	64.8	76.1
OpenVLA [39]	74.3	88.4	79.2	53.7	76.5
MaIL† [35]	-	90.1	81.8	78.6	83.5
UniVLA (Human)	91.2	94.2	90.2	79.4	88.7
UniVLA (Bridge)	95.2	95.4	91.9	87.5	92.5
UniVLA (Full)	96.5	96.8	95.6	92.0	95.2

Analysis: UniVLA with full pretraining (UniVLA (Full)) achieves a stunning 95.2% average success rate, which is 18.7% higher than OpenVLA. Remarkably, even when pretrained only on the Bridge-V2 dataset, UniVLA (92.5%) outperforms OpenVLA pretrained on the much larger OpenX dataset. Pretraining on human videos alone (UniVLA (Human)) is also sufficient to beat OpenVLA, demonstrating the framework's incredible ability to transfer knowledge across embodiments. The task setups are visualized in Image 8.

Navigation Benchmark on Room2Room (R2R):

该图像为柱状图，展示了不同方法在“Oracle Success Rate”指标上的表现。柱状图中，Seq2Seq、CMA、LLaVA-Nav、OpenVLA、NaVid和UniVLA六种方法依次对应不同颜色，UniVLA达到47.1%，仅略低于NaVid的49.1%，显著优于其他几种方法，体现了UniVLA的高效性能。
- Analysis: As shown in Image 10, UniVLA achieves a 47.1% Oracle Success Rate, dramatically outperforming OpenVLA (17.5%) by 29.6%. Its performance is nearly on par with NaVid (49.1%), a state-of-the-art navigation model that uses the entire history of visual observations as input. UniVLA achieves this with only the current frame and a compact history of its own latent action predictions, showcasing extreme efficiency.
Real-world Robot Deployment:

这是一张图表，展示了四个机器人操作任务（存放螺丝刀、清洁切菜板、折叠毛巾两次、堆叠汉诺塔）的成功率对比柱状图。不同颜色代表Diffusion Policy、OpenVLA、LAPA和UniVLA四种方法，UniVLA在各任务中均表现最佳。右侧两个小图表显示了各方法的平均成功率和平均得分，UniVLA以81.7%的成功率和2.63的分数显著领先。每个任务下方配有对应操作场景的机器人示意图。
- Analysis: The results in Image 9 show that UniVLA is highly effective in the real world, achieving an average success rate of 81.7%, which is 36.7% higher than the next-best generalist policy, LAPA. It excels in tasks requiring semantic understanding ("Stack tower of hanoi") and spatial reasoning ("Store the screwdriver"). While the specialist Diffusion Policy performs better on the highly structured "Fold towel twice" task, UniVLA still achieves a higher average score, indicating it makes more meaningful partial progress even on failed attempts.
- Generalizability Analysis:
  
  This is a manual transcription of Table II from the paper.
  
  | Method | Lightning Variation | Visual Distractor | Novel Object | Average ↑ | :--- | Succ. | Score | Succ. | Score | Succ. | Score | Succ. | Score | Diffusion Policy [17] | 20.0 | 0.60 | 26.7 | 0.80 | 26.7 | 0.67 | 24.4 | 0.69 | OpenVLA [39] | 13.3 | 0.93 | 20.0 | 0.73 | 26.7 | 1.27 | 20.0 | 0.98 | LAPA [87] | 26.7 | 1.60 | 6.7 | 0.6 | 53.3 | 1.87 | 28.9 | 1.36 | UniVLA (Ours) | 66.7 | 2.33 | 53.3 | 2.40 | 86.7 | 2.73 | 68.9 | 2.49
  - Analysis: UniVLA demonstrates exceptional robustness to unseen variations, such as changes in lighting, visual distractors, and novel objects. It maintains a high success rate (68.9%) where other models fail, underscoring the generalizability of its learned representations.

Discussion on Latent Action

Qualitative Analysis: Figure 8 in the paper (and additional examples in Image 5) visualizes that the same latent action token corresponds to semantically similar behaviors across different datasets and embodiments (e.g., "Move Things Up"), confirming the model learns a meaningful, unified action space.
Quantitative Analysis: This is a crucial ablation that validates the core hypothesis of the paper.

This is a manual transcription of Table III from the paper.

Latent Action Spatial Object Goal Long Avg.

Genie [12] 89.8 92.8 77.2 69.6 82.3

Task-irrelevant 68.0 90.4 67.2 0.2 56.5

Task-centric 91.2 94.2 90.2 79.4 88.7
- Analysis: When the policy is trained using only the task-centric latent actions, it achieves the best performance (88.7%). Training on naive, unfiltered latent actions (Genie-style) performs worse (82.3%). Crucially, training on the task-irrelevant actions results in abysmal performance (56.5%), with a near-zero success rate on long-horizon tasks. This provides strong quantitative evidence that the proposed decoupling method successfully isolates useful, task-related information.

Latent Action	Spatial	Object	Goal	Long	Avg.
Genie [12]	89.8	92.8	77.2	69.6	82.3
Task-irrelevant	68.0	90.4	67.2	0.2	56.5
Task-centric	91.2	94.2	90.2	79.4	88.7

More Ablations

Data Scalability: Figure 9 in the paper shows that as more diverse data (OpenX, Ego4D human videos) is added to pretraining, UniVLA's performance consistently improves, particularly on the challenging real-world and navigation tasks. This confirms the framework's scalability.
Data Efficiency: The plots in Image 4 show that UniVLA is remarkably data-efficient during fine-tuning. With only 10% of the downstream demonstration data for LIBERO-Goal, it already outperforms OpenVLA trained on the full dataset. This highlights the power of its pretrained representations.

该图像为图表，包含两个子图，分别展示了不同方法在LIBERO-Goal和LIBERO-Long任务中，随着训练示范比例增加，成功率的变化趋势。图中绿色曲线（Ours）表现出明显优于其他方法（ATM、OpenVLA及Previous SOTA），在各训练数据量下均取得最高成功率，且成功率随训练示范增加稳定提升，展示了UniVLA方法的优势和泛化能力。
Latent Action Decoder:

This is a manual transcription of Table IV from the paper.

Action Decoder Spatial Object Goal Long Avg.

Auto-regressive 85.2 81.2 79.0 49.0 73.6

Ours w/o Visual 95.0 95.4 93.7 86.0 92.5

Ours 96.5 96.8 95.6 92.0 95.2
- Analysis: The proposed parallel decoder head significantly outperforms the auto-regressive decoding used by OpenVLA and LAPA. Furthermore, adding visual context as a query to the decoder provides an additional boost in performance, confirming the benefits of the design.
History Latent Actions:

This is a manual transcription of Table V from the paper.

Prompt Input LIBERO (Manip.) Goal LIBERO (Manip.) Long R2R (Navi.)

Instruction-only 95.0 88.1 30.6

w/ History Action 95.6 92.0 47.1
- Analysis: This simple trick of feeding back the last predicted latent action provides a significant performance lift, especially in long-horizon tasks like LIBERO-Long (+3.9%) and R2R navigation (+16.5%), at a negligible computational cost.

Action Decoder	Spatial	Object	Goal	Long	Avg.
Auto-regressive	85.2	81.2	79.0	49.0	73.6
Ours w/o Visual	95.0	95.4	93.7	86.0	92.5
Ours	96.5	96.8	95.6	92.0	95.2

Prompt Input	LIBERO (Manip.) Goal	LIBERO (Manip.) Long	R2R (Navi.)
Instruction-only	95.0	88.1	30.6
w/ History Action	95.6	92.0	47.1

7. Conclusion & Reflections

Conclusion Summary: The paper introduces UniVLA, a new framework for training generalist robot policies that is both scalable and efficient. By learning a unified, task-centric latent action space, it overcomes the data bottleneck of previous VLA models, enabling learning from diverse, action-free video data, including human demonstrations. The core innovation—decoupling task-relevant from irrelevant dynamics using language and DINOv2 features—is shown to be highly effective. UniVLA establishes a new state-of-the-art across a wide range of robotics benchmarks while requiring significantly less computation and data than prior methods.
Limitations & Future Work:
- Latent Action Design: The granularity of the latent actions (fixed number of tokens, fixed codebook size) may not be optimal for all tasks. Future work could explore adaptive mechanisms. The framework also needs to be extended to more complex embodiments like dual-arm or dexterous-hand robots.
- Language Annotation: The current method relies on fine-grained language instructions. While it works, more expressive, high-level language could further improve latent action learning.
- Integration with World Models: The decoder of the latent action model is effectively a world model. This could be leveraged for planning (e.g., with tree search) and reinforcement learning in the future.
- In-Context Learning: The authors propose an exciting future direction: using the latent action model as a "video tokenizer" to encode entire human demonstration videos into compact token sequences. These could then be used as in-context examples for zero-shot skill acquisition by the VLM policy.
Personal Insights & Critique:
- Strengths: The paper's central idea of decoupling latent actions is elegant and powerful. It provides a practical solution to a major problem in robot learning: how to effectively use the massive amounts of unlabeled video data in the wild. The demonstrated efficiency and performance gains are truly impressive and point towards a more sustainable path for scaling up robot policies. The use of DINOv2 features is also a very smart design choice, leveraging the power of modern foundation models.
- Critique: The performance of the system is implicitly tied to the quality of the pretrained DINOv2 vision encoder and the T5 text encoder. As these foundation models improve, UniVLA's performance will likely improve as well, but it also creates a dependency. While the framework is very general, the two-stage decoupling process might require careful tuning for new, unseen domains where the nature of "task-irrelevant" dynamics could be drastically different.
- Impact: UniVLA represents a significant step forward for generalist robotics. It shifts the focus from costly, supervised action labeling to clever, unsupervised representation learning. This approach could unlock the potential of web-scale data and accelerate progress towards creating robots that can learn new skills quickly and operate robustly in the complexities of the real world.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.