Paper status: completed

FASTER: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-wise Decoding

Autoregressive Vision-Language-Action Models (1)Flexible Action Sequence Tokenization (1)Block-wise Autoregressive Decoding (1)Residual Vector Quantization (1)Robotic Manipulation Tasks (1)

Original Link

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

FASTER introduces a learnable action tokenizer based on vector quantization to balance reconstruction accuracy and inference efficiency in autoregressive VLA models. It uses a transformer backbone to capture features efficiently. The subsequent FASTerVLA model further enhances pe

Abstract

Autoregressive vision-language-action (VLA) models have shown strong capabilities in robotic manipulation. However, their core component—action tokenization—often suffers from a trade-off between reconstruction accuracy and inference efficiency. We present Flexible Action Sequence Tokenization for efficient inference (FASTER), a vector-quantization-based learnable tokenizer framework. FASTER represents action chunks as single-channel images to capture global spatio-temporal relationships. Combining a transformer backbone with residual vector quantization, it models cross-dimensional dependencies and regulates code length, thereby preserving structured action dependencies while enabling flexible code organization for downstream VLA models. Building on FASTER, we propose FASTerVLA, which integrates a block-wise autoregressive decoding paradigm and an autoregressive action expert to fully exploit the strengths of autoregressive VLAs. FASTerVLA surpasses existing state-of-the-art VLA models in both performance and inference speed. We construct a systematic evaluation framework for action tokenization and, through comprehensive analysis, demonstrate the performance, efficiency, and flexibility of FASTER across models, tasks, and embodiments. Furthermore, extensive experiments show that FASTerVLA further enhances overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance across diverse simulated and real-world settings.

Mind Map

In-depth Reading

English Analysis~18 min read · 23,603 chars

1. Bibliographic Information

1.1. Title

FASTER: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-wise Decoding

1.2. Authors

The paper is listed with "Anonymous authors" as it was under double-blind review at the time of its submission.

1.3. Journal/Conference

The paper was submitted for review, and the provided link points to OpenReview. OpenReview is a platform commonly used for peer review by top-tier machine learning and AI conferences such as ICLR (International Conference on Learning Representations), NeurIPS (Conference on Neural Information Processing Systems), and ICML (International Conference on Machine Learning). The high quality of the work suggests it was intended for one of these prestigious venues.

1.4. Publication Year

The paper cites preprints from 2025, indicating it was likely written and submitted in late 2024 for a 2025 conference.

1.5. Abstract

The paper introduces FASTER (Flexible Action Sequence Tokenization for efficient inference), a learnable, vector-quantization-based tokenizer for robotic actions. The key problem it addresses is the trade-off between reconstruction accuracy and inference speed in existing action tokenization schemes for autoregressive Vision-Language-Action (VLA) models. FASTER represents action chunks as single-channel images and uses a transformer backbone with residual vector quantization to capture spatio-temporal dependencies efficiently. Building on this, the paper proposes FASTerVLA, a VLA model that incorporates a block-wise decoding paradigm and a specialized "action expert" to enhance performance and speed. The authors establish a systematic evaluation framework and demonstrate through extensive experiments that FASTER offers superior performance, efficiency, and flexibility across various models, tasks, and robot embodiments. FASTerVLA is shown to surpass previous state-of-the-art models in both task success and inference speed in simulated and real-world settings.

1.6. Original Source Link

Official Link: https://openreview.net/pdf?id=k6nTUFoqeT
Publication Status: The document is a preprint available on OpenReview, where it was submitted for peer review.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper lies within the domain of robotic control using Vision-Language-Action (VLA) models, which leverage the power of large language models to interpret visual and linguistic commands and generate robot actions. VLA models generally fall into two categories: diffusion-based and autoregressive.

Diffusion-based models excel at generating precise, continuous actions but often struggle to fully utilize high-level visual and language context.
Autoregressive models, which are architecturally similar to successful Vision-Language Models (VLMs), are better at instruction-following and generalization. However, they face a significant bottleneck: action tokenization.

Action tokenization is the process of converting continuous robot actions (e.g., joint angles, end-effector coordinates) into a sequence of discrete tokens that the model can predict. This process presents a critical trade-off:

High Compression (Efficiency): To make inference fast, the action sequence should be represented by as few tokens as possible. Autoregressive models generate tokens one by one, so fewer tokens mean fewer forward passes and lower latency.
High Fidelity (Accuracy): Compressing the action sequence too much can lead to a loss of information, resulting in inaccurate or jerky robot movements. The reconstructed action must be a faithful representation of the intended motion.

Prior methods failed to strike an optimal balance. For example, simple binning lacks compression, while methods like $DCT+BPE$ (used in the FAST tokenizer) produce variable-length token sequences, which complicates training and slows down inference. Existing Vector Quantization (VQ) based methods often suffered from poor reconstruction quality.

This paper's entry point is to design a new action tokenizer, FASTER, that satisfies four key principles: high compression efficiency, robust reconstruction quality, explicit modeling of the 2D spatio-temporal structure of actions, and flexibility across different robots and tasks.

2.2. Main Contributions / Findings

The paper makes three primary contributions:

Proposal of FASTERVQ: A novel, learnable action tokenizer based on Residual Vector Quantization (RVQ). It treats action chunks as 2D "images" and uses a transformer-based autoencoder to compress them into a compact, fixed-length sequence of discrete codes. This design achieves a superior trade-off between reconstruction accuracy and compression ratio.
Proposal of FASTerVLA: A powerful and efficient VLA model built upon FASTERVQ. It introduces two key architectural innovations:
- Block-wise Autoregressive Decoding: Instead of generating one token at a time, the model generates a block of tokens in a single forward pass, dramatically accelerating inference.
- Lightweight Action Expert: A small, specialized transformer head that handles action decoding, preventing the fine-tuning process from negatively impacting the powerful pretrained weights of the VLM backbone.
Establishment of a Systematic Evaluation Framework: The paper conducts a comprehensive and systematic analysis of action tokenization, evaluating FASTER and other methods across a wide range of benchmarks, including nine distinct task suites, six robot embodiments (real and simulated), and multiple VLM backbones.

The key finding is that FASTerVLA is the first autoregressive VLA model to surpass state-of-the-art non-autoregressive (diffusion) models in both task performance and inference speed, resolving the long-standing efficiency-accuracy trade-off.

3.1. Foundational Concepts

To understand this paper, one must be familiar with the following concepts:

Vision-Language-Action (VLA) Models: These are AI models designed to control robots. They take multimodal inputs—typically a camera image (vision), a natural language command (language), and the robot's current state (e.g., joint angles)—and output a sequence of actions for the robot to execute. They are often built by adapting large pre-trained Vision-Language Models (VLMs).
Autoregressive Models: These are models that generate sequences of data one element at a time, where each new element is conditioned on all the previously generated elements. For example, a language model generates the next word based on the sentence it has written so far. In this paper's context, the VLA model predicts the next action token based on the previously predicted tokens.
Action Tokenization: This is the process of converting a continuous, high-dimensional robot action sequence into a sequence of discrete integer tokens. For example, a 2-second trajectory of a 7-DOF (degree-of-freedom) robot arm is a continuous signal. Tokenization maps this signal to a sequence like [12, 512, 34, 987, ...]. This is necessary for autoregressive models, which are designed to work with discrete vocabularies.
Vector Quantization (VQ): A data compression technique. It works by creating a "codebook," which is a predefined set of vectors (codewords). To compress an input vector, VQ finds the closest codeword in the codebook and replaces the input with the index of that codeword.
- Encoder: Maps the input data to a latent vector.
- Codebook ( $Z$ ): A collection of learnable embedding vectors, $\{e_1, e_2, ..., e_K\}$ .
- Quantizer: For a given latent vector $z$ , it finds the nearest codebook vector: $e_k$ where $k = \arg\min_j ||z - e_j||^2$ . The output is the index $k$ .
- Decoder: Takes the quantized vector $e_k$ and reconstructs the original data.
Residual Vector Quantization (RVQ): An improvement over standard VQ that allows for more accurate representations. It uses multiple codebooks in stages.
1. The first stage quantizes the original data.
2. The second stage quantizes the "residual error" (the difference between the original data and the output of the first stage).
3. This process repeats for several stages, with each subsequent stage refining the representation by modeling the error of the previous one. The final representation is the sum of the quantized vectors from all stages. This creates a coarse-to-fine representation.

3.2. Previous Works

The paper positions itself relative to several key lines of research:

Diffusion-based VLAs: Models like Diffusion Policy and $π₀$ (pi-zero) generate continuous actions using diffusion models. While they achieve high precision, the paper notes they can be less responsive to language and visual cues.
Autoregressive VLAs:
- RT-2 and OpenVLA: Early autoregressive VLAs that demonstrated the potential of transferring knowledge from web-scale VLMs to robotics. Their tokenization schemes were often simple.
- Binning: A basic tokenization method used by models like RT-2. It divides the range of each action dimension into a fixed number of bins and assigns a token to each bin. This method does not perform any compression, leading to very long token sequences and slow inference.
- FAST (π₀-FAST): The previous state-of-the-art autoregressive VLA. Its tokenizer uses a two-step, non-learnable process:
  1. Discrete Cosine Transform (DCT): A signal processing technique to transform the action sequence into the frequency domain, concentrating most of the important information into a few coefficients. This helps with compression.
  2. Byte Pair Encoding (BPE): A standard text tokenization algorithm that iteratively merges frequent pairs of bytes to create a vocabulary. While effective, $DCT+BPE$ produces variable-length token sequences, which makes training less stable and inference slow.
- VQ-based Tokenizers (MiniVLA, VQ-VLA): These works introduced learnable VQ tokenizers for actions. However, the paper's analysis shows they suffer from poor reconstruction quality, which limits the performance of the downstream VLA model.

3.3. Technological Evolution

The evolution of action tokenization in robotics has moved from simple, non-compressive methods to more sophisticated, learnable approaches:

Direct Regression: Predicting continuous values directly (not tokenization).
Discretization (Binning): Simple, but leads to long sequences (RT-2).
Non-Learnable Compression: Using signal processing techniques like DCT combined with text tokenizers like BPE (FAST). This improved compression but introduced issues with variable token lengths.
Learnable Compression (VQ): Using neural networks to learn an optimal, compressed representation (MiniVLA, VQ-VLA). This promised a better trade-off, but initial attempts had reconstruction issues.
FASTER (This paper): A more advanced learnable compression method using a hybrid transformer-convolutional architecture, RVQ, and a specialized loss function to achieve both high compression and high fidelity.

3.4. Differentiation Analysis

FASTER differentiates itself from previous works in several key ways:

vs. FAST: FASTER uses a learnable VQ-based approach, whereas FAST uses a non-learnable $DCT+BPE$ pipeline. This allows FASTER to generate a fixed-length, compact representation that is more data-driven and stable for training.
vs. MiniVLA / VQ-VLA: FASTER employs a more powerful transformer-based autoencoder (TAAE) and a hybrid loss function (reconstruction in both time and frequency domains). This results in significantly higher reconstruction fidelity, overcoming the main weakness of prior VQ-based methods.
Architectural Innovation: FASTER represents action chunks as 2D single-channel images, allowing it to model spatio-temporal dependencies more effectively. This is inspired by techniques from image and audio processing and is novel in the context of action tokenization.
System-level Innovation (FASTerVLA): The paper doesn't just propose a better tokenizer; it co-designs a VLA model that fully leverages it. The block-wise decoding and action expert are crucial system-level optimizations that make autoregressive inference faster than ever before, even surpassing non-autoregressive models.

4. Methodology

4.1. Principles

The core principle behind FASTER and FASTerVLA is to address the shortcomings of autoregressive VLA models from three complementary angles:

Better Tokenization (FASTERVQ): Create a compact, fixed-length, and high-fidelity action representation using a learnable, data-driven tokenizer inspired by modern audio codecs.
Faster Decoding (Block-wise Decoding): Overcome the sequential bottleneck of autoregressive generation by predicting multiple tokens in parallel.
More Efficient Learning (Action Expert): Isolate the learning of action-specific patterns into a smaller module to protect the powerful, general-purpose representations of the pre-trained VLM backbone.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology is divided into two main parts: the FASTERVQ tokenizer and the FASTerVLA model.

4.2.1. FASTERVQ: The Action Tokenizer

The FASTERVQ tokenizer transforms a continuous action chunk into a discrete sequence of tokens. It consists of two stages: an action patchifier and a residual VQ autoencoder.

The following figure from the paper provides an overview of the FASTERVQ architecture.

$Figure 2: FASTerVQ overview. (a) The raw action sequence is patchified into compact tokens, reducing horizon redundancy and avoiding long inputs to the downstream model. (b) In the tokenization pipeline, FASTerVQ applies DCT and L1 reconstruction losses and adopts RVQ, encoding actions into `N _ { c }` code levels; each level can be reshaped into a $C _ { h } \\times C _ { a }$ tensor. When used with FASTerVLA, these structures are flattened into a single token sequence.$

1. Action Patchifier

This initial step preprocesses the raw action sequence to make it more suitable for the transformer model. Given an action sequence of horizon $H$ , $A_{t:t+H} = (\mathbf{a}_t, \mathbf{a}_{t+1}, \dots, \mathbf{a}_{t+H-1})$ , where each action vector $\mathbf{a}_t \in \mathbb{R}^D$ :

Temporal Partitioning: The sequence's time dimension (length $H$ ) is uniformly divided into $m$ groups, each of length $h$ .
Dimensional Partitioning: The action dimension $D$ is non-uniformly partitioned into $n$ groups based on physical properties (e.g., end-effector position, rotation, and gripper state are grouped separately). This is crucial because different action dimensions have very different data distributions (e.g., gripper state is often binary, while position is continuous). Each group is then padded to match the size of the largest group, $d$ .
Patch Creation: This process reshapes the action sequence into a 2D tensor of shape $(m \cdot h) \times (n \cdot d)$ . This tensor is then flattened into a set of patches $\mathbf{a}_{t:t+H}^P \in \mathbb{R}^{(m \cdot n) \times (h \cdot d)}$ . Each patch represents a spatio-temporal segment of the action.

2. Residual VQ Action Tokenizer

This component, named Transformer Action AutoEncoder (TAAE), is a hybrid autoencoder that compresses the action patches into discrete codes.

Encoder ( $\phi_{enc}$ ): A hybrid transformer-convolutional network that takes an action patch $\mathbf{a}_{t:t+H}^P$ and downsamples it into a latent embedding $\mathbf{z} \in \mathbb{R}^{C_h \times C_a}$ .
Residual Vector Quantization (RVQ): This is the core quantization step. It decomposes the latent embedding $\mathbf{z}$ $z$ into a coarse-to-fine representation using $N_c$ $N_{c}$ quantization levels.
- Initialization: The first residual is the latent embedding itself: $\mathbf{r}_1 = \mathbf{z}$ .
- Iterative Quantization: For each level $i$ $i$ from 1 to $N_c$ $N_{c}$ :
  1. The current residual $\mathbf{r}_i$ is quantized by finding the nearest vector in the $i$ -th codebook, $\mathcal{Q}_i(\mathbf{r}_i)$ . The quantizer $\mathcal{Q}_i$ finds the index $k$ of the nearest codebook entry $e_k$ for each vector in $\mathbf{r}_i$ : $k = \arg\min_j ||\mathbf{r}_i - e_j||^2$ .
  2. The next residual is calculated as the error from the current quantization: $\mathbf{r}_{i+1} = \mathbf{r}_i - \mathcal{Q}_i(\mathbf{r}_i)$ .
- Final Quantized Embedding: The final, high-fidelity quantized representation $\mathbf{z}_q$ is the sum of the quantized vectors from all levels: $ \mathbf{z}q = \sum{i=1}^{N_c} \mathcal{Q}_i(\mathbf{r}_i) $
- Action Tokens: The indices collected from each quantization step form a discrete code tensor $C \in \{1, \dots, |\mathcal{Z}|\}^{N_c \times C_h \times C_a}$ , where $|\mathcal{Z}|$ is the codebook size. This tensor $C$ is flattened to form the final sequence of action tokens for the VLA model.
Decoder ( $\phi_{dec}$ ): Another transformer-based network that takes the quantized embedding $\mathbf{z}_q$ and reconstructs the original action patch, $\hat{\mathbf{a}}_{t:t+H}^P$ .

3. Training Objective

The tokenizer is trained end-to-end by minimizing a composite loss function that ensures both accurate reconstruction and stable codebook learning.

$ \mathcal { L } = \underbrace{\left| a_{t:t+H}^P - \hat{a}{t:t+H}^P \right|1}{\text{L1 reconstruction loss (time domain)}} + \underbrace{\left| \mathrm{DCT}(a{t:t+H}^P) - \mathrm{DCT}(\hat{a}_{t:t+H}^P) \right|1}{\text{L1 reconstruction loss (frequency domain)}} + \underbrace{\lambda \cdot \left| z - \mathrm{sg}(z_q) \right|2^2}{\text{Commitment loss}} $

Symbol Explanation:
- $a_{t:t+H}^P$ : The original input action patch.
- $\hat{a}_{t:t+H}^P$ : The reconstructed action patch from the decoder.
- $\mathrm{DCT}(\cdot)$ : The Discrete Cosine Transform, which converts the signal to the frequency domain. The loss in this domain helps capture global trends in the action.
- $z$ : The output of the encoder (before quantization).
- $z_q$ : The quantized latent embedding.
- $\mathrm{sg}(\cdot)$ : The stop-gradient operator. This is a standard trick in VQ training. It means that during backpropagation, the gradient from the decoder passes "straight through" the quantizer to the encoder, as if quantization was an identity function. However, the commitment loss term pulls the encoder's output $z$ closer to the chosen codebook vector $z_q$ , effectively teaching the encoder to produce outputs that are easy to quantize.
- $\lambda$ : A hyperparameter balancing the reconstruction and commitment losses.

4.2.2. FASTerVLA: The VLA Model

FASTerVLA is an autoregressive VLA model built on a standard VLM architecture but with several key modifications for efficiency and performance.

The following figure from the paper provides an overview of the FASTerVLA architecture and decoding process.

1. Architecture

VLM Backbone: It uses a standard VLM structure: a vision tower (e.g., ViT) to encode images, a projection layer to map visual features to the language model's embedding space, and a large transformer-based language model (e.g., PaliGemma).
Lightweight Action Expert: Instead of training the entire VLM backbone to predict action tokens, a small, lightweight transformer head (the "action expert") is added on top. The large VLM backbone encodes the multimodal context (image, text, state) once. Then, the action expert autoregressively decodes the action tokens using these context features. This design has two benefits:
1. It minimizes interference with the powerful, pre-trained VLM weights, preserving its generalization capabilities.
2. Decoding is much faster since it only involves forward passes through the small expert, not the entire backbone.
Other Details:
- Action embeddings are added to the VLM's vocabulary.
- Proprioceptive states (robot's own state) are discretized and treated as text tokens.
- Spacing augmentation is used during training (jittering the positions of action tokens) to prevent the model from overfitting to fixed positions.

2. Block-wise Autoregressive (BAR) Decoding

This is the key innovation for improving inference speed. Standard autoregressive models predict one token at a time. BAR predicts a block of $B$ tokens at once.

Standard Autoregressive Loss: $ \mathcal{L}{\mathrm{AR}} = - \sum{i=1}^{N} \log p_{\boldsymbol{\theta}}(c_i | c_{ $c_i$
Block Partitioning: The full sequence of $N$ action tokens $C$ is partitioned into $J$ contiguous blocks of size $B$ , so $C = \{C_1, \dots, C_J\}$ .
Block-wise Autoregressive Loss: The model is trained to predict all tokens within the next block, conditioned on all previous blocks. $ \mathcal{L}{\mathrm{BAR}} = - \sum{j=1}^{J} \sum_{i=1}^{B} \log p_{\theta}(\boldsymbol{c}{j,i} | C{ $j$
Block-wise Causal Mask: During training, the standard causal attention mask is modified. Tokens within the same block are allowed to attend to each other, but a block as a whole can only attend to preceding blocks (as shown in Figure 3c).

3. Decoding Order

The decoding process is structured hierarchically, following the coarse-to-fine nature of RVQ. The model first generates all tokens for the first codebook level across the entire time horizon, then moves to the second codebook level to predict the residuals, and so on. This "codebook-first" order (Figure 3b) is more stable than a simple "time-first" order because it builds the action representation from coarse global structure to fine-grained details.

5. Experimental Setup

5.1. Datasets

The authors use a comprehensive set of nine benchmarks across six different robot embodiments to evaluate their methods in both simulated and real-world scenarios.

The following figure from the paper shows the diversity of the evaluation setups.

Figure 9: Experiment setups. Diverse embodiments, scenes, and tasks for assessing multiple capability dimensions (marked by emojis). 该图像是示意图，展示了不同的评估任务和场景，包括单臂任务、双臂操控、预训练、以及适应性和零样本任务评估，旨在检验模型在多维能力上的表现。

LIBERO: A benchmark for lifelong robot learning with 40 tasks across four suites (Spatial, Object, Goal, Long). Used for evaluating in-distribution performance and adaptability.
Simpler: A benchmark for evaluating the pre-training performance of VLAs. The models are trained on the Bridge dataset.
VLABench: A benchmark focused on evaluating the generalization capabilities of VLAs, with controlled tests for language understanding, visual disturbances, and commonsense transfer.
GalaxeaManipSim: A simulator used for evaluating dual-arm manipulation tasks.
Xarm Suite: A real-world dataset with an XArm robot, covering tasks like flexible object manipulation and long-horizon planning.
R1Lite Suite: Real-world tasks with an R1Lite robot, including desktop manipulation (picking cubes, table bussing) and a whole-body control task (making the bed).
Bridge: A large-scale manipulation dataset used for pre-training models that are then evaluated in a zero-shot setting on a WidowX robot arm.
Droid: A very large-scale (90k+ trajectories) in-the-wild manipulation dataset used for pre-training, with zero-shot evaluation on a Franka robot.

5.2. Evaluation Metrics

The paper uses two primary metrics to evaluate performance:

5.2.1. Success Rate (SR)

Conceptual Definition: This is the most common metric in robotics for task-oriented evaluation. It measures the percentage of trials in which the robot successfully completes the specified task according to predefined success criteria. A higher SR indicates better policy performance.
Mathematical Formula: $ \text{SR} (%) = \frac{\text{Number of successful trials}}{\text{Total number of trials}} \times 100 $
Symbol Explanation:
- Number of successful trials: The count of evaluation episodes where the task goal was achieved.
- Total number of trials: The total number of episodes attempted.

5.2.2. Valid Reconstruction Rate (VRR)

Conceptual Definition: This metric was proposed by the authors to specifically evaluate the quality of an action tokenizer. It measures the percentage of individual action steps in a trajectory that are reconstructed with an error below a certain tolerance threshold, $\sigma$ . A high VRR indicates that the tokenizer preserves the fine-grained details of the motion accurately.
Mathematical Formula: $ \mathrm{VRR} = \frac{N_{\mathrm{valid}}}{N_{\mathrm{total}}} , \quad N_{\mathrm{valid}} = \sum_{i=1}^{N_{\mathrm{total}}} \mathbf{1}(\left| \mathbf{a}_i^{\mathrm{recon}} - \mathbf{a}_i^{\mathrm{gt}} \right|_1 < \sigma) $
Symbol Explanation:
- $N_{\mathrm{total}}$ : The total number of action steps in the evaluation dataset.
- $N_{\mathrm{valid}}$ : The number of action steps deemed "validly" reconstructed.
- $\mathbf{a}_i^{\mathrm{recon}}$ : The $i$ -th reconstructed action vector from the tokenizer's decoder.
- $\mathbf{a}_i^{\mathrm{gt}}$ : The $i$ -th ground-truth action vector.
- $\| \cdot \|_1$ : The $\ell_1$ norm (Manhattan distance).
- $\mathbf{1}(\cdot)$ : The indicator function, which is 1 if the condition inside is true, and 0 otherwise.
- $\sigma$ : A hyperparameter defining the tolerance for reconstruction error. The authors use a default of $10^{-2}$ , corresponding to about a 1cm error.

5.3. Baselines

The paper compares FASTerVLA against a wide range of state-of-the-art VLA models, categorized as follows:

Non-Autoregressive Models:
- Diffusion Policy: A pioneering work using diffusion models for visuomotor control.
- Octo-Base: A generalist robot policy model.
- $π₀$ (pi-zero): A state-of-the-art VLA using flow matching (a diffusion-like generative process). This is a very strong baseline for performance.
Autoregressive Models:
- OpenVLA: An open-source VLA using a simple binning tokenizer.
- MiniVLA: A VLA using a VQ-based tokenizer.
- VQ-VLA: Another VLA using a VQ-based tokenizer.
- $π₀ FAST-R$ and $π₀ FAST-D$ : The previous state-of-the-art autoregressive VLA, which uses the FAST tokenizer. This is the most direct and important baseline for comparison.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results strongly validate the paper's claims, addressing the four research questions posed by the authors.

6.1.1. FASTERVQ Balances Accuracy and Efficiency

The following figure from the paper compares FASTER against other tokenizers on reconstruction quality and compression rate.

Figure 4: Policy performance across embodiments and environments. Results are reported in the indistribution setting, covering two real-world embodiments and three simulated setups. 该图像是图表，展示了不同tokenizers的VRR（变压率）及FASTer的压缩比。左侧子图(a)比较了不同tokenizers在不同设置下的表现，右侧子图(b)展示了FASTer的性能表现，子图(c)则比较了不同tokenizer的压缩率。

Figure 5 shows that:

Baseline VQ methods (MiniVLA, VQ-VLA) have very poor reconstruction accuracy (low VRR), even on in-distribution data. While they compress well, the reconstructed actions are too noisy to be useful for high-precision tasks.
FASTER achieves a near-perfect VRR on in-distribution data and maintains very high VRR on out-of-distribution (OOD) data, indicating excellent generalization.
FASTER trained on more data (FASTER(L)) shows even better reconstruction accuracy than the one trained on less data (FASTER(S)), demonstrating a positive scaling law for action tokenization. This suggests that, like LLMs, action tokenizers benefit from larger and more diverse training data.

6.1.2. FASTER is Flexible Across Tasks, Embodiments, and Backbones

Cross-Task/Embodiment: Table 2 shows that FASTER(L), trained only on single-arm end-effector data, can achieve 100% VRR (with $\sigma=10^{-2}$ ) when reconstructing dual-arm joint position data for a completely different robot. This is a remarkable demonstration of flexibility, suggesting the tokenizer learns a general, underlying representation of robotic motion.
Cross-Backbone: The following figure from the paper shows that FASTER consistently improves performance over the FAST tokenizer regardless of the VLM backbone used.

该图像是示意图，展示了不同的评估任务和场景，包括单臂任务、双臂操控、预训练、以及适应性和零样本任务评估，旨在检验模型在多维能力上的表现。

Notably, FASTER turned the weakest model (InternVL3.5-2B with FAST) into the strongest performing model overall, boosting its average success rate on Libero by a massive 17.3%. This shows that a better action representation can unlock the latent potential of powerful VLM backbones.

6.1.3. FASTerVLA Achieves State-of-the-Art Performance and Generalization

The following table from the paper presents the main performance results on the Libero and Simpler-Bridge benchmarks.

Table 1: Policy Performance on Libero and Simpler-Bridge Benchmark 该图像是一个图表，展示了在 Libero 和 Simpler-Bridge 基准测试中的政策性能。图中对比了不同模型（如 PIO、FAST 和 FASTer）的成功率与任务进展，从而评估其有效性与效率。

SOTA Performance: FASTerVLA (specifically the Ours(BAR) variant) achieves the highest average success rate on both Libero (97.9%) and Simpler-Bridge (87.9%). Crucially, it outperforms the strongest non-autoregressive baseline $π₀$ (94.2% on Libero), which was previously considered superior in manipulation precision. This is a significant milestone for autoregressive VLAs.
Generalization Ability: The following figures from the paper show the model's performance on out-of-distribution and zero-shot tasks.

该图像是一个柱状图，展示了不同模型在多种任务中的成功率。图中显示了FASTER（紫色）与其他模型在In-distribution、Cross-Category、Common-Sense、Semantic、Texture Disturbance和Overall的表现对比，其中FASTER在多个任务上都展现出更高的成功率，突出其在效率和性能上的优势。

该图像是图表，展示了在不同任务进展情况下，Pi0、Fast、Fast+ 和 FASTer 的任务完成百分比。在 'Widow' 和 'Droid' 的任务进展上，FASTer 的表现明显优于其他方法。

On the VLABench generalization suite (Figure 7), FASTerVLA achieves the highest overall success rate and has the smallest performance drop compared to its in-distribution performance. On zero-shot real-world tasks (Figure 8, Widow and Droid), FASTerVLA also demonstrates a consistent performance advantage over other methods.

6.1.4. Tokenizer Vocabulary Distribution Affects VLA Generalization

The authors analyze why FASTER leads to better generalization.

The following are the results from Table 3 of the original paper:

Metric	Fast	Fast+	FASTer
Nvocab ↑	2048	2048	4096
Usage [%] ↑	48.39	57.37	100.0
Nactive(σ = 10−2) ↓	8	16	3
Fmax [%] ↓	9.63	1.82	1.35
Entropynorm ↑	0.69	0.77	0.91

This table reveals that:

The Fast tokenizer suffers from token collapse: only 48% of its vocabulary is ever used, and one single token appears nearly 10% of the time (Fmax). This suggests a highly skewed, unbalanced distribution where the model likely over-relies on a few "common" action tokens.
FASTER utilizes its entire vocabulary (100% usage) and has a much more balanced distribution, with a low max token frequency (Fmax = 1.35%) and a normalized entropy close to 1 (0.91). The authors conclude that a balanced and fully utilized token vocabulary is crucial for good generalization, as it provides a richer and more expressive action representation for the VLA model to learn from.

6.1.5. Inference Efficiency

The authors report significant speedups. On the Libero benchmark, FASTerVLA runs at 140 ms per inference step, which is much faster than π₀-FAST (224-628 ms) and even slightly faster than the non-autoregressive $π₀$ (200 ms). On the more complex whole-body control task (R1Lite-WBC), FASTerVLA is on par with $π₀$ (around 290-300 ms) while being dramatically faster than π₀-FAST (1,100-3,000 ms). This confirms that block-wise decoding successfully bridges the efficiency gap.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Model	LIBERO					Simpler-Bridge
Model	Spatial	Object	Goal	Long	Average	Spoon	Carrot	Block	Eggplant	Average
Diffusion Policy (Chi et al., 2023)	78.3	92.5	68.3	50.5	72.4	-	-	-	-	-
Octo-Base (Team et al., 2024)	78.9	85.7	84.6	51.1	75.1	12.5	8.3	0.0	43.1	16.0
SpatialVLA (Qu et al., 2025)	88.2	89.9	78.6	55.5	78.1	16.7	25.0	29.2	100.0	42.7
π0 (Black et al., 2024)	96.8	98.8	95.8	85.2	94.2	66.7	58.3	58.3	88.3	66.7
OpenVLA-OFT (Kim et al., 2025)	96.2	98.2	95.6	92.0	95.5	12.5	4.2	8.3	0.0	6.25
UniVLA (Bu et al., 2025)	96.5	96.8	95.6	92.0	95.2	54.2	66.7	50.0	4.2	43.8
OpenVLA (Kim et al., 2024)	84.7	88.4	79.2	53.7	76.5	32.0	30.0	18.0	38.0	29.5
Palligemma + Naive Tokenizer	55.8	64.8	64.4	31.2	54.1	66.7	29.2	12.5	54.2	40.9
MiniVLA (Belkhale & Sadigh, 2024)	-	-	-	77.0	-	68.0	44.0	70.0	14.0	49.0
VQ-VLA (Wang et al., 2025c)	-	-	75.2	60.0	-	12.5	8.0	6.0	0.0	6.3
π0 FAST-R 1(Pertsch et al., 2025)	96.4	96.8	88.6	60.2	85.5	29.1	21.9	10.8	66.6	32.1
π0 FAST-D (Pertsch et al., 2025)	96.6	97.2	96.0	86.8	94.2	77.5	88.3	68.3	71.7	76.5
Ours(Normal)	99.4	98.8	94.8	88.6	95.4	97.5	83.3	65.0	78.3	81.0
Ours(BAR)	98.0	99.4	98.6	95.4	97.9	91.7	93.3	67.5	99.2	87.9

6.3. Ablation Studies / Parameter Analysis

The authors conducted thorough ablation studies in Appendix A.3 to validate their design choices.

The following are the results from Table 5 of the original paper:

(a) Tokenizer			(b) Codebook Size			(c) Residual
Arch.	SR	l1	Size	SR	utilization	#Res.	SR
CNN	96.2	0.0027	512	95.4	100	1	93.4
Transf.	95.3	0.0036	1024	93.2	100	2	95.5
TAAE	97.9	0.0021	4096	97.9	99.6	3	97.9
-		-	8192	96.3	95.1	8	96.6

FASTERVQ Ablations (Table 5):

Tokenizer Architecture: The hybrid Transformer Action AutoEncoder (TAAE) achieves the best success rate (97.9%) and lowest reconstruction loss, outperforming pure CNN or pure Transformer backbones.
Codebook Size: A codebook of size 4096 provides the best trade-off. Smaller sizes limit representational power, while a very large size (8192) slightly degrades performance, possibly due to optimization challenges.

Residual Depth: Using 3 residual codebooks is optimal. One level is too coarse, while too many (8) can introduce instability.

The following are the results from Table 6 of the original paper:

libero (AE)		simpler-widow (AE)		bar decoding
Variant	SR ↑	Variant	SR↑	Decoding strategy	SR ↑	Latency (ms)↓
No AE	95.5	No AE	75.6	Token-wise	95.5	323
AE (no pretrain)	94.8	AE (no pretrain)	23.6	Block-wise	96.7	140
AE (with pretrain)	97.9	AE (with pretrain)	87.9	Block-wise + AE	97.7	140

FASTerVLA Ablations (Table 6):
- Action Expert (AE): The action expert significantly improves performance (97.9% vs. 95.5%), but only when pre-trained. Training it from scratch leads to collapse, highlighting the importance of proper initialization.
- Block-wise Decoding (bar): Block-wise decoding improves the success rate (96.7% vs. 95.5%) while more than halving the inference latency (140 ms vs. 323 ms).
- Combined: The final model, combining the pre-trained Action Expert with block-wise decoding, achieves the best balance of high accuracy (97.7%) and low latency (140 ms).

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces FASTER, a comprehensive framework that pushes the boundaries of autoregressive Vision-Language-Action models. The authors propose a novel, codec-inspired action tokenizer (FASTERVQ) that achieves an excellent balance between reconstruction accuracy and compression. By coupling this tokenizer with a VLA model featuring block-wise autoregressive decoding and a lightweight action expert (FASTerVLA), they create a system that is not only highly performant but also remarkably efficient. The key takeaway is that with careful design of the tokenization and decoding schemes, autoregressive models can overcome their traditional latency issues to become a fast, powerful, and scalable solution for general-purpose robot manipulation, even outperforming their non-autoregressive counterparts.

7.2. Limitations & Future Work

The paper does not explicitly list its limitations, but some potential areas can be inferred:

Data Dependency: While FASTERVQ shows impressive generalization, its performance is still fundamentally tied to the quality and diversity of the large-scale datasets it is trained on. Its ability to generalize to truly novel robot morphologies or action types not seen in training (e.g., legged locomotion vs. manipulation) remains an open question.
Complexity of the Pipeline: The overall system involves multiple stages: pre-training the tokenizer, initializing the VLM, and then fine-tuning the VLA with an action expert. This multi-stage process can be complex and computationally expensive to replicate and tune.
Hyperparameter Sensitivity: The patchifier step introduces several hyperparameters (e.g., number of temporal and dimensional groups) that may need to be carefully chosen for different action spaces or robot embodiments. The paper does not provide an in-depth analysis of this sensitivity.

Future work could explore scaling the FASTERVQ tokenizer with even larger and more diverse datasets, investigating its applicability to other continuous data modalities beyond actions (e.g., audio, video), and exploring more dynamic block sizes for the BAR decoding to further optimize inference speed.

7.3. Personal Insights & Critique

This paper is a strong piece of engineering and empirical research that makes a significant contribution to the field of robot learning.

Strengths and Inspirations:
- The holistic, system-level approach is a major strength. The authors didn't just design a better tokenizer in isolation; they co-designed the entire VLA system around it to maximize its benefits.
- The idea of drawing inspiration from audio codecs for action tokenization is very insightful. Both domains deal with continuous time-series data with complex temporal structures, and the success of RVQ in audio clearly translates well to robotics.
- The systematic and rigorous evaluation is commendable. The paper's claims are backed by extensive experiments across a wide variety of tasks, robots, and backbones, making the results highly credible.
- The analysis of token vocabulary distribution (Table 3) provides a clear and compelling explanation for why FASTER leads to better generalization. This is a valuable insight for anyone working on tokenization for any modality.
Potential Issues and Areas for Improvement:
- While the performance is impressive, the system still relies on a large, pre-trained VLM, inheriting the massive computational and data requirements of such models.
- The "action expert" design, while effective, feels like a practical workaround. A more integrated architecture where action generation is a more native capability of the backbone model could be a future goal.
- The paper focuses on imitation learning. How this tokenization scheme would perform in an online reinforcement learning or planning context, where the model must deal with its own generated (and potentially out-of-distribution) action sequences, would be an interesting extension.
  
  Overall, FASTER represents a significant step forward, making autoregressive VLAs a more practical and powerful option for real-world robotics. It convincingly demonstrates that the perceived trade-off between performance and efficiency is not fundamental but can be overcome with better modeling choices.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

FASTER: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-wise Decoding

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~18 min read · 23,603 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. FASTERVQ: The Action Tokenizer

4.2.2. FASTerVLA: The VLA Model

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Success Rate (SR)

5.2.2. Valid Reconstruction Rate (VRR)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. FASTERVQ Balances Accuracy and Efficiency

6.1.2. FASTER is Flexible Across Tasks, Embodiments, and Backbones

6.1.3. FASTerVLA Achieves State-of-the-Art Performance and Generalization

6.1.4. Tokenizer Vocabulary Distribution Affects VLA Generalization

6.1.5. Inference Efficiency

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers