BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models
TL;DR Summary
BridgeVLA introduces a novel vision-language-action model for 3D manipulation, addressing inefficiencies in existing models. By projecting 3D data to 2D images and using heatmaps for action prediction, it achieves state-of-the-art performance on various benchmarks.
Abstract
Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models
1.2. Authors
- Peiyan Li (CASIA, ByteDance Seed, UCAS) - Project Lead
- Yixiang Chen (CASIA, UCAS)
- Hongtao Wu (ByteDance Seed) - Corresponding Author
- Xiao Ma (ByteDance Seed) - Project Lead
- Xiangnan Wu (CASIA)
- Yan Huang (CASIA, UCAS, FiveAges)
- Liang Wang (CASIA, UCAS)
- Tao Kong (ByteDance Seed)
- Tieniu Tan (CASIA, UCAS, NJU) - Corresponding Author
Affiliations Key:
- CASIA: Institute of Automation, Chinese Academy of Sciences
- UCAS: University of Chinese Academy of Sciences
- ByteDance Seed: A research division of ByteDance
- NJU: Nanjing University
1.3. Journal/Conference
Published at: arXiv (Preprint) Date: June 9, 2025 (v2) Status: The paper format suggests submission to a major robotics or machine learning conference (e.g., CoRL or NeurIPS), but it is currently cited as an arXiv preprint. The content represents state-of-the-art research in robot learning.
1.4. Publication Year
2025
1.5. Abstract
This paper introduces BridgeVLA, a novel Vision-Language-Action (VLA) model designed for 3D robot manipulation. The core problem addressed is the inefficiency of current VLA models that either ignore 3D structure or fail to align 3D inputs with the pre-trained nature of Vision-Language Models (VLMs). Key Methodology:
- Input Alignment: Projects 3D point cloud data into multiple 2D orthographic images to match the VLM's native 2D input format.
- Output Alignment: Predicts 2D heatmaps (probability maps) instead of text tokens for actions, unifying input and output in the 2D pixel space.
- Scalable Pre-training: Introduces a pre-training phase where the VLM learns to predict heatmaps for object grounding (locating objects), bridging the gap between language generation and spatial action prediction. Results: BridgeVLA achieves state-of-the-art (SOTA) performance on simulation benchmarks (RLBench, COLOSSEUM, GemBench) and demonstrates exceptional sample efficiency in the real world (96.8% success rate with only 3 training trajectories per task).
1.6. Original Source Link
- Original Source: https://arxiv.org/abs/2506.07961
- PDF Link: https://arxiv.org/pdf/2506.07961v2.pdf
2. Executive Summary
2.1. Background & Motivation
-
The Problem: In robot learning, we want robots to understand language commands (e.g., "pick up the red cup") and manipulate objects in a 3D world.
- 2D VLA Models: Models like RT-2 use pre-trained Vision-Language Models (VLMs) to process 2D images. They are smart (understand semantics) but data-inefficient, often requiring hundreds of demonstrations because they ignore the 3D geometry of the world.
- 3D Policies: Models like RVT use 3D data (point clouds). They are data-efficient (understand geometry) but often lack the rich semantic knowledge embedded in large pre-trained VLMs.
-
The Gap: Combining these two is hard. Previous attempts to feed 3D data into VLMs often resulted in a "distribution shift"—the 3D data looks nothing like the internet images the VLM was trained on. Furthermore, converting actions into text tokens (like "move to x: 10, y: 20") ignores spatial relationships.
-
Core Idea: Instead of forcing 3D data into a text-generation model, BridgeVLA aligns everything to the VLM's native 2D image space. It converts 3D inputs to 2D images and outputs actions as 2D heatmaps, making the pre-trained VLM feel "at home."
The following figure (Figure 1 from the original paper) provides an overview of this concept, showing how 3D inputs are projected and how the model predicts heatmaps:
该图像是一个示意图,展示了BridgeVLA模型的结构与功能,它通过对3D输入进行2D投影,实现输入和输出之间的对齐。该模型在预训练时使用2D热图,随后在真实世界和仿真环境中进行微调,以提高在3D操作中的成功率。
2.2. Main Contributions & Findings
- Unified 3D VLA Framework: Proposed BridgeVLA, which projects 3D point clouds to multi-view 2D images (Input Alignment) and predicts 2D heatmaps for actions (Output Alignment).
- Scalable Pre-training Strategy: Developed a method to pre-train the VLM backbone on large-scale object detection datasets to output heatmaps, equipping it with spatial grounding capabilities before robot training.
- State-of-the-Art Performance:
- RLBench: Improved average success rate from 81.4% to 88.2%.
- COLOSSEUM: Improved robustness in changing environments from 56.7% to 64.0%.
- Real World: Achieved 96.8% success on 10+ tasks using only 3 demonstrations per task, solving the data efficiency problem.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand BridgeVLA, you need to grasp these concepts:
- Vision-Language Model (VLM): A large AI model (like GPT-4V or PaliGemma) trained on billions of image-text pairs. It can look at an image and answer questions about it.
- Vision-Language-Action (VLA) Model: A VLM adapted for robotics. Instead of outputting just text, it outputs robot actions (e.g., "move arm 10cm forward").
- 3D Point Cloud: A set of data points in space representing the external surface of an object. Unlike a 2D pixel
(x, y, color), a point cloud point is(x, y, z, color). - Orthographic Projection: A method of drawing 3D objects in 2D. Unlike perspective projection (how our eyes see, where parallel lines converge), orthographic projection keeps parallel lines parallel. It's like looking at a map (Top View) or an architectural blueprint (Front/Side View). This preserves true distances, which is crucial for robots measuring where to move.
- Heatmap vs. Token Prediction:
- Token Prediction: Treating coordinates as text (e.g., generating the text string "1", "2", "8"). This breaks spatial continuity.
- Heatmap: An image where brightness represents probability. A bright spot at pixel means "the robot should go here." This preserves spatial structure—nearby pixels are physically close.
- 6-DoF Pose: "Six Degrees of Freedom." To describe a robot gripper's position, you need 3 numbers for position
(x, y, z)and 3 for rotation (roll, pitch, yaw).
3.2. Previous Works
- 2D VLA Models (e.g., RT-2 [9], OpenVLA [5]): These treat robot control as a language modeling problem. They take images and text as input and output actions as text tokens.
- Limitation: They discard depth information and require massive datasets to learn simple spatial movements.
- 3D Policies (e.g., Act3D [12], RVT [13]): These specialize in processing 3D point clouds, often using "voxel grids" (3D pixels) or multi-view projections.
- Limitation: They usually train from scratch or use simple visual encoders (like CLIP), missing out on the complex reasoning capabilities of generative VLMs.
- 3D VLAs (e.g., SpatialVLA [16]): Recent attempts to add 3D depth to VLAs.
- Limitation: They often inject 3D data as numerical encodings into the language model, which confuses the pre-trained weights that only know 2D images.
3.3. Technological Evolution
- Phase 1: Specialized 3D Policies. Models specifically designed for 3D geometry (PointNet, Voxels). Efficient but not smart.
- Phase 2: Generative 2D VLAs. Taking massive web-trained VLMs and fine-tuning them for robots. Smart but clumsy (needs lots of data).
- Phase 3 (This Paper): BridgeVLA. A hybrid approach. It uses the smart brain of a VLM but translates the 3D world into the VLM's native language (2D images and heatmaps) to get the best of both worlds: efficiency and intelligence.
3.4. Differentiation Analysis
- Vs. RVT-2: RVT-2 uses orthographic images but lacks a VLM backbone, limiting its understanding of complex language instructions (e.g., "pick the object that looks like a fruit"). BridgeVLA adds the VLM brain.
- Vs. SpatialVLA: SpatialVLA tries to force 3D position numbers into the VLM. BridgeVLA avoids this by projecting 3D to 2D images, keeping inputs aligned with what the VLM saw during its internet pre-training.
4. Methodology
4.1. Principles
The core design philosophy of BridgeVLA is Alignment. The authors argue that simply concatenating 3D data into a VLM breaks the model's pre-trained patterns. To fix this, they transform the robotics problem to look like a standard computer vision problem:
-
Input: 3D Point Cloud Multi-view 2D Images.
-
Output: 3D Coordinate 2D Heatmap (Probability Map).
The following figure (Figure 2 from the original paper) details this architecture, showing the pre-training (top) and fine-tuning (bottom) phases:
该图像是一个示意图,展示了BridgeVLA模型的架构,其中分为2D热图预训练和3D动作微调两个部分。上半部分使用2D检测数据生成2D热图,下半部分则基于3D点云数据进行动作预测。该模型通过正交投影实现输入对齐,并使用MLP进行输出分类,包括旋转、爪子和碰撞等动作。
4.2. Core Methodology In-depth
4.2.1. Model Backbone (PaliGemma)
BridgeVLA is built on PaliGemma, a Vision-Language Model.
- Vision Encoder: SigLIP (processes images into feature tokens).
- Language Model: Gemma (processes text and image tokens to generate output).
- Crucial Modification: Standard VLMs output text. BridgeVLA modifies the output head to produce spatial heatmaps.
4.2.2. Phase 1: 2D-Heatmap Pre-training
Before the robot ever sees a task, the model is taught to "point" at objects in images using heatmaps. This bridges the gap between text generation and spatial action.
Step 1: Data Preparation They use the RoboPoint object detection dataset. For an image with a target object, a ground-truth heatmap is created.
Step 2: Generating Ground Truth Heatmaps () Instead of a single pixel, the target is represented as a Gaussian distribution (a soft blob). For each object , a probability map is defined at pixel :
- Symbol Explanation:
-
: The coordinates of a pixel
(u, v)in the image. -
: The center pixel coordinates of the -th target object's bounding box.
-
: Squared Euclidean distance (calculates how far the pixel is from the center).
-
: Standard deviation, controlling the spread of the Gaussian blob (how large the "hot" spot is).
The map is truncated (set to 0) if probabilities are below a threshold . Finally, all object maps are averaged and normalized to create the final ground truth :
-
- Symbol Explanation:
- : Number of objects of interest in the image.
- : The set of all pixels in the image space.
- The denominator ensures the sum of all pixel values in the heatmap equals 1, making it a valid probability distribution.
Step 3: Architecture for Heatmap Prediction The VLM processes the image and text. The output tokens (which usually become text) are:
- Rearranged: The output tokens corresponding to image patches are reshaped back into a 2D grid.
- Upsampled: A Convex Upsampling block (learnable interpolation) expands this small grid back to the original image resolution.
- Loss: The model is trained using Cross-Entropy loss to match the predicted heatmap with .
4.2.3. Phase 2: 3D Action Fine-tuning
Now the model learns robot actions. The goal is to learn a policy .
Step 1: Input Processing (3D to 2D)
- The raw observation is a 3D point cloud.
- The system renders this point cloud into 3 orthographic projection images: Top View, Front View, and Right View.
- These 3 images are fed into the VLM backbone (which is now pre-trained to understand spatial locations).
Step 2: Action Prediction (2D Heatmaps to 3D Action) The VLM outputs 3 heatmaps, one for each view.
-
Translational Action (
x, y, z):- The model needs to find a 3D point in the workspace to move to.
- It considers a grid of 3D points in the robot's workspace.
- For each 3D point, it projects it onto the 3 predicted heatmaps (Top, Front, Right) to get 3 probability scores.
- These scores are multiplied (or averaged) to get a final score for that 3D point.
- The 3D point with the highest score is selected as the target translation.
-
Rotational & Gripper Actions:
- These are not spatial points, so heatmaps alone aren't enough.
- Global Feature: Max-pooling over the output tokens of the images.
- Local Feature: Extracting the token vector at the "peak" (hottest spot) of the heatmap.
- Fusion: Concatenate Global + Local features MLP (Multi-Layer Perceptron) Predict Rotation (Euler angles), Gripper (Open/Close), and Collision flag.
Step 3: Coarse-to-Fine Refinement To be precise:
- Predict a rough target using the full workspace.
- Crop the 3D point cloud around this rough target (Zoom in).
- Repeat the prediction on the zoomed-in view for high precision.
Step 4: Training Loss The total loss combines all action components:
- Symbol Explanation:
- : Cross-entropy loss between predicted heatmaps and ground-truth heatmaps (generated by projecting the expert's 3D target point onto the 2D views).
- : Cross-entropy loss for rotation (discretized into bins).
- / : Binary cross-entropy loss (since these are Yes/No decisions).
5. Experimental Setup
5.1. Datasets
The authors use three simulation benchmarks and one real-world setup.
-
RLBench:
- Type: Simulation (CoppeliaSim).
- Scale: 18 diverse tasks (e.g., "stack cups", "insert peg", "slide block").
- Data: 100 expert demonstrations per task.
- Why: A standard benchmark for measuring precise 3D manipulation.
-
COLOSSEUM:
-
Type: Simulation (Extension of RLBench).
-
Focus: Robustness. It adds 12 types of perturbations not seen during training, such as changing lighting, table textures, or adding distractor objects.
-
The following figure (Figure 6 from the original paper) visualizes these perturbations:
该图像是一个示意图,展示了机器人在不同场景下执行多种操作的多个变体,包括物体颜色、纹理、大小等的不同接收对象。图中显示的任务包括放置酒瓶、关闭笔记本电脑盖,以及整理棋盘等多种操作,展示了机器人处理复杂任务的能力。
-
-
GemBench:
- Type: Simulation.
- Focus: Generalization. It tests the robot on novel objects (Level 2), novel articulated objects (Level 3), and long-horizon tasks (Level 4) that were never seen during training.
-
Real-Robot Setup:
- Hardware: Franka Research 3 arm + ZED 2i depth camera.
- Tasks: 13 tasks (e.g., "Put the soda can in the bottom shelf").
- Data Regime: Extremely low data—training with only 3 to 10 trajectories per task.
5.2. Evaluation Metrics
- Success Rate (SR):
- Definition: The percentage of evaluation episodes where the robot successfully completes the task.
- Formula:
- Explanation: is the count of successful trials, is the total number of trials (typically 25 per task in simulation).
5.3. Baselines
The paper compares against a comprehensive list of strong baselines:
- 2D Baselines: Image-BC (CNN/ViT), R3M, MVP. These check if 3D is necessary.
- 3D Policies:
- Act3D: Uses feature clouds.
- RVT / RVT-2: The direct predecessors using multi-view images but without VLM backbones.
- 3D Diffuser Actor: A strong diffusion-based 3D policy.
- VLA Models:
- SpatialVLA: A recent 3D VLA that injects 3D embeddings.
- : A flow-matching based VLA trained on large scale data.
6. Results & Analysis
6.1. Core Results Analysis (RLBench)
BridgeVLA achieves the highest performance on RLBench, proving its effectiveness in learning precise manipulation.
The following are the results from Table 1 of the original paper:
| Models | Avg.SR(%) | Avg.Rank | Task Success Rate (%) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Close Jar | Drag Stick | Insert Peg | Meat off Grill | Open Drawer | Place Cups | Place Wine | Push Buttons | Put in Cupboard | |||
| Image-BC (CNN) | 11.3 | 11.72 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 |
| Act3D | 65.0 | 4.89 | 92.0 | 92.0 | 27.0 | 94.0 | 93.0 | 3.0 | 80.0 | 99.0 | 51.0 |
| RVT-2 | 81.4 | 2.75 | 100.0 | 99.0 | 40.0 | 99.0 | 74.0 | 38.0 | 95.0 | 100.0 | 66.0 |
| BridgeVLA (Ours) | 88.2 | 2.03 | 100.0 | 100.0 | 88.0 | 100.0 | 100.0 | 58.4 | 88.0 | 98.4 | 73.6 |
(Note: Only a subset of tasks is shown for brevity, but the Average SR includes all 18 tasks.)
Analysis:
- BridgeVLA outperforms the previous best (RVT-2) by nearly 7%.
- In precision-heavy tasks like Insert Peg, BridgeVLA doubles the success rate (40.0% 88.0%). This suggests the VLM backbone + heatmap approach provides superior spatial reasoning.
6.2. Robustness Analysis (COLOSSEUM)
COLOSSEUM tests if the robot panics when the lights turn off or the table color changes.
The following are the results from Table 2 of the original paper:
| Models | Avg. SR (%) ↑ | Avg. Rank ↓ | All Perturbations | Visual Perturbations | RLBench (Basic) | ||
|---|---|---|---|---|---|---|---|
| MO-COLOR | Light Color | Distractor | |||||
| RVT-2 | 56.7 | 1.92 | 15.6 | 53.0 | 58.0 | 60.8 | 68.8 |
| BridgeVLA (Ours) | 64.0 | 1.07 | 18.7 | 60.5 | 69.7 | 51.8 | 73.1 |
Analysis:
- BridgeVLA is significantly more robust (64.0% vs 56.7%).
- It performs better in "Light Color" and "MO-COLOR", indicating the VLM backbone (PaliGemma) has strong visual generalization from its internet-scale pre-training.
6.3. Generalization Analysis (GemBench)
GemBench tests generalization to completely new objects.
The following are the results from Table 3 of the original paper:
| Method | Average | L1 (Placement) | L2 (Novel Objects) | L3 (Articulated) | L4 (Long-Horizon) |
|---|---|---|---|---|---|
| 3D Diffuser Actor | 44.0 | 91.9 | 43.4 | 37.0 | 0.0 |
| RVT-2 | 44.0 | 89.1 | 51.0 | 36.0 | 0.0 |
| 3D-LOTUS++ | 48.0 | 68.7 | 64.5 | 41.5 | 17.4 |
| BridgeVLA (Ours) | 50.0 | 91.1 | 65.0 | 43.8 | 0.0 |
Analysis:
- BridgeVLA wins in L2 and L3, showing it can handle new objects best.
- Limitation: Like most baselines, it fails L4 (Long-Horizon). This suggests the current VLA formulation is good for atomic actions but struggles with chaining many steps together.
6.4. Real-World Efficiency
In real-world tests (Table 4 & Figure 3), BridgeVLA shows its "killer feature":
-
With 10 trajectories: 96.9% Success Rate.
-
With 3 trajectories: 95.4% Success Rate.
-
Baselines like and SpatialVLA fail almost completely (0-5% success) with such low data, proving that BridgeVLA's architecture is uniquely data-efficient.
The following chart (Figure 3 from the original paper) summarizes the real-world performance gain over the baseline RVT-2:
该图像是展示真实机器人实验和结果的图表。图中使用Franka Research 3机器人臂和ZED 2i摄像头进行操作,呈现了不同设置下的成功率比较,包括基本设置、干扰因素以及各种泛化设置。结果显示,BridgeVLA模型在各个任务中均优于基线方法RVT-2。
6.5. Ablation Studies
The authors verified their design choices:
- w/o Heatmap (Direct Regression): Performance drops massively (88.2% 31.4%). Predicting coordinates (
x,y,z) directly destroys the spatial benefits of the VLM. - w/ Position Encoding: Adding explicit 3D position embeddings to the VLM input hurt performance (88.2% 56.2%). This confirms the hypothesis: Don't mess with the VLM's input distribution. Keep it as images.
- w/o Pre-training: Removing the heatmap pre-training phase caused the model to fail in generalization tasks (e.g., understanding "Pick the red block" if it only saw blue ones).
7. Conclusion & Reflections
7.1. Conclusion Summary
BridgeVLA successfully bridges the gap between the semantic intelligence of Vision-Language Models and the geometric efficiency of 3D robotic policies. By strictly aligning inputs (via orthographic projection) and outputs (via heatmaps) to the 2D image space, it allows the VLM to function naturally without catastrophic distribution shifts. The result is a model that is both smart (generalizes to new instructions/objects) and efficient (requires very few demonstrations).
7.2. Limitations & Future Work
- Top-Down Occlusion: The authors note performance drops in tasks like "Place Cups" because the gripper occludes the object in orthographic views. Future Work: Dynamic view selection.
- Long-Horizon Tasks: The model failed Level 4 in GemBench. Future Work: Integrating High-level planners (LLMs) to break down tasks.
- Pre-training Domain Gap: The pre-training data (RoboPoint) is third-person, while robot data is different. Expanding pre-training diversity could help.
7.3. Personal Insights & Critique
- Simplicity Wins: The most striking insight is that removing explicit 3D inputs (like coordinate embeddings) and relying purely on projected images worked better. It teaches us a lesson in Transfer Learning: respecting the pre-trained model's original domain (images) is often better than forcing new modalities (3D coordinates) into it.
- The Power of Heatmaps: This paper reaffirms that for spatial tasks, spatial outputs (heatmaps) are superior to discrete tokens. This is a crucial takeaway for anyone building VLAs—don't treat coordinates as text.
- Sample Efficiency: The 3-shot learning capability is game-changing for real-world robotics, where collecting data is expensive. This suggests BridgeVLA could be a practical solution for factories or homes where valid data is scarce.
Similar papers
Recommended via semantic vector search.