GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
TL;DR Summary
GraspVLA is a foundation model pretrained on a billion frames of synthetic action data, addressing reliance on real-world data. By creating the SynGrasp-1B dataset and integrating autoregressive perception with flow-matching action generation, it achieves efficient zero-shot gene
Abstract
Embodied foundation models are gaining increasing attention for their zero-shot generalization, scalability, and adaptability to new tasks through few-shot post-training. However, existing models rely heavily on real-world data, which is costly and labor-intensive to collect. Synthetic data offers a cost-effective alternative, yet its potential remains largely underexplored. To bridge this gap, we explore the feasibility of training Vision-Language-Action models entirely with large-scale synthetic action data. We curate SynGrasp-1B, a billion-frame robotic grasping dataset generated in simulation with photorealistic rendering and extensive domain randomization. Building on this, we present GraspVLA, a VLA model pretrained on large-scale synthetic action data as a foundational model for grasping tasks. GraspVLA integrates autoregressive perception tasks and flow-matching-based action generation into a unified Chain-of-Thought process, enabling joint training on synthetic action data and Internet semantics data. This design helps mitigate sim-to-real gaps and facilitates the transfer of learned actions to a broader range of Internet-covered objects, achieving open-vocabulary generalization in grasping. Extensive evaluations across real-world and simulation benchmarks demonstrate GraspVLA's advanced zero-shot generalizability and few-shot adaptability to specific human preferences. We will release SynGrasp-1B dataset and pre-trained weights to benefit the community.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
The title clearly states the paper's core contributions:
- GraspVLA: The name of the proposed model.
- Grasping Foundation Model: It positions the model as a foundational, general-purpose model specifically for the task of robotic grasping.
- Billion-scale Synthetic Action Data: It highlights the novel pre-training paradigm, which uses a massive amount of synthetic (simulated) data rather than real-world data.
1.2. Authors
Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, Heming Cui, Zhizheng Zhang, and He Wang.
The authors are affiliated with Peking University, the University of Hong Kong, and a company named Galbot. The breadth of affiliations suggests a collaboration between academic research and potentially industry application. Several authors are part of the EPIC Lab (Extended Perception, Interaction and Control) at Peking University, which focuses on robotics, computer vision, and AI.
1.3. Journal/Conference
The paper was submitted to arXiv, a popular repository for pre-print articles in scientific fields. As of the time of this analysis, it is a pre-print, meaning it has not yet undergone formal peer review for publication in a conference or journal. However, arXiv is a standard platform for disseminating cutting-edge research quickly in fast-moving fields like AI and robotics.
1.4. Publication Year
The initial version was published in May 2025 (as listed on arXiv, which can sometimes use future dates for scheduling). The version analyzed here is v3, also from May 2025.
1.5. Abstract
The abstract introduces the growing interest in embodied foundation models for their generalization capabilities. It points out a major bottleneck: the reliance on expensive and labor-intensive real-world data collection. To address this, the paper explores the feasibility of training a Vision-Language-Action (VLA) model entirely with large-scale synthetic data.
The authors make several key contributions:
- They create SynGrasp-1B, a one-billion-frame synthetic robotic grasping dataset with photorealistic rendering and extensive randomization.
- They propose GraspVLA, a VLA model pre-trained on this dataset.
- GraspVLA uses a Progressive Action Generation (PAG) mechanism, which integrates perception and action generation into a unified Chain-of-Thought (CoT) process. This allows the model to be jointly trained on their synthetic action data and general Internet vision-language data.
- This joint training strategy helps bridge the sim-to-real gap and enables open-vocabulary generalization, allowing the model to grasp objects not seen in the synthetic dataset.
- Extensive evaluations show GraspVLA has strong zero-shot generalization (performing tasks without any specific training) and few-shot adaptability (quickly learning from a few examples).
- The authors promise to release the dataset and model weights to the public.
1.6. Original Source Link
-
Original Source Link:
https://arxiv.org/abs/2505.03233 -
PDF Link:
https://arxiv.org/pdf/2505.03233v3.pdf -
Publication Status: Pre-print on arXiv.
2. Executive Summary
2.1. Background & Motivation
The field of artificial intelligence has been revolutionized by foundation models—large-scale models like GPT-4 (for language) and CLIP (for vision) that are pre-trained on vast amounts of internet data. These models exhibit remarkable abilities to generalize to new, unseen tasks with little to no additional training. This success has inspired a new frontier in robotics: creating similar foundation models for physical interaction, known as Vision-Language-Action (VLA) models. A VLA model aims to understand a scene through cameras (Vision), interpret human commands (Language), and execute the corresponding physical task (Action).
However, a critical challenge separates robotics from language and vision: the lack of "internet-scale" action data. While text and images are abundant online, data of robots physically interacting with the world is scarce, expensive, and difficult to collect. Current efforts, like the Open X-Embodiment dataset, require immense coordination across many labs, robots, and human operators. This data bottleneck severely limits the scalability and progress of robotic foundation models.
This paper confronts this problem head-on by asking a fundamental question: Can we build a powerful and generalizable robot foundation model using only synthetic data? Synthetic data, generated in a simulator, is cheap, scalable, and offers perfect control over the environment and annotations. However, it has been historically plagued by the "sim-to-real" gap, where models trained in simulation fail to perform well in the real world due to differences in appearance and physics.
The paper's innovative entry point is to test this hypothesis on grasping, a cornerstone of robotic manipulation. They propose that by generating a dataset of unprecedented scale (one billion frames) with high-fidelity rendering and extensive randomization, and by designing a model architecture that can cleverly merge this synthetic action data with semantic knowledge from the internet, the sim-to-real gap can be overcome.
2.2. Main Contributions / Findings
The paper presents four main contributions that collectively demonstrate the viability of a synthetic-data-first approach for robotic learning:
-
A New Pre-training Paradigm: The authors introduce a novel method for training VLAs that relies entirely on synthetic data for learning actions. This radically reduces the dependency on costly real-world robot data collection, potentially democratizing the development of robotic foundation models.
-
The
SynGrasp-1BDataset: To enable this new paradigm, they curated a massive, billion-frame synthetic dataset for robotic grasping.SynGrasp-1Bis the first of its scale and features over 10,000 unique objects, extensive randomization of scenes (lighting, textures, backgrounds), and photorealistic rendering. The sheer scale and diversity of this dataset are key to learning a generalizable grasping policy. -
Progressive Action Generation (PAG): They propose a novel model mechanism,
PAG, which structures action generation as aChain-of-Thoughtprocess. The model first performs perception tasks (like identifying and locating the target object) and then uses these intermediate "thoughts" to generate the final action. This design allows the model to be co-trained on two different types of data:- Synthetic Action Data (
SynGrasp-1B): Used to train the entire perception-to-action pipeline. - Internet Vision-Language Data (
GRIT): Used to train only the perception part of the pipeline. This synergy allows the model to learn "how to grasp" from simulation while learning "what objects are" from the internet, enabling it to grasp a wide variety of objects it has never seen before (open-vocabulary grasping).
- Synthetic Action Data (
-
Demonstrated Foundation Model Capabilities: Through extensive real-world and simulation experiments, the authors show that
GraspVLAachieves state-of-the-art zero-shot generalization. It can be deployed directly from simulation to the real world and successfully grasp objects across various challenging conditions without any real-world fine-tuning. Furthermore, it shows strong few-shot adaptability, quickly learning specialized grasping behaviors (like avoiding touching the inside of a cup) from just a handful of new demonstrations.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Foundation Models
A foundation model is a large-scale AI model trained on a massive, broad dataset that can be adapted to a wide range of downstream tasks. The key idea is pre-training and fine-tuning.
- Pre-training: The model learns general patterns, structures, and representations from a vast dataset (e.g., the entire web for a language model). This phase is computationally expensive but only needs to be done once.
- Fine-tuning (or Adaptation): The pre-trained model is then adapted for a specific task using a much smaller, task-specific dataset. Because the model has already learned general knowledge, it can adapt quickly and effectively, often with very few examples. This gives them powerful zero-shot (performing a task with no examples) or few-shot (performing a task with a handful of examples) capabilities. Examples include OpenAI's GPT series (language) and Google's Vision Transformer (ViT) (vision).
3.1.2. Vision-Language-Action (VLA) Models
A Vision-Language-Action (VLA) model is the application of the foundation model concept to robotics. It's an end-to-end system that takes in:
- Vision: Raw pixel data from one or more cameras observing the robot's environment.
- Language: A natural language command from a human (e.g., "pick up the red apple").
- And outputs Action: A sequence of motor commands to control the robot (e.g., end-effector positions and gripper states). The goal is to create a single, unified model that can perform a wide variety of physical tasks based on high-level instructions, much like a human.
3.1.3. Sim-to-Real Transfer and Domain Randomization
Training robot policies in simulation is fast, safe, and cheap. However, these policies often fail in the real world because of the sim-to-real gap—the mismatch between the simulator and reality in terms of:
-
Visuals: Textures, lighting, shadows, and reflections can look different.
-
Dynamics: Physics properties like friction, mass, and how objects deform or react to contact are difficult to model perfectly.
Domain Randomization is a key technique to bridge this gap. Instead of trying to make the simulation perfectly match one specific real-world scenario, it trains the model in a huge variety of simulated environments where parameters like lighting, object textures, camera positions, and physics properties are randomly changed in every training episode. The idea is that if the model learns to perform the task across this wide range of simulated domains, the real world will appear to it as just another variation it has already seen, making the policy more robust.
3.1.4. Flow Matching
Flow Matching is a modern technique for training generative models. A generative model's goal is to learn how to create new data samples that look like they came from a training dataset. In this paper, the "data" to be generated is a sequence of robot actions.
At a high level, flow matching works by defining a "flow" or path from a simple, easy-to-sample distribution (like random noise) to the complex distribution of the real data (robot actions). The model, typically a neural network, is trained to predict the vector field that defines this flow. During inference, the model starts with random noise and follows the learned vector field step-by-step to generate a realistic action sequence. It is often considered a more stable and efficient alternative to other generative methods like Diffusion Models.
3.1.5. Chain-of-Thought (CoT)
Chain-of-Thought (CoT) is a technique originally developed for Large Language Models (LLMs). When given a complex problem, instead of directly outputting the final answer, the model is prompted to first generate a series of intermediate reasoning steps. For example, to solve a math word problem, it would first write down the steps to solve it before giving the final number. This has been shown to dramatically improve performance on complex reasoning tasks. This paper adapts the CoT concept to action generation, where the "reasoning steps" are perception tasks like identifying and locating an object before generating the grasp motion.
3.2. Previous Works
The paper situates its work in the context of three main research areas:
-
VLA Models: The paper cites several recent VLA models like
RT-2,OpenVLA,Octo, and . These models have shown impressive results by leveraging pre-trained vision-language backbones and training on large, real-world robotics datasets likeOpen X-Embodiment (OXE)andDROID. However, their primary reliance on real-world data makes them part of the paradigm this paper seeks to challenge. The concurrent work also targets zero-shot deployment but uses a different approach, leveraging multimodal web data and cross-embodiment real data, whereasGraspVLAfocuses exclusively on synthetic action data. -
Synthetic Data in Robotics: Using synthetic data is not a new idea. Early works used simulation with domain randomization to train open-loop grasping models. More recent works like
MimicGenuse simulation to augment a small set of real-world human demonstrations, creating variations to improve robustness. Other methods use generative AI (like text-to-image models) to create novel scenes. The key difference inGraspVLAis the scale and the "from-scratch" approach: it does not augment real demos but instead generates a massive, self-contained dataset intended for pre-training a foundation model from the ground up. -
Grasping Methods: Grasping is a classic robotics problem. Traditional methods often use a modular pipeline: a
grasp detectionmodel (e.g.,AnyGrasp) proposes stable grasp poses from sensor data (like point clouds), and then amotion plannermoves the robot to that pose. This is called open-loop because once the motion starts, it doesn't typically adjust based on new visual feedback. These systems can be fragile, especially with sensor noise (e.g., for transparent objects) and lack recovery behaviors. In contrast,GraspVLAis an end-to-end, closed-loop policy that directly maps pixels to actions at a high frequency, allowing it to continuously correct its movements based on what it sees.
3.3. Technological Evolution
The approach to robot learning has evolved significantly:
- Classical Robotics: Robots were programmed with explicit, hand-engineered rules and models of the world. This was brittle and did not scale to unstructured environments.
- Early Machine Learning: Reinforcement Learning (RL) and Imitation Learning (IL) were applied to learn policies from scratch, often in simulation or on a single task. Generalization was limited.
- Large-Scale Real-World Data: Inspired by foundation models in NLP/CV, the field shifted to collecting large, diverse, real-world datasets (
OXE,DROID) to train generalistVLAmodels. This proved effective for generalization but hit the wall of data collection cost and scalability. - Large-Scale Synthetic Data (This Paper's Contribution):
GraspVLArepresents the next logical step: attempting to bypass the real-world data bottleneck by leveraging massive-scale, high-quality synthetic data as the primary source for learning actions, positioning it as a potentially more scalable path forward for embodied AI.
3.4. Differentiation Analysis
Compared to prior work, GraspVLA's core innovations are:
-
Data Paradigm Shift: It is the first work to demonstrate that a VLA model for a complex manipulation skill can be trained exclusively on synthetic action data and achieve strong zero-shot real-world performance. This is a departure from all major contemporary VLAs (
RT-2,Octo, ) which are heavily dependent on large-scale real-world robotic datasets. -
Unprecedented Scale: The
SynGrasp-1Bdataset, with its billion frames, is orders of magnitude larger than most existing robotics datasets (real or synthetic), testing the hypothesis that "scale is all you need" can apply to synthetic robot training. -
Novel Co-training Mechanism: The
Progressive Action Generation (PAG)mechanism provides a principled way to fuse knowledge from two disparate sources: geometric/physical knowledge from synthetic action data and semantic/visual knowledge from internet-scale vision-language data. This is more structured than simply mixing datasets and hoping for the best. It explicitly teaches the model to ground language in vision (bounding box prediction) before attempting to act, mimicking a logical thought process.
4. Methodology
4.1. Principles
The core principle of GraspVLA is to effectively learn a generalizable grasping skill by combining the strengths of two data modalities:
-
Synthetic Action Data: Provides dense, physically-grounded information about how to interact with objects (geometry, stable grasp configurations, collision-free motions). This is learned from the
SynGrasp-1Bdataset. -
Internet Semantics Data: Provides broad knowledge about what objects are, their appearance, and how they are referred to in language. This is learned from a web-scale grounding dataset (
GRIT).To bridge these two worlds, the paper introduces Progressive Action Generation (PAG), which treats action generation as a Chain-of-Thought (CoT) process. Instead of directly mapping image-language inputs to an action, the model is forced to first generate intermediate representations related to perception. This breaks down the complex problem into a more manageable sequence: . This structured process allows for joint training where internet data can supervise the initial perception steps, and synthetic data can supervise the entire chain.
4.2. Core Methodology In-depth
4.2.1. The SynGrasp-1B Dataset Generation
The quality and scale of the synthetic dataset are paramount. The pipeline, shown in Figure 2, is designed for diversity, realism, and efficiency.
该图像是示意图,展示了数据生成管道的三个主要步骤:首先生成对象资产与布局,其次合成抓取轨迹,最后进行视觉随机化与渲染。这些步骤确保了抓取场景的多样性和真实性。
- Object Assets and Layout Generation:
- The process starts with over 10,000 3D object models from the
Objaversedataset, spanning 240 categories. - In each simulation episode, objects are randomly selected, scaled, and dropped onto a table to create cluttered, physically plausible scenes.
- The process starts with over 10,000 3D object models from the
- Grasp Synthesis and Trajectory Generation:
- An "expert" policy generates the successful grasping trajectories.
- First, a grasp synthesis algorithm finds stable antipodal grasps (where two contact points, e.g., from fingers, apply opposing forces for a stable grip) for the target object.
- Then, the
CuRobomotion planner is used to generate a smooth, collision-free trajectory for the robot arm to reach the grasp pose and lift the object. - Crucially, every trajectory is validated in the
MuJoCophysics simulator to ensure it results in a successful lift.
- Visual Randomization and Rendering:
- To bridge the sim-to-real gap, extensive domain randomization is applied.
- Trajectories are rendered into high-quality RGB images using NVIDIA's
Isaac Sim, which supports photorealistic ray tracing. - Randomization includes:
- Lighting: Point, directional, and dome lights with random properties.
- Materials: Randomized textures for the table, walls, and robot.
- Backgrounds: A diverse set of background images.
- Camera Views: Two camera viewpoints are used, and their positions are randomized within a small radius to simulate slight setup changes.
4.2.2. GraspVLA Architecture
The GraspVLA model, illustrated in Figure 3, consists of two main parts connected by the PAG mechanism.
该图像是示意图,展示了GraspVLA模型的结构和功能。模型通过逐步生成动作机制,从网络数据和合成数据中预测目标对象的2D边界框、抓取姿势和分段动作,结合了视觉-语言模型和流匹配的动作专家。
-
Vision-Language Model (VLM) Backbone: This part is responsible for perception and reasoning.
- Vision Encoders: It uses two powerful, pre-trained, and frozen vision encoders:
DINO-v2andSigLIP. Using frozen encoders leverages their robust visual features learned from web-scale data without the high cost of training them. Their features are fused. - Projector: A trainable MLP (multi-layer perceptron) that projects the vision features into the same space as the language model's embeddings.
- Large Language Model (LLM): A trainable
InternLM2 1.8Bmodel. It processes the sequence of vision and language tokens to perform reasoning.
- Vision Encoders: It uses two powerful, pre-trained, and frozen vision encoders:
-
Action Expert: This part is responsible for generating low-level robot control commands.
- It's a conditional flow matching model. It takes the high-level reasoning output from the VLM (specifically, its internal key-value cache) as a condition and generates a continuous, multi-step chunk of end-effector actions.
4.2.3. Progressive Action Generation (PAG)
PAG is the "glue" that holds the architecture together and enables joint training. It works as an autoregressive process within the LLM:
- Input Formatting: The model receives the instruction (e.g., "pick up the banana") and images from the two cameras.
- Step 1: Bounding Box Prediction (CoT Step 1): The LLM is prompted to first generate special tokens that represent the 2D bounding box of the target object in the camera images. This step is supervised using both the
SynGrasp-1Bdataset (which has perfect annotations) and theGRITinternet grounding dataset (which has human-annotated bounding boxes). This forces the model to learn to "ground" language to visual objects using a massive and diverse set of examples from the web. - Step 2: Grasp Pose Prediction (CoT Step 2): After generating the bounding box, the model is prompted to generate the target 3D grasp pose for the end-effector. This step is only supervised using
SynGrasp-1Bdata, as internet images lack this information. Before this step, recent proprioceptive data (the robot's own joint states) is injected as tokens to give the model awareness of its current physical state for more accurate 3D reasoning. - Step 3: Action Generation: The
flow matchingaction expert is conditioned on the full history of tokens generated by the LLM, including the input instruction and the intermediate "thoughts" (bounding box and grasp pose). It then generates a short sequence (a "chunk") of end-effector delta actions (changes in position and orientation).
4.2.4. Joint Training and Loss Functions
The model is trained with a single objective that combines losses for the VLM and the action expert. For a given batch of data, which is a mix of GRIT and SynGrasp-1B:
-
VLM Loss (): This is a standard language modeling cross-entropy loss for predicting the next token in a sequence. The sequence is structured according to
PAG. The formula is:$ \mathcal { L } _ { \mathrm { S 2 } } = - \sum _ { n = 1 } ^ { N _ { \mathrm { b b o x } } } \log P _ { \theta } ( \mathbf { y } _ { \mathrm { b b o x } , n } \mid \mathbf { x } , \mathbf { y } _ { \mathrm { b b o x } , < n } ) - \mathbf { 1 } _ { \mathrm { synheic } } \cdot \sum _ { n = 1 } ^ { N _ { \mathrm { gase } } } \log P _ { \theta } ( \mathbf { y } _ { \mathrm { g r a s p } , n } \mid \mathbf { x } , \mathbf { y } _ { \mathrm { b b o x } } , \mathbf { y } _ { \mathrm { g r a s p } , < n } ) $
- Explanation:
- The first term, , is the loss for predicting the sequence of bounding box tokens (). This is applied to all data in the batch (both synthetic and internet).
- The second term is the loss for predicting the sequence of grasp pose tokens (). It is multiplied by an indicator function, (a typo for ), which is 1 if the data sample is from
SynGrasp-1Band 0 otherwise. This ensures the grasp pose is only trained on synthetic data. - represents the input images and text.
- and (a typo for ) are the lengths of the token sequences for the bounding box and grasp pose, respectively.
- represents the trainable parameters of the VLM.
- Explanation:
-
Action Expert Loss (): This is the flow matching loss, supervised only on the
SynGrasp-1Bdata. The formula is:$ \mathcal { L } _ { \mathrm { S 1 } } = | v _ { t } ( \mathbf { A } _ { t } , \mathbf { x } , \mathbf { y } _ { \mathrm { b b o x } } , \mathbf { y } _ { \mathrm { g r a s p } } ) - u _ { t } ( \mathbf { A } _ { t } \mid \mathbf { A } _ { 0 } ) | ^ { 2 } $
- Explanation:
-
This is a mean squared error loss. It minimizes the difference between the predicted vector field and the ground-truth vector field .
-
is a random timestep along the flow.
-
is the ground-truth action chunk from the expert trajectory.
-
is a "noised" version of the action, obtained by moving along the path towards the noise distribution.
-
The model predicts the direction of the flow at point , conditioned on the visual input and the VLM's generated reasoning tokens and .
The final loss is a simple sum of and .
-
- Explanation:
5. Experimental Setup
5.1. Datasets
-
SynGrasp-1B(Training): The novel, billion-frame synthetic dataset created by the authors. It contains 10 million grasping trajectories across 10,680 objects and 240 categories, with extensive visual and physical randomization. It provides annotations for actions, camera parameters, bounding boxes, and 3D object/gripper poses. -
GRIT(Training): A large-scale internet dataset for grounded image-text pairs. It contains images with corresponding text descriptions and bounding boxes that link phrases in the text to regions in the image. This dataset is used to teachGraspVLAopen-vocabulary object recognition. -
Real-World Test Objects: The authors created physical test sets divided into two groups to evaluate generalization:
-
Synthetic Categories: Objects from categories that are present in theSynGrasp-1Bdataset (e.g., toy animals, common fruits). -
Web Categories: Objects from categories that are not inSynGrasp-1Bbut are common on the internet and thus present inGRIT(e.g., chargers, towels, swimming goggles). This directly tests the open-vocabulary generalization enabled byPAG. An example of the objects used is shown in Figure 4.
该图像是一个示意图,展示了实验中的机器人设置(a),使用的合成物体分类(b)、网络物体分类(c)以及五个测试集(d),以评估模型在不同条件下的表现。
-
-
LIBEROBenchmark (Simulation Evaluation): A standard benchmark for lifelong robot learning, featuring a suite of manipulation tasks. The authors use it to evaluateGraspVLA's zero-shot performance in a controlled simulation environment different from the one used for training.
5.2. Evaluation Metrics
The paper uses two primary metrics to evaluate performance:
-
Success Rate:
- Conceptual Definition: This metric measures the raw task completion ability of the model. It is the percentage of trials in which the robot successfully grasps the specified object and lifts it by a certain height (15 cm in the real-world tests). The model is allowed up to 3 grasp attempts per trial.
- Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
- Symbol Explanation: A trial is successful if the task objective is met within the allowed attempts.
-
Success weighted by Path Length (SPL):
- Conceptual Definition: SPL is a metric borrowed from navigation tasks that evaluates not only success but also the efficiency of the agent's path. A high SPL score means the agent is both successful and takes short, direct paths. A low score can indicate either failure or an inefficient, hesitant policy that takes a long time to succeed.
- Mathematical Formula: The paper provides the formula as: $ \text{SPL} = \frac{1}{N} \sum_{i=1}^{N} S_i \frac{l_i}{\max(p_i, l_i)} $
- Symbol Explanation:
- : The total number of evaluation trials.
- : A binary indicator for the -th trial, where if the trial was a success, and if it was a failure.
- : The length of the shortest path to success for trial , defined as the minimum number of action steps taken by any successful method on that specific trial.
- : The path length (number of action steps) taken by the policy being evaluated in trial . The ratio penalizes paths that are longer than the optimal one.
5.3. Baselines
GraspVLA is compared against a comprehensive set of baselines:
- VLA Generalists: These are state-of-the-art, large-scale transformer-based policies pre-trained on real-world data. To ensure a fair comparison, the authors fine-tune them on the
SynGrasp-1Bdataset.- : A VLA from Google DeepMind.
OpenVLA: An open-source VLA from Stanford.Octo: A generalist robot policy from UC Berkeley.
- Imitation Learning Specialist:
Diffusion Policy: A strong baseline for single-task imitation learning that uses a diffusion model to generate actions. Since it doesn't support language, it's trained and tested on grasping a single object category (elephant).
- Traditional Grasping System:
-
AnyGrasp: A state-of-the-art open-loop grasp detection model. It is combined withGrounding DINO(an open-vocabulary object detector) to handle language instructions. This represents a strong, modular, non-end-to-end baseline.
-
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Zero-Shot Real-World Generalization (vs. VLAs)
This experiment tests the core hypothesis: can a model pre-trained on synthetic data generalize to the real world without any real-world training? The results in Table 1 are compelling.
The following are the results from Table 1 of the original paper:
| Synthetic Categories | Web Categories | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| basic↑ | light↑ | b.g.↑ | dis.↑ | height↑ | SPL↑ | basic↑ | light↑ | b.g.↑ | dis.↑ | height↑ | SPL↑ | |
| Diffusion Policy [75] | 30.0 | 16.6 | 16.6 | 13.3 | 13.3 | 12.3 | - | - | - | - | - | |
| Octo [26] | 16.6 | 3.3 | 0.0 | 0.0 | 3.3 | 3.2 | 0.0 | 3.3 | 0.0 | 0.0 | 0.0 | 0.4 |
| OpenVLA [6] | 20.0 | 13.3 | 16.6 | 0.0 | 13.3 | 8.8 | 3.3 | 6.6 | 13.3 | 0.0 | 6.6 | 4.1 |
| π0(w/π0 pre-train)[7] | 66.6 | 63.3 | 60.0 | 60.0 | 56.6 | 42.3 | 33.3 | 36.6 | 30.0 | 26.6 | 26.6 | 17.8 |
| π0(w/o π0 pre-train)[7] | 80.0 | 76.6 | 80.0 | 86.6 | 76.6 | 51.8 | 40.0 | 40.0 | 36.6 | 36.6 | 33.3 | 36.9 |
| Ours | 93.3 | 96.6 | 93.3 | 93.3 | 90.0 | 87.2 | 93.3 | 90.0 | 93.3 | 86.6 | 86.6 | 84.7 |
- Massive Performance Gap:
GraspVLAachieves success rates around 90% across all test conditions, dramatically outperforming all other VLA baselines, which struggle to exceed 80% and often perform much worse, especially on web categories. - Effective Open-Vocabulary Grasping: The most striking result is
GraspVLA's high performance on Web Categories (objects it has never seen in action data). Its ~90% success rate is comparable to its performance on Synthetic Categories, proving that thePAGmechanism successfully transfers semantic knowledge from internet data to the grasping skill learned in simulation. Baselines struggle significantly here, showing they fail to generalize well outside their training distribution. - Robustness to Variations:
GraspVLAmaintains its high performance across changes in lighting (light), background (b.g.), presence of distractors (dis.), and table height (height), demonstrating the robustness learned from extensive domain randomization. - High Efficiency (SPL):
GraspVLAachieves very high SPL scores (87.2 and 84.7), indicating it performs grasps efficiently and without hesitation. The baselines have much lower SPL scores, suggesting their policies are less direct, even when successful.
6.1.2. Zero-Shot Generalization in LIBERO Benchmark
The following are the results from Table 2 of the original paper:
| Long | Goal | Object | |
|---|---|---|---|
| OpenVLA (fine-tuned) | 33.7 | 56.6 | 65.4 |
| π0 (fine-tuned) | 62.7 | 79.4 | 93.8 |
| Ours (zero-shot) | 82.0 | 91.2 | 94.1 |
This result is remarkable. GraspVLA, with zero-shot deployment (no fine-tuning on LIBERO data), surpasses the performance of OpenVLA and that were explicitly fine-tuned on the LIBERO dataset. This strongly suggests that pre-training on a massive, diverse synthetic dataset (SynGrasp-1B) leads to a more general and robust model than fine-tuning on a smaller, in-domain real-world dataset.
6.1.3. Comparison with Specialized Grasping System (AnyGrasp)
The following are the results from Table 3 of the original paper:
| Language-Conditioned | Arbitary Grasping | Speed | |||
|---|---|---|---|---|---|
| overall | grasp | common | transparent | ||
| AnyGrasp | 91.6 | 96.6 | 100.0 | 10.0 | 37 Hz |
| Ours | 93.3 | 93.3 | 93.3 | 86.6 | 5 Hz |
- Strength on Challenging Materials: While
AnyGraspis perfect on common, opaque objects, it fails catastrophically on transparent objects (10% success). This is a known weakness of depth-sensor-based methods, as depth cameras struggle with transparent and reflective surfaces.GraspVLA, being RGB-only, is immune to this issue and maintains high performance (86.6%). - End-to-End Advantage: In the language-conditioned task,
GraspVLAslightly outperforms the pipeline, likely because its end-to-end, multi-view nature provides better object grounding and closed-loop control compared to the modular pipeline's potential for cascading errors. - Speed Trade-off: The primary drawback of
GraspVLAis its inference speed. At 5 Hz, it is significantly slower thanAnyGrasp's 37 Hz, a consequence of its large VLM backbone. This makes it more suitable for static scenes than highly dynamic ones.
6.2. Data Presentation (Tables)
6.2.1. Scaling Law
The chart in Figure 5 demonstrates a clear scaling law.
该图像是图表,展示了成功率与训练帧数的关系。随着训练帧数的增加,合成数据(橙色线)和网络数据(蓝色线)的成功率均有所提升,尤其是在训练帧数达到1B时,成功率接近90%。
As the number of training frames from SynGrasp-1B increases from 12M to 1B, the success rate on both synthetic and web categories steadily improves. The performance on web categories, in particular, continues to climb, suggesting that more data is crucial for generalizing to novel objects and that the model may not have saturated even at 1 billion frames. This validates the paper's core premise that massive scale is a key ingredient for success.
6.2.2. Efficient Post-Training
Table 4 shows GraspVLA's ability to adapt to new, specialized tasks with very few demonstrations (few-shot learning), a hallmark of a true foundation model.
The following are the results from Table 4 of the original paper:
| Training Data | Task 1 | Task 2 | Task 3 | |||||
|---|---|---|---|---|---|---|---|---|
| BBox | traj. | overall | grasp | overall | grasp | overall | grasp | |
| OpenVLA | - | - | 0 | 0 | 0 | 20 | 0 | 0 |
| π0 | - | - | 10 | 20 | 0 | 30 | 0 | 0 |
| Ours | - | - | 40 | 90 | 0 | 80 | 0 | 20 |
| DP | ✓ | - | - | 20 | 60 | 10 | 30 | |
| OpenVLA | ✓ | 0 | 0 | 20 | 30 | 0 | 20 | |
| π0 | ✓ | 60 | 80 | 60 | 70 | 50 | 60 | |
| Ours | ✓ | - | 90 | 100 | - | - | - | - |
| Ours(scratch) | ✓ | ✓ | 10 | 30 | 10 | 30 | 0 | 20 |
| Ours | ✓ | 90 | 100 | 80 | 90 | 90 | 90 | |
GraspVLA(pre-trained) consistently and significantly outperforms all baselines (OpenVLA, ) and a model trained from scratch (Ours(scratch)) when fine-tuned on the three specialized tasks.- Task 1 (New vocabulary):
GraspVLAachieves 90% success. Notably, it can achieve this high performance even when trained with only bounding box annotations (BBoxonly), without full trajectory data (traj.). This drastically reduces the annotation burden for adapting to new objects. - Task 2 (New grasp specification): Grasping a mug without touching the inside.
GraspVLAachieves 80% success, demonstrating it can learn fine-grained constraints. - Task 3 (Sequential grasping):
GraspVLAlearns to sequentially grasp bottles in a cluttered scene with 90% success. These results confirm that the general-purpose knowledge learned during pre-training provides a powerful starting point for rapid specialization.
6.3. Ablation Studies / Parameter Analysis
Table 5 analyzes the contribution of each component of the Progressive Action Generation (PAG) mechanism.
The following are the results from Table 5 of the original paper:
| Synthetic | Web | |||
|---|---|---|---|---|
| SR | SPL | SR | SPL | |
| vanilla | 66.6 | 39.3 | 53.3 | 27.7 |
| + PAG-2D | 80.0 | 59.2 | 76.7 | 48.9 |
| + PAG-3D | 93.3 | 90.2 | 93.3 | 91.7 |
-
vanilla: A baseline that co-trains on both datasets but without the explicit CoT structure of PAG. Its performance is mediocre, especially on web categories (53.3% success rate). -
+ PAG-2D: Adding the 2D bounding box prediction as an intermediate step significantly boosts performance, especially for web categories (from 53.3% to 76.7%). This confirms that forcing the model to first locate the object helps bridge the semantic gap. -
+ PAG-3D(Full Model): Adding the final step of 3D grasp pose prediction provides another major boost, bringing success rates to over 90% for both categories. More importantly, the SPL score nearly doubles, jumping from ~50 to ~90. This indicates that predicting the 3D grasp pose explicitly before generating the motor commands is crucial for eliminating hesitation and producing efficient, direct motions.This ablation study clearly demonstrates that the structured reasoning imposed by
PAGis essential to the model's success.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper presents GraspVLA, a grasping foundation model that successfully challenges the dominant paradigm of relying on expensive real-world data. By pre-training on SynGrasp-1B, a novel billion-scale synthetic dataset, GraspVLA achieves state-of-the-art zero-shot performance when transferred directly to the real world. The key to its success is the Progressive Action Generation (PAG) mechanism, a Chain-of-Thought process that enables joint training on synthetic action data and internet-scale semantic data. This synergy allows the model to learn the physics of grasping from simulation and the semantics of open-vocabulary objects from the web. Extensive experiments validate that GraspVLA not only generalizes robustly to unseen objects and environments but also adapts efficiently to specialized tasks with few-shot fine-tuning, establishing it as a powerful and scalable foundation model for grasping.
7.2. Limitations & Future Work
The authors candidly acknowledge several limitations and areas for future research:
- Hardware Specificity: The current model is trained and evaluated only on the Franka Panda arm with a specific dual-camera setup. While adaptable, extending it to new robots and camera configurations requires generating new synthetic data.
- Ambiguous Instructions: The model struggles with instructions that require higher-level semantic reasoning, such as "pick up the leftmost object" or "pick up food" (when multiple food items are present). This may require scaling the VLM further or architectural improvements.
- Deformable Objects: The grasp synthesis in the dataset is based on force-closure for rigid bodies, a common limitation. The model cannot reason about how to grasp deformable objects like cloth or soft bags, though it may succeed by chance. Integrating soft-body simulation is a potential future direction.
- Task Specificity: The work is currently focused only on grasping. The authors plan to extend the data generation pipeline and model to other manipulation tasks like pick-and-place, pushing, and more complex non-prehensile actions.
- Inference Latency: The autoregressive nature of
PAGintroduces latency (~200ms), which is adequate for static scenes but may be too slow for dynamic environments. The authors suggest exploring model distillation and quantization to improve speed.
7.3. Personal Insights & Critique
This paper marks a significant and exciting development in the field of robotic learning.
- A Path to Scalability: The most impactful contribution is demonstrating a viable, cost-effective alternative to the real-world data bottleneck. If pre-training on massive synthetic datasets can consistently produce high-performing, generalizable policies, it could dramatically accelerate progress in robotics by making the development of foundation models accessible to a much wider research community.
- The Power of Structured Reasoning: The
PAGmechanism is an elegant and effective solution for fusing heterogeneous data sources. The idea of structuring policy learning as aChain-of-Thoughtprocess—forcing the model to "perceive, then reason, then act"—is intuitive and powerful. This design pattern could be highly influential and applicable to other complex robotics tasks beyond grasping. - Critique and Open Questions:
-
Is Grasping a "Sweet Spot"? Grasping is a fundamental but relatively constrained task. The physics are well-understood, and success can often be determined geometrically. The true test of this "synthetic-first" paradigm will be its ability to scale to more contact-rich, dynamic, and long-horizon tasks like in-hand manipulation, tool use, or assembly, where subtle physical dynamics are more critical and harder to simulate accurately.
-
Expert Data Generation Bottleneck: The quality of the learned policy is fundamentally limited by the quality of the expert policy used to generate the synthetic data. The current pipeline uses a modular system of grasp synthesis and motion planning. This may not generate the full diversity of behaviors a human might use, especially for complex objects. Future work might need to explore more advanced data generation methods, perhaps using reinforcement learning or generative models to create more varied expert trajectories.
-
The Role of the LLM: While the LLM is used to structure the reasoning process, its full capabilities for commonsense reasoning about physics and tasks seem underutilized. The model still fails on some semantic ambiguities. Future architectures might integrate the LLM's reasoning abilities more deeply into the action generation process itself.
Overall,
GraspVLAis a landmark paper that provides a strong proof-of-concept for a new, highly scalable direction in robot learning. It successfully combines the scale of modern foundation models with the practicality of simulation, paving the way for more capable and accessible embodied intelligence.
-
Similar papers
Recommended via semantic vector search.