Papers
Sign in to view your remaining parses.
Tag Filter
Vision-Language-Action Model
REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation
Published:12/23/2025
Vision-Language-Action ModelRobotic Manipulation BenchmarkRobotic Generalization EvaluationHigh-Fidelity Simulation EnvironmentTask Perturbation Factors
REALM introduces a highfidelity simulation environment for evaluating the generalization of VisionLanguageAction models, featuring 15 perturbation factors and over 3,500 objects. Validation shows strong alignment between simulated and realworld performance, highlighting ongoi
03
Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting
Published:9/26/2025
Fine-Tuning of Visual Language ModelsVision-Language-Action ModelCatastrophic Forgetting PreventionLow-Rank Adaptation MethodRobot Teleoperation Data
The paper presents VLM2VLA, a method for finetuning visionlanguage models (VLMs) into visionlanguageaction models (VLAs) without catastrophic forgetting, by representing lowlevel robot actions in natural language, achieving zeroshot generalization in real experiments.
04
ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis
Published:3/16/2025
Vision-Language-Action ModelRobotic Video SynthesisReal-to-Sim-to-Real ApproachRobot Dataset ScalingRobotic Manipulation Tasks
ReBot enhances robot learning by proposing a realtosimtoreal video synthesis method, addressing data scaling challenges. It replays real robot movements in simulators and combines them with inpainted realworld backgrounds, significantly improving VLA model performance with s
03
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Published:8/27/2025
Vision-Language-Action ModelRobotic ManipulationLong-Term Memory and Anticipatory ActionMemory-Conditioned Diffusion ModelsShort-Term Memory and Cognition Fusion
MemoryVLA is a memorycentric VisionLanguageAction framework for nonMarkovian robotic manipulation, integrating working memory and episodic memory. It significantly enhances performance in 150 tasks, achieving up to a 26% success rate increase across simulations and realworl
03
Real-Time Execution of Action Chunking Flow Policies
Published:6/9/2025
Real-Time Action Chunking Policy ExecutionVision-Language-Action ModelHigh-Frequency Control TasksKinetix SimulatorAction Chunking Algorithm
This paper introduces a novel algorithm called realtime chunking (RTC) to address inference latency issues in realtime control of visionlanguageaction models, showing improved task throughput and high success rates in dynamic and realworld tasks.
02
VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation
Published:11/21/2025
Vision-Language-Action ModelSpatiotemporal Coherent Robotic Manipulation4D-aware Visual RepresentationMultimodal Action RepresentationVLA Dataset Extension
The VLA4D model introduces 4D awareness into VisionLanguageAction models for coherent robotic manipulation, integrating spatial and temporal information to ensure smooth and consistent actions in robot tasks.
04
See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations
Published:12/8/2025
Vision-Language-Action ModelRobotic Manipulation Policy LearningOne-Shot Learning from DemonstrationLearning from Human Video DemonstrationsExpert Demonstration Video Generation
ViVLA is a general robotic manipulation model that learns new tasks from a single expert video demonstration. By processing the video alongside robot observations, it distills expertise for improved performance in unseen tasks, showing significant experimental gains.
02
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Published:3/19/2025
Generalist Humanoid Robot Foundation ModelVision-Language-Action ModelDiffusion Transformer ModuleHumanoid Robot Manipulation TasksMultimodal Data Training
GR00T N1 is an open foundation model for humanoid robots, integrating a reasoning module and a motion generation module. Trained endtoend with a pyramid of heterogeneous data, it outperforms existing imitation learning methods in simulation benchmarks, demonstrating high perfor
04
FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models
Published:10/2/2025
Vision-Language-Action ModelRobotic Manipulation Failure RecoveryFailure Generation and Recovery SystemRobotic Manipulation DatasetLarge-Scale Robot Training Data
The paper introduces FailSafe, a system that generates diverse failure scenarios and executable recovery actions for VisionLanguageAction (VLA) models, achieving up to 22.6% performance improvement in robotic failure recovery across various tasks.
08
RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction
Published:5/18/2025
Robotic Failure Analysis and Correction FrameworkVision-Language-Action ModelTask Understanding for Failure CorrectionRoboFAC DatasetOpen-World Robotic Manipulation
The RoboFAC framework enhances robotic failure analysis and correction for VisionLanguageAction models in openworld scenarios. It includes a large dataset and a model capable of task understanding, with experimental results showing significant performance improvements.
02
TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking
Published:10/8/2025
Vision-Language-Action ModelSpatial Reasoning MechanismTarget Identification MemoryLong-Horizon Consistency ModelingAutoregressive Reasoning Model
TrackVLA is a novel VisionLanguageAction model that enhances embodied visual tracking by introducing a spatial reasoning mechanism and Target Identification Memory. It effectively addresses tracking failures under severe occlusions and achieves stateoftheart performance.
011
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Published:9/12/2025
Vision-Language-Action ModelReinforcement Learning for Math ReasoningRL Training for Large Language ModelsMulti-Environment RenderingEfficient Reinforcement Learning Framework
SimpleVLARL is introduced to enhance the training of VisionLanguageAction models using reinforcement learning, addressing data scarcity and generalization issues. Results show stateoftheart performance on OpenVLAOFT, reducing reliance on labeled data.
03
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
Published:10/14/2025
Vision-Language-Action ModelEnhanced Spatial Understanding CapabilitiesImplicit Spatial Representation AlignmentAlignment with 3D Foundation ModelsPrecise Execution of Robotic Tasks
This paper introduces 'Spatial Forcing' (SF), an implicit alignment method enhancing spatial understanding in VisionLanguageAction (VLA) models. By aligning visual embeddings with 3D foundation models, SF improves robotics' operational precision in 3D environments without relyi
05
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment
Vision-Language-Action Model
Published:10/12/2025
Vision-Language-Action ModelCross-Embodiment LearningSoft Prompt LearningGeneralist Robotic PlatformsLarge-Scale Heterogeneous Datasets
The paper introduces XVLA, a scalable VisionLanguageAction model utilizing a softprompted Transformer architecture. By integrating learnable embeddings for diverse robot data sources, XVLA achieves stateoftheart performance across simulations and real robots, demonstratin
02
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Published:1/17/2025
Vision-Language-Action ModelFrequency-space Action Sequence TokenizationHigh-Frequency Robot Action DataAutoregressive Sequence ModelsRobot Action Tokenization
This paper introduces Frequencyspace Action Sequence Tokenization (FAST), enhancing autoregressive visionlanguageaction models for highfrequency robot tasks. FAST allows effective handling of challenging dexterous actions, releasing FAST, a versatile tokenizer that reduces t
05
ADriver-I: A General World Model for Autonomous Driving
Published:11/23/2023
Driving World ModelsMultimodal Large Language ModelVision-Language-Action ModelDiffusion ModelsnuScenes Dataset
ADriverI integrates visionaction pairs using MLLM and diffusion models to autoregressively predict control signals and future scenes, enabling iterative autonomous driving and significantly improving performance.
04
TrackVLA: Embodied Visual Tracking in the Wild
Published:5/29/2025
Vision-Language-Action ModelEmbodied Visual TrackingDiffusion Model for Trajectory PlanningLLM BackboneEmbodied Visual Tracking Benchmark (EVT-Bench)
TrackVLA integrates vision, language, and action via a shared LLM, jointly optimizing target recognition and trajectory planning with diffusion models. Trained on a large EVTBench dataset, it achieves stateoftheart, robust embodied visual tracking in complex realworld scenar
05
Emergent Active Perception and Dexterity of Simulated Humanoids from
Visual Reinforcement Learning
Published:5/18/2025
Vision-Language-Action ModelRobotic Action LearningLLM-guided motion planningReinforcement Learning TrainingSimulated Humanoid Control
This work introduces PDC, enabling simulated humanoids to perform multiple tasks using egocentric vision alone. Reinforcement learning yields emergent humanlike behaviors such as active search, advancing visiondriven dexterous control without privileged states.
010
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical
Reachability
Published:3/11/2025
Vision-Language-Action ModelRobotic Physical Reachability RepresentationMulti-Robot Multimodal Dataset Phys100KRobotic Vision-Language ReasoningEnvironmental Perception and Spatial Representation
PhysVLM integrates a unified SpacePhysical Reachability Map into visionlanguage models, enabling accurate physical reachability reasoning for robots. It enhances embodied visual reasoning without compromising visionlanguage capabilities, validated on the largescale multirobo
05
WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent
Published:10/8/2025
Multimodal Large Language ModelVision-Language-Action ModelRL Training for Large Language ModelsComplex Information Retrieval BenchmarkVisual-Language Reasoning
WebWatcher is a multimodal deep research agent enhancing visuallanguage reasoning via synthetic trajectories and reinforcement learning, validated on the new BrowseCompVL benchmark for complex visualtext retrieval tasks, surpassing existing baselines.
02