Paper status: completed

PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

Published:03/11/2025

Vision-Language-Action Model (34)Robotic Physical Reachability Representation (1)Multi-Robot Multimodal Dataset Phys100K (1)Robotic Vision-Language Reasoning (1)Environmental Perception and Spatial Representation (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

PhysVLM integrates a unified Space-Physical Reachability Map into vision-language models, enabling accurate physical reachability reasoning for robots. It enhances embodied visual reasoning without compromising vision-language capabilities, validated on the large-scale multi-robo

Abstract

Understanding the environment and a robot's physical reachability is crucial for task execution. While state-of-the-art vision-language models (VLMs) excel in environmental perception, they often generate inaccurate or impractical responses in embodied visual reasoning tasks due to a lack of understanding of robotic physical reachability. To address this issue, we propose a unified representation of physical reachability across diverse robots, i.e., Space-Physical Reachability Map (S-P Map), and PhysVLM, a vision-language model that integrates this reachability information into visual reasoning. Specifically, the S-P Map abstracts a robot's physical reachability into a generalized spatial representation, independent of specific robot configurations, allowing the model to focus on reachability features rather than robot-specific parameters. Subsequently, PhysVLM extends traditional VLM architectures by incorporating an additional feature encoder to process the S-P Map, enabling the model to reason about physical reachability without compromising its general vision-language capabilities. To train and evaluate PhysVLM, we constructed a large-scale multi-robot dataset, Phys100K, and a challenging benchmark, EQA-phys, which includes tasks for six different robots in both simulated and real-world environments. Experimental results demonstrate that PhysVLM outperforms existing models, achieving a 14% improvement over GPT-4o on EQA-phys and surpassing advanced embodied VLMs such as RoboMamba and SpatialVLM on the RoboVQA-val and OpenEQA benchmarks. Additionally, the S-P Map shows strong compatibility with various VLMs, and its integration into GPT-4o-mini yields a 7.1% performance improvement.

Mind Map

In-depth Reading

English Analysis~33 min read · 45,313 chars

1. Bibliographic Information

1.1. Title

PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

1.2. Authors

Weijie Zhou $^{1,2}$ , Manli Tao $^{2}$ , Chaoyang Zhao $^{2,3,\dagger}$ , Haiyun Guo $^{2}$ , Honghui Dong $^{1}$ , Ming Tang $^{2}$ , Jinqiao Wang $^{1,2,3,4,\dagger}$

$^{1}$ School of Traffic and Transportation, Beijing Jiaotong University
$^{2}$ Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences
$^{3}$ ObjectEye Inc.
$^{4}$ Guangdong Provincial Key Laboratory of Intellectual Property & Big Data, Guangdong Polytechnic Normal University
$^\dagger$ Corresponding authors: chaoyang.zhao@nlpr.ia.ac.cn, jqwang@nlpr.ia.ac.cn

1.3. Journal/Conference

This paper is published as a preprint on arXiv, specifically arXiv:2503.08481. arXiv is an open-access repository for preprints of scientific papers in various fields, including computer science. While it is not a peer-reviewed journal or conference proceeding itself, it serves as a platform for researchers to share their work rapidly before or during the formal peer-review process. Papers on arXiv often represent cutting-edge research and can be highly influential.

1.4. Publication Year

The paper was published on arXiv on 2025-03-11.

1.5. Abstract

Understanding the environment and a robot's physical reachability is crucial for task execution. While state-of-the-art vision-language models (VLMs) excel in environmental perception, they often generate inaccurate or impractical responses in embodied visual reasoning tasks due to a lack of understanding of robotic physical reachability. To address this issue, we propose a unified representation of physical reachability across diverse robots, i.e., Space-Physical Reachability Map (S-P Map), and PhysVLM, a vision-language model that integrates this reachability information into visual reasoning. Specifically, the $S-P Map$ abstracts a robot's physical reachability into a generalized spatial representation, independent of specific robot configurations, allowing the model to focus on reachability features rather than robot-specific parameters. Subsequently, PhysVLM extends traditional VLM architectures by incorporating an additional feature encoder to process the $S-P Map$ , enabling the model to reason about physical reachability without compromising its general vision-language capabilities. To train and evaluate PhysVLM, we constructed a large-scale multi-robot dataset, Phys100K, and a challenging benchmark, EQA-phys, which includes tasks for six different robots in both simulated and real-world environments. Experimental results demonstrate that PhysVLM outperforms existing models, achieving a $14 \%$ improvement over GPT-4o on EQA-phys and surpassing advanced embodied VLMs such as RoboMamba and SpatialVLM on the RoboVQA-val and OpenEQA benchmarks. Additionally, the $S-P Map$ shows strong compatibility with various VLMs, and its integration into GPT-4o-mini yields a $7 . 1 \%$ performance improvement.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2503.08481
PDF Link: https://arxiv.org/pdf/2503.08481v2.pdf
Publication Status: This is a preprint, indicating it has been submitted to arXiv but may not have undergone formal peer review or been published in a conference/journal yet.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the inability of current Vision-Language Models (VLMs) to accurately understand and reason about a robot's physical reachability in embodied tasks. While VLMs have made significant strides in general environmental perception, they often fail to generate practical or accurate responses when robotic physical constraints are involved. For instance, a VLM might instruct a robot to grasp an object that is physically out of its reach, leading to task failure or even damage.

This problem is crucial in the field of robotics because effective task execution, planning, and safe human-robot interaction fundamentally rely on a robot's awareness of its physical limitations within an environment. Without this understanding, VLMs applied to robotics are prone to producing hallucinations or impractical actions, limiting their utility in real-world applications.

The specific challenges or gaps in prior research that this paper identifies are:

Lack of unified and efficient representation: Robots vary greatly in design, kinematics, and operational envelopes. A general VLM struggles to directly learn and generalize these diverse physical characteristics.
Integration without compromise: Introducing new modality information (like physical reachability) into existing VLM architectures often risks degrading their general vision-language capabilities. The challenge is to integrate this specific knowledge seamlessly.

The paper's entry point and innovative idea revolve around creating a unified and robot-agnostic representation of physical reachability, termed the Space-Physical Reachability Map (S-P Map), and then designing a VLM architecture (PhysVLM) that can effectively process this representation alongside visual and linguistic inputs.

2.2. Main Contributions / Findings

The paper makes several primary contributions to address the identified challenges:

Unified and Robot-Agnostic Reachability Representation (S-P Map): They propose the Space-Physical Reachability Map (S-P Map), a novel representation that abstracts a robot's physical reachability into a generalized spatial form. This map is independent of specific robot configurations, allowing the VLM to learn features related to reachability rather than robot-specific parameters, thus promoting generalization across diverse robots.
Novel VLM Architecture (PhysVLM): They introduce PhysVLM, a vision-language model that extends traditional VLM architectures. It incorporates an additional feature encoder specifically designed to process the $S-P Map$ , enabling the model to reason about physical reachability while preserving its general vision-language capabilities. This dual-branch architecture ensures seamless integration.
Large-Scale Multi-Robot Dataset (Phys100K): To facilitate training and evaluation, the authors constructed Phys100K, a large-scale dataset comprising data from multiple robots and diverse environments, including both simulated and real-world scenarios. This dataset is crucial for teaching VLMs about physical reachability.
Challenging Evaluation Benchmark (EQA-phys): They developed EQA-phys, a new embodied question answering (EQA) benchmark specifically designed to test a model's understanding of physical reachability. It includes tasks for six different robots in both simulated and real-world settings, making it a robust testbed for robot-aware VLMs.
Demonstrated Superior Performance: Experimental results show that PhysVLM significantly outperforms existing VLMs and advanced embodied VLMs on the EQA-phys benchmark, achieving a $14 \%$ improvement over GPT-4o. It also surpasses RoboMamba and SpatialVLM on RoboVQA-val and OpenEQA benchmarks, indicating that the integration of physical reachability does not compromise general visual reasoning abilities.
S-P Map Compatibility: The $S-P Map$ is shown to be highly compatible with other VLMs. Its integration into GPT-4o-mini resulted in a $7 . 1 \%$ performance improvement, highlighting its transferability and utility.

These findings collectively demonstrate a significant step towards equipping VLMs with a crucial understanding of robotic physical constraints, which is essential for more reliable, practical, and safe task execution in embodied AI and robotics.

3.1. Foundational Concepts

To fully grasp the contributions of PhysVLM, a beginner should understand several fundamental concepts:

Vision-Language Models (VLMs): VLMs are a class of artificial intelligence models that can process and understand information from both visual data (like images or videos) and textual data (like natural language). They are designed to bridge the gap between human language and visual perception, enabling tasks such as image captioning, visual question answering (VQA), and zero-shot image classification. State-of-the-art VLMs typically consist of a vision encoder (e.g., a Vision Transformer) that extracts features from images, a language encoder (e.g., a Transformer-based Large Language Model or LLM) for text, and a mechanism to align or fuse these features, often followed by an LLM for generating responses.
Embodied AI/Robotics: Embodied AI refers to intelligent agents (often robots) that exist and interact within a physical (or simulated physical) environment. Unlike disembodied AI that operates purely in a digital realm, embodied agents must perceive, reason about, and act upon their physical surroundings. This often involves tasks like navigation, manipulation, grasping, and task planning, where physical constraints and real-world physics are paramount.
Physical Reachability: In robotics, physical reachability refers to the set of all points in space that a robot's end-effector (e.g., a gripper or tool) can physically access. This is determined by the robot's mechanical design (e.g., number of joints, link lengths), joint limits (e.g., maximum and minimum angles), and the robot's base position. An object is "reachable" if its location falls within this reachable workspace. Understanding reachability is critical because a robot cannot perform tasks on objects outside this space.
Forward Kinematics: This is a fundamental concept in robotics that describes how to calculate the position and orientation of a robot's end-effector given the angles (or displacements) of its joints. For a robot arm, forward kinematics takes the joint configurations as input and outputs the spatial coordinates of the end-effector.
Denavit-Hartenberg (DH) parameters: The Denavit-Hartenberg (DH) convention is a standardized notation for describing the spatial relationship between adjacent links and joints in a robotic arm. It uses four parameters (link length $a_i$ , link twist $\alpha_i$ , joint offset $d_i$ , and joint angle $\theta_i$ ) to establish a coordinate frame for each link, simplifying the process of deriving forward kinematics equations for complex multi-joint robots.
Point Clouds: A point cloud is a set of data points in a three-dimensional coordinate system. In the context of robotics and computer vision, point clouds are often generated by RGB-D cameras (which capture both color and depth information) or LiDAR sensors. Each point in the cloud represents a single measurement of a spatial location on the surface of an object or environment, typically used to create a 3D model of the scene.
Voxel Grid: A voxel grid is a three-dimensional grid that discretizes a continuous space into small, cubic units called voxels (volumetric pixels). Similar to how pixels represent 2D images, voxels represent 3D volumes. In robotics, voxel grids are often used to represent the environment, particularly for collision detection, path planning, and mapping reachable workspaces, as each voxel can store information (e.g., occupied, free, reachable).
Vision Transformer (ViT): ViT is a type of Transformer model (originally developed for natural language processing) adapted for computer vision tasks. Instead of processing images as a grid of pixels, ViT divides an image into fixed-size patches, treats these patches as sequences of tokens, and then feeds them into a standard Transformer encoder. This allows ViT models to capture long-range dependencies in images and achieve state-of-the-art performance on various visual tasks, often requiring extensive pre-training on large datasets.
Multi-Layer Perceptron (MLP): An MLP is a class of feedforward artificial neural networks consisting of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Each node (neuron) in one layer is connected to every node in the next layer with associated weights. MLPs are used for tasks like classification, regression, and feature transformation, often serving as projection layers or non-linear activation functions in larger neural networks.
Large Language Model (LLM): LLMs are advanced Transformer-based neural networks trained on vast amounts of text data to understand, generate, and process human language. They excel at tasks like text generation, question answering, summarization, and translation. In VLMs, an LLM often serves as the decoder, taking fused visual and language features to produce coherent textual responses.

3.2. Previous Works

The paper contextualizes its contributions by referencing several prior works, highlighting both the advancements and existing limitations.

3.2.1. VLMs in Robotics

Embodied Question Answering (EQA) (e.g., [10, 13, 38]): EQA tasks require an agent (robot) to interact with an environment to answer questions. This often involves navigation, perception, and reasoning within a physical space. While promising, existing EQA models typically focus on high-level visual and semantic understanding without deeply integrating physical constraints.
RoboVQA ([29]): Offers a large dataset for robotic visual question answering. This benchmark is relevant for evaluating VLMs in robotic contexts, but the paper implies it doesn't fully capture the nuances of physical reachability.
3D-VLA ([44]): Integrates 3D perception with a generative world model for embodied reasoning. This represents a step towards understanding 3D environments for embodied AI.
SpatialVLM ([5]): Enhances VLMs' spatial understanding using extensive 3D data. This work attempts to improve VLMs' ability to reason about space, but may not explicitly model robot-specific physical reachability in a unified way.
Robot Task Planning (e.g., [3, 30, 40]): Involves sequencing subtasks to achieve goals. VLMs have been used to assist in this.
- Code as Policies (CaP) ([20]): Uses LLMs (like Codex) to generate planning code for robots.
- SayCan ([2]): Combines LLMs (PaLM) with robotic affordances to create feasible action plans based on a robot's capabilities.
Limitations of Prior VLM-Robot Integration: The paper notes that these methods often assume all objects are within the robot's operational area, ignoring physical reachability and potentially leading to suboptimal or infeasible plans. This highlights the central gap PhysVLM aims to fill.

3.2.2. Understanding Physical Reachability

Previous efforts have explored representing reachable workspaces, but their integration with VLMs for complex embodied tasks is limited.

Voxel Grids with Open-Vocabulary Detection:
- ReKep ([15]): Generates keypoint proposals and constraints using voxel grids and VLMs. It uses voxel grids to model the environment and identify constraints.
- VoxPoser ([14]): Synthesizes robot trajectories by integrating OWL-ViT (an open-vocabulary object detection model) and VLMs with voxel-based environment representations.
- Limitation: While these approaches use voxel grids for environment modeling, they often do not explicitly represent or integrate the robot's own physical reachability into the VLM's reasoning process. They might identify objects, but not whether the robot can reach them.
Explicit Workspace Representation:
- Reachability maps ([41]): Model a robot's spatial capabilities. These maps explicitly define the volume of space a robot can access.
- Occupancy grids ([16]): Account for obstacles to ensure safe navigation. Occupancy grids are similar to voxel grids but typically focus on marking areas as occupied or free for collision avoidance.
- Advanced Control Methods ([31, 12]): Methods like online model predictive control with offline workspace analysis and Reachability Expression-based Motion Planning (REMP) address workspace constraints in control.
Overall Limitation: Despite these advancements, the paper argues that integrating explicit physical reachability into visual reasoning for complex embodied tasks remains limited. A key reason cited is the lack of large-scale datasets that include robotic physical parameters in VLM pretraining.

3.3. Technological Evolution

The field has evolved from general VLMs excelling at image captioning and VQA to specialized embodied VLMs aiming to assist robots in understanding their environments and performing tasks. Early embodied VLMs focused on visual perception and language grounding for navigation or high-level planning. However, a significant gap emerged: these models, while adept at understanding "what" is in the environment, often lacked the "can I reach it" or "how can I interact with it physically" reasoning crucial for physical robots. This paper's work represents an evolution towards physically-grounded embodied VLMs, moving beyond purely visual or semantic understanding to incorporate intrinsic robotic capabilities.

3.4. Differentiation Analysis

Compared to the main methods in related work, PhysVLM offers several core differences and innovations:

Unified, Robot-Agnostic Representation: Unlike methods that might model reachability for a specific robot or rely on complex voxel grid processing that isn't directly integrated into a VLM's input stream, the $S-P Map$ provides a generalized spatial representation. This abstraction allows PhysVLM to focus on which areas are reachable rather than the specific kinematics of individual robots, enabling zero-shot generalization to unseen robots.
Direct Integration into VLM Architecture: PhysVLM explicitly integrates the $S-P Map$ as a distinct input modality through a dedicated feature encoder. This differs from methods that might use VLMs to interpret affordances or generate code, but don't feed an explicit, processed reachability map directly into the VLM's core reasoning pipeline. This direct integration is key to making physical reachability a first-class citizen in the VLM's inference process.
Preservation of General VLM Capabilities: The dual-branch architecture ensures that the addition of physical reachability information does not degrade the VLM's general vision-language understanding. This is a significant advantage over potential approaches that might try to fine-tune a general VLM with reachability data, which could lead to catastrophic forgetting or reduced performance on broader VLM tasks.
Dedicated Large-Scale Dataset and Benchmark: The creation of Phys100K and EQA-phys directly addresses the lack of large-scale datasets and benchmarks for physical reachability in VLM pretraining. This is a crucial enabler for developing and evaluating physically-aware VLMs.

In essence, PhysVLM innovates by providing a practical, scalable, and generalizable solution to integrate robot physical reachability into the powerful framework of VLMs, moving beyond implicit assumptions or separate affordance computations.

4. Methodology

4.1. Principles

The core idea behind PhysVLM is to enable Vision-Language Models (VLMs) to perform visual reasoning while explicitly accounting for robotic physical reachability. The theoretical basis is that by providing a VLM with a unified, spatial representation of what a robot can physically reach, along with standard visual and linguistic inputs, the model can generate responses that are both contextually relevant and physically feasible. This is achieved through a novel representation, the Space-Physical Reachability Map (S-P Map), which abstracts robot-specific kinematics into a general form, and a dual-branch neural network architecture that seamlessly integrates this $S-P Map$ information without compromising the VLM's general vision-language capabilities. The intuition is that just as humans consider their physical limitations when planning actions, a robot's AI should do the same.

4.2. Core Methodology In-depth (Layer by Layer)

PhysVLM is designed as a large-scale vision-language model for visual reasoning in embodied tasks subject to physical constraints. Its design is centered around integrating three main inputs: instruction text, a visual input (RGB image), and a specialized Space-Physical Reachability Map (S-P Map). The output is a response that aligns with visual context and the robot's physical capabilities, independently of specific robot configurations.

The overall architecture of PhysVLM is illustrated in Figure 2 from the original paper, which is presented below:

该图像是PhysVLM方法的示意图，展示了如何融合视觉编码器、约束编码器和大型语言模型，通过S-P Map对机器人物理可达性进行编码，实现对机器人可达空间的推理。

The image is an illustration of the PhysVLM method, showing how visual encoder, constraint encoder, and large language model are integrated using the S-P Map to encode the robot's physical reachability and reason about reachable workspace.

As depicted, PhysVLM consists of a vision branch, a physical reachability branch (or constraint encoder), and a Large Language Model (LLM) decoder.

4.2.1. S-P Map Encoding

The $S-P Map$ is a crucial component that unifies the representation of physical reachability across diverse robots, abstracting robot-specific parameters into a generalized spatial form. This allows the model to focus on the spatial regions that are physically reachable rather than the detailed kinematics of a particular robot.

The $S-P Map$ is constructed using the following functional relationship: $ \mathrm { S-P } \mathrm { M a p } = F \left( \mathcal { P } _ { \mathrm { r a w } } , { \theta _ { i } ^ { \mathrm { m i n } } , \theta _ { i } ^ { \mathrm { m a x } } } , \mathrm { D H } , \mathbf { E } \right) $ Where:

$\mathrm { S-P } \mathrm { M a p }$ : The generated Space-Physical Reachability Map, representing the abstracted physical reachability.
$F$ : A function that maps the raw inputs to the $S-P Map$ .
$\mathcal { P } _ { \mathrm { r a w } }$ : The raw point cloud data captured from the robot's RGB-D camera. A point cloud is a set of 3D points representing the surface of objects in the scene.
$\{ \theta _ { i } ^ { \mathrm { m i n } } , \theta _ { i } ^ { \mathrm { m a x } } \}$ : Represents the minimum and maximum joint angle limits for each joint $i$ of the robot. These define the robot's range of motion.
$\mathrm { D H }$ : Denotes the Denavit-Hartenberg parameters. These are a set of four parameters ( $\theta _ { i }, d _ { i }, a _ { i }, \alpha _ { i }$ ) that describe the geometric relationship between two adjacent links in a robot arm, crucial for defining its kinematic structure.
$\mathbf { E }$ : The extrinsic calibration matrix. This matrix transforms coordinates from the camera's frame of reference to the robot's base frame of reference, aligning the camera's view with the robot's operational space.

The process to generate the $S-P Map$ involves several steps:

Robot Kinematics and Workspace Discretization:
- For a robot arm with $n$ degrees of freedom, each joint $i$ is described by DH parameters: $\theta _ { i }$ (joint angle), d _ { i } (offset along the $Z$ -axis), a _ { i } (link length), and $\alpha _ { i }$ (twist angle).
- The homogeneous transformation matrix for each joint $i$ $i$ is calculated using the standard Denavit-Hartenberg transformation function $G$ $G$ : $ \mathbf { T } _ { i } = G ( \theta _ { i } , d _ { i } , a _ _ { i } , \alpha _ { i } ) $ Where:
  - $\mathbf { T } _ { i }$ : The $4 \times 4$ homogeneous transformation matrix for joint $i$ .
  - $G$ : The Denavit-Hartenberg transformation function which computes the transformation matrix from link i-1 to link $i$ .
  - $\theta _ { i }, d _ { i }, a _ { i }, \alpha _ { i }$ : The Denavit-Hartenberg parameters for joint $i$ .
- The transformation matrix from the base frame to the end-effector frame (the tool attached to the robot's arm) is obtained by multiplying the individual joint transformation matrices: $ \mathbf { T } = \mathbf { T } _ { 1 } \mathbf { T } _ { 2 } \dots \mathbf { T } _ { n } $ Where:
  - $\mathbf { T }$ : The overall transformation matrix from the robot's base to its end-effector.
  - $\mathbf { T } _ { j }$ : The transformation matrix for joint $j$ .
- To define the reachable workspace, joint angles $\theta _ { i }$ are sampled from their respective motion ranges $[ \theta _ { i } ^ { \mathrm { m i n } } , \theta _ { i } ^ { \mathrm { m a x } } ]$ . These joint configurations $\{ \theta _ { 1 } , \theta _ { 2 } , \ldots , \theta _ { n } \}$ are then fed into the forward kinematics equations to compute corresponding end-effector positions.
- These end-effector positions define the reachable workspace, which is discretized into a voxel grid $\mathcal { W } _ { \mathrm { v o x e l } }$ $W_{voxel}$ . This is precomputed offline for efficiency. $ { \mathcal { W } } _ { \mathrm { v o x e l } } = \left{ \mathbf { p } \bigg \vert \mathbf { p } = \mathbf { T } ( \theta _ { 1 } , \theta _ { 2 } , \ldots , \theta _ { n } ) \cdot \mathbf { p } _ { 0 } \right} $ Where:
  - ${ \mathcal { W } } _ { \mathrm { v o x e l } }$ : The voxel grid representing the precomputed reachable workspace.
  - $\mathbf { p }$ : A point in 3D space, representing a potential end-effector position.
  - $\mathbf { T } ( \theta _ { 1 } , \theta _ { 2 } , \ldots , \theta _ { n } )$ : The forward kinematics transformation matrix for a given joint configuration.
  - $\mathbf { p } _ { 0 }$ : The origin point in the end-effector frame, typically $(0,0,0,1)^T$ in homogeneous coordinates.
Point Cloud Transformation and Filtering:
- The raw point cloud data $\mathcal { P } _ { \mathrm { r a w } }$ is captured from the robot's RGB-D camera in the camera's coordinate system.
- This point cloud is then transformed into the robot's coordinate system using the extrinsic calibration matrix $\mathbf { E }$ $E$ : $ \mathcal { P } = \mathbf { E } \cdot \mathcal { P } _ { \mathrm { r a w } } $ Where:
  - $\mathcal { P }$ : The point cloud transformed into the robot's coordinate system.
  - $\mathbf { E }$ : The extrinsic calibration matrix.
  - $\mathcal { P } _ { \mathrm { r a w } }$ : The raw point cloud from the camera.
- To determine physical feasibility, each point in the transformed point cloud $\mathcal { P }$ $P$ is checked against the precomputed reachable workspace $\mathcal { W } _ { \mathrm { v o x e l } }$ $W_{voxel}$ using a voxel grid lookup. This filters the point cloud to include only points that are within the robot's physical reach: $ \mathcal { P } _ { \mathrm { v a l i d } } = \left{ \mathbf { p } \in \mathcal { P } \bigg | \mathbf { p } \in \mathcal { W } _ { \mathrm { v o x e l } } \right} $ Where:
  - $\mathcal { P } _ { \mathrm { v a l i d } }$ : The subset of points from $\mathcal { P }$ that fall within the reachable workspace $\mathcal { W } _ { \mathrm { v o x e l } }$ .
S-P Map Finalization:
- The valid point cloud $\mathcal { P } _ { \mathrm { v a l i d } }$ is transformed back into the camera coordinate system.
- Using the camera's intrinsic parameters, these points are projected onto the image plane.
- Regions on the original depth map that correspond to these projected valid points are marked as physically reachable.
- For areas that are not reachable, a gray mask is applied, and their boundaries are outlined. This visual representation, the $S-P Map$ , clearly highlights reachable and unreachable regions, providing an abstracted and unified representation of physical reachability independent of the specific robot configuration.

4.2.2. Model Architecture

PhysVLM employs a dual-branch architecture to handle visual information and physical reachability constraints independently before fusion.

Vision Branch:
- Processes egocentric RGB images (images taken from the robot's perspective).
- Utilizes a pre-trained Vision Transformer (ViT), specifically SigLip-400M [42], to extract high-level visual features.
- A Max Pooling layer is applied to reduce computational overhead (downsample features).
- A two-layer Multi-Layer Perceptron (MLP) transforms these visual features into token representations suitable for multimodal fusion with language.
Physical Reachability Branch:
- Processes the generated $S-P Map$ .
- Also employs the SigLip-400M model for feature extraction from the $S-P Map$ (treated as an image-like input).
- Follows with Max Pooling.
- A feature fusion layer combines the features extracted from the $S-P Map$ with the visual features from the vision branch.
- A two-layer MLP then refines these fused features into reachability-specific tokens.
Language Decoder:
- The Qwen-2.5-Instruct-3B model [33, 39] serves as the Large Language Model (LLM) decoder.
- It uses the Qwen-2.5 tokenizer to process natural language instructions (prompts or questions).
- The decoder integrates multimodal tokens from the vision branch (visual features), the S-P Map branch (reachability features), and the language inputs (instructions).
- Its role is to generate coherent and contextually relevant textual responses that account for both the visual context and the robot's physical reachability information.

4.2.3. Training Data Construction

The training data for PhysVLM is a combination of the custom Phys100K dataset and existing general VQA datasets.

Phys100K Dataset:
- A large-scale multi-robot dataset focused on physical reachability questions and answers.
- Aggregates data from:
  - RoboVQA (20K samples): Existing robotic VQA dataset.
  - ScanNet [8] (10K samples): Dataset of richly-annotated 3D indoor scenes.
  - OpenX-Embodiment [27] (60K samples): Large robotic learning dataset.
  - PyBullet (10K additional samples): A physics simulator used to generate data for four robotic arms (UR5, FR5, CR5, FRANKA).
- Depth Map Generation: For datasets like RoboVQA or ScanNet that might lack depth maps, these are generated using DepthAnything-v2. Depth maps are essential for $S-P Map$ creation.
- Object Segmentation: Grounding DINO [23] and SAM2 [34] are used to obtain 2D bounding boxes and segmentation results for objects in the images, which are helpful for identifying objects and their locations.
- PyBullet Data Specifics: In PyBullet, precise robot configurations are known, allowing direct generation of the $S-P Map$ using the method described above. Reachability labels for objects are obtained via simulated motion.
- Pseudo-labeling for other datasets: The $S-P Map$ 's abstraction allows generating pseudo-labels for reachability on datasets without precise robot parameters. This is done by approximating reachability based on segmentation results and depth values, marking regions and objects as "reachable" or "unreachable".
General VQA Datasets:
- LLaVA-Pretrain: A foundational dataset for VLM pre-training.
- ShareGPT4V: Another large-scale VLM pre-training dataset.
- RoboVQA: Used for both general VQA and as a source for Phys100K.
Question-Answer Pair Generation:
- Embodied QA: GPT-4 is used to generate question-answer pairs for ScanNet and RoboVQA. These cover categories such as Function Reasoning, World Knowledge, Object Recognition, Object Localization, Attribute Recognition, Spatial Reasoning, Object State Recognition, and Hallucination. Prompts for GPT-4 are provided in the appendix (not included in the given text).
- Tasks Involving Physical Reachability: Fixed task templates are used with the "reachable" label to generate Q&A pairs. For example: $USER:<image>\n<sp_map>\n Is the [Object] in the robot's reachable space? ASSISTANT: Yes, it is.$ Here, $<image>$ and <sp_map> are placeholders for the image patch tokens and S-P Map patch tokens, respectively. [Object] represents the relevant object category. Examples are shown in Figure 3 from the original paper.

The following are the details of the Phys100K Dataset and EQA-Phys Benchmark from Figure 3 of the original paper:

Figure 3. Details of the Phys100K Dataset and EQA-Phys Benchmark. 该图像是图3，展示了Phys100K数据集和EQA-Phys基准的具体细节，包括多机器人环境下的物理可达性任务、数据来源构成及模型问答示例，体现了机器人物理可达性推理的应用。

Figure 3. Details of the Phys100K Dataset and EQA-Phys Benchmark.

4.2.4. Training Pipeline

PhysVLM employs a two-stage training process to effectively utilize the $S-P Map$ and ensure generalization.

Stage 1: Multimodal Feature Alignment:
- Objective: To build a foundational understanding of visual inputs and physical reachability, independent of specific robot configurations.
- Data: LLaVA-Pretrain and OpenX-Embodiment datasets from Phys100K.
- Training Focus: Only the projection layers (presumably the MLPs in both branches that convert features into tokens for the LLM) are trained in this stage. This allows the model to learn how to represent and align the features from images and S-P Maps with the LLM's token space.
Stage 2: Full Model Training for Complex Reasoning:
- Objective: To enhance PhysVLM's ability to handle complex visual reasoning tasks with physical reachability constraints, ensuring generalization across diverse environments and robots.
- Data: Full Phys100K dataset, ShareGPT4V, and RoboVQA.
- Training Focus: All parameters of the model (including the SigLip encoders and LLM) are unfrozen and trained end-to-end.

4.2.5. Implementation Details

Training Resources: Trained for 48 hours using eight A800 GPUs.
Training Epochs: Each of the two stages lasts one epoch.
Batch Sizes:
- Stage 1: 128
- Stage 2: 64
Learning Rates:
- Stage 1: $1 \times 10^{-3}$
- Stage 2: $1 \times 10^{-5}$
Final Model: The final model is referred to as PhysVLM-3B, indicating it uses a 3-billion parameter LLM.

4.2.6. EQA-phys Benchmark

A new embodied question answering (EQA) benchmark, EQA-phys, is introduced to specifically evaluate the model's ability to perform QA tasks constrained by physical limitations.

Simulator Dataset:
- Comprises 200 samples and 1,000 questions from the PyBullet validation set.
- Includes data generated from four different robot arms: UR5, FR5, CR5, and FRANKA.
Real-World Evaluation Set (Zero-Shot):
- A zero-shot evaluation set based on real-world data.
- Features UR3 and XArm6 robots in two distinct scenarios.
- Contains 60 samples and 300 questions.
- All questions and answers are manually annotated by domain experts to ensure accuracy and relevance to physical reachability.
- This component tests the model's ability to generalize to unseen robots and environments without specific training data for these real-world setups.

5. Experimental Setup

5.1. Datasets

The experiments utilize a combination of newly constructed and existing datasets to train and evaluate PhysVLM's capabilities in physical reachability reasoning and general embodied visual reasoning.

Phys100K Dataset: This is the large-scale multi-robot dataset constructed by the authors for training PhysVLM. As detailed in the methodology section, it aggregates data from RoboVQA (20K samples), ScanNet (10K samples), OpenX-Embodiment (60K samples), and 10K samples from PyBullet simulations. Its purpose is to provide diverse scenarios and robot configurations for learning physical reachability.
- Domain: Robotic manipulation, object interaction, physical reachability scenarios.
- Characteristics: Multi-robot, multi-environment (simulated and real-world origins), includes RGB images, depth maps (generated if not originally present), and segmentation masks. Crucially, it includes S-P Maps and reachability labels (either direct from simulation or pseudo-labeled).
- Data Sample: Figure 3 (shown above in Section 4.2.3) provides examples of question-answer pairs from Phys100K for tasks involving physical reachability. For instance, a USER prompt might include an $<image>$ and an <sp_map> (which shows reachable areas) and ask: Is the [Object] in the robot's reachable space? with the ASSISTANT responding Yes, it is. or No, it isn't..
EQA-phys Benchmark: This is a newly introduced challenging benchmark specifically designed to test physical reachability understanding.
- Source: Combines a simulator dataset (200 samples, 1000 questions from PyBullet validation set, covering UR5, FR5, CR5, FRANKA robots) and a real-world zero-shot evaluation set (60 samples, 300 questions, manually annotated, featuring UR3 and XArm6 robots).
- Domain: Embodied Question Answering with strong physical reachability constraints.
- Purpose: To assess the model's ability to integrate visual reasoning with robotic physical reachability, especially its zero-shot generalization to unseen robots and environments.
OpenEQA [25] and RoboVQA-val [29] Benchmarks: These are existing benchmarks used to evaluate the model's general visual reasoning ability in embodied tasks.
- RoboVQA-val: A standard benchmark for robotic visual question answering.
- OpenEQA: An embodied QA benchmark that provides diverse scenarios for VLMs. It includes data from ScanNet and HM3D (Habitat-Matterport 3D Dataset).
- Purpose: To demonstrate that PhysVLM's focus on physical reachability does not detract from its performance on general embodied VQA tasks.
  
  These datasets were chosen to provide a comprehensive evaluation: Phys100K for training a physically-aware VLM, EQA-phys for rigorously testing the core contribution (physical reachability), and RoboVQA-val and OpenEQA for validating general VLM capabilities in robotic contexts.

5.2. Evaluation Metrics

The paper uses different evaluation metrics depending on the task type, ensuring a comprehensive assessment of PhysVLM's performance.

LLM Scoring (for EQA-phys and tasks involving physical reachability):
- Conceptual Definition: This metric is a subjective, human-like evaluation of the quality and correctness of the model's generated responses, particularly when the task involves reasoning about nuanced concepts like physical reachability. It aims to capture whether the response is truly accurate, practical, and useful in a robotic context, beyond simple keyword matching.
- Methodology: Following existing studies [24, 25], LLM scoring typically involves using a powerful Large Language Model (like GPT-4) as an impartial judge to score the responses.
- Scoring System:
  - 5 points: Assigned for completely correct responses.
  - 1 point: Assigned for incorrect responses.
- Final Calculation: The average score across all questions is calculated and expressed as a percentage. This provides a scalar measure of performance reflecting the model's ability to provide correct and meaningful answers.
- Mathematical Formula: Let $N$ $N$ be the total number of questions, $S_k$ $S_{k}$ be the score (5 or 1) assigned to the response for question $k$ $k$ . The LLM Score (as a percentage) is calculated as: $ \text{LLM Score (%)} = \frac{\sum_{k=1}^{N} S_k}{N \times 5} \times 100 $ Where:
  - $N$ : Total number of question-answer pairs evaluated.
  - $S_k$ : The score assigned to the $k$ -th response by the LLM judge (either 5 for correct or 1 for incorrect).
  - 5: The maximum possible score for a single question.
BLEU (Bilingual Evaluation Understudy) Score (for RoboVQA-val):
- Conceptual Definition: BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. In VQA or image captioning, it's used to compare a candidate generated text to one or more reference texts. It measures the overlap of n-grams (contiguous sequences of n items from a given sample of text) between the candidate and reference texts, with a penalty for brevity. Higher BLEU scores indicate closer resemblance to human-quality reference translations or answers.
- Mathematical Formula: The BLEU score is calculated as follows: $ \text{BLEU} = \text{BP} \cdot \exp \left( \sum_{n=1}^{N} w_n \log p_n \right) $ Where:
  - $\text{BP}$ (Brevity Penalty): This factor penalizes candidate sentences that are too short compared to the reference sentences. $ \text{BP} = \begin{cases} 1 & \text{if } c > r \ e^{(1 - r/c)} & \text{if } c \le r \end{cases} $ Here, $c$ is the length of the candidate sentence and $r$ is the effective reference corpus length.
  - $N$ : The maximum n-gram order considered (e.g., for BLEU-4, $N=4$ ).
  - $w_n$ : The weight for the $n$ -gram precision (typically $1/N$ for uniform weights).
  - $p_n$ $p_{n}$ : The modified n-gram precision. This is the ratio of the number of matched n-grams in the candidate to the total number of n-grams in the candidate, clipped at the maximum count occurring in any reference. $ p_n = \frac{\sum_{\text{sentence } \in \text{candidate}} \sum_{n\text{-gram } \in \text{sentence}} \min(\text{Count}(\text{n-gram}), \text{MaxRefCount}(\text{n-gram}))}{\sum_{\text{sentence } \in \text{candidate}} \sum_{n\text{-gram } \in \text{sentence}} \text{Count}(\text{n-gram})} $ Where:
    - $\text{Count}(\text{n-gram})$ : The count of an n-gram in the candidate sentence.
    - $\text{MaxRefCount}(\text{n-gram})$ : The maximum count of that n-gram in any single reference sentence.
- Symbol Explanation:
  - $\text{BP}$ : Brevity Penalty, a factor applied to penalize short generated sentences.
  - $c$ : Length of the candidate (generated) sentence.
  - $r$ : Effective reference corpus length (sum of reference lengths closest to candidate lengths).
  - $N$ : The highest order of n-grams considered (e.g., 1 for unigrams, 2 for bigrams, up to 4 for BLEU-4).
  - $w_n$ : Weight assigned to the precision of $n$ -grams. Commonly, $w_n = 1/N$ .
  - $p_n$ : Modified n-gram precision, which measures the accuracy of $n$ -grams, giving partial credit to n-grams that appear in multiple references.
  - $\text{Count}(\text{n-gram})$ : The number of times a specific n-gram appears in the candidate sentence.
  - $\text{MaxRefCount}(\text{n-gram})$ : The maximum number of times a specific n-gram appears in any single reference sentence.
Success Rate (for Robot Task Planning):
- Conceptual Definition: This metric directly measures the operational effectiveness of the robot's generated plan. For task planning, it evaluates whether the robot successfully achieves the specified goal based on the plan provided by the VLM.
- Methodology: Each task type is executed a fixed number of times (e.g., 10 times in this paper), and the success or failure of each attempt is recorded.
- Final Calculation: The success rate is the percentage of successful task executions out of the total number of attempts.
- Mathematical Formula: Let $N_{\text{total}}$ $N_{total}$ be the total number of task execution attempts and $N_{\text{success}}$ $N_{success}$ be the number of successful attempts. $ \text{Success Rate (%)} = \frac{N_{\text{success}}}{N_{\text{total}}} \times 100 $ Where:
  - $N_{\text{success}}$ : Number of times the robot successfully completed the task.
  - $N_{\text{total}}$ : Total number of attempts to complete the task.

5.3. Baselines

The authors compare PhysVLM against a range of baseline models falling into two main categories:

API-accessible VLMs: These are powerful, general-purpose VLMs that are typically accessed via an API and are not specifically designed for embodied AI or robotics. They serve as a strong baseline for general vision-language understanding.
- GPT-4o-mini [1]
- Claude 3.5 [28]
- GPT-4o [1]
- The paper also evaluates these models when augmented with the $S-P Map$ as an additional input, to showcase the $S-P Map$ 's compatibility and benefit.
Embodied VLMs: These are VLMs designed with embodied AI applications in mind, often incorporating elements like 3D perception or spatial reasoning.
- SpatialVLM [5]: Enhances VLMs' spatial understanding using 3D data. The paper uses the 3B version, which has a similar parameter count to PhysVLM-3B.
- SpatialBot [4]: Another model focused on precise spatial understanding. The paper uses the 3B version.
- 3D-VLA [44]: Integrates 3D perception with a generative world model for embodied reasoning. Its executable version was unavailable, so reported results are used for comparison on RoboVQA-val.
- RoboMamba [22]: A multimodal state space model for efficient robot reasoning and manipulation. Its executable version was unavailable, so reported results are used for comparison on RoboVQA-val.
  
  The selection of baselines provides a robust comparison, pitting PhysVLM against both state-of-the-art general VLMs and specialized embodied VLMs, some of which have comparable model sizes.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Results on EQA-phys

The EQA-phys benchmark is specifically designed to test a model's ability to integrate visual reasoning with robotic physical reachability.

The following are the results on EQA-phys from Table 1 of the original paper:

		REAL-WORLD		SimuLaToR				ALL
		UR3	XARM6	UR5	FR5	CR5	FRANKA	ALL
API-BASED VLMs	GPT-40-MINI	54.3	56.0	49.4	55.4	54.6	47.1	52.8
	CLAUDE-3.5	56.2	60.5	54.0	58.1	55.7	54.3	56.4
	GPT-40	56.7	61.5	55.7	58.3	57.5	52.6	57.0
	GPT-4O-MINI + S-P MAP	60.0↑5.7	60.5↑4.5	57.0↑7.6	59.1↑3.7	59.2↑4.6	53.3↑6.2	59.8↑7.0
	CLAUDE-3.5 + S-P MaP	65.3↑9.1	67.3↑6.8	54.9↑0.9	58.3↑0.2	58.2↑2.5	58.1↑3.8	60.3↑3.4
	GPT-40 + S-P MAP	66.6↑9.9	68.1↑6.6	55.8↑0.1	60.7↑1.4	59.4↑1.9	57.6↑5.0	61.3↑4.1
Embodied VLMs	SPATIALVLM	56.3	55.1	54.6	59.1	52.0	47.5	54.1
	SPATIALBOT	51.1	50.2	50.0	48.1	53.3	54.4	51.1
	PHysVLM-3B	64.1	63.0	71.4	75.7	74.0	78.1	71.0

Table 1: Results on EQA-phys (LLM Score %)

Baseline Performance: Both API-based and embodied VLMs without explicit reachability understanding achieve suboptimal scores, generally in the $50-60 \%$ range (e.g., GPT-4o at $57.0 \%$ , SpatialVLM at $54.1 \%$ ). This confirms the paper's premise that these models struggle with tasks requiring an understanding of physical reachability.
PhysVLM's Superiority: PhysVLM-3B significantly outperforms all baselines, achieving an average score of $71.0 \%$ . This represents a $14 \%$ improvement over GPT-4o ( $71.0 - 57.0 = 14.0$ ). This strong performance validates that PhysVLM effectively integrates visual reasoning with an understanding of physical reachability. The model demonstrates particularly high scores in simulated environments (e.g., UR5 $71.4 \%$ , FR5 $75.7 \%$ , CR5 $74.0 \%$ , FRANKA $78.1 \%$ ).
Impact of S-P Map on API-based VLMs: Integrating the $S-P Map$ into API-based VLMs (e.g., GPT-4o + S-P MAP) leads to substantial performance gains (e.g., GPT-4o improves from $57.0 \%$ to $61.3 \%$ , a $4.1 \%$ increase; GPT-4o-mini improves by $7.0 \%$ ). This highlights the $S-P Map$ 's effectiveness as a general, robot-agnostic representation of reachability, which can be leveraged by various VLMs. The improvements are particularly notable for real-world robots (e.g., GPT-4o + S-P Map on UR3: $66.6 \%$ , an increase of $9.9 \%$ ).
Zero-Shot Generalization: PhysVLM-3B achieves scores of over $63 \%$ on the zero-shot real-world evaluations (UR3: $64.1 \%$ , XARM6: $63.0 \%$ ), despite encountering new environments and different robot parameters. This indicates its ability to generalize, which is attributed to the $S-P Map$ 's unified representation and the model's independent encoding branches that learn generalizable visual features.

6.1.2. Results on Embodied QA

These experiments evaluate PhysVLM's general visual reasoning capabilities in embodied tasks, demonstrating that integrating physical constraints does not diminish its broader VLM performance.

The following are the results on the RoboVQA-val set from Table 2 of the original paper:

	BLEU1	BLEU2	BLEU3	BLEU4
SPATIALVLM*	5.1	3.0	1.9	1.2
SPATIALBOT*	12.4	9.3	8.0	7.2
3D-VLA	48.3	38.5	31.7	26.8
ROBOMaMBA	54.9	44.2	39.5	36.3
PHYsVLM-3B	65.3	62.4	50.9	43.5

Table 2: Embodied QA results on the RoboVQA-val set (BLEU Score %) (* indicates models not pre-trained on the RoboVQA dataset)

RoboVQA-val Performance: PhysVLM-3B achieves the best performance across all BLEU scores on the RoboVQA-val benchmark. Notably, it surpasses RoboMamba by $7.2 \%$ in BLEU-4 (43.5 vs 36.3), indicating superior general embodied visual reasoning. Even against models that were pre-trained on RoboVQA (whereas SpatialVLM and SpatialBot were not), PhysVLM performs strongly.

The following are the results on the OpenEQA benchmark from Table 3 of the original paper:

EM-EQA (ScANNET) EM-EQA (HM3D) ALL

SPATIALVLM 42.9 44.3 43.8

SPATIALBoT 45.3 51.0 49.1

GPT4V 57.4 51.3 55.3

GPT-40* 68.2 65.2 66.7

PHYSVLM-3B 60.7 51.2 57.4

	EM-EQA (ScANNET)	EM-EQA (HM3D)	ALL
SPATIALVLM	42.9	44.3	43.8
SPATIALBoT	45.3	51.0	49.1
GPT4V	57.4	51.3	55.3
GPT-40*	68.2	65.2	66.7
PHYSVLM-3B	60.7	51.2	57.4

Table 3: Embodied QA results on the OpenEQA benchmark (LLM Score %) (* indicates that only the first 200 samples were tested due to API limitations)

OpenEQA Performance: On the OpenEQA benchmark, PhysVLM-3B ranks second overall ( $57.4 \%$ ), outperforming SpatialVLM, SpatialBot, and even GPT4V ( $55.3 \%$ ). It is only surpassed by GPT-4o ( $66.7 \%$ ), which is a much larger and more powerful general VLM. This demonstrates that PhysVLM maintains strong general visual reasoning capabilities while incorporating physical reachability.

6.1.3. Results on Robot Task Planning

This section evaluates how PhysVLM's understanding of physical reachability translates into more effective robot task planning.

The following are the task planning results from Table 4 of the original paper:

	ALL OBJECTS IN RANGE	PART OBJECTS IN RANGE
GPT-40-MINI	70.5	23.2
CLAUDE-3.5	73.6	32.1
GPT-40	75.9	35.8
SPATIALVLM	64.4	21.5
SPATIALBOT	65.6	25.3
PHYSVLM-3B	69.2	48.4

Table 4: Task planning results (Success Rate %)

Objects within Range: When all objects are within the robot's physical reach, PhysVLM performs comparably to other models (e.g., $69.2 \%$ vs. GPT-4o's $75.9 \%$ ). In these cases, directly interacting with objects (grabbing/placing) is feasible for most models, as physical reachability isn't a distinguishing constraint.
Objects Partially Out of Range: The critical advantage of PhysVLM emerges when some objects are outside the robot's immediate physical reach. In these challenging scenarios, PhysVLM achieves a success rate of $48.4 \%$ , which is significantly higher than all baselines (e.g., GPT-4o at $35.8 \%$ , Claude-3.5 at $32.1 \%$ , SpatialVLM at $21.5 \%$ ). This demonstrates that PhysVLM can reason that the robot needs to perform intermediate steps (e.g., "move closer") before attempting to grasp or place an object, leading to more robust and successful task plans. This is a direct consequence of its explicit understanding of robotic physical reachability.

6.2. Data Presentation (Tables)

All tables from the original paper have been transcribed and presented in the subsections above.

6.3. Ablation Studies / Parameter Analysis

The ablation studies provide insights into the contribution of individual components of PhysVLM.

6.3.1. Effectiveness of S-P Map

This study evaluates the importance of the $S-P Map$ by comparing performance with and without its input, and also with a simple depth map.

The following are the ablation study results on the S-P Map from Table 5 of the original paper:

ID	S-P MAP	DePTH MAP	EQA-PHYS REAL	EQA-PHYS SIM
1	√		63.5	74.8
2		√	58.1	62.4
3			54.2	58.8

Table 5: Ablation study on the S-P Map (LLM Score %)

S-P Map vs. No S-P Map (ID 1 vs. ID 3): Removing the $S-P Map$ entirely (ID 3) leads to a significant performance drop compared to using it (ID 1). The overall average score decreases by $16 \%$ in simulation (from $74.8 \%$ to $58.8 \%$ ) and $9.3 \%$ in real-world evaluations (from $63.5 \%$ to $54.2 \%$ ). This confirms that the $S-P Map$ is crucial for the model to handle robotic physical reachability. Without it, the model struggles significantly.
S-P Map vs. Depth Map (ID 1 vs. ID 2): Replacing the $S-P Map$ with a raw Depth Map (ID 2) also significantly degrades performance on zero-shot tasks (real-world EQA-phys drops from $63.5 \%$ to $58.1 \%$ ; simulation drops from $74.8 \%$ to $62.4 \%$ ). This indicates that raw depth information alone is insufficient. The $S-P Map$ 's abstraction of physical reachability (marking reachable regions) is superior to just providing depth values, as depth doesn't inherently convey kinematic reachability.

6.3.2. Effectiveness of an Additional Feature Encoder

This study examines whether having an independent feature encoder for the $S-P Map$ is beneficial compared to sharing weights with the visual feature encoder.

The following are the ablation study results on the effectiveness of an additional feature encoder from Table 6 of the original paper:

	EQA-PHYS	OPENEQA
INdEPENDENT	71.0	57.4
SHARE	68.2	56.5

Table 6: Ablation study on the effectiveness of an additional feature encoder (LLM Score %)

Independent vs. Shared Encoder: When the feature encoder for the $S-P Map$ is independent, PhysVLM achieves better performance on both EQA-phys ( $71.0 \%$ ) and OpenEQA ( $57.4 \%$ ). If the encoder shares weights with the visual feature encoder, performance drops on EQA-phys to $68.2 \%$ and on OpenEQA to $56.5 \%$ .
Reasoning: The paper attributes this to the distinct nature of S-P Map features compared to general image features. The $S-P Map$ focuses on abstracted reachability information (e.g., gray mask for unreachable areas), which is different from the rich visual details in an RGB image. Sharing weights forces a single encoder to learn potentially conflicting feature representations, leading to suboptimal performance. Furthermore, the training data contains significantly more image-text pairs than S-P Map data, meaning a shared encoder would be heavily biased towards learning visual features, potentially neglecting the nuances of $S-P Map$ features.

6.3.3. Effectiveness of Training Data

This study investigates the impact of different data sources within the Phys100K dataset on overall performance.

The following are the ablation study results on the effectiveness of training data from Table 7 of the original paper:

Part of Phys100k	EQA-PHYS REAL	EQA-PHYS SIM
ALL	63.5	74.8
w/o PybuLLeT	62.1	65.4
w/o Other Datasets	58.6	71.5

Table 7: Ablation study on the effectiveness of training data (LLM Score %)

Full Phys100K (ALL): Using all components of Phys100K yields the best performance ( $63.5 \%$ on real-world EQA-phys, $74.8 \%$ on simulated EQA-phys).
Excluding PyBullet Data (w/o PybuLLeT): Removing the PyBullet data (which provides precise robot configurations and direct $S-P Map$ generation) leads to a notable decrease in performance, especially in simulation (from $74.8 \%$ to $65.4 \%$ ). This highlights the critical role of high-quality, ground-truth simulated data for learning physical reachability.
Excluding Other Embodied Datasets (w/o Other Datasets): Removing other embodied datasets (RoboVQA, ScanNet, OpenX-Embodiment, which rely more on pseudo-labeling) also degrades performance, particularly in real-world scenarios ( $58.6 \%$ ). This suggests that the diversity and scale of these broader embodied AI datasets are important for PhysVLM's generalization capabilities, even if they sometimes use pseudo-labels for reachability.
Conclusion: The ablation studies confirm that all components of Phys100K contribute to PhysVLM's overall performance, emphasizing the need for a comprehensive and diverse training dataset for robust physical reachability understanding.

6.4. Qualitative Results

The qualitative results provide visual examples to reinforce the quantitative findings, showing how PhysVLM's reasoning contrasts with GPT-4o and SpatialBot.

The following are the visual comparison of PhysVLM (ours), GPT-4o, and SpatialBot from Figure 4 of the original paper:

Figure 4. Visual comparison of PhysVLM (ours), GPT-4o, and SpatialBot. 该图像是图表，展示了PhysVLM、GPT-4o和SpatialBot三种模型在机器人物理可达性任务中的视觉输入和回答对比，每组包含图像、S-P Map、深度图和点云，突出PhysVLM对物理可达性推理的准确性。

Figure 4. Visual comparison of PhysVLM (ours), GPT-4o, and SpatialBot.

Figure 4 illustrates several scenarios:

Scenario 1 (Left Column): The task is to identify if the "white box" is reachable.
- GPT-4o (using only the image) incorrectly states, "Yes, it is." It fails to account for the robot's arm limitations.
- SpatialBot (using depth maps and images) also incorrectly states, "Yes, it is." Although it has depth information, it lacks an explicit understanding of the robot's reachable workspace.
- PhysVLM (using image and $S-P Map$ ) correctly states, "No, it is not." The $S-P Map$ clearly shows the white box is outside the gray-masked reachable area.
Scenario 2 (Middle Column): The task is to identify if the "blue mug" is reachable.
- GPT-4o incorrectly states, "Yes, it is."
- SpatialBot incorrectly states, "Yes, it is."
- PhysVLM correctly states, "No, it is not."
Scenario 3 (Right Column): The task is to identify if the "yellow apple" is reachable.
- GPT-4o incorrectly states, "Yes, it is."
- SpatialBot incorrectly states, "Yes, it is."
- PhysVLM correctly states, "Yes, it is." In this case, the $S-P Map$ confirms the apple is within reach.
  
  The qualitative results clearly show that traditional VLMs (GPT-4o) and even embodied VLMs that leverage depth maps (SpatialBot) struggle with physical reachability. They often make incorrect assertions about what a robot can interact with, highlighting a critical gap in their embodied visual reasoning. In contrast, PhysVLM, by explicitly processing the $S-P Map$ , consistently provides accurate assessments of physical reachability. The examples also demonstrate that even GPT-4o can be improved by integrating the $S-P Map$ , making its responses more accurate regarding physical reachability. This visual evidence strongly supports the quantitative findings that PhysVLM offers a superior understanding of robotic physical constraints.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces PhysVLM, a novel Vision-Language Model (VLM) specifically designed to enable VLMs to understand and reason about robotic physical reachability in embodied tasks. The core innovation lies in the Space-Physical Reachability Map (S-P Map), a unified and robot-agnostic representation that abstracts diverse robot kinematics into a generalized spatial form, allowing the model to focus on reachability features. PhysVLM integrates this $S-P Map$ through a dedicated feature encoder into a dual-branch VLM architecture, ensuring that physical constraints are considered without compromising general vision-language capabilities. The authors also developed Phys100K, a large-scale multi-robot dataset, and EQA-phys, a challenging benchmark for evaluating physical reachability tasks. Experimental results consistently demonstrate PhysVLM's superior performance, achieving a $14 \%$ improvement over GPT-4o on EQA-phys and outperforming advanced embodied VLMs like RoboMamba and SpatialVLM on RoboVQA-val and OpenEQA. The $S-P Map$ also proved compatible with other VLMs, significantly boosting their performance.

7.2. Limitations & Future Work

The authors acknowledge a primary limitation:

Reduced Zero-Shot Performance on Real Robots: While PhysVLM shows good zero-shot generalization to real robots compared to baselines, its performance on real-world EQA-phys tasks is still lower than in simulated environments. This is attributed to the domain gap between simulated and real-world data, which is a common challenge in robotics.

Based on this limitation, the authors suggest the following future work:
Expanding Datasets: Future efforts will focus on creating larger and more diverse datasets, especially for real-world scenarios, to bridge the domain gap and improve generalization.
Enhancing Real-World Performance: Further research will aim to specifically improve PhysVLM's performance in real-world environments, likely through techniques that mitigate the sim-to-real gap.
Improving Understanding of Physical Accessibility in VLA Models: The work will extend to integrating this physical reachability understanding into full vision-language-action (VLA) models, allowing for more robust and informed robotic decision-making beyond just question answering.

7.3. Personal Insights & Critique

This paper presents a highly valuable contribution to the field of embodied AI and robotics. The concept of a Space-Physical Reachability Map (S-P Map) is particularly insightful, as it effectively disentangles the complex kinematics of individual robots from the abstract notion of "reachability" that VLMs need to understand. This robot-agnostic representation is key to achieving generalization, which is often a bottleneck in applying AI to diverse robotic platforms.

Potential Applications: The reachability awareness endowed by PhysVLM has immediate and significant implications for practical robotic deployments.

Industrial Automation: Robots can make more informed decisions on assembly lines, avoiding attempts to pick up components out of reach, thus reducing errors and increasing efficiency.
Assistive Robotics: Personal robots can better understand their physical capabilities when helping humans, ensuring tasks like fetching objects are only attempted if feasible, leading to safer and more reliable assistance.
Teleoperation/Remote Control: Operators could receive real-time feedback on robot reachability, preventing commands that would lead to impossible or unsafe actions.
Cross-Platform Adaptability: The unified S-P Map representation means that a PhysVLM trained on a diverse set of robots could potentially be deployed on a new, unseen robot type with minimal adaptation, which is a significant advantage for real-world deployment.

Critique and Areas for Improvement:

Domain Gap: The acknowledged domain gap between simulation and reality is a persistent problem. While the $S-P Map$ helps, it doesn't entirely eliminate it. Future work should explore more sophisticated domain adaptation techniques or sim-to-real transfer learning specifically tailored for S-P Maps (e.g., using generative models to create more realistic S-P Maps from synthetic data).
Real-time S-P Map Generation: The paper states voxel grid generation is precomputed offline. For dynamic environments or on-the-fly task planning where the robot's base might move, the $S-P Map$ would need to be re-generated in real-time. The computational cost of generating the $S-P Map$ (especially point cloud processing and voxel grid lookups) for every frame or new robot pose could be a performance bottleneck, especially on resource-constrained robots. Further research into efficient, real-time $S-P Map$ generation or approximation methods would be beneficial.
Reliance on Depth Maps: The $S-P Map$ generation relies on RGB-D cameras and depth map accuracy (even if generated using DepthAnything-v2). Depth map quality can vary greatly with lighting conditions, reflective surfaces, or transparent objects. Errors in depth perception would directly translate to inaccuracies in the $S-P Map$ , potentially leading to hallucinated reachability or missed opportunities. Robustness to depth sensing imperfections is an important consideration.
Complexity of Reachability: The current $S-P Map$ seems to focus on end-effector reachability in free space. However, physical reachability in complex tasks also involves collision avoidance (the robot's body hitting obstacles), joint singularities, and self-collision. While the $S-P Map$ could potentially be extended to encode these more complex constraints, the current representation might be simplified. Integrating these richer physical accessibility concepts into the $S-P Map$ could further enhance VLM reasoning.
Action Space Integration: The current PhysVLM is a VLM that generates textual responses. While it improves task planning by providing more accurate knowledge, the ultimate goal for embodied AI is often direct action generation. Bridging PhysVLM's improved reasoning directly to low-level robot control or action primitives in an end-to-end VLA model is the natural next step and poses its own set of challenges, as highlighted in the future work.

Overall, PhysVLM makes a significant stride towards enabling VLMs to operate more intelligently and reliably in the physical world, laying crucial groundwork for the next generation of robot-aware AI.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~33 min read · 45,313 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. VLMs in Robotics

3.2.2. Understanding Physical Reachability

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. S-P Map Encoding

4.2.2. Model Architecture

4.2.3. Training Data Construction

4.2.4. Training Pipeline

4.2.5. Implementation Details

4.2.6. EQA-phys Benchmark

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Results on EQA-phys

6.1.2. Results on Embodied QA

6.1.3. Results on Robot Task Planning

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

6.3.1. Effectiveness of S-P Map

6.3.2. Effectiveness of an Additional Feature Encoder

6.3.3. Effectiveness of Training Data

6.4. Qualitative Results

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers