Paper status: completed

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Published:12/19/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
9 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work introduces VSI-Bench to evaluate multimodal large language models' spatial reasoning from videos, revealing emerging spatial awareness and local world models, with cognitive map generation enhancing spatial distance understanding beyond standard linguistic reasoning tec

Abstract

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

1.2. Authors

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, Saining Xie.

  • 1New York University
  • 2Yale University
  • 3Stanford University

1.3. Journal/Conference

Published at arXiv (a preprint server) on 2024-12-18T18:59:54.000Z. While arXiv hosts preprints, the authors and institutions involved (NYU, Yale, Stanford) are highly reputable, suggesting this work is likely intended for a top-tier machine learning or computer vision conference (e.g., NeurIPS, CVPR, ICLR) or journal.

1.4. Publication Year

2024

1.5. Abstract

Humans excel at remembering spaces from sequential visual observations. This paper investigates whether Multimodal Large Language Models (MLLMs), trained on extensive video datasets, can also "think in space" from videos. The authors introduce VSI-Bench, a novel video-based visual-spatial intelligence benchmark comprising over 5,000 question-answer pairs. Evaluations reveal that MLLMs exhibit competitive, though subhuman, visual-spatial intelligence. By probing models to express their spatial reasoning both linguistically and visually, the study identifies spatial reasoning capabilities as the primary bottleneck for higher performance, despite the emergence of local world models and spatial awareness within these models. Intriguingly, standard linguistic reasoning techniques like chain-of-thought, self-consistency, and tree-of-thoughts fail to improve performance. However, explicitly generating cognitive maps during question-answering significantly enhances MLLMs' ability in spatial distance tasks.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper addresses is the gap in Multimodal Large Language Models' (MLLMs) ability to comprehend and recall spatial information from dynamic, sequential visual input, specifically videos, in a manner akin to human visual-spatial intelligence. Humans effortlessly build mental models of spaces after observing them, remembering objects, their positions, and sizes, and using this knowledge for tasks like navigation or estimation.

This problem is crucial in the current field due to the increasing relevance of MLLMs in embodied AI applications, such as robotics, autonomous driving, and augmented/virtual reality (AR/VR). For these agents to effectively operate and interact with the 3D world, they require robust visual-spatial intelligence. While MLLMs have made significant strides in linguistic and general visual understanding from static images or short video clips, their capacity for complex 3D spatial reasoning from continuous video streams remains under-explored and constitutes a critical bottleneck for real-world deployment. Previous benchmarks often focus on content-level understanding or temporal extensions of 2D image analysis, lacking a deep focus on 3D spatial cognition and memory.

The paper's entry point is to thoroughly investigate this challenge by introducing a comprehensive video-based benchmark (VSI-Bench) and systematically probing MLLMs' internal mechanisms for spatial reasoning, both linguistically and visually. The innovative idea is to not only measure performance but also to understand how MLLMs "think in space" and identify specific areas for improvement.

2.2. Main Contributions / Findings

The paper makes several significant contributions and presents key findings:

  • Novel Video-Based Visual-Spatial Intelligence Benchmark (VSI-Bench): The authors introduce VSI-Bench, a new benchmark comprising over 5,000 question-answer pairs derived from 288 real indoor-scene videos. This benchmark is designed to evaluate MLLMs' capabilities in various aspects of visual-spatial intelligence, including configurational, measurement estimation, and spatiotemporal tasks. It leverages existing 3D reconstruction datasets to enable accurate object-level annotations and spatial queries.

  • MLLMs Exhibit Emerging but Subhuman Visual-Spatial Intelligence: Evaluations on VSI-Bench show that state-of-the-art MLLMs possess competitive visual-spatial intelligence, significantly outperforming chance baselines. However, a substantial performance gap remains between MLLMs and human performance, especially in tasks requiring complex spatial configuration understanding.

  • Spatial Reasoning is the Primary Bottleneck: Through linguistic self-explanations from models, the study identifies that spatial reasoning capabilities (specifically relational reasoning and egocentric-allocentric transformation) are the main factor behind MLLMs' performance limitations on VSI-Bench, accounting for over 70% of errors. This highlights that while MLLMs have strong visual perception, linguistic intelligence, and temporal processing abilities, their core spatial reasoning is still developing.

  • Emergence of Local World Models and Spatial Awareness: Visual probing using cognitive maps reveals that MLLMs are capable of building strong local world models and exhibiting spatial awareness for adjacent objects. However, their accuracy significantly deteriorates as the distance between objects increases, indicating difficulty in forming unified global world models from sequential video observations.

  • Failure of Prevailing Linguistic Reasoning Techniques: Surprisingly, common linguistic prompting techniques such as Chain-of-Thought (CoT), Self-Consistency, and Tree-of-Thoughts (ToT), which are effective in other language and general visual tasks, consistently fail to improve, and sometimes even degrade, MLLMs' performance on VSI-Bench. This suggests that spatial reasoning is fundamentally different and cannot be solved by merely enhancing linguistic capabilities.

  • Enhancement through Explicit Cognitive Map Generation: In contrast to linguistic prompting, explicitly instructing MLLMs to generate and utilize cognitive maps during question-answering significantly enhances their ability to answer spatial distance questions, improving performance by 10%. This finding underscores the importance of building explicit mental spatial representations for MLLMs to tackle visual-spatial reasoning tasks effectively.

    These findings collectively solve the problem of understanding the current state and limitations of MLLMs in visual-spatial intelligence, providing clear directions for future research in developing more spatially aware and intelligent embodied AI systems.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully grasp the contributions of this paper, a foundational understanding of several key concepts is essential:

  • Multimodal Large Language Models (MLLMs):

    • Conceptual Definition: MLLMs are advanced artificial intelligence models that combine the capabilities of Large Language Models (LLMs) with modalities beyond text, most commonly vision. They are designed to understand, process, and generate content based on multiple types of data inputs, such as text, images, and videos. This integration allows them to perform complex tasks that require understanding relationships between different data forms, like answering questions about an image or describing a video.
    • Functionality: Typically, an MLLM consists of a vision encoder (e.g., a Convolutional Neural Network or Vision Transformer) that extracts visual features from images or video frames, and an LLM (e.g., a Transformer-based architecture like GPT or LLaMA) that processes both the visual features and textual input (prompts). A "connector" or "aligner" module bridges the gap between the vision encoder's output and the LLM's input space, enabling the LLM to interpret visual information in conjunction with language.
    • Example: If shown a video, an MLLM might identify objects, track their movement, and then answer a natural language question about what happened or where an object is located.
  • Visual-Spatial Intelligence:

    • Conceptual Definition: Visual-spatial intelligence refers to the ability to perceive, understand, reason about, and mentally manipulate spatial relationships and visual information. It involves comprehending 3D space, recognizing patterns, mentally rotating objects, and navigating environments. In humans, it's a cognitive ability crucial for tasks ranging from everyday navigation to complex engineering design.
    • Key Capabilities (as per the paper's taxonomy in Figure 2):
      • Visual Perception: The ability to accurately recognize objects, their properties (e.g., size, color), and their presence within a visual scene. This is the foundational layer.
      • Linguistic Intelligence: The ability to understand and generate human language, which is necessary to interpret questions and formulate answers.
      • Temporal Processing: The ability to understand and reason about sequences of events and changes over time, particularly important when processing videos.
      • Spatial Reasoning: The core component for understanding spatial relationships. This is further broken down into:
        • Relational Reasoning: The ability to identify relationships between objects based on distance, direction, and relative size (e.g., "the cup is on the table," "the chair is closer to the door than the window"). This often involves visuospatial common sense (e.g., knowing typical object sizes to estimate others).
        • Egocentric-Allocentric Transformation: The ability to switch between a self-centered (egocentric) view (what the camera/person sees) and an environment-centered (allocentric) view (a bird's-eye map-like understanding of the entire space). This is vital for tasks like route planning and perspective-taking. It relies on visuospatial working memory to hold and manipulate spatial information.
  • Cognitive Maps:

    • Conceptual Definition: Originating from psychology and neuroscience (e.g., Tolman's work), a cognitive map is a mental representation of an environment, enabling an organism to acquire, store, and utilize spatial information. It's an internal, abstract "map" of the world that includes locations of objects, landmarks, and pathways.
    • Functionality: For humans, cognitive maps allow for navigation, pathfinding, and understanding spatial relationships even when parts of the environment are not directly visible. The paper explores if MLLMs can generate similar internal representations.
  • Linguistic Prompting Techniques: These are strategies used to guide or enhance the reasoning capabilities of LLMs and MLLMs by structuring the input prompt.

    • Chain-of-Thought (CoT):
      • Conceptual Definition: A prompting technique that encourages LLMs to explain their reasoning process step-by-step before providing a final answer. This mimics human thought processes and has been shown to improve complex reasoning tasks.
      • Example: Instead of asking "What is 2+2?", one might prompt "Let's think step by step. What is 2+2?". The model would then output "2+2 equals 4. Therefore, the answer is 4."
    • Self-Consistency:
      • Conceptual Definition: An advanced CoT technique where an LLM generates multiple distinct reasoning paths and answers for a given question (often by sampling with a higher temperature). The final answer is then determined by taking a majority vote among these diverse outputs. The idea is that consistent answers across multiple reasoning paths are more likely to be correct.
    • Tree-of-Thoughts (ToT):
      • Conceptual Definition: A more complex prompting method that extends CoT by exploring multiple reasoning paths in a tree-like structure. It involves generating several "thoughts" or intermediate steps at each stage of reasoning, evaluating them, and then choosing the most promising path to continue. This allows for more deliberate problem-solving, planning, and backtracking if a path leads to a dead end.
  • Evaluation Metrics:

    • Accuracy (ACC\mathcal{ACC}):
      • Conceptual Definition: A fundamental metric in classification tasks, representing the proportion of correct predictions (i.e., instances where the model's output exactly matches the ground truth) out of the total number of predictions made. It's suitable for tasks with discrete, categorical answers.
      • Mathematical Formula: $ \mathcal{ACC} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
      • Symbol Explanation:
        • ACC\mathcal{ACC}: Accuracy.
        • Number of Correct Predictions: The count of model predictions that exactly match the ground truth.
        • Total Number of Predictions: The total number of questions asked or instances evaluated.
    • Mean Relative Accuracy (MRA\mathcal{MRA}):
      • Conceptual Definition: A metric designed for numerical prediction tasks, especially when exact matches are rare and the degree of proximity to the ground truth matters. It calculates accuracy based on whether the relative error falls within an acceptable range, averaged over multiple confidence thresholds. This provides a more nuanced evaluation than simple exact match accuracy for continuous values.
      • Mathematical Formula: $ \mathcal{MRA} = \frac{1}{10} \sum_{\theta \in \mathcal{C}} \mathbb{1} \left( \frac{|\hat{y} - y|}{y} < 1 - \theta \right) $
      • Symbol Explanation:
        • MRA\mathcal{MRA}: Mean Relative Accuracy.
        • 1/101/10: A normalization factor, as the set C\mathcal{C} contains 10 confidence thresholds.
        • θC\sum_{\theta \in \mathcal{C}}: Summation over all confidence thresholds θ\theta in the predefined set C\mathcal{C}.
        • C={0.5,0.55,\hdots,0.95}\mathcal{C} = \{0.5, 0.55, \hdots, 0.95\}: The set of 10 confidence thresholds used in the calculation, ranging from 0.5 to 0.95 with steps of 0.05.
        • 1()\mathbb{1}(\cdot): The indicator function. It returns 1 if the condition inside its parentheses is true, and 0 otherwise.
        • y^\hat{y}: The model's numerical prediction for a given question.
        • yy: The ground truth numerical answer for that same question.
        • y^y|\hat{y} - y|: The absolute difference between the model's prediction and the ground truth.
        • y^y/y|\hat{y} - y| / y: The relative error rate of the prediction.
        • 1θ1 - \theta: The maximum allowed relative error for a prediction to be considered "correct" at a specific confidence threshold θ\theta. For example, if θ=0.9\theta = 0.9, then 1θ=0.11 - \theta = 0.1, meaning the relative error must be less than 10% for the prediction to be counted as correct.

3.2. Previous Works

The paper grounds its work by referencing previous research in visual-spatial intelligence, MLLMs with visual-spatial awareness, and benchmarking MLLMs on video.

  • Foundation in Visual-Spatial Intelligence: The paper draws on cognitive psychology literature ([11, 26, 57, 62]) to define and categorize visual-spatial intelligence capabilities. This includes classical theories of spatial cognition and human abilities like mental rotation [74] and working memory [2]. The paper explicitly states its focus on real-world environments and differentiates its scope from pen-paper tasks.

  • MLLMs with Visual-Spatial Awareness: Prior work has begun to explore grounding MLLMs in the real world for spatial understanding.

    • Early LLMs [3, 9, 67, 68, 77, 81, 82] laid the groundwork for powerful language reasoning.
    • MLLMs [1, 4, 15, 34, 42, 49, 78] combine these LLMs with vision encoders [30, 65, 69] to achieve visual understanding [86, 91, 102, 103].
    • Recent efforts directly address spatial intelligence in MLLMs [10, 13, 16, 29, 41, 48, 94, 107]. For instance, SpatialVLM [13] and SpatialRGPT [16] aim to endow VLM with spatial reasoning. However, many of these works primarily focus on understanding spatial information from 2D images [70, 76, 93] or solely language [58, 72, 90, 92]. The crucial differentiation of the current paper is its use of real-world videos, which offers a much richer and more continuous spatial context than static images, mirroring human perception more closely.
  • Benchmarking MLLMs on Video: As MLLMs evolve, there's a growing interest in evaluating their video understanding capabilities [23, 24, 43, 44, 47, 51, 54, 55, 63, 85, 96].

    • Video-MME [24]: A comprehensive benchmark for various video-related tasks, including recognition and perception. The current paper uses Video-MME for comparison when evaluating Chain-of-Thought techniques, highlighting that while CoT helps in general video understanding (Table 2), it fails for spatial reasoning.
    • EgoSchema [55] and OpenEQA [54]: These benchmarks evaluate MLLMs' understanding using egocentric videos, which is a closer parallel to VSI-Bench. However, the paper argues that most prior works, despite their significance, focus on content-level understanding [24, 43, 55, 63], which is essentially a temporal extension of 2D image understanding without explicit 3D spatial consideration.
    • Differentiation: VSI-Bench goes beyond content-level understanding by requiring core spatial capabilities like visual working memory and implicit scene reconstruction from videos, emphasizing 3D spatial reasoning rather than just temporal event recognition.

3.3. Technological Evolution

The field has evolved from foundational Large Language Models (LLMs) excelling in text-based reasoning to Multimodal Large Language Models (MLLMs) that integrate vision. Initially, MLLMs focused on static image understanding, then expanded to video content understanding (e.g., event recognition, action classification). The current paper represents a significant step towards enabling MLLMs to perform complex 3D spatial reasoning from continuous video streams, moving beyond simple object detection or temporal event identification. This evolution aims to bridge the gap between AI perception and cognitive understanding of the physical world, paving the way for more intelligent embodied agents.

3.4. Differentiation Analysis

Compared to main methods in related work, the core differences and innovations of this paper's approach are:

  1. Focus on 3D Visual-Spatial Intelligence from Videos: Unlike many MLLM benchmarks that emphasize content understanding or temporal events in videos, VSI-Bench specifically targets 3D spatial reasoning, requiring models to perceive, remember, and recall the layout, measurements, and spatiotemporal relationships of objects within a 3D environment from egocentric video. This is distinct from tasks solvable with 2D image understanding or language-only spatial descriptions.
  2. Novel Benchmark (VSI-Bench): The creation of VSI-Bench itself is a key innovation. It's a high-quality, large-scale, video-based benchmark drawing from 3D reconstruction datasets (ScanNet, ARKitScenes), ensuring accurate ground truth for spatial properties. Its tasks are carefully designed to probe specific aspects of visual-spatial intelligence, including challenging measurement estimation and egocentric-allocentric transformation tasks.
  3. Dual-Coding Inspired Probing: The paper's methodology of probing MLLMs' internal thinking in space through both linguistic self-explanations and visual cognitive maps is innovative. This dual approach provides a richer understanding of how MLLMs are reasoning, rather than just what their final answer is. This is inspired by human cognitive theories.
  4. Counter-Intuitive Finding on Linguistic Prompting: A significant differentiation is the finding that standard linguistic reasoning techniques (Chain-of-Thought, Self-Consistency, Tree-of-Thoughts) fail to improve performance on spatial reasoning tasks, and can even degrade it. This challenges the common assumption that enhancing linguistic reasoning universally improves MLLM performance and suggests that spatial reasoning requires distinct, perhaps visual-centric, mechanisms.
  5. Effectiveness of Explicit Cognitive Map Generation: The demonstration that explicitly generating cognitive maps enhances MLLMs' spatial distance ability is a novel and actionable insight. It proposes a concrete strategy—building internal spatial representations—as a promising pathway to improve spatial reasoning, contrasting with the ineffectiveness of purely linguistic approaches.

4. Methodology

The paper's methodology centers on three main pillars: constructing a dedicated benchmark (VSI-Bench), evaluating MLLMs on it, and then deeply probing their "thinking" process using both linguistic and visual techniques.

4.1. Principles

The core idea is to move beyond simple visual recognition or temporal event understanding and rigorously evaluate if MLLMs can build a mental model of a 3D space from sequential visual observations (videos), similar to how humans do. This involves testing their ability to remember object locations, estimate distances, understand directions, and plan routes within an environment they've "seen" only through a camera. The theoretical basis is rooted in cognitive psychology's understanding of visual-spatial intelligence, particularly concepts like relational reasoning and egocentric-allocentric transformation. By probing models linguistically (through self-explanations) and visually (through cognitive maps), the authors aim to identify the strengths and weaknesses of current MLLMs in spatial cognition and to uncover potential internal world models.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. VSI-Bench: A Video-Based Visual-Spatial Intelligence Benchmark

VSI-Bench is introduced as a novel benchmark to quantitatively evaluate the visual-spatial intelligence of MLLMs from egocentric video.

  • Scale and Scope: It comprises over 5,000 question-answer pairs derived from 288 real indoor-scene videos. These videos are sourced from validation sets of public indoor 3D scene reconstruction datasets: ScanNet [19], ScanNet++ScanNet++ [97], and ARKitScenes [5]. This choice provides diverse environments (residential, professional, industrial) and accurate object-level annotations, which are crucial for generating precise ground truth for spatial questions. The use of video data is central, as it captures continuous, temporal input, enabling richer spatial understanding than static images.
  • Quality Control: The benchmark underwent iterative human review to minimize ambiguity in questions and correct any propagated errors from source datasets.
  • Task Taxonomy (Figure 3): VSI-Bench includes eight tasks categorized into three types:
    • Configurational Tasks: Test understanding of spatial layouts.
      • Object Count: "How many cabinet(s) are in this room?" (MCA)
      • Relative Distance: "Which of these objects is the closest to the {category}?" (MCA)
      • Relative Direction: "If I am standing by the {positioning object} and facing the {orienting object}, is the {querying object} to my front-left, front-right, back-left, or back-right?" (MCA)
      • Route Plan: "You are a robot beginning at the door and facing the floor. You want to navigate to the window. You will perform the following actions..." (MCA, fill-in-the-blank for turns)
    • Measurement Estimation Tasks: Require precise quantitative estimations.
      • Object Size: "What is the height of the stool, in cm?" (NA)
      • Room Size: "What is the size of this room (in square meters)?" (NA)
      • Absolute Distance: "What is the distance between the keyboard and the TV, in meters?" (NA)
    • Spatiotemporal Tasks: Evaluate memory of changes over time within a space.
      • Appearance Order: "What will be the first-time appearance order of the following categories in the video?" (MCA)

        The following figure (Figure 3 from the original paper) provides a visual overview of VSI-Bench tasks and a 3D camera trajectory.

        该图像是一个多模态大语言模型视频空间视觉智能能力测试示意图,展示了模型在不同空间感知任务中的问答示例及三维空间摄像轨迹。图中通过视频帧和3D布局视觉化说明了对象计数、相对距离、外观顺序等多维空间推理问题。 该图像是一个多模态大语言模型视频空间视觉智能能力测试示意图,展示了模型在不同空间感知任务中的问答示例及三维空间摄像轨迹。图中通过视频帧和3D布局视觉化说明了对象计数、相对距离、外观顺序等多维空间推理问题。

4.2.2. Benchmark Construction Pipeline (Figure 4)

The construction of VSI-Bench involves a sophisticated pipeline to generate high-quality question-answer (QA) pairs at scale:

  1. Data Collection and Unification:

    • Source Datasets: ScanNet, ScanNet++ScanNet++, and ARKitScenes are aggregated.
    • Video Processing: ScanNet frames are converted to 24 FPS videos. ScanNet++ScanNet++ and ARKitScenes are subsampled to 30 FPS. All videos are standardized to 640×480640 \times 480 pixels. ARKitScenes videos are normalized for consistent upward orientation.
    • Unified Meta-information: Data from diverse datasets is standardized into a common format for each scene, including:
      • dataset: Original source dataset.
      • video path: Path to the processed video.
      • room size: Calculated using the Alpha shape algorithm on the scene's point cloud. The Alpha shape algorithm is a computational geometry method used to define the shape of a finite set of points, effectively creating a "tight" boundary around the points to represent the room's perimeter.
      • room center: Geometric center of the minimal bounding box of the scene's point cloud.
      • object counts: Number of instances for each object category.
      • object bounding boxes: Unified to the OrientedBoundingBox format in Open3D [106].
    • Category Curation: Rare or extremely small object categories are excluded. Category remapping is applied to ensure vocabulary consistency.
  2. Question-Answer Generation:

    • Template-Based Auto-annotation: QA pairs for seven of the eight tasks are primarily auto-annotated using the unified meta-information and predefined question templates.
    • Human Annotation for Route Plan: This task, due to its complexity, is human-annotated.
    • Question Templates (Table 4): Specific templates are designed for each task.
      • For example, Object Counting: "How many {category}(s) are in this room?"
      • Relative Distance: "Measuring from the closest point of each object, which of these objects ({choice a}, {choice b},{choice c}, {choice d}) is the closest to the {category}?"
    • Numerical Answer (NA) Options: For NA tasks, multiple-choice options are generated by sampling within a factor of the ground truth and re-sampling if options are too close.
    • Ambiguity Handling: Rules are implemented to identify and filter ambiguous questions (e.g., if multiple objects are too close for relative distance, or if timestamps are too close for appearance order).
  3. Human-in-the-loop Quality Review:

    • Iterative Verification: A bidirectional quality assurance protocol is used. Humans manually filter scenes during data collection, verify meta-information, and review generated QA pairs via a web interface.

    • Error Correction: Evaluators flag ambiguous or erroneous questions. Errors are traced back to their source (data sample, meta-information, question template, or QA generation rule), and corrective actions are taken (removal, modification). This iterative process ensures high quality.

      The following figure (Figure 4 from the original paper) visualizes the benchmark construction pipeline.

      该图像是一个示意图,展示了VSI-Bench基于视频的数据集构建流程,包括数据收集、统一元信息、问答对生成及人工质量审核,最终形成经过筛选的视频问答对。 该图像是一个示意图,展示了VSI-Bench基于视频的数据集构建流程,包括数据收集、统一元信息、问答对生成及人工质量审核,最终形成经过筛选的视频问答对。

The following are the results from [Table 4] of the original paper:

TaskQuestion Template
Object CountingHow many {category}(s) are in this room?
Relative DistanceMeasuring from the closest point of each object, which of these objects ({choice a}, {choice b},{choice c}, {choice d}) is the closest to the {category}?
Relative DirectionTo create a comprehensive test of relative direction, three difficulty levels were created:Easy: If I am standing by the {positioning object} and facing the {orienting object}, is the{querying object} to the left or the right of the {orienting object}?•Medium: If I am standing by the {positioning object} and facing the {orienting object}, is the{querying object} to my left, right, or back? An object is to my back if I would have to turn atleast 135 degrees in order to face it.•Hard: If I am standing by the {positioning object} and facing the {orienting object}, is the{querying object} to my front-left, front-right, back-left, or back-right? Directions refer to thequadrants of a Cartesian plane (assuming I am at the origin and facing the positive y-axis).
Appearance OrderWhat will be the first-time appearance order of the following categories in the video: {choice a},{choice b}, {choice c}, {choice d}?
Object SizeWhat is the length of the longest dimension (length, width, or height) of the {category}, measuredin centimeters?
Absolute DistanceMeasuring from the closest point of each object, what is the direct distance between the {object1}and the {object 2}(in meters)?
Room SizeWhat is the size of this room (in square meters)? If multiple rooms are shown, estimate the sizeof the combined space.
Route PlanYou are a robot beginning at {the bed facing the tv}. You want to navigate to {the toilet}. You willperform the following actions (Note: for each [please fll in], choose either 'turn back,' 'turnleft,' or 'turn right.'): {1. Go forward until the TV 2. [please fill in] 3. Go forward until theshower 4. [please fll in] 5. Go forward until the toilet.} You have reached the final destination.

4.2.3. Probing MLLMs' Linguistic Thinking: Self-Explanations

To understand why MLLMs succeed or fail, the paper prompts the best-performing model, Gemini-1.5 Pro, to articulate its internal reasoning in natural language.

  • Method: After an MLLM predicts an answer, it is prompted with "Please explain your answer step by step." This is distinct from Chain-of-Thought (CoT) prompting, where the model generates reasoning before the answer. Here, the explanation is an ex-post-facto articulation of its internal state.
  • Case Studies (Figure 6): Analysis of these self-explanations reveals MLLMs' capabilities, such as accurate timestamped descriptions in videos and formation of step-by-step reasoning processes. The models sometimes construct a global coordinate system, hinting at an implicit world model.
  • Error Analysis (Figure 7): A subset of incorrect answers (163 samples from VSI-Bench (tiny)) is manually reviewed and categorized into four types:
    1. Visual perception error: Due to unrecognized objects or misclassified categories.
    2. Linguistic intelligence error: Due to defects in logical reasoning, mathematical calculation, or language understanding.
    3. Relational reasoning error: Errors in understanding spatial relationships (distance, direction, size).
    4. Egocentric-allocentric transformation error: Resulting from incorrect allocentric spatial layout or improper perspective-taking.
    • Finding: Approximately 71% of errors are attributed to spatial reasoning (combining relational reasoning and egocentric-allocentric transformation errors), confirming it as the primary bottleneck.

      The following figure (Figure 6 from the original paper) shows an example of MLLM's self-explanation in both a success and an error case.

      该图像是一个示意图,展示了多模态大语言模型(MLLM)在视频中基于空间关系理解问题的推理过程。图中通过冰箱、洗碗机和餐桌的相对位置,结合时间戳和空间坐标,体现了模型的类人化推理和内部空间世界模型的构建。 该图像是一个示意图,展示了多模态大语言模型(MLLM)在视频中基于空间关系理解问题的推理过程。图中通过冰箱、洗碗机和餐桌的相对位置,结合时间戳和空间坐标,体现了模型的类人化推理和内部空间世界模型的构建。

The following figure (Figure 7 from the original paper) summarizes the human-conducted error analysis.

Figure 7. Human-conducted analysis of errors by type. Over \(70 \\%\) of errors stem from faulty spatial reasoning capabilities. 该图像是图表,展示了图7中不同任务类型的错误分布及整体错误构成。超过70%的错误来自空间推理能力不足,其中关系推理错误和视觉感知错误占比较大。

4.2.4. Limits of CoT Methods in Visuospatial Tasks

The paper investigates whether prevailing linguistic prompting techniques can improve MLLMs' visual-spatial capabilities on VSI-Bench.

  • Techniques Evaluated:

    1. Zero-Shot Chain-of-Thought (CoT): The phrase "Let's think step by step" is appended to the prompt. Greedy decoding (temperature 0, top-p 1, top-k 1) is used. After the model generates its reasoning and answer, an additional dialogue turn explicitly extracts the answer to mitigate fuzzy matching errors.
    2. Self-Consistency w/ CoT: MLLMs generate multiple answers using Zero-shot CoT but with diverse reasoning encouraged by setting temperature to 0.7, top-p to 1, and top-k to 40. Five independent runs are performed, and the majority consensus determines the final prediction.
    3. Tree-of-Thoughts (ToT): The problem-solving is divided into two steps.
      • Plan Generation: The MLLM generates 3 distinct plans. A voting mechanism (3 times, majority-selected plan) chooses the best plan.
      • Answer Prediction: Using the selected plan, the MLLM generates 3 candidate answers. Another voting mechanism (3 times, majority vote) determines the final prediction.
  • Finding (Figure 8): Surprisingly, all three linguistic reasoning techniques lead to performance degradation on VSI-Bench. Zero-Shot CoT and ToT reduce average performance by ~4%, while Self-Consistency drops by 1.1%. This contrasts with its effectiveness on general video understanding benchmarks like VideoMME (Table 2), where Zero-Shot CoT improved performance by 1.6%.

    The following figure (Figure 8 from the original paper) illustrates the relative improvements of these methods.

    Figure 8. Relative improvements of CoT, self-consistency and Tree-of-Thought compared to the baseline. All three prevailing prompting techniques fail on average on our benchmark, and, in some cases,… 该图像是柱状图,展示了CoT、self-consistency和Tree-of-Thought三种技术相较基线的相对性能提升。图中显示这三种技术在视觉空间智能基准测试上普遍无效,部分任务反而退步,表明提升语言能力无法解决该基准问题。

The following are the results from [Table 2] of the original paper:

CasePerformance
Gemini-1.5 Pro (w/o CoT)77.2
Gemini-1.5 Pro (w/ CoT)79.8

4.2.5. Probing MLLMs' Visual Thinking: Cognitive Maps

Inspired by human mental representations of space, the paper investigates if MLLMs form internal cognitive maps.

  • Method: The best-performing MLLM (Gemini-1.5 Pro) is prompted to predict object center positions within a 10×1010 \times 10 grid based on the video input.

  • Cognitive Map Prompt:

    # Cognitive Map Prompt
    
    [Task]
    
    This video captures an indoor scene. Your objective is to identify specific objects within the video, understand the spatial arrangement of the scene, and estimate the center point of each object, assuming the entire scene is represented by a 10x10 grid.
    
    [Rule]
    
    1. We provide the categories to care about in this scene: {categories_of_interest}. Focus ONLY on these categories.
    
    2. Estimate the center location of each instance within the provided categories, assuming the entire scene is represented by a 10x10 grid.
    
    3. If a category contains multiple instances, include all of them.
    
    4. Each object's estimated location should accurately reflect its real position in the scene, preserving the relative spatial relationships among all objects.
    
       [Output]
    
    Present the estimated center locations for each object as a list within a dictionary. STRICTLY follow this JSON format: {"category name": [(x_1, y_1), ..,.}
    

    For general probing, categories_of_interest includes all potential categories. For benchmark tasks (e.g., relative distance), it's restricted to categories mentioned in the question.

  • Evaluation: The Euclidean distance between all pairs of objects is calculated for both MLLM-predicted and ground truth maps. A predicted distance is considered correct if it deviates by no more than one grid unit from the ground truth distance.

  • Finding (Figure 9, Figure 10): MLLMs achieve 64% accuracy in positioning adjacent objects, indicating strong local spatial awareness. However, this accuracy dramatically declines as object distance increases, suggesting MLLMs form a series of local world models rather than a unified global world model.

    The following figure (Figure 9 from the original paper) shows visualization of cognitive maps from MLLM and GT.

    Figure 9. Visualization of cognitive maps from MLLM and GT. 该图像是论文中的图9,展示了MLLM预测认知图与真实认知图的可视化对比。图中通过颜色方块标示不同物体的空间分布,直观呈现模型在空间记忆和推理上的表现差异。

The following figure (Figure 10 from the original paper) illustrates the locality of the MLLM's predicted cognitive maps.

Figure 10. Locality of the MLLM's predicted cognitive maps. The MLLM's map-distance accuracy decreases dramatically with increasing object distance. 该图像是一个示意图,展示了多模态大语言模型(MLLM)预测认知地图的局部性。图中显示随着对象距离基准物体越来越远,模型的地图距离准确率显著下降。

4.2.6. Better Distance Reasoning via Cognitive Maps

The paper tests if explicitly generating and using cognitive maps can improve MLLMs' spatial reasoning.

  • Method: Gemini-1.5 Pro is prompted to first generate a cognitive map based on the video and question, and then use this predicted map to answer the question, specifically for the relative distance task.

  • Finding (Table 3): Using MLLM-generated cognitive maps improves relative distance accuracy by 10%. When provided with ground truth cognitive maps, the gain is even higher (20-32%), underscoring the potential of accurate mental maps for spatial reasoning and highlighting that cognitive map generation is a crucial component but not the only one.

    The following are the results from [Table 3] of the original paper:

    Case Rel. Dist Acc.Cog. Map Src.SizeRel. Dist Acc.
    w/o Cog. map46.0MLLM10 × 1056.0
    w/ Cog. map56.0MLLM20 × 2054.0
    w/ Cog. map (GT)66.0GT10 × 1066.0
    GT20 × 2078.0

5. Experimental Setup

5.1. Datasets

The primary dataset used for evaluation and analysis is the newly introduced VSI-Bench.

  • VSI-Bench:
    • Scale: Over 5,000 question-answer pairs derived from 288 real indoor-scene videos.

    • Source: Videos are sourced from the validation sets of public indoor 3D scene reconstruction datasets: ScanNet [19], ScanNet++ScanNet++ [97], and ARKitScenes [5]. These datasets provide high-fidelity video scans and accurate object-level 3D annotations, which are crucial for generating spatial ground truths.

    • Characteristics:

      • Diverse Environments: Includes residential spaces, professional settings (offices, labs), and industrial spaces (factories), covering multiple geographic regions.
      • Video-based: Captures continuous, temporal input, mirroring human observation and enabling richer spatial understanding than static images.
      • Object-level Annotations: Leverages 3D annotations for precise spatial queries.
      • High Quality: Iteratively reviewed to minimize ambiguity and correct errors.
    • Domain: Indoor scenes, focusing on visual-spatial intelligence within typical human-centric environments.

    • Data Sample Example: As shown in Figure 1 of the paper (and Figure 13, Figure 14 in the appendix), a data sample for VSI-Bench consists of a video segment and a question related to the spatial configuration, measurement, or spatiotemporal aspects of the objects within that video. For example:

      • Video Input: A sequence of frames showing a room (e.g., a living room with a sofa, TV, fireplace).

      • Question: "What is the distance between the keyboard and the TV, in meters?" (Absolute Distance)

      • Question: "How many cabinet(s) are in this room?" (Object Counting)

      • Question: "If I am standing by the sofa and facing the suitcase, is the microwave to my front-left, front-right, back-left, or back-right?" (Relative Direction)

        The following figures (Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19 from the original paper) provide visual examples of VSI-Bench questions and video frames.

        Figure 13. vSI-Bench Examples (Part 1). 该图像是论文中图13的一部分,展示了vSI-Bench中的多个视频场景截图,反映了多模态大语言模型在空间视觉理解中的实际测试环境。图中内容涉及室内多视角场景,但未包含具体数据信息或公式。

        该图像是一组室内空间的连续场景照片,展示了不同角度和部位的房间细节,用于多模态大语言模型在视觉-空间识别和记忆中的测试。图像捕捉了楼梯、客厅、厨房等多种室内结构特征。 该图像是一组室内空间的连续场景照片,展示了不同角度和部位的房间细节,用于多模态大语言模型在视觉-空间识别和记忆中的测试。图像捕捉了楼梯、客厅、厨房等多种室内结构特征。

        Figure 14. vSI-Bench Examples (Part 2). 该图像是一组连续拍摄的卫生间内部照片,展示了洗手台、洗衣机、马桶与浴室布局,用于多模态大型语言模型的视频视觉空间智能测试示例(vSI-Bench,图14第2部分)。

        该图像是一个视频帧序列的视觉资源,展示了室内空间的多个连贯场景,可能用于研究多模态大语言模型的视觉空间记忆和推理能力。 该图像是一个视频帧序列的视觉资源,展示了室内空间的多个连贯场景,可能用于研究多模态大语言模型的视觉空间记忆和推理能力。

        该图像是论文中展示的示意图,展示了多模态大语言模型(MLLM)对视频中桌子最长维度的估计过程及其结果对比,包含标注的定位框与文字推理说明。 该图像是论文中展示的示意图,展示了多模态大语言模型(MLLM)对视频中桌子最长维度的估计过程及其结果对比,包含标注的定位框与文字推理说明。

        该图像是一个示意图,展示了视频帧中不同物体(沙发、椅子、电视)的位置及相对方向判断,以及多模态大语言模型(MLLM)与真实标签(GT)在方位判别上的对比,其中MLLM错误判断为“左”,而真实方向为“右”。 该图像是一个示意图,展示了视频帧中不同物体(沙发、椅子、电视)的位置及相对方向判断,以及多模态大语言模型(MLLM)与真实标签(GT)在方位判别上的对比,其中MLLM错误判断为“左”,而真实方向为“右”。

        该图像是一幅视频帧序列示意图,展示了一系列室内空间的视觉场景,并使用标注框标记了“Fireplace”和“Stool”,用于说明多模态大语言模型在视觉空间理解中的目标识别。 该图像是一幅视频帧序列示意图,展示了一系列室内空间的视觉场景,并使用标注框标记了“Fireplace”和“Stool”,用于说明多模态大语言模型在视觉空间理解中的目标识别。

        该图像是论文中的示意图,展示了一段视频帧及其对应的空间对象标签和相对方向关系。图中以不同颜色框出物品位置,右下角标注了多模态大语言模型(MLLM)和真实标签(GT)在空间定位上的差异,反映模型的空间认知能力。 该图像是论文中的示意图,展示了一段视频帧及其对应的空间对象标签和相对方向关系。图中以不同颜色框出物品位置,右下角标注了多模态大语言模型(MLLM)和真实标签(GT)在空间定位上的差异,反映模型的空间认知能力。

        该图像是一个示意图,展示了机器人在浴室环境中基于视觉信息进行路径规划的过程,图上标注了浴缸、水槽和镜子位置,并通过文字阐述机器人从水槽面向镜子起始点出发,判断转向策略以到达浴缸。 该图像是一个示意图,展示了机器人在浴室环境中基于视觉信息进行路径规划的过程,图上标注了浴缸、水槽和镜子位置,并通过文字阐述机器人从水槽面向镜子起始点出发,判断转向策略以到达浴缸。

        该图像是多张室内空间照片的拼接展示,展示了不同视角下房间的家具布局和空间环境,未包含图号或公式。图片反映了视频中空间场景的多视角特征,有助于视觉-空间理解。 该图像是多张室内空间照片的拼接展示,展示了不同视角下房间的家具布局和空间环境,未包含图号或公式。图片反映了视频中空间场景的多视角特征,有助于视觉-空间理解。

        Figure 16. Zero-Shot CoT Examples. 该图像是文章中的示意图,展示了一段室内视频的关键帧,标注了沙发、壁炉和电视等空间元素的位置和名称,用于说明模型如何识别和定位空间中的物体。

        该图像是视频帧序列的示意图,展示了一个浴室内多个视角的连续画面,涉及马桶、水槽和门把手等细节,体现了视频中空间信息的连续性和变化。 该图像是视频帧序列的示意图,展示了一个浴室内多个视角的连续画面,涉及马桶、水槽和门把手等细节,体现了视频中空间信息的连续性和变化。

        该图像是一组多张室内空间视频帧,展示了从不同视角捕捉的家具和室内布局细节,可能用于视觉-空间智能的分析与理解。 该图像是一组多张室内空间视频帧,展示了从不同视角捕捉的家具和室内布局细节,可能用于视觉-空间智能的分析与理解。

        Figure 17. Self-Consistency w/ CoT Examples. 该图像是图表,展示了多模态大语言模型(MLLMs)在视频帧序列中基于自洽性与链式思维(CoT)推理的视觉空间理解示例,内容包括物体尺寸和房间面积估算,包含模型多轮回答和最终多数投票结果。

        该图像是一个视频帧序列的示意图,展示了多个连续拍摄的室内空间场景,反映了论文中评估多模态大语言模型视觉-空间智能的空间记忆和感知能力。 该图像是一个视频帧序列的示意图,展示了多个连续拍摄的室内空间场景,反映了论文中评估多模态大语言模型视觉-空间智能的空间记忆和感知能力。

        该图像是一个示意图,展示了从问题到多个计划,再到多个答案的层级结构关系,反映思考路径的分支逻辑。 该图像是一个示意图,展示了从问题到多个计划,再到多个答案的层级结构关系,反映思考路径的分支逻辑。

        该图像是一个示意图,展示了视频帧中多个家具对象(如壁炉、电视、咖啡桌)的标注框及其对应的空间位置信息,体现了论文中多模态大语言模型对空间物体识别和定位的能力。 该图像是一个示意图,展示了视频帧中多个家具对象(如壁炉、电视、咖啡桌)的标注框及其对应的空间位置信息,体现了论文中多模态大语言模型对空间物体识别和定位的能力。

        Figure 18. Tree-of-Thought Examples. 该图像是论文中Tree-of-Thought示例的图表,展示了多模态大语言模型在空间推理中的思考路径和决策树结构,反映模型如何逐步展开空间认知过程。

        Figure 19. Additional predicted cognitive map examples. 该图像是图19,呈现了多组认知地图的预测结果与真实地图的对比,图中以颜色方块代表不同物体类别,展示了模型在空间认知任务中的表现。

    • Purpose: VSI-Bench was chosen specifically to evaluate detailed 3D spatial reasoning from videos, a capability often overlooked by existing benchmarks. Its reliance on 3D reconstruction datasets ensures the availability of accurate ground truth for complex spatial queries, making it highly effective for validating MLLMs' performance in this domain.

5.2. Evaluation Metrics

The paper uses two primary metrics, Accuracy (\mathcal{ACC}) for multiple-choice answer tasks and Mean Relative Accuracy (\mathcal{MRA}) for numerical answer tasks.

  • 1. Accuracy (ACC\mathcal{ACC})

    • Conceptual Definition: This metric quantifies the proportion of predictions that exactly match the ground truth answer. It is widely used for classification tasks or questions with discrete, predefined answer options (e.g., multiple-choice questions). For VSI-Bench, it's applied to Multiple-Choice Answer (MCA) tasks. The paper mentions "possible fuzzy matching" for ACC\mathcal{ACC} but primarily implies exact matching based on standard practice [24, 31, 99].
    • Mathematical Formula: $ \mathcal{ACC} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
    • Symbol Explanation:
      • ACC\mathcal{ACC}: The accuracy score.
      • Number of Correct Predictions: The count of instances where the model's predicted answer exactly matches the true answer.
      • Total Number of Predictions: The total number of questions or instances evaluated in the dataset.
  • 2. Mean Relative Accuracy (MRA\mathcal{MRA})

    • Conceptual Definition: Introduced for Numerical Answer (NA) tasks, where predicting an exact continuous value is very difficult. Instead of strict equality, MRA\mathcal{MRA} assesses the degree of proximity between the model's prediction and the ground truth. It does this by averaging the "relative accuracy" over a range of confidence thresholds. A prediction is considered "relatively correct" if its relative error rate falls below a certain threshold. This provides a more robust and discriminative measure for continuous numerical predictions.
    • Mathematical Formula: $ \mathcal{MRA} = \frac{1}{10} \sum_{\theta \in \mathcal{C}} \mathbb{1} \left( \frac{|\hat{y} - y|}{y} < 1 - \theta \right) $
    • Symbol Explanation:
      • MRA\mathcal{MRA}: The Mean Relative Accuracy score.
      • 1/101/10: A normalization factor, dividing by the total number of confidence thresholds (which is 10 in this case).
      • θC\sum_{\theta \in \mathcal{C}}: A summation over each confidence threshold θ\theta belonging to the set C\mathcal{C}.
      • C={0.5,0.55,\hdots,0.95}\mathcal{C} = \{0.5, 0.55, \hdots, 0.95\}: The set of 10 confidence thresholds. These values represent the minimum acceptable accuracy (e.g., for θ=0.5\theta=0.5, a prediction is considered correct if its relative error is less than 10.5=0.51-0.5=0.5, or 50%).
      • 1()\mathbb{1}(\cdot): The indicator function. It outputs 1 if the condition inside its parentheses is true, and 0 otherwise.
      • y^\hat{y}: The numerical value predicted by the model.
      • yy: The true, ground-truth numerical value.
      • y^y|\hat{y} - y|: The absolute difference between the predicted value and the ground truth value.
      • y^y/y|\hat{y} - y| / y: The relative error rate of the prediction. This expresses the error as a fraction of the true value.
      • 1θ1 - \theta: The maximum allowable relative error for a prediction to be counted as correct at a given threshold θ\theta. For example, if θ=0.9\theta = 0.9, then 1θ=0.11 - \theta = 0.1, meaning the relative error must be less than 10% for the prediction to be considered correct.

5.3. Baselines

The paper compares the performance of MLLMs against several baselines to contextualize their visual-spatial intelligence.

  • 1. Chance Level (Random):
    • For MCA tasks, this baseline represents the accuracy achieved by randomly selecting an answer option. It's calculated based on the number of options for each question.
    • It is inapplicable for NA tasks as there are no discrete options to randomly select from.
  • 2. Chance Level (Frequency):
    • This baseline represents the performance if an MLLM were to always select the most frequent answer for each specific task in the dataset.
    • It helps to identify if models are gaining performance simply by exploiting long-tailed answer distributions or imbalanced multiple-choice distributions, rather than actual reasoning.
  • 3. Human Level Performance:
    • A subset of 400 questions (VSI-Bench (tiny), 50 questions per task) was independently answered by human evaluators.
    • Humans were allowed unlimited time and multiple video reviews. Their performance was evaluated using ACC\mathcal{ACC} and MRA\mathcal{MRA}. This serves as the upper bound for desired performance.
  • 4. Benchmark Models (15 Video-Supporting MLLMs):
    • Proprietary Models (API access):
      • Gemini 1.5 Pro [78]
      • Gemini 1.5 Flash [78]
      • Gemini 2.0 Flash (mentioned in tables, likely a newer/variant of Gemini)
      • GPT-4o [34]
    • Open-source Models:
      • InternVL2-2B, InternVL2-8B, InternVL2-40B [14]
      • ViLA-1.5-8B, ViLA-1.5-40B [45]
      • LongViLA-8B [91]
      • LongVA-7B [101]
      • LLaVA-Video-7B, LLaVA-Video-72B [104]
      • LLaVA-OneVision-0.5B, LLaVA-OneVision-7B, LLaVA-OneVision-72B [40]
    • Evaluation Settings: All MLLMs were evaluated under zero-shot settings using their default prompts and greedy decoding (temperature 0, top-p 1, top-k 1 for reproducibility) unless specified otherwise (e.g., for self-consistency).
  • 5. Socratic LLMs with Frame Captions:
    • This is a composite baseline using GPT-4o as the reasoning LLM, but instead of directly processing video, it receives frame captions generated by LLaVA-Video-72B as the captioner. This setup is inspired by previous works like OpenEQA [54] and HourVideo [12], which explore using an MLLM to describe visual content for a language model to reason upon.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Overall Model Performance on VSI-Bench

The following are the results from [Table 1] of the original paper:

noqt'! sz!s !q0Je SooeRr CT. PRrd ranMpe
MethodsRank Avg.Numerical AnswerMultiple-Choice Answer
Baseline
Chance Level (Random)------25.036.128.325.0
Chance Level (Frequency)-34.062.132.029.933.125.147.928.425.2
VSI-Bench (tiny) Perf.
Human Level-79.294.347.060.445.994.795.895.8100.0
Gemini-1.5 Flash-45.750.833.656.545.248.039.832.759.2
Gemini-1.5 Pro-48.849.628.858.649.446.048.142.068.0
Gemini-2.0 Flash-45.452.430.666.731.856.046.324.555.1
Proprietary Models (API)3
GPT-4034.046.25.343.838.237.041.331.528.5
Gemini-1.5 Flash2 142.149.830.853.554.437.741.031.537.8
Gemini-1.5 Pro45.456.230.964.143.651.346.336.034.6
Open-source Models
InternVL2-2B1126.525.724.020.029.232.144.130.46.3
InternVL2-8B337.531.329.048.944.238.033.428.946.4
InternVL2-40B4 1237.041.326.248.227.547.632.727.844.7
LongVILA-8B921.629.19.116.70.029.630.732.525.5
VILA-1.5-8B728.917.421.850.318.832.134.831.024.8
VILA-1.5-40B LongVA-7B831.2 29.222.424.848.722.740.525.731.532.9
LLaVA-Video-7B535.638.016.638.922.233.143.325.415.7
140.948.514.047.824.243.542.434.030.6
LLaVA-Video-72B1048.922.857.435.342.436.735.048.6
LLaVA-OneVision-0.5B628.046.128.415.428.328.936.934.55.8
LLaVA-OneVision-7B32.447.720.247.412.342.535.229.424.4
LLaVA-OneVision-72B240.243.523.957.637.542.539.932.544.6
  • Human Level Performance: Humans achieve 79% average accuracy, significantly outperforming the best MLLM by 33%. Humans excel particularly in configurational and spatiotemporal tasks (94-100%). However, the gap is narrower for measurement tasks (object size, room size, absolute distance), suggesting MLLMs might have a relative strength in quantitative estimation, which is less intuitive for humans to do precisely without tools.
  • Proprietary MLLMs: Gemini-1.5 Pro leads, showing competitive results and substantially surpassing chance baselines. It approaches human performance in absolute distance and room size estimation. This is remarkable given MLLMs are trained on 2D data, while humans have years of 3D world experience.
  • Open-source MLLMs: Top-tier open-source models (LLaVA-Video-72B, LLaVA-OneVision-72B) are competitive, trailing Gemini-1.5 Pro by only 4-5%. However, a majority (7 out of 12) perform below the Chance Level (Frequency) baseline, indicating significant limitations in their visual-spatial intelligence.

6.1.2. How MLLMs Think Linguistically: Self-Explanations

Analysis of Gemini-1.5 Pro's self-explanations (Figure 6) reveals advanced video understanding (accurate timestamped descriptions, step-by-step reasoning) and even hints at a global coordinate system construction, suggesting an implicit world model. However, error analysis (Figure 7) highlights that 71% of errors stem from spatial reasoning issues, specifically relational reasoning and egocentric-allocentric transformation. For example, in a route plan task, the model might follow the camera's egocentric pan direction instead of inferring the allocentric route (Figure 6, right). This confirms spatial reasoning as the primary bottleneck.

6.1.3. Limits of CoT Methods in Visuospatial Tasks

The study shows that prevailing linguistic reasoning techniques (Chain-of-Thought, Self-Consistency, Tree-of-Thoughts) surprisingly fail to improve performance on VSI-Bench (Figure 8).

  • Zero-Shot CoT and ToT reduce average performance by ~4%.
  • Self-Consistency also drops by 1.1%.
  • While these methods showed improvement on a general video understanding benchmark (VideoMME, Table 2), their ineffectiveness here suggests that spatial reasoning cannot be solved by merely enhancing linguistic capabilities. This finding is critical as it challenges the general applicability of these popular prompting methods.

6.1.4. How MLLMs Think Visually: Cognitive Maps

  • Locality of Cognitive Maps: Probing MLLMs by asking them to generate cognitive maps (Figure 9) reveals a strong local spatial awareness. MLLMs achieve 64% accuracy in positioning adjacent objects within their maps. However, this accuracy dramatically decreases with increasing object distance (Figure 10), indicating that MLLMs tend to form a series of local world models rather than a unified, globally consistent spatial model from videos.

  • Better Distance Reasoning via Cognitive Maps: Explicitly generating cognitive maps before answering relative distance questions significantly enhances MLLMs' spatial distance ability.

    • Using MLLM-generated maps improves relative distance accuracy by 10% (Table 3a).
    • When provided with ground truth cognitive maps, performance improves even more (20-32% gain over baseline), highlighting the potential of accurate mental spatial representations. This suggests that building an internal spatial world model could be a valuable pretext task or solution for MLLMs.

6.1.5. Input Sequencing and Repetition Analysis

The paper explores how the order of video and question presentation, and video repetition, affect MLLM performance (Table 8):

  • Input Sequence: Contrary to human intuition, a video-first approach leads to a 2.5% decrease in Gemini-1.5 Pro's overall performance compared to a question-first approach. This is surprising, as humans typically benefit from knowing the question before viewing to direct attention.
  • Video Repetition: MLLMs benefit from multiple video views. Repeating the video input twice (e.g., [Video] [Context] [Video]) provides a 2.1% performance gain. This suggests that current autoregressive MLLMs don't fully exploit their ability to "revisit" video content internally and could benefit from explicit re-exposure to visual information.

6.2. Data Presentation (Tables)

The following are the results from [Table 1] of the original paper:

noqt'! sz!s !q0Je SooeRr CT. PRrd ranMpe
MethodsRank Avg.Numerical AnswerMultiple-Choice Answer
Baseline
Chance Level (Random)------25.036.128.325.0
Chance Level (Frequency)-34.062.132.029.933.125.147.928.425.2
VSI-Bench (tiny) Perf.
Human Level-79.294.347.060.445.994.795.895.8100.0
Gemini-1.5 Flash-45.750.833.656.545.248.039.832.759.2
Gemini-1.5 Pro-48.849.628.858.649.446.048.142.068.0
Gemini-2.0 Flash-45.452.430.666.731.856.046.324.555.1
Proprietary Models (API)3
GPT-4034.046.25.343.838.237.041.331.528.5
Gemini-1.5 Flash2 142.149.830.853.554.437.741.031.537.8
Gemini-1.5 Pro45.456.230.964.143.651.346.336.034.6
Open-source Models
InternVL2-2B1126.525.724.020.029.232.144.130.46.3
InternVL2-8B337.531.329.048.944.238.033.428.946.4
InternVL2-40B4 1237.041.326.248.227.547.632.727.844.7
LongVILA-8B921.629.19.116.70.029.630.732.525.5
VILA-1.5-8B728.917.421.850.318.832.134.831.024.8
VILA-1.5-40B LongVA-7B831.2 29.222.424.848.722.740.525.731.532.9
LLaVA-Video-7B535.638.016.638.922.233.143.325.415.7
140.948.514.047.824.243.542.434.030.6
LLaVA-Video-72B1048.922.857.435.342.436.735.048.6
LLaVA-OneVision-0.5B628.046.128.415.428.328.936.934.55.8
LLaVA-OneVision-7B32.447.720.247.412.342.535.229.424.4
LLaVA-OneVision-72B240.243.523.957.637.542.539.932.544.6

The following are the results from [Table 2] of the original paper:

CasePerformance
Gemini-1.5 Pro (w/o CoT)77.2
Gemini-1.5 Pro (w/ CoT)79.8

The following are the results from [Table 3] of the original paper:

Case Rel. Dist Acc.Cog. Map Src.SizeRel. Dist Acc.
w/o Cog. map46.0MLLM10 × 1056.0
w/ Cog. map56.0MLLM20 × 2054.0
w/ Cog. map (GT)66.0GT10 × 1066.0
GT20 × 2078.0

The following are the results from [Table 5] of the original paper:

Distance[1.0, 2.1] (2.1, 3.3] (3.3, 4.4] (4.4, 5.5] (5.5, 6.6] (6.6, 7.8] (7.8, 8.9] (8.9, 10.0]
Gemini-1.5 Pro0.640.480.350.350.280.120.060.00
LLaVA-Video-72B0.590.450.420.300.150.230.160.00
LLaVA-Video-7B0.500.430.340.290.190.180.140.00

The following are the results from [Table 6] of the original paper:

ModelsLLaVA-Video-72BLLaVA-Video-7B
w/o. Cog. Map36.040.0
w/. Cog. Map42.032.0

The following are the results from [Table 7] of the original paper:

GPT-40StandardSocraticBlind
Avg.34.029.314.5

The following are the results from [Table 8] of the original paper:

OrderAvg.# TimesAvg.
Video first48.8148.8
Question first46.3250.9
(a) Input Sequence(b) Video Repetition Time

The following are the results from [Table 9] of the original paper:

Methods# of Frames
Proprietary Models (API)
GPT-40 Gemini-1.5 Flash16
Gemini-1.5 Pro
Open-source Models
InternVL2-2B
InternVL2-8B32
InternVL2-40B32
LongVILA-8B32
VILA-1.5-8B32
32
VILA-1.5-40B32
LongVA-7B32
LLaVA-Video-7B32
LLaVA-Video-72B32
LLaVA-OneVision-0.5B32
LLaVA-OneVision-7B32
LLaVA-OneVision-72B32

The following are the results from [Table 10] of the original paper:

ModelsQA. TypePrompt
Pre-Prompt--These are frames of a video.
Post-PromptOpen-source ModelsProprietary ModelsNAMCANAMCAPlease answer the question using a single word or phrase.Answer with the option's letter from the given choices directly.Do not respond with anything other than a single number!Answer with the option's letter from the given choices directly.

The following are the results from [Table 11] of the original paper:

MethodsAvg.non q0a s.s q0Joe SomRa! T.H Rrd ranMe
Numerical AnswerMultiple-Choice Answer
Proprietary Models (API) Gemini-1.5 Flash41.649.130.352.753.7
Gemini-1.5 Pro44.955.130.363.143.337.1 50.040.8 45.931.4 35.737.1 35.7
Open-source Models
InternVL2-2B27.022.424.921.134.132.943.530.07.1
InternVL2-8B34.122.628.347.639.635.730.430.038.6
InternVL2-40B35.534.426.945.631.341.431.732.940.0
LongVILA-8B21.028.78.616.30.028.630.531.424.3
VILA-1.5-8B28.417.321.649.918.631.434.430.024.3
VILA-1.5-40B30.821.424.448.321.940.025.030.035.7
LongVA-7B29.038.116.938.121.732.942.825.715.7
LLaVA-Video-7B34.947.913.446.723.942.941.932.930.0
LLaVA-Video-72B40.548.322.656.734.641.436.535.748.6
LLaVA-OneVision-0.5B27.645.127.914.727.928.637.034.35.7
LLaVA-OneVision-7B32.146.919.946.912.141.435.130.024.3
LLaVA-OneVision-72B39.642.723.756.736.941.439.531.444.3

The following are the results from [Table 12] of the original paper:

non qtNumerical Answeraz!s q0Je Sood! E.H Rod nanMape
Methods Proprietary Models (API)Avg.Multiple-Choice Answer
GPT-4035.647.240.4
Gemini-1.5 Flash45.736.2 50.84.6 33.656.545.240.0 48.046.2 39.832.0 32.738.0 59.2
Gemini-1.5 Pro48.849.658.649.446.048.142.0
Gemini-2.0 Flash45.452.428.8 30.666.731.856.046.324.568.0 55.1
Open-source Models
InternVL2-2B25.530.620.426.029.628.039.228.02.0
InternVL2-8B32.926.425.443.841.630.032.220.044.0
InternVL2-40B37.640.823.848.026.046.030.142.044.0
LongVILA-8B19.123.410.811.40.020.033.128.026.0
VILA-1.5-8B31.412.223.451.418.636.041.542.026.0
VILA-1.5-40B32.314.621.048.020.642.022.040.050.0
LongVA-7B31.841.217.439.625.430.052.834.014.0
LLaVA-Video-7B35.749.012.848.621.440.043.534.036.0
LLaVA-Video-72B39.341.426.655.631.636.025.642.056.0
LLaVA-OneVision-0.5B27.744.023.018.828.430.033.436.08.0
LLaVA-OneVision-7B33.848.222.044.414.044.031.934.032.0
LLaVA-OneVision-72B41.638.031.654.435.244.039.732.058.0

The following are the results from [Table 13] of the original paper:

MethodsAvg.nonq0! s.az!s 90J wood .H dRrd nanMoe r
Numerical AnswerMultiple-Choice Answer
Proprietary Models (API)
GPT-40 Gemini-1.5 Flash14.5 19.90.15.236.70.010.823.226.913.1
32.325.030.352.50.00.021.229.90.2
Gemini-1.5 Pro30.611.551.533.133.844.633.520.2
Open-source Models
InternVL2-2B17.8 27.65.423.79.20.026.941.227.97.9
InternVL2-8B31.926.838.30.727.139.233.023.6
InternVL2-40B24.45.429.139.20.730.337.727.924.7
LongVILA-8B20.247.412.68.70.624.327.027.413.9
VILA-1.5-8B21.57.47.645.70.025.439.129.417.6
VILA-1.5-40B25.5 21.95.327.646.50.730.237.131.525.0
LongVA-7B5.118.127.426.123.439.826.98.7
LLaVA-Video-7B25.214.814.632.526.126.845.033.08.5
LLaVA-Video-72B29.1 28.619.025.446.326.129.038.833.015.5
LLaVA-OneVision-0.5B38.430.132.024.322.041.834.55.4
LLaVA-OneVision-7B25.313.88.545.526.128.641.227.911.1
LLaVA-OneVision-72B28.98.223.854.126.130.438.133.017.1

The following are the results from [Table 14] of the original paper:

MethodsAvg.on q0A s q0Jo SooeR S.HPdRrd nanMee
Numerical AnswerMultiple-Choice Answer
Proprietary Models (API)
GPT-4019.546.10.17.138.226.218.04.615.4
Gemini-1.5 Flash22.224.90.51.054.437.719.91.537.7
Gemini-1.5 Pro13.025.519.512.610.617.51.72.514.4
Open-source Models
InternVL2-2B8.720.30.310.829.25.22.92.5-1.6
InternVL2-8B9.9-0.62.210.643.510.9-5.8-4.122.8
InternVL2-40B12.635.9-2.99.026.817.3-5.09.920.0
LongVILA-8B1.4-18.2-3.57.9-0.65.33.75.111.5
VILA-1.5-8B7.310.014.24.618.86.7-4.41.57.2
VILA-1.5-40B5.717.1-2.82.222.010.4-11.40.07.9
LongVA-7B7.232.9-1.511.5-3.99.73.5-1.57.1
LLaVA-Video-7B10.533.8-0.615.2-1.916.7-2.71.022.1
LLaVA-Video-72B11.7 -0.529.9-2.611.19.213.3-2.02.033.0
LLaVA-OneVision-0.5B7.8-1.7-16.64.06.9-5.00.00.3
LLaVA-OneVision-7B7.033.911.71.9-13.913.9-6.01.513.3
LLaVA-OneVision-72B11.435.40.13.511.412.11.8-0.527.4

6.3. Ablation Studies / Parameter Analysis

6.3.1. Impact of Number of Sampled Frames

The paper investigates the impact of the number of sampled frames on MLLM performance. The following figure (Figure 11 from the original paper) shows the analysis of different # sampled frames.

Figure 11. Analysis of different # sampled frames. 该图像是图表,展示了不同采样帧数对多模态大语言模型性能的影响,曲线对比了InternVL2系列、LV-7B和GPT-4o五种模型在空间智能任务中的表现。

As shown in Figure 11, the number of sampled frames only marginally affects performance for the tested models (InternVL2 series, LV-7B, GPT-4o). This suggests that frame sampling strategies are not the primary bottleneck for visual-spatial intelligence in these models, reinforcing the idea that architectural or reasoning limitations are more critical.

6.3.2. Blind Evaluation (Vision Disabled)

The paper compares MLLMs' performance with and without video input (i.e., "blind" or "vision disabled"). The following figure (Figure 12 from the original paper) shows performance comparisons between Vision Enabled (w/ video), Vision Disabled (w/o video) and Chance Level (Freq.).

Figure 12. Performance comparisons between Vision Enabled (w/ video), Vision Disabled (w/o video) and Chance Level (Freq.). EnabledDisabled indicates the gap between Vision Enabled and Vision Disable… 该图像是图表,展示了视觉开启(含视频)、视觉关闭(不含视频)和随机水平的性能比较。图中以曲线和点标注不同任务上的表现差距,任务按视觉开启与关闭的差异排序。

  • Video is essential: The consistent improvements in "Enabled-Disabled" bars in Figure 12 (the gap between Vision Enabled and Vision Disabled) and general degradation in "Disabled-Chance" (the gap between Vision Disabled and Chance Level (Frequency)) demonstrate that video input is crucial and beneficial for VSI-Bench. Blind models often perform below the Chance Level (Frequency).
  • Persistent Difficulty for Some Tasks: Even with vision enabled, MLLMs struggle to improve significantly beyond Chance Level in absolute distance estimation, route plan, and relative direction tasks. This highlights the inherent difficulty of these specific spatial reasoning challenges.
  • Common-Sense Knowledge: On object size estimation, "Vision Disabled" models already significantly outperform Chance Level. This is likely due to common-sense knowledge learned during large language model pre-training, where models may infer typical object sizes from text.

6.3.3. Vision Enabled vs. Vision Disabled Performance Improvement

The following are the results from [Table 14] of the original paper:

MethodsAvg.on q0A s q0Jo SooeR S.HPdRrd nanMee
Numerical AnswerMultiple-Choice Answer
Proprietary Models (API)
GPT-4019.546.10.17.138.226.218.04.615.4
Gemini-1.5 Flash22.224.90.51.054.437.719.91.537.7
Gemini-1.5 Pro13.025.519.512.610.617.51.72.514.4
Open-source Models
InternVL2-2B8.720.30.310.829.25.22.92.5-1.6
InternVL2-8B9.9-0.62.210.643.510.9-5.8-4.122.8
InternVL2-40B12.635.9-2.99.026.817.3-5.09.920.0
LongVILA-8B1.4-18.2-3.57.9-0.65.33.75.111.5
VILA-1.5-8B7.310.014.24.618.86.7-4.41.57.2
VILA-1.5-40B5.717.1-2.82.222.010.4-11.40.07.9
LongVA-7B7.232.9-1.511.5-3.99.73.5-1.57.1
LLaVA-Video-7B10.533.8-0.615.2-1.916.7-2.71.022.1
LLaVA-Video-72B11.7 -0.529.9-2.611.19.213.3-2.02.033.0
LLaVA-OneVision-0.5B7.8-1.7-16.64.06.9-5.00.00.3
LLaVA-OneVision-7B7.033.911.71.9-13.913.9-6.01.513.3
LLaVA-OneVision-72B11.435.40.13.511.412.11.8-0.527.4

Table 14 quantifies the performance improvement of MLLMs when using visual signals compared to their vision-disabled counterparts. Most MLLMs show improvements, with notable gains in object count, room size, relative distance, and appearance order. This further confirms the value of video input for these specific spatial tasks.

6.4. Socratic LLMs with Frame Captions

The following are the results from [Table 7] of the original paper:

GPT-40StandardSocraticBlind
Avg.34.029.314.5

Using a Socratic approach (GPT-4o reasoning on LLaVA-Video-72B-generated frame captions) surprisingly leads to a 4.7% performance degradation compared to standard GPT-4o. This suggests that simply providing linguistic descriptions of visual content is not a substitute for direct visual understanding, especially for complex spatial reasoning. The direct visual processing by MLLMs is superior to relying on an intermediate captioning step for VSI-Bench tasks.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper presents a rigorous investigation into the visual-spatial intelligence of Multimodal Large Language Models (MLLMs) when observing spaces through sequential video observations. The introduction of VSI-Bench, a novel video-based benchmark with over 5,000 question-answer pairs across diverse indoor scenes, quantitatively demonstrates that MLLMs exhibit emerging, though subhuman, visual-spatial capabilities. A deep dive into MLLMs' "thinking" reveals that while they possess strong perceptual, temporal, and linguistic abilities, spatial reasoning (specifically relational reasoning and egocentric-allocentric transformation) remains the primary bottleneck for achieving higher performance. Intriguingly, widely successful linguistic prompting techniques like Chain-of-Thought, Self-Consistency, and Tree-of-Thoughts fail to improve performance on VSI-Bench, and in some cases, even degrade it. In contrast, explicitly prompting MLLMs to generate cognitive maps enhances their ability to perform spatial distance reasoning, indicating that building explicit mental spatial representations is a promising avenue for improving visual-spatial intelligence. The study also highlights that MLLMs tend to form local world models rather than unified global world models from video.

7.2. Limitations & Future Work

The authors identify several limitations and suggest future research directions:

  • Subhuman Performance: Despite competitive results, a significant gap remains between MLLMs and human performance, especially in configurational and spatiotemporal tasks.
  • Spatial Reasoning Bottleneck: The primary limitation is the models' spatial reasoning capabilities (specifically relational reasoning and egocentric-allocentric transformation).
  • Local vs. Global World Models: MLLMs demonstrate strong local spatial awareness but struggle with global spatial consistency and long-range spatial relationships, indicating difficulty in forming a unified spatial model from discrete video frames.
  • Failure of Linguistic Prompting: The ineffectiveness of existing linguistic Chain-of-Thought methods for spatial reasoning is a clear limitation of these techniques' generalizability.
  • Future Work Suggestions:
    • Task-specific fine-tuning: Tailoring MLLMs to VSI-Bench tasks could improve performance.
    • Self-supervised learning objectives for spatial reasoning: Developing pre-training tasks that specifically encourage the learning of spatial relationships and 3D world understanding.
    • Visuospatial-tailored prompting techniques: Designing new prompting strategies that are better suited to guide spatial reasoning, perhaps by integrating visual or spatial steps more explicitly.
    • Building accurate mental maps: The success of cognitive maps suggests this is a valuable pretext task or a crucial component for improving MLLMs' visual-spatial reasoning.
    • Addressing the local-to-global gap: Research into how MLLMs can build more unified and globally consistent spatial representations from sequential video input.

7.3. Personal Insights & Critique

This paper provides a highly valuable contribution to the understanding of MLLMs' capabilities in a critical, yet often overlooked, domain: 3D visual-spatial intelligence.

  • Innovation and Significance: The most significant innovation is the introduction of VSI-Bench. Its design, leveraging 3D reconstruction datasets and focusing on true 3D spatial reasoning tasks (e.g., absolute distance, relative direction, route planning) from video, fills a crucial gap in MLLM evaluation. The dual-coding inspired probing, using both linguistic self-explanations and visual cognitive maps, offers a rich, interpretable lens into MLLM internal processes, moving beyond black-box performance metrics. The counter-intuitive finding that standard Chain-of-Thought methods degrade spatial reasoning performance is a critical insight, challenging prevailing assumptions about LLM reasoning and highlighting the unique nature of spatial cognition. Conversely, the success of explicitly generating cognitive maps is a promising and actionable direction, suggesting that models need to actively construct and utilize internal spatial representations.

  • Transferability and Applicability: The findings have direct implications for embodied AI, robotics, autonomous driving, and AR/VR. For agents operating in the physical world, robust spatial understanding and memory are non-negotiable. The idea of using cognitive maps as a pretext task or an explicit reasoning step could be directly integrated into these applications, allowing MLLMs to build and leverage internal world models for navigation, object interaction, and scene understanding. Future embodied agents could benefit from architectures that prioritize spatial consistency and the explicit formation of mental maps.

  • Potential Issues, Unverified Assumptions, or Areas for Improvement:

    • "Thinking in space" Definition: While the paper provides a taxonomy of visual-spatial intelligence, the term "thinking in space" remains somewhat metaphorical for MLLMs. The probing reveals what they can do and where they struggle, but the underlying mechanisms of how these computations translate to human-like spatial thought are still largely unknown. A deeper theoretical framework connecting MLLM activations to cognitive spatial processes would be valuable.

    • Scaling Cognitive Map Generation: The current method for generating cognitive maps involves a specific prompt to predict object centers on a grid. While effective for probing, integrating this explicit generation into a seamless, end-to-end MLLM for dynamic spatial reasoning in real-time applications might require more sophisticated and efficient mechanisms. Can MLLMs learn to implicitly generate and update these maps without explicit instruction, or are explicit prompts always necessary for such high-level spatial tasks?

    • Generalizability of CoT Failure: The observation that linguistic CoT fails for spatial reasoning is profound. Is this specific to the nature of VSI-Bench (e.g., requiring rapid, holistic spatial inference rather than sequential linguistic decomposition), or does it indicate a more fundamental limitation of current MLLMs in translating between linguistic and spatial reasoning modalities for complex tasks? Further research could investigate if visuospatial-CoT (e.g., "think visually step by step" by generating intermediate visual representations) could be effective.

    • Human-Level Performance Gap: The substantial gap between MLLMs and human performance (especially in configurational tasks) highlights that current models lack a truly robust, intuitive grasp of 3D space. This points to the need for perhaps fundamentally new architectures or training paradigms that prioritize 3D spatial consistency and perception.

    • Egocentric-Allocentric Transformation: This remains a major challenge. How can MLLMs robustly build and update an allocentric map from egocentric video streams, especially when facing occlusions or complex camera movements? This is crucial for real-world navigation and manipulation.

    • Dataset Complexity: While VSI-Bench is excellent, 3D reconstruction datasets can still have their own limitations (e.g., reconstruction errors, limited object categories compared to the real world). Ensuring the benchmark scales and remains robust with even greater real-world complexity is an ongoing challenge.

      Overall, this paper serves as a vital diagnostic tool and a call to action for the AI community to dedicate more focus on robust 3D visual-spatial intelligence in MLLMs, moving beyond superficial video understanding to genuine spatial cognition.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.