Paper status: completed

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

Published:05/21/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper investigates the mechanisms of Multimodal Chain-of-Thought (MCoT) in Large Vision-Language Models (LVLMs), highlighting how visual thoughts enhance performance and interpretability. Four forms of visual thought expressions are defined, demonstrating their impact on MCo

Abstract

Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing performance and interpretability. Recent MCoT methods fall into two categories: (i) Textual-MCoT (T-MCoT), which takes multimodal input and produces textual output; and (ii) Interleaved-MCoT (I-MCoT), which generates interleaved image-text outputs. Despite advances in both approaches, the mechanisms driving these improvements are not fully understood. To fill this gap, we first reveal that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format, depending only on clarity and conciseness of expression. Furthermore, to explore visual thoughts systematically, we define four distinct forms of visual thought expressions and analyze them comprehensively. Our findings demonstrate that these forms differ in clarity and conciseness, yielding varying levels of MCoT improvement. Additionally, we explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers, enabling more advanced visual information transmission. We hope that the visual thoughts can inspire further breakthroughs for future MCoT research.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

1.2. Authors

Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, Libo Qin. The authors are affiliated with various academic institutions and industry labs, including Central South University, Harbin Institute of Technology, Chinese University of Hong Kong, Shanghai AI Laboratory, National University of Singapore, Peking University, and ByteDance Seed (China). This diverse set of affiliations suggests a collaborative effort between multiple research groups and potentially industry.

1.3. Journal/Conference

The paper is listed as an arXiv preprint, indicating it has not yet undergone formal peer review and publication in a journal or conference proceedings at the time of this analysis. However, given its publication date (2025-05-21T13:29:58.000Z), it might be targeting a major machine learning or artificial intelligence conference such as NeurIPS, ICML, or ICLR, which are highly reputable venues in the field.

1.4. Publication Year

2025

1.5. Abstract

This paper investigates the mechanisms behind Multimodal Chain-of-Thought (MCoT) reasoning in Large Vision-Language Models (LVLMs), which enhance performance and interpretability in multimodal tasks. Existing MCoT methods are categorized into Textual-MCoT (T-MCoT) for text-only output and Interleaved-MCoT (I-MCoT) for interleaved image-text outputs. The authors propose that visual thoughts are the core mechanism driving MCoT's effectiveness, functioning as a means to convey image information to the reasoning process, irrespective of the MCoT format, with clarity and conciseness being key factors. The study defines and systematically analyzes four forms of visual thought expressions, finding that their clarity and conciseness influence performance gains. Furthermore, it reveals that visual thoughts act as intermediaries, transmitting visual information to deeper transformer layers, thus facilitating more advanced reasoning. The research aims to provide a unified understanding of MCoT and inspire future breakthroughs.

https://arxiv.org/abs/2505.15510 (Preprint on arXiv) PDF Link: https://arxiv.org/pdf/2505.15510v2.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the lack of a unified understanding of Multimodal Chain-of-Thought (MCoT) reasoning in Large Vision-Language Models (LVLMs). While MCoT has shown significant success in improving LVLM performance and interpretability across various multimodal tasks, the underlying mechanisms driving these improvements are not fully understood.

This problem is important because, despite the advancements in both Textual-MCoT (T-MCoT) (generating text from multimodal input) and Interleaved-MCoT (I-MCoT) (generating interleaved image-text outputs), there's an ongoing debate about their relative merits and a lack of a cohesive framework to explain their effectiveness. This gap hinders the identification of optimal MCoT paradigms and the derivation of generalizable insights. The field needs to know why MCoT works and how different MCoT approaches achieve their results.

The paper's entry point is the hypothesis that visual thoughts are the common, underlying factor explaining MCoT's effectiveness. These visual thoughts are defined as intermediate, logic-driven cross-modal representations that help bridge raw pixel data with linguistic rationales, enabling efficient and context-aware access to visual information during reasoning. This innovative idea provides a unified perspective to analyze disparate MCoT approaches.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • Unified Explanation of MCoT Effectiveness: It introduces the concept of visual thoughts as the core mechanism that unifies the understanding of how MCoT enhances LVLMs. Visual thoughts convey visual information to the reasoning process, acting as a "cache" for distilled visual information, regardless of the MCoT format.

  • Systematic Categorization and Analysis of Visual Thought Expressions: The paper defines and comprehensively analyzes four distinct forms of visual thought expressions: Natural Language (N-LANG), Structured Language (S-LANG), Edited Image (E-IMG), and Generative Image (G-IMG). It demonstrates that these forms differ in clarity and conciseness, which in turn leads to varying levels of MCoT improvement.

  • Empirical Validation of Visual Thought Importance: Experimental results show that removing visual thoughts significantly impairs performance, often worse than reasoning directly from the query, thereby emphasizing their essential role.

  • Insight into Internal Mechanisms: The study delves into the internal nature of visual thoughts, revealing that they serve as critical intermediaries. They transmit visual information from the input image to deeper transformer layers, facilitating more advanced cognitive processing and enhanced cross-modal interaction within LVLMs. This is observed through attention distribution shifts and information flow analysis.

  • Guidance for MCoT Application: The findings suggest that different categories of MCoT (T-MCoT vs. I-MCoT) are better suited for different types of tasks (coarse-grained vs. fine-grained/complex visual operations), depending on the efficiency of visual thought transmission.

    These findings solve the problem of understanding the disparate mechanisms of MCoT paradigms by providing a unified conceptual framework. They offer practical guidance for designing and applying MCoT strategies, highlighting the importance of clarity, conciseness, and the efficient transmission of visual thoughts for optimal reasoning performance.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following foundational concepts:

  • Large Vision-Language Models (LVLMs): These are advanced artificial intelligence models that combine capabilities of Large Language Models (LLMs) with computer vision models. They are designed to process and understand both visual (images, videos) and textual data simultaneously, enabling them to perform multimodal tasks such as visual question answering, image captioning, and multimodal reasoning. They learn to align representations from different modalities.

  • Chain-of-Thought (CoT) Reasoning: Originating in Large Language Models (LLMs), CoT is a prompting technique that encourages models to generate a series of intermediate reasoning steps before arriving at a final answer. Instead of directly outputting an answer, the model articulates its thought process, which often involves breaking down complex problems into simpler, manageable steps. This technique has been shown to significantly improve reasoning abilities, especially for complex tasks like mathematical problem-solving or commonsense reasoning.

  • Multimodal Chain-of-Thought (MCoT): An extension of CoT reasoning to multimodal contexts, particularly for LVLMs. MCoT involves generating step-by-step reasoning paths that integrate information from both visual and textual inputs. It aims to enhance LVLMs' reasoning capabilities in multimodal tasks by making their decision-making process more transparent and robust.

  • Transformer Architecture: The dominant neural network architecture for sequence processing, widely used in both LLMs and LVLMs. Transformers rely heavily on the self-attention mechanism (explained below) to weigh the importance of different parts of the input sequence. They consist of encoder and decoder blocks, each typically comprising multi-head self-attention layers and feed-forward neural networks. Understanding how Transformers process input tokens (including visual tokens) and distribute attention is crucial for interpreting the internal rationale discussed in this paper.

  • Attention Mechanism: A core component of the Transformer architecture. It allows the model to dynamically weigh the importance of different input elements (tokens) when processing another element. In a self-attention mechanism, the model attends to different parts of its own input sequence to compute a representation for each token. In a cross-modal context (like LVLMs), it enables the model to attend to relevant parts of an image when generating text, or vice-versa. The basic formula for attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:

    • QQ (Query), KK (Key), VV (Value) are matrices derived from the input embeddings.
    • QKTQ K^T calculates similarity scores between queries and keys.
    • softmax\mathrm{softmax} normalizes these scores to produce attention weights.
    • dk\sqrt{d_k} is a scaling factor to prevent large values in the dot product, where dkd_k is the dimension of the keys.
    • Multiplying by VV combines values based on these weights.

3.2. Previous Works

The paper frames its discussion by categorizing previous MCoT methods into two main paradigms, which it then seeks to unify:

  • Textual-MCoT (T-MCoT): This approach extends the traditional text-based CoT framework to multimodal inputs, where the reasoning rationale generated is purely textual.

    • Examples: Some methods require LVLMs to describe visual elements before generating an answer (e.g., Zheng et al. [55], Yao et al. [49]). Others integrate structured information, like json-format scene graphs derived from images, into the reasoning process (e.g., Mitra et al. [30], Mondal et al. [31]).
    • Mechanisms: These models typically generate textual descriptions or structured representations of visual content, which then serve as intermediate steps for text-based reasoning to produce a final textual answer. Chen et al. [4] introduce M3CoT for multi-domain multi-step multimodal CoT.
  • Interleaved-MCoT (I-MCoT): This is a more recent approach where the reasoning process generates a sequence of interleaved image-text outputs. This means the model can generate new images or modify existing ones as part of its thought process, alongside textual explanations.

    • Examples: Methods like o3-mini [33] and Visual Sketchpad [15] employ external tools (e.g., code interpreters, specialized visual models) to modify images for reasoning. Other approaches use image generation models to create new images to enhance reasoning (e.g., Meng et al. [27], Li et al. [19]).
    • Mechanisms: These models go beyond text generation, allowing for visual manipulation, augmentation, or creation within the reasoning chain. This is often achieved by integrating external visual tools or fine-tuning LVLMs for multimodal generation capabilities. Cheng et al. [7] introduced CoMT to assess I-MCoT capabilities.

3.3. Technological Evolution

The evolution of multimodal reasoning in AI has progressed from simple image-text matching and captioning to complex multi-step reasoning. Initially, models focused on basic tasks like Visual Question Answering (VQA) where a single image and question yield a short text answer. With the rise of Large Language Models (LLMs) and their reasoning capabilities through Chain-of-Thought (CoT), researchers sought to extend this reasoning power to multimodal inputs. This led to Textual-MCoT, where visual information is first converted into a textual description or structured representation, and then LLM-like reasoning takes over. The limitation of purely textual rationales for inherently visual tasks prompted the development of Interleaved-MCoT, which allows for a more "human-like" thought process involving both visual and textual intermediate steps. This paper's work fits within this technological timeline by offering a unified conceptual framework (visual thoughts) that bridges the understanding of both T-MCoT and I-MCoT, explaining why they are effective and how they transmit visual information. It positions visual thoughts as a crucial, distilled form of visual information that acts as a cache, enabling deeper reasoning regardless of the specific MCoT implementation.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's core differentiation and innovation lie in its unified perspective and mechanistic understanding.

  • Unified Framework: Previous works largely treated T-MCoT and I-MCoT as distinct paradigms, often debating their superiority. This paper proposes visual thoughts as a common underlying mechanism, arguing that both approaches leverage this concept to improve performance. This shifts the focus from which MCoT format is better to how effectively visual information (as visual thoughts) is conveyed and processed.

  • Focus on Internal Mechanisms: While many MCoT papers focus on designing new prompting strategies or integrating external tools, this work delves deeper into the internal workings of LVLMs. It analyzes attention mechanisms and information flow to demonstrate how visual thoughts facilitate deeper visual information transmission within the Transformer layers. This provides a crucial mechanistic explanation beyond mere empirical performance gains.

  • Categorization and Analysis of Visual Thought Expressions: The paper systematically defines and evaluates four specific strategies for expressing visual thoughts (N-LANG, S-LANG, E-IMG, G-IMG), offering a structured way to understand their trade-offs in terms of clarity, conciseness, and reasoning cost. This systematic analysis allows for a more principled approach to MCoT design.

  • "Visual Cache" Analogy: The paper introduces the compelling analogy of visual thoughts as an internal cache that stores distilled, instruction-relevant visual information, contrasting it with the raw image as external memory. This analogy provides an intuitive way to understand the efficiency and effectiveness benefits.

    In essence, while related work primarily develops and applies MCoT techniques, this paper steps back to analyze and explain their fundamental drivers, offering a higher-level theoretical and empirical understanding that can inform future MCoT research.

4. Methodology

4.1. Principles

The core idea of the method used is that Multimodal Chain-of-Thought (MCoT) approaches boost Large Vision-Language Models (LVLMs) by integrating visual thoughts into their reasoning process. Visual thoughts are intermediate, logic-driven cross-modal representations that act as a "cache" for distilled visual information. Instead of continuously reprocessing the entire raw image (which is analogous to slow, resource-intensive external memory access), visual thoughts extract and store only the instruction-relevant visual cues. This allows for rapid, context-aware access, reduces computational overhead, and enables deeper, multi-step multimodal reasoning. The effectiveness of these visual thoughts depends on their clarity and conciseness of expression, regardless of whether they are purely textual or interleaved image-text.

4.2. Core Methodology In-depth (Layer by Layer)

The paper formally defines visual thoughts and categorizes them into four distinct expressions, demonstrating how they integrate into the MCoT reasoning process.

4.2.1. Definition of Visual Thoughts

A visual thought (denoted as REVT\mathcal{R}^{\mathcal{E_{VT}}}) is defined as a reasoning step that conveys information derived from the visual input VI\mathcal{V}_I and all previous reasoning steps R<i={R1,R2,...,Ri1}\mathbf{R}_{<i} = \{\mathcal{R}_1, \mathcal{R}_2, ..., \mathcal{R}_{i-1}\}. These steps are guided by the task question QT\mathcal{Q}_T and explicit instructions TE\mathcal{T}_{\mathcal{E}} that request a specific MCoT expression EVT\mathcal{E}_{VT}.

The model generates the next reasoning step Ri\mathcal{R}_i based on a conditional probability, deciding whether to generate a visual thought or a derivative reasoning step. The formal representation is given by:

Ri={irgmaxRV i(R<i,VIREWVI,QT,ZEW,R<i),if π˙π,argmaxRD π(R<i,REWRDVI,QT,ZEW,R<i),if π˙<π, \mathcal { R } _ { i } = \left\{ \begin{array} { l l } { \underset { \mathcal { R } ^ { \ell } V } { \operatorname { i r g m a x } } \ i \left( \mathbf { R } _ { < i } , \mathcal { V } _ { I } \longmapsto \mathcal { R } ^ { \mathcal { E } _ { W } } | \mathcal { V } _ { I } , \mathcal { Q } _ { T } , \mathbb { Z } _ { \mathcal { E } _ { W } } , \mathbf { R } _ { < i } \right) , } & { \mathrm { i f ~ } \dot { \pi } \ge \pi , } \\ { \underset { \mathcal { R } ^ { D } } { \operatorname { a r g m a x } } \ \pi ( \mathbf { R } _ { < i } , \mathcal { R } ^ { \mathcal { E } _ { W } } \longmapsto \mathcal { R } ^ { D } | \mathcal { V } _ { I } , \mathcal { Q } _ { T } , \mathbb { Z } _ { \mathcal { E } _ { W } } , \mathbf { R } _ { < i } ) , } & { \mathrm { i f ~ } \dot { \pi } < \pi , } \end{array} \right. Where:

  • Ri\mathcal{R}_i: The ii-th reasoning step being generated.

  • R<i\mathbf{R}_{<i}: All reasoning steps generated prior to Ri\mathcal{R}_i.

  • VI\mathcal{V}_I: The initial visual input (image).

  • QT\mathcal{Q}_T: The task question.

  • TEVT\mathcal{T}_{\mathcal{E}_{VT}}: Explicit instructions defining the desired MCoT expression format for the visual thought.

  • REVT\mathcal{R}^{\mathcal{E}_{VT}}: A visual thought expressed in a particular format EVT\mathcal{E}_{VT}.

  • RD\mathcal{R}^D: A derivative reasoning step, which follows a visual thought and grasps visual information from it.

  • π˙()\dot{\pi}(\cdot): The probability of generating a visual thought (i.e., a step that explicitly conveys visual information).

  • π()\pi(\cdot): The probability of generating a derivative reasoning step.

  • xyx \longmapsto y: Indicates the flow of reasoning information from step xx to step yy. The notation means that yy is derived from xx.

    This formula implies that at each step, the model evaluates whether to generate a visual thought (if π˙π\dot{\pi} \ge \pi) or a subsequent derivative reasoning step based on a previously generated visual thought (if π˙<π\dot{\pi} < \pi).

The following figure (Figure 2 from the original paper) illustrates the concept of visual thoughts as an internal visual cache:

Figure 2: Comparison of multimodal reasoning from a computer-system perspective: (a) visual thoughts as an internal visual cache versus (b) direct access to raw images as external storage. 该图像是示意图,比较了没有视觉思维和具有视觉思维的多模态推理过程。左侧展示了没有视觉思维时的信息流动,右侧则展示了通过视觉思维作为内部缓存来处理图像的信息流动,增强推理能力。

Figure 2 (a) shows visual thoughts as an internal cache, rapidly accessing distilled visual information. Figure 2 (b) depicts direct access to raw images as external storage, which is slower and requires reprocessing the entire visual input.

4.2.2. Categories of Visual Thoughts

Visual thoughts can be expressed in different modalities, aligning with T-MCoT (textual expressions) and I-MCoT (visual/interleaved expressions).

The following figure (Figure 3 from the original paper) shows examples of visual thoughts in textual and visual expressions:

Figure 3: Visual Thoughts in textual expression (a) and visual expression (b). Specifically, the textual expression includes N-LANG and S-LANG, while the visual expression includes E-IMG and G-IMG. 该图像是图示图,展示了视觉思想在文本表达(a)和视觉表达(b)中的应用。其中,文本表达包含N-LANG和S-LANG,而视觉表达包括E-IMG和G-IMG,信息流动的不同形式体现了视觉思想的多样性。

Figure 3 illustrates N-LANG and S-LANG as textual expressions and E-IMG and G-IMG as visual expressions.

4.2.2.1. Textual Multimodal Chain-of-Thought (T-MCoT)

In T-MCoT, visual thoughts are generated as textual tokens, making EVT=Etext\mathcal{E}_{VT} = \mathcal{E}_{text}. The generation of such a visual thought Ri\mathcal{R}_i is formally expressed as:

Ri=argmaxRεtextπtext(R<i,VIREtextVI,Q,TEtext,R<i), \mathscr { R } _ { i } = \underset { \mathscr { R } ^ { \varepsilon _ { t e x t } } } { \operatorname { a r g m a x } } \pi _ { t e x t } ( \mathbf { R } _ { < i } , \mathscr { V } _ { I } \longmapsto \mathscr { R } ^ { \mathcal { E } _ { t e x t } } | \mathscr { V } _ { I } , \mathscr { Q } , \mathscr { T } _ { \mathcal { E } _ { t e x t } } , \mathbf { R } _ { < i } ) , Where:

  • πtext()\pi_{text}(\cdot): Denotes the probability of generating a rationale REtext\mathcal{R}^{\mathcal{E}_{text}} from textual tokens.
  • All other symbols are as defined for the general visual thought equation.

Expression 1: Natural Language (N-LANG) N-LANG involves expressing visual thoughts through natural language descriptions. This facilitates effective visual information transfer by creating richer visual descriptions based on the question, enhancing vision-language alignment.

The reasoning process for N-LANG is formally defined as: Ri=argmaxRNLANGπtext(R<i,VIRNLANGVI,Q,ZNLANG,R<i). \mathcal { R } _ { i } = \underset { \mathcal { R } ^ { N - \mathrm { L A N G } } } { \mathrm { a r g m a x } } \pi _ { t e x t } ( \mathbf { R } _ { < i } , \mathcal { V } _ { I } \longmapsto \mathcal { R } ^ { N - \mathrm { L A N G } } | \mathcal { V } _ { I } , \mathcal { Q } , \mathcal { Z } _ { N - \mathrm { L A N G } } , \mathbf { R } _ { < i } ) . Where:

  • RNLANG\mathcal{R}^{N-LANG}: A visual thought expressed in natural language.

  • ZNLANG\mathcal{Z}_{N-LANG}: Instructions specific to generating N-LANG visual thoughts.

  • Other symbols are as defined previously.

    Implementation: LVLMs are prompted to generate image captions as a precursor to subsequent reasoning steps.

Expression 2: Structured Language (S-LANG) S-LANG incorporates structured language, such as scene graphs, into reasoning pipelines. This approach has shown superior performance in tasks requiring precise object relationship understanding, particularly in mathematical contexts.

The formal expression for S-LANG visual thoughts is: Ri=argmaxRSLANG πtext(R<i,VIRSLANGVI,Q,TSLANG,R<i). \mathcal { R } _ { i } = \underset { \mathcal { R } ^ { S \mathrm { - } \mathrm { L A N G } } } { \mathrm { a r g m a x } } ~ \pi _ { t e x t } ( \mathbf { R } _ { < i } , \mathcal { V } _ { I } \longmapsto \mathcal { R } ^ { S \mathrm { - } \mathrm { L A N G } } | \mathcal { V } _ { I } , \mathcal { Q } , \mathcal { T } _ { S \mathrm { - } \mathrm { L A N G } } , \mathbf { R } _ { < i } ) . Where:

  • RSLANG\mathcal{R}^{S-LANG}: A visual thought expressed in structured language.

  • TSLANG\mathcal{T}_{S-LANG}: Instructions specific to generating S-LANG visual thoughts.

  • Other symbols are as defined previously.

    Implementation: LVLMs are prompted to generate a scene graph (e.g., in JSON format) from input images and queries, which then guides the reasoning process.

4.2.2.2. Interleaved Multimodal Chain-of-Thought (I-MCoT)

In I-MCoT, visual thoughts are conveyed through visual expressions, meaning image tokens are integral. This expands on T-MCoT by integrating image editing and generation. The generation of such an image-based visual thought Ri\mathcal{R}_i is mathematically represented as:

Ri=argmaxRξimageπimage(R<i,VIRξimageVI,Q,TEimage,R<i), \mathcal { R } _ { i } = \underset { \mathcal { R } ^ { \xi _ { i m a g e } } } { \operatorname { a r g m a x } } \pi _ { i m a g e } ( \mathbf { R } _ { < i } , \mathcal { V } _ { I } \longmapsto \mathcal { R } ^ { \xi _ { i m a g e } } | \mathcal { V } _ { I } , \mathcal { Q } , \mathcal { T } _ { \mathcal { E } _ { i m a g e } } , \mathbf { R } _ { < i } ) , Where:

  • πimage()\pi_{image}(\cdot): Denotes the probability of the model incorporating an image-based rationale step.

  • REimage\mathcal{R}^{\mathcal{E}_{image}}: A visual thought expressed through an image.

  • TEimage\mathcal{T}_{\mathcal{E}_{image}}: Instructions specific to generating image-based visual thoughts.

  • Other symbols are as defined previously.

    Implementation: This framework uses two types of image-based visual thoughts: Edited Image and Generative Image.

Expression 3: Edited Image (E-IMG) E-IMG processes the original image by performing various visual operations such as grounding, depth estimation, and segmentation. By conveying edited image tokens, E-IMG enhances the LVLM's ability to interpret visual data and improves reasoning capabilities.

The formal definition for E-IMG visual thoughts is: Ri=argmaxREMGπimage(R<i,VIREIMGVI,Q,TEIMG,R<i), \mathcal { R } _ { i } = \underset { \mathcal { R } ^ { E \mathrm { - } \mathrm { M G } } } { \mathrm { a r g m a x } } \pi _ { i m a g e } ( \mathbf { R } _ { < i } , \mathcal { V } _ { I } \underset { - } { \longrightarrow } \mathcal { R } ^ { E \mathrm { - } \mathrm { I M G } } \vert \mathcal { V } _ { I } , \mathcal { Q } , \mathcal { T } _ { E \mathrm { - } \mathrm { I M G } } , \mathbf { R } _ { < i } ) , Where:

  • REIMG\mathcal{R}^{E-IMG}: A visual thought generated by editing the original image.

  • TEIMG\mathcal{T}_{E-IMG}: Instructions specific to generating E-IMG visual thoughts.

  • xyx \underset{-}{\longrightarrow} y: A modified arrow indicating the process of editing xx to derive yy.

  • Other symbols are as defined previously.

    Implementation: LVLMs are provided with edited images produced by vision tools (e.g., Grounding DINO, Semantic-SAM, DepthAnything), and these edited results are incorporated into subsequent reasoning.

Expression 4: Generative Image (G-IMG) G-IMG requires prompting generative models to create logically related images. These generated images serve as visual thoughts, leveraging the advancements in LVLMs and image generation capabilities.

The formal definition for G-IMG visual thoughts is: Ri=argmaxRGMGπimage(R<i,VIRGIMGVI,Q,TGIMG,R<i), \mathcal { R } _ { i } = \underset { \mathcal { R } ^ { G \cdot \mathrm { M G } } } { \mathrm { a r g m a x } } \pi _ { i m a g e } ( \mathbf { R } _ { < i } , \mathcal { V } _ { I } \underset { - } { \longrightarrow } \mathcal { R } ^ { G \cdot \mathrm { I M G } } \vert \mathcal { V } _ { I } , \mathcal { Q } , \mathcal { T } _ { G \cdot \mathrm { I M G } } , \mathbf { R } _ { < i } ) , Where:

  • RGIMG\mathcal{R}^{G-IMG}: A visual thought generated by an image generation model.

  • TGIMG\mathcal{T}_{G-IMG}: Instructions specific to generating G-IMG visual thoughts.

  • xyx \underset{-}{\longrightarrow} y: A modified arrow indicating the process of generating yy based on xx.

  • Other symbols are as defined previously.

    Implementation: DALL-E 3 is used as a tool to generate novel images based on input queries, which then act as supplementary inputs to aid reasoning.

5. Experimental Setup

5.1. Datasets

The experiments utilize a selection of benchmarks from both math and commonsense categories to comprehensively evaluate the proposed visual thought concept and its different expressions.

  • IsoBench [10]: This benchmark is chosen for math tasks, including reasoning challenges related to chess, math problems, and graphs. It is particularly used to explore the effectiveness of visual thoughts in pure-text problems requiring spatial imagination. For IsoBench, function graphs are leveraged as image-form visual thoughts.

  • MMVP [39]: A commonsense benchmark that assesses LVLMs' capabilities such as visual grounding and object detection. It is used to test coarse-grained perception tasks, where identifying salient objects is crucial (e.g., N-LANG spotting "a butterfly").

  • V*Bench [47]: Also a commonsense benchmark, it evaluates fine-grained identification. It is divided into sub-tasks like position and attributes. V*Bench-position is used to assess reasoning about object relationships (e.g., S-LANG inferring relative positions of objects), while V*Bench-attributes requires detailed image analysis (e.g., E-IMG for magnifying and annotating areas of interest).

  • M3CoT [4]: This benchmark focuses on multi-domain multi-step multi-modal chain-of-thought reasoning in commonsense tasks, involving physical, social, and temporal reasoning. It requires multiple interaction rounds and is well-suited for G-IMG to test and refine reasoning hypotheses through iterative image generation.

  • CoMT [7]: A novel benchmark specifically designed for chain of multi-modal thought on LVLMs. It emphasizes complex multimodal operations like visual deletion, selection, and update during reasoning. This benchmark is used to evaluate scenarios requiring more complex visual operations and is where visual thoughts yield substantial performance improvements. For CoMT, interleaved-modal rationales are directly used to construct image-form visual thoughts.

    The choice of these diverse datasets ensures that the evaluation covers a wide range of visual reasoning complexities and task types, allowing for a thorough analysis of how different visual thought expressions perform under various conditions.

5.2. Evaluation Metrics

The paper primarily uses accuracy as the main quantitative evaluation metric for task performance. Additionally, for human evaluation of visual thought quality, three distinct metrics are introduced.

  1. Accuracy:

    • Conceptual Definition: Accuracy is a measure of correctness, typically defined as the proportion of total predictions that were correct. In the context of the benchmarks used, it quantifies how often the LVLM produces the correct answer for a given multimodal task (e.g., visual question answering, reasoning problems).
    • Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
    • Symbol Explanation:
      • Number of Correct Predictions\text{Number of Correct Predictions}: The count of instances where the model's output matches the ground truth answer.
      • Total Number of Predictions\text{Total Number of Predictions}: The total number of instances (questions/tasks) evaluated.
  2. Human Evaluation Metrics for Visual Thought Quality: For Section 4.3, a human evaluation was conducted on a sample of 100 instances from MMVP, VBenchV*Bench, M3CoT, and CoMT. Three metrics were used, each scored on a 3-point ordinal scale (1: Low, 2: Medium, 3: High).

    • Image Relevance:

      • Conceptual Definition: This metric assesses how well the generated visual thought aligns with the semantic content and context of the input image. It evaluates if the visual thought accurately reflects the objects, relationships, or scenes depicted, maintaining faithfulness to the original visual information.
      • Scoring Scale:
        • Low (1): The visual thought misrepresents or incorrectly describes the image content.
        • Medium (2): The visual thought accurately captures the main content of the image.
        • High (3): The visual thought not only accurately represents the image but also provides a comprehensive description.
    • Expression Clarity:

      • Conceptual Definition: This criterion measures how clearly the visual thought conveys the intended reasoning or logic in response to the input query. It evaluates if visual elements (e.g., spatial arrangements, symbols, visual cues) are intuitively understandable and unambiguous, allowing for easy comprehension of the underlying rationale.
      • Scoring Scale:
        • Low (1): The visual thought fails to convey any visual logic relevant to the query.
        • Medium (2): The visual thought partially captures the visual logic relevant to the query.
        • High (3): The visual thought fully and clearly expresses the visual logic associated with the query.
    • Concise Expression:

      • Conceptual Definition: This metric evaluates the efficiency and succinctness of the visual thought in communicating information. An effective visual thought should avoid unnecessary visual complexity while preserving essential content, thereby enhancing interpretability and reducing cognitive load.
      • Scoring Scale:
        • Low (1): The visual thought is verbose, redundant, or difficult to understand.

        • Medium (2): The visual thought conveys its visual content and logic in a generally clear manner.

        • High (3): The visual thought presents the visual content and logic clearly, concisely, and in an easily understandable way.

          Additionally, the paper uses Spearman's rank correlation coefficient (ρ\rho) and Pearson's correlation coefficient (rr) to assess the statistical relationship between these human-evaluated qualities (fidelity, clarity, conciseness) and model accuracy. The p-value (PP) is also reported to indicate the statistical significance of these correlations.

5.3. Baselines

The paper's method is primarily compared against different configurations of Multimodal Chain-of-Thought (MCoT) with and without the explicit incorporation of visual thoughts, and variations in how visual thoughts are expressed.

The main baselines and comparison points are:

  • w/o VT (Without Visual Thoughts): This is the primary baseline. It represents prompting LVLMs without any additional visual thoughts or explicit instructions to generate them. The model directly generates the reasoning path and final answer based solely on the given query, explicitly avoiding visual descriptions during the process. This baseline is crucial for demonstrating the effectiveness of visual thoughts.

  • Direct: In Appendix B.3, this baseline refers to direct prompting with GPT-4o to generate responses without any explicit CoT instructions or visual content avoidance. It serves to show that w/o VT performance degradation is not merely due to contextual perturbations but actual omission of visual information.

  • w/o CoT (Without Chain-of-Thought): In Appendix C.2, this baseline represents a vanilla setting where the model generates an answer directly without any CoT reasoning steps, implemented using GPT-4o. This is used to understand the broader role of CoT itself compared to the specific impact of visual thoughts within CoT.

  • Caption Only: This baseline involves providing GPT-4o generated textual descriptions of the original image as context, but without considering the questions. It distinguishes the cache-like function of visual thoughts from simple image captions.

  • Comparison among four visual thought categories: N-LANG, S-LANG, E-IMG, and G-IMG. These are compared against each other and against the w/o VT baseline to assess their individual effectiveness, reasoning costs, and suitability for different task types.

  • Diverse-VT: In Appendix E, a strategy that ensembles diverse visual thoughts reasoning paths, inspired by self-consistency, is introduced as a future direction, demonstrating a potential upper bound or enhanced approach.

    These baselines are representative because they cover different levels of multimodal reasoning: from direct answer generation (Direct, w/o CoT) to reasoning without explicit visual caching (w/o VT), to different forms of explicit visual caching (N-LANG, S-LANG, E-IMG, G-IMG), allowing for a comprehensive analysis of the visual thought concept.

5.4. Model Settings

The experiments were conducted using a range of Large Vision-Language Models (LVLMs), including both open-source and proprietary models, to ensure broad applicability and generalizability of the findings.

  • LLaVA-1.5 [21]: An open-source LVLM. Specifically, LLaVA-1.5-7B and LLaVA-1.5-13B models were used for attention and information flow analysis (Section 5.1 and 5.2).
  • Qwen2-VL [42]: Another open-source LVLM. Qwen2-VL-2B, Qwen2-VL-3B, and Qwen2-VL-7B models were utilized, particularly for attention analysis (Section 5.1) and verification of visual thought effectiveness in pure-text problems (Appendix B.4).
  • GPT-4o-mini [32]: A proprietary model from OpenAI.
  • GPT-4o [32]: The advanced proprietary multimodal model from OpenAI, used extensively across various experiments, including Caption Only generation (Appendix B.1) and as a baseline for w/o CoT (Appendix C.2).

Parameter Settings:

  • Temperature: For the GPT series models (GPT-4o-mini, GPT-4o), the temperature parameter was adjusted within the range [0, 2]. For the open-source models (LLaVA-1.5, Qwen2-VL), the temperature parameter was also adjusted within the range [0, 2]. A higher temperature leads to more diverse and less deterministic outputs, while a lower temperature makes the outputs more focused and deterministic.
  • Compute Resources: All open-source models completed inference on 2 A6000 48G GPUs. This specifies the hardware used, indicating substantial computational resources.

Prompt Design: The paper details a two-stage prompting framework for visual thought integration:

  1. Stage 1: Visual Thought Generation: The model is prompted to generate the visual thought based on the question and initial image.

  2. Stage 2: Visual Thought Reasoning: The generated visual thought (image or text) is then provided back to the model, along with the original query, as additional context to guide the generation of the reasoning path and final answer.

    Specific prompt templates are provided in Appendix C.1 for:

  • w/o VT: Instructs the model to generate reasoning and answer, explicitly avoiding visual descriptions.
  • N-LANG: Stage 1 prompts for a comprehensive caption based on the query. Stage 2 uses this caption for reasoning, again avoiding further visual descriptions.
  • S-LANG: Stage 1 prompts for a JSON-format scene graph (objects, attributes, relationships) relevant to the question. Stage 2 uses this scene graph for reasoning.
  • E-IMG: Stage 1 prompts for a series of image processing steps (e.g., segment_and_mark(), detection(objects)) using vision tools (GroundingDINO, Semantic-SAM, DepthAnything). Stage 2 uses the extra annotated image (output of tools) as input for reasoning.
  • G-IMG: Stage 1 prompts for a detailed text prompt for text-to-image generation. DALL·E 3 is used as the generative model. Stage 2 uses the additional synthesized image as input for reasoning.

5.5. External Tools

For I-MCoT (E-IMG and G-IMG), external tools are used:

  • E-IMG: Grounding DINO [24], Semantic-SAM [20], DepthAnything [48] are employed to edit images based on action series.

  • G-IMG: DALL-E 3 [1] is used to generate novel images based on tailored image generation prompts.

    These external tools facilitate the creation of the visual components of visual thoughts for I-MCoT methods.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Effectiveness Verification of Visual Thoughts

The paper first verifies that visual thoughts are essential for MCoT's effectiveness.

The following figure (Figure 4 from the original paper) illustrates the effectiveness verification for visual thoughts:

Figure 4: Effectiveness Verification for Visual Thoughts. More details are in Appendix B. 该图像是一个示意图,展示了视觉思想的效果验证工作流程及不同形式对准确性的影响。左侧部分说明了如何使用图像形式和文本形式的视觉思想进行推理,而右侧部分则展示了不同任务中的准确性对比,包括查询和描述要求的任务。通过图表,展示了视觉思想对模型表现的显著提升。

Figure 4 (b) shows a comparison of accuracy across different tasks under three conditions: "Image-form visual thoughts" (original I-MCoT), "w/o visual thoughts" (cache cleared), and "Text-form visual thoughts" (cache restored with text descriptions). The results consistently show that omitting visual thoughts leads to a decrease in accuracy, sometimes even worse than reasoning directly from the query. Conversely, including visual thoughts consistently improves reasoning performance. This highlights their critical role in conveying visual information.

Furthermore, image-form visual thoughts consistently outperform text-form visual thoughts across different complexities, especially for harder scenarios (47.83% on CoMT-Selection). This suggests that images are a superior modality for transmitting detailed visual information for subsequent visual logic propagation.

The following figure (Figure 12 from the original paper) provides more results on the effectiveness verification for visual thoughts, specifically on pure text problems requiring spatial imagination:

Figure 12: Effectiveness Verification for Visual Thoughts on pure text problem. 该图像是图表,展示了在 Qwen2-VL 上对视觉思维有效性的验证,包括不同任务的准确率。图(a)比较了引入视觉思维与不引入视觉思维的结果,图(b)则显示了视觉思维与仅依赖描述的模型在不同图像描述难易程度下的对比,包含公式 f(x)=3x35x2+4x5f(x) = 3x^3 - 5x^2 + 4x - 5

Figure 12 (a) further supports the claim that the absence of visual thoughts leads to a noticeable drop in accuracy, even for problems that are initially textual but require spatial imagination. This is consistent with the findings in Figure 4. Figure 12 (b) demonstrates that visual thoughts encode visual information more efficiently than captions in complex scenarios. The gain in accuracy is modest for simple scenes (7.24%), rises for moderate complexity (19.54%), and exceeds 50% for highly complex images where captions struggle. This confirms the efficiency of visual thoughts in scaling with image complexity.

The following figure (Figure 11 from the original paper) further substantiates the performance degradation without visual thoughts:

Figure 11: Results of Direct and w/o Visual Thought prompting on GPT-4o. 该图像是图表,展示了在 GPT-4o 上使用视觉思维提示与不使用视觉思维提示的直接结果。各组的性能百分比分别为:直接提示为 13.0,42.0,84.0,31.0,不使用视觉思维提示则为 12.0,32.0,31.0。

Figure 11 compares "Direct" prompting (vanilla GPT-4o without specific instructions) against "w/o VT" prompting (explicitly avoiding visual content). The "w/o VT" setting shows a noticeable performance drop even when the surrounding context is complete, confirming that the degradation isn't just a contextual perturbation but a result of entirely omitting visual information.

6.1.2. Performance of Different Categories of Visual Thoughts

The paper evaluates the four categories of visual thoughts (N-LANG, S-LANG, E-IMG, G-IMG).

The following are the results from Table 1 of the original paper:

ModelMMVPV*BenchM3CoTCoMTAVG.
positionattributesphysicalsocialtemporaldeletionselectionupdate
LLaVA-1.5-7B [21]
w/o VT45.0043.4229.5744.4459.5026.8321.0016.0023.5034.36
N-LANG52.3352.6334.7846.6760.3332.5221.5017.5029.0038.58
S-LANG51.3352.6335.6551.1161.5731.7122.0020.5029.0039.50
E-IMG49.3350.0036.5248.8964.0534.1525.5023.0029.5040.10
G-IMG49.6748.6834.7855.5663.2239.0229.5025.0035.0042.27
Qwen2-VL-7B [42]
w/o VT70.0055.2668.7080.0075.2174.8026.0018.0037.0056.11
N-LANG71.3361.8473.0483.3379.7581.3028.0019.5740.5059.85
S-LANG71.0068.4270.4385.5678.1079.6728.5020.0042.0060.41
E-IMG71.0065.7972.1785.5680.9967.4828.5023.5045.5060.05
G-IMG65.0059.2151.3084.4480.1782.9329.5025.5044.5058.06
GPT-4o-mini [32]
w/o VT72.6744.7436.5278.8970.1280.4910.0019.5024.0048.55
N-LANG75.3352.1752.6384.4479.3481.3027.5020.0027.0055.66
S-LANG74.3361.8454.5584.4474.6581.3026.0020.0033.0056.55
E-IMG73.5852.6370.1884.4478.9383.7429.0021.5033.0058.33
G-IMG72.6750.0053.0486.6776.8687.8030.0020.0040.5057.73
GPT-4o [32]
w/o VT74.3353.9554.7888.8976.8679.6726.5019.5037.0056.83
N-LANG85.3357.8963.4888.8978.9383.7433.5025.5037.5061.64
S-LANG84.3363.1664.3590.0078.5182.9329.5018.0042.0061.42
E-IMG83.0059.2165.2290.0078.1086.1834.0028.5050.0063.80
G-IMG78.0059.2159.1392.2278.9386.1833.5028.5046.5062.46

Table 1 shows that all four visual thought strategies (N-LANG, S-LANG, E-IMG, G-IMG) consistently improve performance across different tasks and LVLMs compared to the w/o VT baseline. This broadly confirms the effectiveness of incorporating visual thoughts.

The following figure (Figure 5 from the original paper) shows the proportion of performance improvement rate across tasks:

Figure 5: The proportion of performance improvement rate across tasks. 该图像是图表,显示了各任务的性能提升比例。图中不同颜色的柱形表示不同模型的贡献,包括 MMVP、VBench、M3CoT 和 CoMT,分别对应于不同的性能提升份额,且总和为 100%。*

Figure 5 indicates that visual thoughts yield the most substantial performance improvements in tasks from the CoMT benchmark. This is attributed to CoMT focusing on complex multimodal operations like visual deletion, selection, and update, rather than simpler perception tasks. This suggests visual thoughts are particularly beneficial for complex reasoning scenarios.

Comparison of T-MCoT vs. I-MCoT:

  • T-MCoT (N-LANG, S-LANG) generally performs better on coarse-grained perception tasks (e.g., MMVP, V*Bench-position).
  • I-MCoT (E-IMG, G-IMG) excels in scenarios requiring fine-grained visual operations (e.g., V*Bench-attributes) and tasks demanding complex visual operations (e.g., M3CoT, CoMT). This implies that the choice of MCoT category should be adapted to the task's characteristics.

Reasoning Costs: The following are the results from Table 2 of the original paper:

# Text Token# Image Token
N-LANG139.02-
S-LANG364.37-
E-IMG91.651,112.51
G-IMG89.18393.00

Table 2 compares the average number of text and image tokens generated for each visual thought expression. N-LANG and especially S-LANG (due to extensive structured snippets) have higher average text token counts. However, E-IMG and G-IMG involve a significantly higher average number of image tokens. This substantially increases the reasoning burden on LVLMs in terms of time and expense, indicating that visual thoughts with visual expression (I-MCoT) generally incur higher reasoning costs than textual expressions (T-MCoT). This highlights a trade-off between expressive power and computational efficiency.

6.1.3. Specific Scenarios for Different Visual Thoughts

The paper identifies optimal scenarios for each visual thought expression:

  • N-LANG: Excels in coarse-grained, perception-oriented reasoning tasks. It efficiently extracts macro features and provides a rapid overview of an entire scene (e.g., MMVP where identifying salient objects like "a butterfly" is important).
  • S-LANG: Strong in reasoning about object relationships. By converting images into detailed scene graphs, it precisely models spatial and semantic relationships (e.g., V*Bench-position for inferring relative positions).
  • E-IMG: Achieves strong results in detailed image analysis. It refines visual content through operations like magnification and annotation, aiding fine-grained feature detection (e.g., V*Bench-attributes for accurate attribute predictions).
  • G-IMG: Well-suited for multi-step reasoning through iterative image generation. It dynamically generates images to test hypotheses and deepen understanding of complex concepts over multiple interaction rounds (e.g., M3CoT).

6.1.4. Ablation Study on w/o CoT

The following are the results from Table 3 of the original paper:

ModelMMVPV*BenchM3C0TCoMTAVG.
positionattributesphysicalsocialtemporaldeletionselectionupdate
w/o VT74.3353.9554.7888.8976.8679.6726.5019.5037.0056.83
N-LANG85.3357.8963.4888.8978.9383.7433.5025.5037.5061.64
S-LANG84.3363.1664.3590.0078.5182.9329.5018.0042.0061.42
E-IMG83.0059.2165.2290.0078.1086.1834.0028.5050.0063.80
G-IMG78.0059.2159.1392.2278.9386.1833.5028.5046.5062.46
w/o CoT77.6757.8963.4887.7876.0358.5432.5021.5041.5057.43

Table 3 (using GPT-4o) compares w/o CoT (without any CoT reasoning) against w/o VT (without visual thoughts, but potentially still with CoT) and the visual thought methods. For benchmarks requiring extensive multi-step reasoning like M3CoT, the w/o CoT variant shows significant performance degradation. This indicates that while visual thoughts improve performance within a CoT framework, the CoT reasoning process itself is fundamental for complex tasks. The trends of w/o VT and w/o CoT are not entirely consistent, suggesting distinct but complementary roles.

6.2. Factors Influencing Visual Thought Effectiveness

The paper explores the core factors influencing the effectiveness of visual thoughts.

The following figure (Figure 6 from the original paper) presents the correlation between accuracy and visual thought quality:

Figure 6: Analysis of the correlation between accuracy and visual thought quality. \(\\rho\) : Spearman's correlation coefficient; \(r\) Pearson's correlation coefficient; \(P\) :p-value of related assumptions. 该图像是图表,展示了准确性与视觉思维质量之间的相关性分析。图中包含三个子图,分别表示输入图像的相关性、视觉思维表达的清晰度及简洁性与准确度之间的关系。每个子图中标注了Spearman相关系数ρ\rho、Pearson相关系数rr以及pp值,进一步探讨了这些因素对准确度的影响。

  • Fidelity vs. Accuracy (Figure 6a): The correlation between visual thought fidelity (how accurately it preserves content of the original image) and model accuracy is low (Spearman's\rho < 0.15,Pearson's r<0.15r < 0.15, p>0.4p > 0.4). This suggests that visual thoughts function more as a condensed cache of visual information rather than a faithful replica. They distill information relevant to reasoning, rather than merely reproducing the image.

  • Clarity vs. Accuracy (Figure 6b): A strong positive correlation is observed between the clarity of logical representation in visual thoughts and model accuracy (Spearman's\rho > 0.8,Pearson's r>0.8r > 0.8, p<0.01p < 0.01). This indicates that clearer visual logic enables the model to reason more effectively using the information stored in the visual thought cache.

  • Conciseness vs. Accuracy (Figure 6c): Strong correlation is also found between the conciseness of visual logic expression and reasoning accuracy. This implies that removing redundant or extraneous elements from visual thoughts enhances retrieval efficiency from the internal visual cache, further boosting effectiveness alongside clarity.

    The following figure (Figure 7 from the original paper) shows the factors correlation of I-MCoT:

    Figure 7: Factors Correleation of I-MCoT. 该图像是图表,展示了 E-IMG 和 G-IMG 在不同参数下的性能比率。左侧图(a)展示了在不同 Box 阈值和文本阈值下的准确性和噪声水平变化;右侧图(b)展示了在指导尺度不同情况下的表现。可以看到,准确性随参数变化而变化。

  • External Noise (Figure 7): For I-MCoT, where visual thoughts (edited images, generative images) are produced by external tools, the noise introduced by these tools negatively impacts reasoning performance. Both E-IMG and G-IMG show a clear negative correlation between accuracy and noise level (manipulated by adjusting tool parameters like Box threshold, text threshold, guidance scale). Excessive noise can even cause reasoning performance to fall below native reasoning (Direct), emphasizing the critical need to manage external noise for effective I-MCoT.

6.3. Internal Rationale Behind the Visual Thought

The paper investigates the internal mechanisms of visual thoughts through attention and information flow analysis in LVLMs.

6.3.1. Visual Attention Analysis

The following figure (Figure 8 from the original paper) shows the attention distribution of visual thought and image in MCoT:

Figure 8: Attention distribution of Visual Thought and Image in MCoT. 该图像是图表,展示了视觉思想与原始图像在多模态链式思维中的注意力分布。上方部分显示了没有视觉输入时的注意力评分,下方部分则为视觉思想的注意力评分,两者在模型层数上比较。该图表分析了不同层级对视觉思想的关注程度,揭示了视觉思想在推理过程中的重要性。

  • Attention Shift (Figure 8a): In MCoT without visual thoughts (No-VIS), the model pays high attention to the original image. However, when visual thoughts are incorporated, the attention to the raw input image significantly decreases across all visual thought expressions. Instead, the model shifts its attention to the various visual thoughts. This redistribution of attention is crucial for the flow of visual information and supports the logical framework.
  • Deeper Layer Transmission (Figure 8a): A striking observation is that in deeper layers of the model, attention to the original image without visual thoughts diminishes sharply (approaching zero after the 12th layer). In contrast, with visual thoughts, attention to all expressions substantially increases and remains comparable to earlier layers, even beyond the 12th layer. This suggests that visual thoughts play a crucial role in transferring visual information into deeper layers of the model, enabling enhanced cross-modal interaction and more sophisticated reasoning.
  • Architectural Impact over Scale (Figure 8b): Comparing LLaVA (7B, 13B) and Qwen2-VL (3B, 7B) models, attention toward visual thoughts generally exceeds that toward the original image, especially in the latter 50% of layers. The increase in attention gains with deeper model layers indicates that the influence of visual thoughts depends more on model architecture (e.g., number of layers) than on parameter scale. For LLaVA, N-LANG and S-LANG show relatively low attention due to weaker language capabilities generating overly simplistic descriptions.

6.3.2. Visual Information Flow Analysis

The paper analyzes the information flow by disturbing the model's internal processing and using saliency scores.

The following figure (Figure 9 from the original paper) illustrates the disturbance analysis of information flow within the LLaVA model:

Figure 9: The Disturbation Analysis of Information Flow within LLaVA model. 该图像是图表,展示了在 LLaVA 模型中信息流的扰动分析。通过对比去除视觉思想的影响前后,不同层次的变换器层在确定函数的断点编号时,展示了视觉思想在信息传递中的关键作用。相关公式为 f(x)={29.26x23.75for x5.1525.60x+20.06for 2.88x3.8521.32x39.80for x2.3844.34x47.85for x7.87f(x) = \begin{cases}-29.26x - 23.75 & \text{for } x \leq -5.15\\ 25.60x + 20.06 & \text{for } 2.88 \leq x \leq 3.85\\ 21.32x - 39.80 & \text{for } x \geq 2.38\\ 44.34x - 47.85 & \text{for } x \geq 7.87\end{cases}

  • Querying Visual Information from Cache (Figure 9): Without visual thoughts (No-VIS), the model may make incorrect choices (e.g., AA instead of CC). Incorporating visual thoughts leads to correct answers. Critically, intercepting the information flow from the query to the visual thought cache before layer 1 significantly prevents the model from choosing the correct answer. In contrast, disturbing the flow directly from the image has no effect. This confirms that querying visual information from the visual thought cache is the key mechanism for improving predictions.

    The following figure (Figure 10 from the original paper) shows the information flow distribution of visual thought and image in MCoT within LLaVA-1.5-7B and Qwen2-VL-2B:

    Figure 10: Information flow distribution of Visual Thought and Image in MCoT within LLaVA-1.5- 7B and Qwen2-VL-2B. 该图像是图表,展示了在MCoT中视觉思维和图像的信息流分布。图中包含不同层次索引下的比例值,分别对应于多模态理解过程中的推理。信息流展示了独特的层次结构及其对推理的影响。

  • Visual Thought as Intermediary (Figure 10): The information flow from visual thoughts to reasoning stages is considerably stronger than the direct flow from the image to reasoning. This emphasizes that visual thoughts mediate and organize visual data, optimizing it for reasoning. Most of the original image's input information is first routed to visual thoughts during derivation steps, which then feed into the reasoning stage. This two-stage process (raw pixels \to visual thoughts \to deeper reasoning) highlights visual thoughts as a crucial bridge, enabling LVLMs to generate better MCoT.

6.4. Future Work: Diverse-VT

The following are the results from Table 4 of the original paper:

ModelMMVPV*BenchM3C0TCoMTAVG.
positionattributesphysicalsocialtemporaldeletionselectionupdate
w/o VT74.3353.9554.7888.8976.8679.6726.5019.5037.0056.83
N-LANG (Maj@4)85.0057.8963.4888.8978.9383.7433.5025.5037.5061.60
S-LANG(Maj@4)83.6763.1664.3590.0078.5182.9329.5018.0042.0061.35
E-IMG (Maj@4)82.3359.2165.2292.2278.1086.1833.5028.0049.0063.75
G-IMG(Maj@4)78.0059.2163.4892.2278.9386.1833.5028.5046.5062.95
Diverse-VT85.0063.1665.2292.2278.9386.1834.0028.5050.0064.80

Table 4 presents the results of Diverse-VT, a strategy inspired by self-consistency that ensembles diverse visual thoughts reasoning paths. Diverse-VT (implemented on GPT-4o) achieves the best overall performance (64.80% AVG), surpassing individual visual thought categories. This demonstrates the feasibility and benefit of integrating multiple visual thoughts to further enhance visual information transmission and reasoning performance in MCoT.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study introduces the concept of visual thoughts as a unified mechanism explaining the effectiveness of Multimodal Chain-of-Thought (MCoT) approaches in Large Vision-Language Models (LVLMs). Visual thoughts are crucial for transferring visual information from input images to the reasoning process and deeper transformer layers, acting as a distilled visual cache. The paper defines and evaluates four strategies for expressing visual thoughtsNatural Language (N-LANG), Structured Language (S-LANG), Edited Image (E-IMG), and Generative Image (G-IMG)—demonstrating that their clarity and conciseness significantly impact MCoT performance. Furthermore, internal analyses of attention distribution and information flow reveal that visual thoughts serve as critical intermediaries, transmitting visual information more effectively and deeply than raw images alone. The findings provide a comprehensive understanding of MCoT's underlying mechanisms and offer insights for future innovations.

7.2. Limitations & Future Work

The authors acknowledge the following limitations:

  • Exclusion of Multiple Rounds of Visual Thought Interactions: To facilitate variable control and streamline analysis, the current study excludes multiple rounds of visual thought interactions. This implies that the full potential of iterative visual reasoning, particularly relevant for I-MCoT and complex tasks, was not explored.

  • Difficulty in Generating I-MCoT: Due to the challenges in generating high-quality I-MCoT during Any-to-Any LVLM inference (which often leads to poor logical quality), DALL-E 3 was used as an external tool for G-IMG generation instead of directly relying on the LVLMs for interleaved image generation. This might not fully capture the nuances of an LVLM natively generating images as part of its thought process.

    Based on these limitations and findings, the authors suggest future research directions:

  • Exploring Multiple Rounds of Visual Thought Interactions: This is a direct extension of the acknowledged limitation, suggesting that future work should investigate how multiple, iterative interactions with visual thoughts can enhance reasoning in complex scenarios.

  • Enhancing MCoT Performance: Inspired by strategies like self-consistency, the paper's Diverse-VT experiment (ensembling diverse visual thoughts) is presented as a promising direction. This suggests exploring methods to combine or leverage multiple visual thought pathways to further improve reasoning.

  • Improving I-MCoT Generation: Addressing the difficulty in generating high-quality I-MCoT natively within LVLMs could be a significant area of research.

  • Further Breakthroughs for MCoT Research: The authors hope that the concept of visual thoughts will inspire broader breakthroughs in understanding and developing MCoT.

7.3. Personal Insights & Critique

This paper provides a highly valuable conceptual framework for understanding Multimodal Chain-of-Thought, moving beyond mere empirical observation to a mechanistic explanation. The analogy of visual thoughts as a "distilled visual cache" is intuitive and powerful, immediately clarifying why intermediate steps involving visual information are beneficial. The systematic categorization of visual thought expressions (N-LANG, S-LANG, E-IMG, G-IMG) is rigorous and provides a clear taxonomy for future research.

One of the most impactful findings is the internal analysis, demonstrating how visual thoughts facilitate deeper visual information transmission within the transformer layers and that model architecture (number of layers) plays a more significant role than parameter scale. This offers concrete guidance for LVLM design, suggesting that architects should prioritize mechanisms that effectively integrate and propagate visual thoughts into deeper processing stages. The emphasis on clarity and conciseness as key factors for visual thought effectiveness is also a practical takeaway for prompt engineering and visual thought generation strategies.

A potential area for deeper exploration, stemming from the acknowledged limitations, is the dynamic generation of I-MCoT. While the paper used DALL-E 3 as an external tool for G-IMG, the ultimate goal for truly integrated I-MCoT would be for LVLMs to natively generate high-quality visual steps as part of their internal thought process without relying on external, potentially noisy, tools. Investigating training paradigms that enable this native multimodal generation capability could be a frontier. Additionally, while the paper highlights the computational cost of I-MCoT, future work could explore efficient I-MCoT generation methods or adaptive strategies that decide whether to incur the cost based on task complexity or required granularity.

The concept of visual thoughts is highly transferable. It could be applied to other multimodal domains beyond images and text, such as video-language models, where visual thoughts might take the form of distilled temporal visual events or motion patterns. It could also inspire new evaluation benchmarks specifically designed to measure the clarity and conciseness of internal representations within black-box models, pushing the boundaries of interpretability. The rigorous analysis in this paper sets a strong foundation for a more principled approach to multimodal reasoning.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.