ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
TL;DR Summary
The paper introduces ROVER, a benchmark for evaluating reciprocal cross-modal reasoning in Unified Multimodal Models. It features 1,312 tasks assessing how one modality guides another, revealing significant performance differences in physical and symbolic reasoning.
Abstract
Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
1.2. Authors
Yongyuan Liang*, Wei Chow*, Fn Li, Ziqiao Ma*, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, Furong Huang. (Note: * indicates equal contribution)
1.3. Journal/Conference
Venue: arXiv Preprint Status: Published on November 3, 2025 (UTC). Reputation: arXiv is a highly influential repository for preprints in Computer Science, particularly in AI and Machine Learning, often hosting cutting-edge research before formal conference proceedings.
1.4. Publication Year
2025
1.5. Abstract
This paper introduces ROVER, a benchmark designed to evaluate reciprocal cross-modal reasoning in Unified Multimodal Models (UMMs). Unlike existing benchmarks that assess text and image capabilities in isolation, ROVER tests a model's ability to use one modality to guide, verify, or refine the other. The benchmark consists of 1,312 tasks across 1,876 images, divided into two settings:
-
Verbally-Augmented Reasoning for Visual Generation (ROVER-IG): Using text reasoning to guide image synthesis.
-
Visually-Augmented Reasoning for Verbal Generation (ROVER-TG): Using generated intermediate images to aid text-based question answering.
Experiments on 17 models reveal that interleaved models (capable of generating mixed sequences of text and image) significantly outperform others. However, models show a dissociation between physical reasoning (where they succeed) and symbolic reasoning (where they fail).
1.6. Original Source Link
https://arxiv.org/abs/2511.01163v1 (Preprint)
2. Executive Summary
2.1. Background & Motivation
Core Problem: The field of Artificial Intelligence has seen the rise of Unified Multimodal Models (UMMs), also known as omnimodal models, which can understand and generate both text and images. However, current evaluation methods treat these modalities separately. Text benchmarks focus on language logic, while visual benchmarks focus on pixel quality or simple instruction following.
Importance: True multimodal intelligence requires reciprocal reasoning—the ability to weave text and images together in a reasoning chain. For example, a model should be able to "think" in text to plan a complex image, or sketch an intermediate diagram to solve a geometry problem. Existing benchmarks fail to capture this interplay, leaving a gap in understanding how well UMMs can leverage one modality to enhance the other.
Innovation: ROVER is the first benchmark explicitly designed to test this bidirectional support (reciprocity) between modalities during the generation and reasoning process, rather than just checking the final output against a static ground truth.
2.2. Main Contributions & Findings
Contributions:
- ROVER Benchmark: A rigorous, human-annotated dataset with over 1,300 tasks targeting reciprocal reasoning.
- Two Evaluation Settings:
- ROVER-IG: Tests if models can use "verbal thinking" to improve image generation.
- ROVER-TG: Tests if models can use "visual thinking" (generating images) to improve text answers.
- Comprehensive Evaluation Protocol: A multi-dimensional scoring system using a calibrated VLM-as-judge (Visual Language Model) to assess reasoning process, alignment, and visual quality.
Key Findings:
- Cross-Modal Reasoning Drivers Quality: In visual generation, models that use interleaved reasoning (generating text rationales before images) significantly outperform those that don't.
- Physical vs. Symbolic Dissociation: UMMs are good at "literal" visual reasoning (e.g., simulating physics or perception) but fail at "symbolic" visual reasoning (e.g., drawing auxiliary lines for geometry or abstract puzzles). In symbolic tasks, attempting to use visual reasoning often hurts performance.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Unified Multimodal Models (UMMs) / Omnimodal Models: These are AI architectures designed to process and generate sequences containing both text and images. Unlike older models that specialized in one (e.g., LLMs for text, Diffusion models for images), UMMs often tokenize images (converting image patches into discrete codes similar to words) to process them alongside text in a single Transformer stream. This allows them to "read" images and "write" images interchangeably with text.
-
Reciprocal Cross-Modal Reasoning: This refers to the cyclical support between modalities.
- Text-to-Image Reasoning: The model generates a textual plan or logical chain (e.g., "First, I need to visualize the object falling...") to guide the generation of the final image.
- Image-to-Text Reasoning: The model generates a visual aid (e.g., a diagram) to help itself answer a text-based question.
-
Chain-of-Thought (CoT): A prompting technique where the model is encouraged to output intermediate reasoning steps ("Let's think step by step") before the final answer. In this paper, CoT is extended to include visual steps.
-
Interleaved Generation: The capability of a model to generate a sequence of outputs that switches between text and images (e.g., Text Image Text).
3.2. Previous Works
- Visual Understanding Benchmarks: MMBench, MathVista, MMMU. These focus on inputting an image and outputting text. They test perception but not the generation of images for reasoning.
- Visual Generation Benchmarks: GenEval, Pick-a-Pic, ImageReward. These focus on text-to-image synthesis quality or alignment. They do not evaluate the reasoning process behind the generation.
- Reasoning-Guided Editing: Benchmarks like ReasonPix2Pix and ReasonEdit involve editing images based on instructions.
- Unified Benchmarks: Recent works like Unified-Bench and RISEBench have started to look at unified capabilities but often lack deep process evaluation or reciprocal reasoning assessment.
3.3. Differentiation Analysis
ROVER vs. The Rest:
-
Process vs. Outcome: While previous benchmarks measure if the final image looks good (outcome), ROVER measures if the reasoning leading to it was sound (process).
-
Reciprocity: ROVER is unique in requiring the model to generate visual artifacts to help its own text reasoning (ROVER-TG), simulating how humans use scratchpads or diagrams.
-
Hybrid Evaluation: ROVER combines VLM-based judging with human verification to assess the alignment between the generated rationale and the generated output, a dimension often ignored.
The following figure (Figure 1 from the original paper) illustrates the two reciprocal settings of ROVER:
该图像是一个示意图,展示了ROVER基准测试中的统一多模态模型。左侧展示了ROVER-Image Gen任务的流程,通过语言增强的推理来生成输入图像的输出;右侧展示了ROVER-Text Gen任务的流程,通过视觉增强的推理生成文本答案。图中还包含了一些公式,如 表示输出答案的真实值。
4. Methodology
4.1. Principles
The core principle of ROVER is that true omnimodal intelligence is not just about doing Task A (Text) and Task B (Vision) separately, but using Task A to improve Task B, and vice versa. The methodology is built around defining complex tasks that require this interaction and establishing a rigorous rubric to score it.
4.2. ROVER-IG: Verbally-Augmented Reasoning for Visual Generation
This setting evaluates if a model can use language-based reasoning chains to guide faithful image synthesis.
4.2.1. Taxonomy & Domains
The benchmark spans 4 Conceptual Domains and 7 Reasoning Subtasks.
-
Domains:
- Natural Science: Phenomena, experimental processes.
- Culture & Art: Artistic styles, artifacts.
- Common Sense: Everyday scenarios.
- Logic & Math: Abstract puzzles, geometry.
-
Reasoning Subtasks:
-
Temporal: Predicting sequences or changes over time (e.g., "How will this flower look in a week?").
-
Spatial: Geometric relationships, perspective changes (e.g., "View this scene from top-down").
-
Causal: Cause-effect mechanisms (e.g., "What happens to the apple after soaking in salt water?").
-
Imaginative: Creative integration (e.g., "Reimagine this studio as a cyberpunk workspace").
-
Quantitative: Numerical changes (e.g., "Remove exactly 3 fruits").
-
Puzzle: Abstract pattern discovery.
-
Geometry: Mathematical principles in visualization.
The following figure (Figure 2 from the original paper) visualizes the ROVER-IG taxonomy with examples:
该图像是一个示意图,展示了不同领域(自然科学、文化与艺术、常识、逻辑与数学)中可以使用的七种推理子任务(时间、空间、因果、想象和定量),并给出了相应的任务示例。这些任务通过视觉化和推理能力评估统一的多模态模型在生成图像时的效果。
-
4.2.2. Task Structure
Each task instance consists of:
- Input Image: The starting state.
- Reasoning Prompt: An instruction requiring a chain of constraints (e.g., "Identify the object, determine its material properties, and show its state after being dropped").
- Target Description: A detailed text description of the expected visual output.
- Domain Keywords: Concepts guiding the reasoning.
4.2.3. Evaluation Protocol (Metric Construction)
Since measuring reasoning in images is hard for standard metrics (like FID), the authors use a VLM-as-Judge approach (specifically GPT-4.1) calibrated with expert human annotations.
The scoring is multidimensional:
-
Reasoning Process (RP):
- Evaluates the text rationale generated by the model.
- Criteria: Logical structure, domain knowledge application, completeness.
- Score: 0-100 (normalized from 1-5 scale).
-
Reasoning Visual (RV):
- Evaluates the final generated image against the Target Description.
- Criteria: Does the image reflect the correct reasoning outcome?
-
Reasoning Alignment (Align.):
- Measures consistency between the text rationale and the image.
- Key Question: Did the model actually draw what it said it would draw? (e.g., If the text says "The ice melts," does the image show water?)
-
Visual Consistency (VC):
- Ensures non-target elements (background, unrelated objects) remain unchanged.
-
Image Quality (IQ):
- Assesses visual fidelity, artifacts, and structural coherence.
4.3. ROVER-TG: Visually-Augmented Reasoning for Verbal Generation
This setting evaluates if a model can generate intermediate images ("visual thoughts") to improve its performance on text-based questions.
4.3.1. Taxonomy
This setting focuses on 3 problem domains where visual aids are naturally helpful:
-
World Model:
- Task: Robot manipulation & physical dynamics.
- Requirement: Predict intermediate states of an environment given actions.
-
Logic & Math:
- Task: Geometry & Puzzles.
- Requirement: Draw auxiliary lines or diagrams to solve a problem (e.g., "Find x in this triangle").
-
Visual Perception:
-
Task: Multi-view reasoning, Jigsaw puzzles.
-
Requirement: Generate missing pieces or alternative views to verify understanding.
The following figure (Figure 3 from the original paper) shows the ROVER-TG workflow:
该图像是插图,展示了不同推理能力的任务示例,包括代理知情状态变化、几何、物理知情状态变化、拼图、多视角推理和拼图任务。每个任务都有输入和视觉推理部分,呈现了对机器人、几何问题、物理状态等进行推理的应用场景,以此强调跨模态推理的重要性。
-
4.3.2. Evaluation Protocol
The evaluation again uses a VLM Judge across 3 dimensions:
-
Interleaved Reasoning Quality (IR):
- Evaluates the generated intermediate image.
- Criteria: Is the diagram/simulation physically and logically correct? Is it relevant to the question?
-
Final Answer Accuracy (Acc.):
- Measures if the final text answer matches the ground truth.
-
Reasoning-Answer Alignment (Align.):
- Quantifies if the generated image actually helped.
- Criteria: Is there a causal link between the visual aid and the correct text answer?
5. Experimental Setup
5.1. Datasets
- Source: Images were curated from large-scale web datasets and specific domains (robotics datasets, physical simulation videos, logic puzzles).
- Scale:
- Total: 1,312 tasks grounded in 1,876 images.
- ROVER-IG: 908 tasks (1,009 images).
- ROVER-TG: 404 tasks.
- Curation: Tasks were generated collaboratively by domain experts and LLMs, then verified by humans to ensure they require complex reasoning, not just simple recognition.
5.2. Models Evaluated
The paper evaluates 17 Unified Multimodal Models across distinct categories:
-
Closed-Source UMMs:
- Nano Banana (Gemini 2.0 Flash Image)
- Gemini 2.0 Flash
- GPT-5 (Note: The paper cites "Hurst et al., 2024" for GPT-5, which likely refers to a state-of-the-art internal or preview model referenced in the context of the paper's fictional or future-dated timeline—See Note below).
- Self-Correction/Note: The paper is dated Nov 2025. It references "GPT-5" and "Gemini 2.0". In the context of this analysis, I must treat these as the specific models evaluated in the paper's contemporary setting.
-
Open-Source UMMs:
- BAGEL / BAGEL-Think
- UniCoT
- Step1X-Edit
- Ovis-U1, Emu2-Gen, OmniGen2, etc.
-
Reasoning Language Models (Baseline):
- GPT-4.1 (Used as a text-only baseline to see if text reasoning alone is sufficient).
-
Image Editing Models (Baseline):
- Qwen-Image-Edit, FLUX.1 Kontext, UltraEdit.
5.3. Evaluation Metrics Summary
The metrics (RP, RV, Align, IR, Acc) are calculated by feeding the model's output (Text + Image) and the ground truth/rubric into the VLM Judge. The Judge outputs a score (1-5) and a rationale, which is normalized to 0-100.
6. Results & Analysis
6.1. ROVER-IG: Visual Generation Results
Core Finding: Cross-modal reasoning capability dictates visual generation quality.
The following table (Table 2 from the original paper) presents the results for Verbally-Augmented Visual Generation.
| Verb.-Aug. Reasoning for Visual Generation | Nature Science | Culture & Art | Common Sense | Logic & Math | Overall | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RP | Align. | RV | RP | Align. | RV | RP | Align. | RV | RP | Align. | RV | RP | Align. | RV | |
| Closed-source Unified Models | |||||||||||||||
| Nano Banana | 64.8 | 88.8 | 77.3 | 68.1 | 81.9 | 76.6 | 61.8 | 85.0 | 74.8 | 78.6 | 66.1 | 55.1 | 67.0 | 82.3 | 73.2 |
| Gemini 2.0 Flash | 64.1 | 88.4 | 68.8 | 62.8 | 78.7 | 71.9 | 57.8 | 74.4 | 66.1 | 74.5 | 63.2 | 42.6 | 64.8 | 78.6 | 62.3 |
| GPT-5 | 61.7 | 87.9 | 71.3 | 63.4 | 80.2 | 72.6 | 56.3 | 77.2 | 65.3 | 75.4 | 60.2 | 45.8 | 64.2 | 76.4 | 63.7 |
| Open-source Unified Models | |||||||||||||||
| BAGEL-Think | 58.1 | 64.2 | 54.0 | 53.2 | 78.0 | 63.7 | 50.1 | 69.4 | 55.9 | 57.7 | 26.2 | 20.8 | 54.3 | 64.4 | 52.7 |
| BAGEL | - | - | 35.9 | - | - | 49.2 | - | - | 42.0 | - | - | 27.1 | - | - | 40.5 |
| Step1X-Edit v1.2 | 29.7 | 59.7 | 46.2 | 31.4 | 71.6 | 50.6 | 28.7 | 61.0 | 46.1 | 77.5 | 35.5 | 18.4 | 37.0 | 60.3 | 43.5 |
| UniCoT | 52.4 | 68.9 | 38.2 | 57.3 | 69.2 | 63.9 | 53.1 | 64.3 | 56.3 | 50.3 | 23.1 | 21.5 | 50.7 | 56.3 | 47.4 |
Analysis:
- The Power of "Think": Compare
BAGEL-Think(which uses interleaved reasoning text) vs.BAGEL(direct generation).BAGEL-Thinkscores 52.7 in overall Reasoning Visual (RV), whileBAGELscores only 40.5. This proves that explicitly generating a text rationale before the image drastically improves the image's adherence to reasoning constraints. - Closed vs. Open Source: Closed-source models (Nano Banana, GPT-5) dominate, with Nano Banana achieving an overall RV of 73.2. This suggests a strong correlation between general model scale/capability and reciprocal reasoning.
- Domain Difficulty: All models struggle significantly in Logic & Math (RV scores drop to ~20-50), compared to Science or Art. This indicates current UMMs find abstract/symbolic visual generation much harder than rendering natural objects.
6.2. ROVER-TG: Verbal Generation Results
Core Finding: Visual reasoning is a double-edged sword. It helps in physical tasks but hurts in symbolic ones.
The following table (Table 3 from the original paper) shows results for Visually-Augmented Verbal Generation.
| Verb.+Vis. Reasoning for Verbal Generation | Reasoning Modalities | World Model | Logic & Math | Visual Perception | Overall | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IR | Align. | Acc. | IR | Align. | Acc. | IR | Align. | Acc. | IR | Align. | Acc. | ||||||
| Closed-source Unified Models | |||||||||||||||||
| Nano Banana | Verb.+Vis. | 35.3 | 62.0 | 40.6 | 14.8 | 61.2 | 44.9 | 66.5 | 56.8 | 50.0 | 38.8 | 60.0 | 43.6 | ||||
| Gemini 2.0 Flash | Verb.+Vis. | 27.1 | 46.7 | 35.6 | 11.4 | 47.9 | 30.4 | 49.5 | 46.8 | 43.0 | 29.3 | 47.1 | 36.3 | ||||
| Verb. | - | - | 36.9 | - | - | 42.0 | - | - | 43.7 | - | - | 40.8 | |||||
| GPT-5 | Verb.+Vis. | 32.8 | 61.5 | 39.2 | 13.2 | 58.7 | 45.6 | 62.7 | 54.9 | 45.5 | 36.2 | 60.9 | 43.4 | ||||
| Verb. | - | - | 33.2 | - | - | 32.6 | - | - | 43.6 | - | - | 42.8 | |||||
Analysis:
- The "Verb." vs. "Verb.+Vis." Comparison: This compares standard text reasoning against reasoning augmented with generated images.
- Success Case: In World Model tasks (simulating physics), generating images often helps. E.g., GPT-5 scores 39.2 (Verb+Vis) vs 33.2 (Verb only). The model successfully simulates the outcome to answer better.
- Failure Case: In Logic & Math, generating images often hurts. Gemini 2.0 Flash scores 30.4 with visual reasoning, but 42.0 with text only.
- Why? The IR (Interleaved Reasoning) scores for Logic & Math are abysmal (e.g., 14.8, 11.4). This means the models fail to draw the correct auxiliary lines or diagrams. A wrong diagram misleads the model, resulting in a lower final accuracy than if it had just reasoned with text.
6.3. Dissociation of Reasoning
The results highlight a fundamental split in capabilities:
-
Physical/Literal Reasoning: Models are good at this. They can imagine "an apple turning brown" (Causal) or "a robot arm moving" (World Model) because these rely on patterns seen in training data (perceptual concepts).
-
Symbolic/Abstract Reasoning: Models fail here. They cannot robustly generate "auxiliary lines to solve x" or "abstract puzzle pieces" because these require symbolic abstraction, which UMMs struggle to map to pixels.
The following figure (Figure 5 from the original paper) visually demonstrates these failures and successes across tasks:
该图像是一个示意图,展示了不同的任务和相应的视觉推理过程,包括输入、视觉推理和最终答案。在逻辑与数学部分,展示了图形和公式;在世界模型部分,涉及机器臂的操作;在视觉感知部分,比较了不同图像以识别缺失部分。
6.4. Comparison with Specialized Editors
The paper also compares UMMs with specialized Image Editing models (like Qwen-Image-Edit). Result: UMMs outperform editors on ROVER-IG. Why? Editors are great at pixel manipulation (changing color, style) but lack the deep reasoning to understand complex causal or temporal instructions (e.g., "show the object after 5 minutes of decay"). UMMs use their LLM backbone to understand the "why" and "how" before generating the "what".
7. Conclusion & Reflections
7.1. Conclusion Summary
ROVER establishes that reciprocal cross-modal reasoning is a distinct and critical capability for the next generation of AI. The paper demonstrates that simply stitching together a strong text model and a strong image model is not enough; the model must be trained to interleave reasoning across modalities. The benchmark reveals that while current top-tier models (like Nano Banana/GPT-5) show promise in physical reasoning, the entire field lags significantly in symbolic visual reasoning.
7.2. Limitations & Future Work
- Symbolic Gap: The authors explicitly note the failure in logic/math tasks. Future work needs to bridge the gap between pixel generation and symbolic logic (possibly via vector graphics or code generation rather than pixel diffusion).
- Evaluation Cost: The VLM-as-judge method is scalable but relies on the judge's own capabilities. While verified against humans, it may still harbor biases or hallucinations in complex reasoning scenarios.
7.3. Personal Insights & Critique
- The "Think" Paradigm shift: This paper strongly reinforces the trend seen in LLMs (like OpenAI's o1) that "thinking time" (or token generation) improves performance. Here, it extends to "visual thinking." The fact that
BAGEL-ThinkbeatsBAGELis a compelling argument for inference-time compute in multimodal generation. - Visual Hallucination as a Blocker: The finding that bad visual reasoning hurts downstream text performance is intuitive but critical. It suggests that until image generation is highly reliable (faithful to physics and logic), using it as a "scratchpad" might be detrimental. This parallels early CoT in LLMs where wrong reasoning steps led to wrong answers.
- Benchmark Longevity: ROVER's focus on process rather than just pixel quality makes it a more enduring benchmark. As image generators become perfect at rendering textures, the challenge will shift entirely to whether the content makes logical sense, which is exactly what ROVER measures.
Similar papers
Recommended via semantic vector search.