Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
TL;DR Summary
This study assesses whether video generation models like Veo-3 can function as zero-shot reasoners, introducing the Chain-of-Frame reasoning concept and creating the MME-CoF benchmark. Results reveal strong short-term coherence but significant limitations in long-term reasoning a
Abstract
Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
1.2. Authors
- Ziyu Guo*, Xinyan Chen*, Renrui Zhang*, Ruichuan An*, Yu Qi* (Equal Contribution)
- Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li
- Pheng-Ann Heng (Corresponding Author)
Affiliations:
- CUHK (The Chinese University of Hong Kong): IMIXR & MMLab
- Peking University
- Northeastern University
1.3. Journal/Conference
The paper was published on arXiv on October 30, 2025. Based on the content mentioning "CVPR 2025 Highlight" in the references for related work, this paper positions itself within the computer vision research community, likely targeting top-tier conferences like CVPR, ICCV, or ECCV.
1.4. Publication Year
2025
1.5. Abstract
This paper investigates whether state-of-the-art video generation models (specifically Veo-3) possess zero-shot reasoning capabilities beyond simple image synthesis. The authors propose that video generation can be viewed as a "Chain-of-Frame" (CoF) reasoning process. They conduct a comprehensive empirical study across 12 reasoning dimensions (e.g., spatial, geometric, physics, embodied logic). To standardize this, they curate MME-CoF, a compact benchmark. The study concludes that while models show promise in short-term spatial coherence and fine-grained grounding, they fail significantly in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Thus, they are not yet reliable as standalone reasoners but hold potential as visual engines for reasoning agents.
1.6. Original Source Link
- Original Source: https://arxiv.org/abs/2510.26802v1
- PDF Link: https://arxiv.org/pdf/2510.26802v1.pdf
- Project Page: https://video-cof.github.io
2. Executive Summary
2.1. Background & Motivation
The field of video generation has seen explosive growth with models like Sora, Veo, and Kling capable of producing high-fidelity, temporally coherent videos.
-
Core Problem: While these models generate realistic visuals, it remains unclear if they truly "understand" or "reason" about the world, or if they merely mimic surface-level patterns found in their training data.
-
Why Important: If video models possess genuine reasoning capabilities (understanding physics, geometry, causality), they could serve as generalist "World Models" or unified vision systems, similar to how Large Language Models (LLMs) serve as foundation models for text.
-
Innovative Perspective: The authors propose examining video generation through the lens of Chain-of-Frame (CoF) reasoning. Just as LLMs use "Chain-of-Thought" (step-by-step textual reasoning) to solve problems, video models might use the sequential generation of frames to solve visual problems step-by-step in time and space.
The following figure (Figure 1 from the original paper) provides an overview of this study, mapping the "Video Models" center to 12 distinct reasoning dimensions:
该图像是示意图,展示了视频模型在零-shot 推理中的应用。图中通过中心的“视频模型”节点,连接了包括3D几何推理、物理推理、对象计数推理等12个不同的推理维度,体现了当前视频模型在多种推理场景下的潜力与挑战。
2.2. Main Contributions / Findings
- Comprehensive Empirical Study: The first rigorous investigation into the reasoning potential of Veo-3, evaluating it across 12 distinct categories such as Physics-based Reasoning, 3D Geometry, and Embodied Reasoning.
- MME-CoF Benchmark: The creation of a standardized, compact benchmark containing 59 carefully curated tasks designed to probe "Chain-of-Frame" capabilities.
- Key Findings:
-
Strengths: Models excel at Short-horizon spatial coherence (keeping an object consistent over a few frames), Fine-grained grounding (identifying specific attributes like color or texture), and Locally consistent dynamics.
-
Weaknesses: Models fail at Long-horizon causal reasoning (maintaining logic over a long sequence), Strict geometric constraints (e.g., rotating a shape without distorting it), and Abstract logic (e.g., solving a maze or math problem visually).
-
Conclusion: Current video models are not ready to be standalone zero-shot reasoners. They operate more like "System 1" (intuitive, pattern-matching) thinkers rather than "System 2" (logical, rigorous) thinkers.
The following figure (Figure 2 from the original paper) illustrates the MME-CoF benchmark concept and the performance gap across different models:
该图像是示意图(a),展示了MME-CoF基准的评估雷达图。不同模型在多种推理任务上的表现有所差异,但大多数模型在所有任务上的推理能力均有限。
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Video Generation Models: AI systems designed to create video content from text prompts (Text-to-Video). They typically use Diffusion Models (which iteratively denoise random static to create clear frames) or Autoregressive Models (which predict the next patch of pixels based on previous ones).
- Zero-Shot Reasoning: The ability of a model to solve a task it has not been explicitly trained for. For example, asking a video generator to "solve a maze" when it was only trained to "generate videos of scenery."
- Chain-of-Thought (CoT): A technique in Large Language Models (LLMs) where the model is encouraged to produce intermediate reasoning steps (e.g., "First I calculate X, then I use X to find Y") before giving a final answer. This significantly improves performance on complex tasks.
- Chain-of-Frame (CoF): The visual analog proposed in this paper. Instead of text steps, the model generates a sequence of video frames where each frame represents a step in the reasoning process (e.g., Frame 1: Hand approaches object; Frame 2: Hand grasps object; Frame 3: Hand lifts object).
3.2. Previous Works
- Video Understanding: Traditional research focused on analyzing existing videos (e.g., MViT, VideoMAE) for tasks like classification or event localization. Recent work uses LLMs to caption or answer questions about videos (Video-LLMs).
- Video Generation: The paper cites closed-source leaders like Sora (OpenAI) and Veo (Google DeepMind), and open-source attempts like Stable Video Diffusion and Hunyuan-Video.
- Reasoning with Video:
- Video-R1 and VideoChat-R1: Recent concurrent works that try to enhance reasoning in video understanding models using Reinforcement Learning (RL).
- World Models: Research exploring if video generators can simulate physical laws (e.g., gravity, collision), effectively acting as a simulation engine for the real world.
3.3. Differentiation Analysis
- vs. Video Understanding: Most benchmarks (like Video-MME) test if a model can describe a video. This paper tests if a model can generate a video that represents a solution to a logic problem.
- vs. Standard Generation Metrics: Standard metrics (like FVD) measure visual quality (is the video blurry?). This paper measures reasoning correctness (did the ball follow the correct trajectory defined by physics?).
- Innovation: This is the first systematic "stress test" of the reasoning capabilities of state-of-the-art video generators using a structured benchmark (MME-CoF).
4. Methodology
4.1. Principles
The core principle is Empirical Probing. Since models like Veo-3 are closed-source "black boxes," the authors cannot analyze their internal weights. Instead, they design specific inputs (prompts) representing reasoning tasks and analyze the outputs (videos) to infer the model's internal capabilities.
The hypothesis is Chain-of-Frame (CoF): If a video model can accurately generate a sequence of frames portraying a complex logical process (e.g., a multi-step maze solution), it implies the model has encoded the underlying rules of that logic.
4.2. Task Taxonomy (12 Dimensions)
The authors categorize visual reasoning into 12 distinct tasks. Below is a breakdown of key categories with examples:
- Visual Detail Reasoning: Can the model identify and maintain fine attributes?
- Task: "Zoom in on the person with the white handbag."
- Visual Trace Reasoning: Can the model visualize a path or trajectory?
- Task: "Draw a path from point A to point B in this maze."
- Physics-based Reasoning: Does the video obey physical laws?
- Task: "Show a ball rolling down a ramp and colliding with a block."
- 3D Geometry Reasoning: Can the model manipulate 3D shapes consistently?
- Task: "Fold this 2D net into a 3D cube."
- 2D Geometry Reasoning: Can the model draw geometric constructions?
- Task: "Connect points A, B, and C to form a triangle."
- Real-world Spatial Reasoning: Understanding viewpoints and directions.
- Task: "Show the view from the balcony looking Southwest."
- Object Counting: Accurately representing quantities.
- Task: "Pan across the table showing exactly 4 apples."
- Others: Rotation, Table/Chart Reasoning, GUI Navigation, Embodied Reasoning (robotics planning), and Medical Reasoning.
4.3. Prompt Design Protocol
To ensure the test is fair and specifically targets reasoning (rather than artistic interpretation), the authors use a strict prompt design strategy:
- Imperative Phrasing: Use commands like "Animate step-by-step," "Zoom in," "Trace the path."
- Explicit Constraints: Rules like "Static camera," "No zoom," "No glitches," "Constant speed."
- Visual Logic: The prompt translates a textual logic problem (e.g., "What is the result of moving X to Y?") into a visual instruction (e.g., "Animate object X moving to position Y").
Example of Prompt Transformation:
- Text Problem: "Is the motorcycle to the left or right of the dog?"
- Video Prompt: "Smoothly zoom in on the dog near the lower right corner, then highlight the motorcycle parked near it... Static shot."
4.4. Evaluation Methodology
The evaluation is conducted in a Zero-Shot setting (no training examples provided).
4.4.1. Qualitative Evaluation (Human Expert)
Experts rate generated videos on a 3-level scale:
- Good (): Accurate reasoning, stable visuals, no artifacts.
- Moderate (): Roughly correct, but with minor flaws (blur, slight jitter, incomplete actions).
- Bad (): Reasoning fails (e.g., ball goes up instead of down), hallucinations, severe artifacts.
4.4.2. Quantitative Evaluation (Automated via Gemini-2.5-Pro)
To scale the evaluation for the MME-CoF benchmark, the authors use Gemini-2.5-Pro (a powerful Multimodal LLM) as a judge. The judge is given the prompt, the generated video, and specific criteria to score from 0 to 4.
The scoring dimensions are:
- Instruction Alignment: Did the video follow the steps?
- Temporal Consistency: Is the motion smooth across frames?
- Visual Stability: Is the camera steady? Are there glitches?
- Content Fidelity: Are objects preserved without hallucination?
- Focus Relevance: Did the camera focus on the correct target?
5. Experimental Setup
5.1. Datasets (MME-CoF Construction)
The authors curated the MME-CoF benchmark by selecting high-quality reasoning cases from existing static benchmarks and adapting them for video generation.
-
Sources:
- V*Bench: For visual details.
- MVoT, FrozenLake: For trace reasoning/planning.
- MMMU, ScienceQA: For physics and science tasks.
- Robobench: For embodied reasoning.
- ChartQA: For table/chart understanding.
-
Scale: 59 total entries across 12 categories.
-
Format: Text Prompt -> Video Generation Task.
Figure 19 from the paper shows the distribution of these categories:
该图像是一个图表,展示了MME-CoF基准的各类推理的分布情况。在图中,各类推理的比例通过不同颜色的扇形表示,如物体计数推理(10.2%)和2D几何推理(11.9%)等,反映了视频模型在进行零-shot 推理时的能力和特点。
5.2. Models Evaluated
The study evaluates five leading video generation models:
- Veo-3 (Google DeepMind): Tested in
previewandfastversions. - Sora-2 (OpenAI): Tested in
baseandproversions. - Kling-v1 (Kuaishou).
- Seedance-1.0-pro.
5.3. Evaluation Settings
- Resolution: .
- Frame Rate: 24 FPS.
- Duration: 8 seconds (for Veo/Sora), 5 seconds (for Kling/Seedance).
- Sampling: 6 video samples generated per prompt to calculate a robust success rate.
6. Results & Analysis
6.1. Qualitative Deep-Dive (Veo-3)
The authors provide a detailed "Success vs. Failure" analysis for Veo-3.
6.1.1. Visual Detail & Grounding
-
Success: When objects are large and salient (e.g., a person with a handbag), Veo-3 tracks them well.
-
Failure: When objects are small, occluded, or in a cluttered scene (e.g., a specific motorcycle in a crowd), the model fails to "attend" to the right pixels.
Figure 3 illustrates this, showing success with the handbag (Success Rate 33%) but failure with smaller objects:
该图像是插图,展示了Veo-3在视觉细节推理中的能力。左侧为输入图像的第一帧,右侧为推理生成的视频帧,显示了模型在定位目标和维持视觉属性上的表现,以及在小目标、遮挡或杂乱背景中的常见失误。
6.1.2. Visual Trace (Path Following)
-
Observation: The model struggles with multi-step logic. In maze tasks, it often "drifts" towards visually prominent areas rather than following the logical path.
-
Failure Mode: It mimics the look of a path but ignores the constraints (walls of the maze).
See Figure 4 for examples of path tracing failures:
该图像是一个示意图,展示了输入图像及其推理视频的第一帧。左右分别展示了输入图像和推理过程中的多个帧,示例中有多个字符和指示箭头,表明推理和动作的过程。
6.1.3. Physics & Dynamics
-
Observation: Videos look realistic at a glance (System 1) but violate laws of physics (System 2).
-
Example: A block sliding on a track might speed up or slow down arbitrarily, violating energy conservation. A ball bouncing might change direction without hitting a wall.
-
Takeaway: The model has learned "things move," not "F=ma."
Figure 11 highlights these physics-based reasoning failures:
该图像是插图,展示了Veo-3在物理场景中的推理。左侧为输入图像,右侧为推理生成的视频的第一帧,显示了一个主要的机械装置在运动中的展现,旁边标记了不合规的动作。
6.1.4. Geometry (2D & 3D)
-
2D: When asked to connect dots, the model often just draws a generic shape (like a bird) that looks conceptually correct but fails to connect the specific dots provided.
-
3D: When rotating objects, the model often distorts the object's structure (non-rigid deformation), failing to maintain geometric consistency.
Figure 9 shows 2D geometry failures (e.g., connecting dots incorrectly):
该图像是一个示意图,展示了Veo-3在2D几何推理中的表现。左侧为输入图像,右侧为推理视频的第一帧,表现出Veo-3在识别简单形状方面的潜力,但在准确的几何操作中缺乏必要的约束意识。
6.2. Quantitative Results (MME-CoF)
The following are the results from Table 2 of the original paper, showing the overall model-level performance judged by Gemini-2.5-Pro (Scale 0-4).
| Model | Overall | Instruction Alignment | Temporal Consistency | Visual Stability | Content Fidelity | Focus Relevance |
|---|---|---|---|---|---|---|
| Kling-v1 [38] | 0.64 ± 0.91 | 0.01 ± 0.09 | 0.15 ± 0.75 | 2.43 ± 1.86 | 0.21 ± 0.79 | 0.43 ± 1.07 |
| Seedance-1.0-pro [19] | 1.41 ± 1.51 | 0.30 ± 0.86 | 1.65 ± 1.57 | 2.00 ± 1.72 | 1.13 ± 1.65 | 1.98 ± 1.75 |
| Veo-3.0-fast [21] | 1.44 ± 1.51 | 0.56 ± 1.09 | 1.37 ± 1.51 | 1.88 ± 1.73 | 1.10 ± 1.52 | 2.27 ± 1.69 |
| Veo-3.0-preview [21] | 1.45 ± 1.50 | 0.54 ± 1.06 | 1.43 ± 1.53 | 1.89 ± 1.71 | 1.12 ± 1.49 | 2.26 ± 1.73 |
| Sora-2-pro [56] | 1.66 ± 1.53 | 0.48 ± 0.96 | 1.36 ± 1.59 | 2.39 ± 1.65 | 1.64 ± 1.72 | 2.44 ± 1.73 |
| Sora-2 [56] | 1.72 ± 1.59 | 0.59 ± 1.12 | 1.52 ± 1.69 | 2.32 ± 1.68 | 1.62 ± 1.75 | 2.52 ± 1.71 |
Analysis of Table 2:
-
Low Scores: The highest overall score is only 1.72 (Sora-2) out of 4.0. This quantitatively confirms that current models are poor reasoners.
-
Visual Stability vs. Instruction Alignment: Models score relatively high on "Visual Stability" (~2.3-2.4), meaning they generate pretty videos. However, they score extremely low on "Instruction Alignment" (< 0.6), meaning they generate the wrong content for the reasoning task.
The following are the results from Table 3 of the original paper, showing per-category scores:
Category Kling-v1 Seedance-1.0 Pro Veo-3.0 Fast Veo-3.0 Preview Sora-2 Sora-2 Pro Visual Detail 0.72 ± 0.69 1.37 ± 1.39 1.10 ± 1.24 1.59 ± 1.68 1.14 ± 1.32 1.08 ± 1.89 Visual Trace 0.49 ± 0.65 1.23 ± 1.13 1.43 ± 1.26 1.48 ± 1.24 1.51 ± 1.37 1.75 ± 1.31 Real-world Spatial 0.77 ± 0.76 1.79 ± 1.53 2.07 ± 1.54 2.10 ± 1.46 1.84 ± 1.43 1.77 ± 1.35 3D Geometry 0.61 ± 0.58 1.95 ± 1.64 1.71 ± 1.54 1.54 ± 1.43 1.37 ± 1.49 1.42 ± 1.45 2D Geometry 0.49 ± 0.67 0.96 ± 1.11 1.18 ± 1.15 1.27 ± 1.20 1.77 ± 1.45 1.77 ± 1.21 Physics-based 0.60 ± 0.62 1.27 ± 1.25 1.44 ± 1.39 1.44 ± 1.35 2.13 ± 1.32 2.10 ± 1.33 Rotation 0.22 ± 0.34 2.30 ± 1.46 1.83 ± 1.44 1.60 ± 1.29 1.62 ± 1.37 1.44 ± 1.28 Table & Chart 0.87 ± 0.72 0.71 ± 1.18 0.82 ± 1.30 0.96 ± 1.44 1.84 ± 1.61 1.48 ± 1.59 GUI 1.09 ± 0.51 0.70 ± 0.76 1.11 ± 1.09 1.18 ± 0.89 1.88 ± 1.64 1.52 ± 1.48 Object Counting 0.64 ± 0.58 1.15 ± 0.97 2.03 ± 1.42 1.84 ± 1.42 2.06 ± 1.48 1.86 ± 1.41 Embodied 0.80 ± 0.00 1.82 ± 1.67 1.33 ± 1.57 1.18 ± 1.46 1.30 ± 1.51 1.40 ± 1.42 Medical 1.15 ± 1.17 1.56 ± 1.41 0.27 ± 0.39 0.30 ± 0.58 2.08 ± 1.56 1.81 ± 1.42
Analysis of Table 3:
- Specialization:
- Sora-2 performs significantly better in Physics-based (2.13) and Medical (2.08) reasoning compared to Veo.
- Veo-3 shows strength in Real-world Spatial (2.10) tasks.
- Seedance surprisingly leads in Rotation (2.30) and 3D Geometry (1.95).
- Medical Failure: Veo-3 fails catastrophically in Medical Reasoning (0.30), likely due to a lack of domain-specific training data or aggressive safety filters preventing anatomical generation.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper rigorously demonstrates that while modern video models (Veo-3, Sora-2) are exceptional visual synthesizers, they are immature visual reasoners.
- They exhibit emergent capabilities in simple tasks like short-term tracking and basic rotation.
- However, they lack the systematic logic required for rigorous reasoning. They operate on probability and pattern matching, which crumbles under strict constraints (physics, complex geometry, long causal chains).
- The MME-CoF Benchmark provides a necessary yardstick to measure this specific gap between "looking real" and "being logical."
7.2. Limitations & Future Work
- Author-identified Limitations: The study focuses on zero-shot generation. It does not explore if fine-tuning on reasoning tasks would unlock these capabilities.
- Safety Filters: The authors noted that aggressive content filters (e.g., in medical or embodied contexts) might have artificially lowered scores by blocking valid generations.
- Future Directions:
- Developing "System 2" video models that incorporate explicit physics engines or logical verifiers.
- Using video models as World Simulators to train embodied agents (robots), but only after improving their causal reliability.
7.3. Personal Insights & Critique
- The "Hallucination" Feature: In LLMs, hallucination is a bug. In video models, "creativity" (hallucination) is often a feature for art but a fatal flaw for reasoning. This paper highlights the tension between generative diversity and logical consistency.
- Complementary Role: The authors suggest video models could be "complementary visual engines." This is a powerful insight. Perhaps an LLM could act as the "brain" (logic) and direct the Video Model as the "imagination" (visualizer), checking the output for consistency in a feedback loop.
- Benchmark Value: MME-CoF is small (59 entries). While curated, its small size might make it prone to overfitting if widely adopted. A larger-scale automated version of this benchmark would be a valuable next step.
Similar papers
Recommended via semantic vector search.