Paper status: completed

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Published:11/26/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

LongVT introduces an end-to-end framework enhancing long video reasoning via interleaved Multimodal Chain-of-Tool-Thought, leveraging LMMs' temporal grounding. It releases the VideoSIAH dataset for training and evaluation, significantly improving performance on various benchmarks

Abstract

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

1.2. Authors

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing

Affiliations: The authors are affiliated with:

  • MiroMind AI
  • Nanyang Technological University (NTU)
  • Hong Kong University of Science and Technology (HKUST(GZ))
  • Tsinghua University (THU)
  • LMMs-Lab Team

1.3. Journal/Conference

The paper is published as a preprint on arXiv (arXiv:2511.20785). arXiv is an open-access repository for preprints of scientific papers in various fields, including computer science. While it is not a peer-reviewed journal or conference proceeding, it is a widely recognized platform for disseminating early research findings and facilitating rapid scientific communication. Many papers first appear on arXiv before undergoing formal peer review and publication in conferences or journals.

1.4. Publication Year

2025

1.5. Abstract

Large multimodal models (LMMs) show promise in video reasoning with textual Chain-of-Thought (CoT), but they suffer from hallucinations, especially in long videos where evidence is sparse and spread out. Inspired by human long-video comprehension (global skimming then local examination), this paper introduces LongVT, an end-to-end agentic framework enabling "Thinking with Long Videos" through interleaved Multimodal Chain-of-Tool-Thought (iMCoTT). LongVT leverages LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on specific clips and resample finer-grained frames. This global-to-local reasoning loop continues until answers are supported by visual evidence. To address the scarcity of fine-grained question-answering (QA) data for long video reasoning, the authors curate and will release VideoSIAH, a data suite for training and evaluation. The training dataset comprises 247.9K samples for tool-integrated cold-start supervised fine-tuning (SFT), 1.6K for agentic reinforcement learning (RL), and 15.4K for agentic reinforcement fine-tuning (RFT). The evaluation benchmark consists of 1,280 QA pairs meticulously curated via a semi-automatic pipeline with human-in-the-loop validation. With a carefully designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms strong existing baselines across four challenging long-video understanding and reasoning benchmarks. Codes, data, and model checkpoints are publicly available.

https://arxiv.org/abs/2511.20785 (Preprint) PDF Link: https://arxiv.org/pdf/2511.20785v1.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem LongVT aims to solve is the limitation of current large multimodal models (LMMs) in reliably reasoning over long-form videos. While LMMs have shown potential in video reasoning, particularly with textual Chain-of-Thought (CoT), they are prone to hallucinations—generating information that is not supported by the visual evidence. This issue is exacerbated in long videos (exceeding 15 minutes) because the crucial evidence needed to answer a question might be sparse, subtle, and temporally dispersed across hours of footage. Existing LMM approaches often rely on R1-style paradigms (Supervised Fine-Tuning followed by Group Relative Policy Optimization (GRPO) based reinforcement learning), which are largely language-centric and struggle with deep visual reasoning. Their uniform frame sampling further hinders adaptive capture of key visual evidence, often missing fine-grained or decisive moments critical for accurate long-video understanding.

This problem is important because understanding long-form videos is a major challenge in multimodal artificial intelligence, underpinning real-world applications like event spotting in sports, long-range film analysis, and complex video question answering. The existing methods lack the capability to perform human-like visual operations to guide reasoning, such as skimming and zooming in on relevant segments. The paper's innovative idea is to enable LMMs to perform human-like visual operations by integrating a native video cropping tool into their reasoning process, allowing for a global-to-local inspection strategy.

2.2. Main Contributions / Findings

The paper makes three primary contributions:

  1. An End-to-End Agentic Paradigm for Long-Video Reasoning: LongVT introduces an agentic framework that natively interleaves multimodal tool-augmented CoT with on-demand clip inspection over hours-long videos. This allows LMMs to transition from passive frame consumption to active, evidence-seeking reasoning, thereby enabling more effective and reliable long-video understanding. This framework enables self-correction and hypothesis-verification loops, inspired by human cognitive processes.
  2. VideoSIAH Data Suite for Evidence-Sparse Long-Video Reasoning: To address the scarcity of fine-grained QA data, the paper constructs VideoSIAH, a large-scale, diverse, and high-quality data suite. This includes a training dataset with tool-integrated reasoning traces (247.9K SFT samples, 1.6K RL samples, 15.4K RFT samples) and a dedicated evaluation benchmark, VideoSIAH-Eval (1,280 human-in-the-loop validated QA pairs), specifically designed for video segment-in-a-haystack scenarios where evidence is sparse.
  3. Meticulously Designed Three-Stage Training Strategy and Comprehensive Validation: The paper proposes a robust three-stage training pipeline:
    • Cold-start Supervised Fine-Tuning (SFT): To establish foundational capabilities like temporal window proposal, tool invocation, and multimodal evidence composition.
    • Agentic Reinforcement Learning (RL): To optimize a novel joint answer-temporal grounding reward function, refining tool-using rollouts.
    • Agentic Reinforcement Fine-Tuning (RFT): To distill high-quality RL trajectories into supervised data, stabilizing agentic behaviors and consolidating long-horizon reasoning. Through extensive empirical validation and ablation studies, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks, narrowing the performance gap with proprietary LMMs.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand LongVT, a beginner should be familiar with the following concepts:

  • Large Multimodal Models (LMMs): These are advanced artificial intelligence models that can process and understand information from multiple modalities, typically text and images/videos. They extend the capabilities of large language models (LLMs) by adding visual understanding, allowing them to answer questions about images, describe video content, or perform complex reasoning tasks involving both text and visual inputs.
  • Chain-of-Thought (CoT): A prompting technique used with large language models to enable complex reasoning. Instead of directly asking for an answer, CoT involves instructing the model to "think step-by-step" or "show your work." This encourages the model to break down a complex problem into intermediate steps, which often leads to more accurate and verifiable answers, reducing hallucinations (incorrect or fabricated information). In multimodal settings (Multimodal Chain-of-Thought), this involves reasoning over visual inputs as well.
  • Hallucinations: In the context of AI, hallucinations refer to instances where a model generates content (text or visual descriptions) that is plausible but factually incorrect, not supported by its input data, or completely fabricated. This is a significant challenge for LMMs, especially when dealing with ambiguous or sparse information in long videos.
  • Temporal Grounding: The task of identifying the precise start and end times (a temporal span or time window) within a video that corresponds to a given natural language query or event description. For example, given the query "the player scoring a goal," temporal grounding would identify the exact video segment where the goal occurs.
  • Agentic Frameworks / AI Agents: An agentic framework refers to an AI system designed to operate autonomously in an environment, making decisions and taking actions to achieve a goal. AI Agents often involve planning, memory, tool use, and self-reflection. In the context of LMMs, an agentic framework allows the model to interact with its environment (e.g., a video) by calling specialized tools (like crop_video) to gather more information, rather than just passively processing a pre-defined input.
  • Supervised Fine-Tuning (SFT): A common technique in machine learning where a pre-trained model (like an LMM) is further trained on a labeled dataset for a specific downstream task. The model learns to map inputs to desired outputs based on explicit supervision (correct examples). In LongVT, cold-start SFT is used to teach the base LMM foundational skills like tool invocation and initial reasoning patterns.
  • Reinforcement Learning (RL): A paradigm where an agent learns to make decisions by performing actions in an environment and receiving rewards or penalties for those actions. The goal is to learn a policy that maximizes cumulative reward. RL is used in LongVT to optimize the model's decision-making process, such as when to use a tool and how to interpret its output, by providing feedback (rewards) based on the correctness of answers and temporal grounding.
  • Reinforcement Fine-Tuning (RFT): A stage of training that often follows RL. It involves converting high-quality trajectories (sequences of actions and observations) generated during RL into supervised training examples. These self-distilled examples are then used to further fine-tune the model in a supervised manner, stabilizing the beneficial behaviors learned during RL and improving performance.
  • Group Relative Policy Optimization (GRPO): An RL algorithm often used for training Large Language Models (LLMs) to align with human preferences or perform complex reasoning. It builds on Proximal Policy Optimization (PPO) by sampling multiple responses (rollouts) for each prompt, evaluating them with a reward model, and using a group baseline to reduce variance in the advantage estimates. This allows for more stable and efficient learning, particularly for open-ended generation tasks.
  • Intersection over Union (IoU): A common evaluation metric used to quantify the overlap between two bounding boxes or temporal intervals. For temporal grounding, IoU measures the ratio of the intersection duration of the predicted and ground-truth intervals to their union duration. A higher IoU indicates better overlap. Mathematical Formula: $ \mathrm { I o U } \ = \ \frac { | [ t _ { s } , t _ { e } ] \cap [ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ] | } { | [ t _ { s } , t _ { e } ] \cup [ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ] | } $ Symbol Explanation:
    • [ t _ { s } , t _ { e } ]: The predicted temporal interval, where tst_s is the start time and tet_e is the end time.
    • [ts,te][ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ]: The ground-truth temporal interval, where t'_s is the start time and t'_e is the end time.
    • \cap: Represents the intersection of the two intervals (the duration where they overlap).
    • \cup: Represents the union of the two intervals (the total duration covered by either interval).
    • | \cdot |: Denotes the length or duration of a temporal interval.
  • LLM-as-a-Judge: A technique where a powerful Large Language Model (LLM) is used to evaluate the quality of responses generated by other models. Instead of relying on human annotators or rule-based systems, the LLM-as-a-Judge assesses responses based on criteria like correctness, coherence, and relevance. This is particularly useful for open-ended tasks where traditional metrics are difficult to apply.

3.2. Previous Works

The paper contextualizes its work within two main streams of research:

  • RL-Based Multimodal Reasoning:

    • Early Inspiration: OpenAI o1 [17] and DeepSeek-R1 [11] extended GRPO-style RL from text-only reasoning to multimodal domains. These foundational works showed the potential of RL for improving reasoning capabilities.
    • Vision-centric: Methods like [15, 30, 59] applied RL to improve image question answering (QA), grounding [7, 27, 35], and segmentation [26].
    • Video-centric: More directly relevant, works such as Video-R1 [8] and [44] tackle video QA, while [47] focuses on temporal grounding, and [23] addresses spatiotemporal grounding. Recent efforts [4] have scaled RL to long videos.
    • Audio and Omnimodal: RL has also been applied to audio QA [20, 48] and broader omnimodal reasoning [62].
    • Key takeaway: These works collectively demonstrate that RL-based reasoning significantly improves cross-modal understanding.
  • Tool-Augmented Agentic LMMs:

    • Images: Recent methods [38, 50, 54, 61] enhance image reasoning by interleaving pixel-level operations (e.g., zooming, drawing auxiliary lines, generative imagery) to process finer details and reduce hallucinations. DeepEyes [61] is specifically mentioned as having similar training dynamics concerning reflection tokens.
    • Videos: VITAL [57] is a concurrent work that also explores tool-augmented RL for improving video QA and temporal grounding.

3.3. Technological Evolution

The field has evolved from language-centric models that primarily process text, to multimodal models that can handle both text and visual information. Initially, multimodal models often relied on supervised fine-tuning (SFT) and simple Chain-of-Thought (CoT) prompting, which proved effective for short videos but struggled with the scale and complexity of long-form content, leading to hallucinations.

The next evolutionary step involved incorporating Reinforcement Learning (RL) (R1-style paradigms) to enhance reasoning and align model behavior with desired outcomes, moving beyond token-level likelihood optimization. Concurrently, the concept of AI agents emerged, where models are equipped with tools to interact dynamically with their environment, much like humans use tools to gather more information.

LongVT fits within this timeline by pushing the boundaries of agentic LMMs specifically for long-form video understanding. It combines RL-based reasoning with tool augmentation, allowing models to actively seek out visual evidence, mirroring human cognitive strategies for long video comprehension (global skim, local inspection). This moves beyond passive processing towards active, interactive, and self-correcting reasoning, especially addressing the segment-in-a-haystack problem where crucial evidence is sparse.

3.4. Differentiation Analysis

Compared to prior work, particularly VITAL [57], LongVT introduces several key innovations:

  • Target Task and Dataset:

    • VITAL focuses on general video QA and temporal grounding.
    • LongVT specifically targets video segment-in-a-haystack reasoning, where evidence is extremely sparse and temporally dispersed in hours-long footage. To address this, it contributes a large-scale, high-quality dataset, VideoSIAH, and a dedicated benchmark, VideoSIAH-Eval. This dataset is designed to explicitly trigger tool-integrated reasoning and reveal emergent human-like self-reflection capabilities.
  • Training Paradigm:

    • VITAL uses tool-augmented RL but the specifics of its training stages are not detailed in the comparison.
    • LongVT proposes a novel three-stage closed-loop training paradigm:
      1. Cold-start SFT: To provide a robust foundation for tool-calling and reasoning, which the paper empirically shows is indispensable (Figure 14).
      2. Agentic RL: For enhancing generalization.
      3. Agentic RFT: A dedicated stage that leverages high-quality rollout traces (self-distilled from RL) for iterative self-refinement and stabilization of agentic behaviors.
  • Reward Function:

    • Prior works often rely on multi-task objectives (e.g., Video-R1 [8], [23]) or explicit tool rewards (e.g., VITAL [57], DeepEyes [61]).
    • LongVT shows that single-task RL with a decoupled temporal-grounding reward (specifically IoU) can achieve state-of-the-art performance. This decoupled reward is integrated into a joint answer-temporal grounding reward function, unifying answer correctness and temporal precision without requiring explicit tool invocation bonuses. The paper also ablates the necessity of tool reward, finding it less critical than strong SFT and IoU rewards.
  • Emphasis on Emergent Behavior: LongVT explicitly designs its training and data to foster emergent human-like self-reflection and hypothesis-verification behaviors, which is a core part of its interleaved Multimodal Chain-of-Tool-Thought (iMCoTT).

4. Methodology

4.1. Principles

The core idea behind LongVT is to mimic human comprehension of long videos: first skim globally to identify potentially relevant segments, and then zoom in on those segments for detailed examination. This global-to-local reasoning strategy is implemented through an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought (iMCoTT).

The theoretical basis and intuition are that LMMs can leverage their inherent temporal grounding capabilities to dynamically interact with the video. This involves proposing precise temporal windows (hypotheses), using a native video cropping tool (crop_video()) to resample finer-grained frames within that window (verification), and then refining the reasoning based on the new visual evidence. This hypothesis-verification cycle allows the model to self-correct when initial retrievals are insufficient or inaccurate, similar to how a human would re-inspect a part of a video. The goal is to ground answers in retrieved visual evidence, reducing hallucinations that often plague LMMs when dealing with sparse or dispersed evidence in long videos.

The following figure (Figure 4 from the original paper) visualizes the overall framework of LongVT, showing its iterative hypothesis-verification cycle and the role of the crop_video tool:

该图像是一个示意图,展示了LongVT的工作流程,包括全局浏览(Global Skim)与细粒度推理(Finer-grained Reasoning)。图中呈现了在视频特定时间点(如\(T_1\)和\(T_2\))重新采样的视频帧,以及通过思考和奖励管理机制得到的最终预测答案。 该图像是一个示意图,展示了LongVT的工作流程,包括全局浏览(Global Skim)与细粒度推理(Finer-grained Reasoning)。图中呈现了在视频特定时间点(如T1T_1T2T_2)重新采样的视频帧,以及通过思考和奖励管理机制得到的最终预测答案。 Figure 4. The overall framework of LongVT. LongVT enables Thinking with Long Videos through an iterative hypothesis-verification cycle. This is incentivized via cold-start SFT, enabling the model to skim global frames and proactively invoke the crop_video tool to resample fine-grained evidence. In cases where the initial retrieval (e.g., at T1T_1) proves insufficient, the model leverages learned self-correction to reinvoke the tool (e.g., at T2T_2) with refined parameters. Crucially, this entire decision-making trajectory is consolidated via agentic RL, which optimizes the policy against the joint answer-temporal grounding reward (Racc+Rformat+Rtime\mathbf{R_{acc}} + \mathbf{R_{format}} + \mathbf{R_{time}}), enhancing the model's generalization ability to further align with human-like verification strategies.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. VideoSIAH: Fine-Grained Data Suite for Evidence-Sparse Long-Video Reasoning

LongVT's training and evaluation heavily rely on a newly constructed dataset called VideoSIAH. This dataset is designed to tackle the unique challenges of long-video reasoning, where LMMs must locate sparse, fine-grained, and causally decisive moments within hours-long content. Existing datasets are often too coarse-grained and do not sufficiently supervise the learning of temporal hypothesis formation, verification, or revision.

4.2.1.1. Data Pipeline

The VideoSIAH dataset is curated using a semi-automatic, human-in-the-loop pipeline that generates temporally grounded reasoning traces aligned with human cognitive processes.

The following figure (Figure 2 from the original paper) illustrates the data pipeline for VideoSIAH:

该图像是示意图,展示了LongVT框架中的长视频理解与推理流程。图中标示了视频收集、场景检测与分割、视频剪辑标注以及问题-回答对生成的各个阶段,此外还展示了多模态和基于文本的问答过滤过程。不同的模型和工具(如OpenAI和Gemini)被整合,以支持长视频内容的分析与回答生成。图中也展示了关于“野生事物”主题的视频片段及问答示例。 Figure 2. VideoSIAH Data Pipeline. We construct high-quality, temporally grounded reasoning traces for long videos. We automatically detect scenes, merge short segments, generate detailed captions using Qwen2.5-VL-72B, and create initial QA pairs. Text-based filtering removes low-quality QAs, and multimodal filtering (with GLM-4.5V) ensures visual consistency. Annotator feedback refines prompts for QA generation, filtering, and iMCoTT construction, ensuring high-fidelity data. The tool-integrated iMCoTT reasoning traces are generated only for the cold-start SFT stage, whereas RL training operates solely on the filtered QA pairs.

The steps are as follows:

  1. Automatic Scene Detection & Segmentation: Long videos are first processed to detect scene changes. Consecutive segments shorter than 10 seconds are merged to create semantically stable units.
  2. Detailed Caption Generation: For each stable video segment, Qwen2.5-VL-72B [1] is employed to generate detailed descriptions. These captions capture salient objects, spatial relations, and evolving events, forming the semantic basis for QA pair generation.
  3. Initial QA Generation: QA pairs are created from these detailed captions. These questions cover a wide range of aspects including temporal events, spatial layouts, motion, object attributes, and scene transitions, ensuring broad coverage and diversity.
  4. Filtering Stages:
    • Text-based QA Filtering: Low-quality or ill-posed questions (e.g., those with answer leakage, where the question implicitly contains the answer) are removed using linguistic heuristics and cross-model agreement checks.
    • Multimodal QA Filtering: GLM-4.5V [12] is used to verify the consistency of answers against the actual video segments. This step eliminates hallucinated or visually unsupported claims.
  5. Prompt-Feedback Refinement Loop: Human annotators provide feedback, which is used to refine the prompting rules for QA generation, filtering, and iMCoTT (interleaved Multimodal Chain-of-Tool-Thought) construction. This iterative refinement process ensures high-fidelity, temporally grounded, and scalable data collection with minimal manual annotation.

4.2.1.2. Dataset Curation

VideoSIAH is divided into different splits for Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and Reinforcement Fine-Tuning (RFT).

  • SFT Data Curation: This data aims to enhance both tool-calling capability and general reasoning. It includes three major categories:

    1. Tool-augmented Multi-round Data: These are QA pairs with iMCoTT traces that involve multiple tool calls. For hours-long videos, a single tool call might not capture the correct temporal segment, necessitating multi-round tool-calling. The probability of selecting a video sample for multi-round curation is defined adaptively based on its length: $ P _ { \mathrm { m u l t i } } = 1 - \frac { L _ { \mathrm { m a x } } - \mathrm { c l i p } ( L _ { \mathrm { v i d e o } } , L _ { \mathrm { m a x } } , L _ { \mathrm { m i n } } ) } { L _ { \mathrm { m a x } } - L _ { \mathrm { m i n } } } $ Symbol Explanation:
      • PmultiP _ { \mathrm { m u l t i } }: The probability of choosing a given data sample for multi-round generation.
      • LvideoL _ { \mathrm { video } }: The length of the video in question.
      • LmaxL _ { \mathrm { max } }: The maximum video length threshold.
      • LminL _ { \mathrm { min } }: The minimum video length threshold.
      • clip(x,a,b)\mathrm { clip } ( x , a , b ): A function that restricts the value of xx to the range [b, a]. Specifically, if x<bx < b, it returns bb; if x>ax > a, it returns aa; otherwise, it returns xx. This formula ensures that longer videos have a higher probability of undergoing multi-round data generation, thereby improving temporal coverage and reasoning completeness.
    2. Image Reasoning Data: A mixture of diverse image-based reasoning datasets is included to strengthen fundamental perceptual capabilities.
    3. Video Reasoning Data: General video QA datasets are also incorporated.
  • RL Data Curation: This split is built from filtered segment-in-a-haystack QA pairs.

    1. Length-balanced Subset: QA pairs are grouped by video duration (short, medium, long), and a length-balanced subset is sampled to ensure diverse video durations are covered.
    2. Difficulty-aware Filter: For each question, KK rollouts (generated responses) are drawn from the current policy. Items are discarded if all KK trajectories answer correctly (too easy) or all KK fail (too hard), focusing RL on a middle band of difficulty to provide more informative reward signals.
  • RFT Data Curation: This data is used for post-RL refinement.

    1. High-Quality Trajectory Filtering: Trajectories from early RL runs are kept only if they meet two criteria:
      • The model produces the correct final answer.
      • The predicted temporal span achieves an Intersection over Union (IoU) of at least 0.3 with the ground-truth window.
    2. Supervised Training Examples: These filtered, high-quality trajectories are converted into supervised examples for RFT. This provides high-precision in-distribution supervision, stabilizing optimization and strengthening grounding and tool-calling behavior.

4.2.1.3. Dataset Statistics

The following are the dataset statistics of VideoSIAH, as presented in Table 1 of the original paper:

Split Source Purpose Samples Total
SFT (w/o tool) LongVideo-Reason CoT [4] Reasoning-augmented Open-ended QA 5,238 228,835
Video-R1 CoT [8] Reasoning-augmented Video QA 165,575
Image-based CoT Reasoning-augmented Image QA 58,022
SFT (w/ tool) Gemini-distilled iMCoTT Tool-augmented Open-ended QA 12,766 19,161
Qwen-distilled iMCoTT Tool-augmented Temporal Grounding 6,395
RL Gemini-distilled Open-ended QA over Long Videos 1,667 17,020
RFT Self-distilled iMCoTT Agentic Behaviors 15,353

Table 1. Dataset Statistics of VideoSIAH. Our proposed dataset contains non-tool SFT data, tool-augmented SFT data, RL QAs, and self-distilled RFT traces.

VideoSIAH-Eval Benchmark: This dedicated evaluation benchmark consists of 244 videos and 1,280 carefully filtered QA pairs with human-in-the-loop validation. It is designed for long-form video reasoning, with an average video duration of approximately 1,688 seconds. About 71.84% of videos are in the 15-30 minute range, and 28.16% are longer than 30 minutes.

4.2.2. Training Strategy

LongVT employs a three-stage training pipeline to elicit robust "Thinking with Long Videos" behaviors.

4.2.2.1. Cold-Start Supervised Fine-Tuning (SFT)

This initial stage is crucial for equipping the base LMM with fundamental capabilities necessary for tool-augmented reasoning. The paper empirically shows that without this cold-start SFT, the model struggles significantly in later RL stages, often failing to improve or even collapsing. The SFT stage teaches the model:

  1. Proposing Temporal Windows: The ability to identify and suggest a precise time window where relevant events might occur.
  2. Invoking Video Tools: The skill to correctly call the crop_video() tool and understand its function.
  3. Composing Multimodal Evidence: How to integrate the finer-grained frames returned by the tool into its reasoning process to formulate an answer.
  4. Self-Correcting: The capacity to recognize when an initial temporal window is suboptimal and to reinvoke the tool with refined parameters.

4.2.2.2. Agentic Reinforcement Learning (RL)

In this stage, the model acts as a tool-using agent. It learns when to inspect the video, how long to crop the video segment, and how to integrate the retrieved evidence into its reasoning. GRPO [34] is employed for optimization. A key innovation here is the joint answer-temporal grounding reward function, which unifies answer accuracy, format compliance, and temporal grounding precision.

The training objective for SFT is Next-Token Prediction, where the model minimizes the negative log-likelihood of target tokens. For a sequence of tokens x=(x1,x2,,xT)x = ( x _ { 1 } , x _ { 2 } , \dots , x _ { T } ) and a model parameterized by θ\theta that defines conditional probabilities pθ(xtx<t)p _ { \theta } ( x _ { t } \mid x _ { < t } ), the loss function is: $ \mathcal { L } ( \theta ) = - \sum _ { t = 1 } ^ { T } \log p _ { \theta } ( x _ { t } \mid x _ { < t } ) $ Symbol Explanation:

  • L(θ)\mathcal { L } ( \theta ): The negative log-likelihood loss function for the model with parameters θ\theta.
  • xx: A sequence of tokens (x1,x2,,xT)(x_1, x_2, \dots, x_T).
  • TT: The total number of tokens in the sequence.
  • tt: An index iterating through the tokens from 1 to TT.
  • pθ(xtx<t)p _ { \theta } ( x _ { t } \mid x _ { < t } ): The probability of the token xtx_t given all preceding tokens x<t=(x1,,xt1)x_{<t} = (x_1, \dots, x_{t-1}), as predicted by the model with parameters θ\theta. This loss encourages the model to assign higher probability to the correct next token in the sequence.

The three-part reward modeling for RL is defined as follows:

  • Answer Accuracy (Racc(k)\mathbf{R_{acc}^{(k)}}): An LLM-as-a-Judge [53] is used to assess the quality of the generated answer a^(k)\hat{a}^{(k)} against the ground-truth answer aa^\star. This judge provides a categorical verdict J(k)J^{(k)}: $ J ^ { ( k ) } = \mathrm { J u d g e } _ { \mathrm { L L M } } \left( \hat { a } ^ { ( k ) } , a ^ { \star } \right) \in { \mathrm { F } , \mathrm { P } , \mathrm { I } } $ Symbol Explanation:

    • J(k)J^{(k)}: The categorical verdict for the kk-th rollout.

    • JudgeLLM\mathrm{Judge}_{\mathrm{LLM}}: The function performed by the LLM-as-a-Judge.

    • a^(k)\hat{a}^{(k)}: The answer generated by the model for the kk-th rollout.

    • aa^\star: The ground-truth answer.

    • F\mathrm{F}: Fully consistent (semantically equivalent to aa^\star).

    • P\mathrm{P}: Partially consistent (contains some correct information but is incomplete or imprecise).

    • I\mathrm{I}: Inconsistent (incorrect or contradictory).

      The accuracy reward is then normalized: $ \mathbf { R } _ { \mathrm { a c c } } ^ { ( k ) } = { \left{ \begin{array} { l l } { 1 , } & { { \mathrm { i f ~ } } J ^ { ( k ) } = \operatorname { F } , } \ { 0 . 5 , } & { { \mathrm { i f ~ } } J ^ { ( k ) } = \operatorname { P } , } \ { 0 , } & { { \mathrm { i f ~ } } J ^ { ( k ) } = \operatorname { I } . } \end{array} \right. } $ Symbol Explanation:

    • Racc(k)\mathbf{R_{acc}^{(k)}}: The accuracy reward for the kk-th rollout.

    • J(k)J^{(k)}: The verdict from LLM-as-a-Judge for the kk-th rollout.

  • Format Compliance (Rformat(k)\mathbf{R_{format}^{(k)}}): This reward ensures that the model's output adheres to a required schema ss. $ \mathbf { R } _ { \mathrm { f o r m a t } } ^ { ( k ) } = \left{ \begin{array} { l l } { 1 , } & { \mathrm { i f } y ^ { ( k ) } \mathrm { m a t c h e s } \ : \mathcal { S } , } \ { 0 , } & { \mathrm { o t h e r w i s e } . } \end{array} \right. $ Symbol Explanation:

    • Rformat(k)\mathbf{R_{format}^{(k)}}: The format compliance reward for the kk-th rollout.
    • y(k)y^{(k)}: The full textual output of the kk-th rollout.
    • S\mathcal{S}: The required output schema (e.g., specific XML tags or JSON structure).
  • Temporal Overlap (Rtime(k)\mathbf{R_{time}^{(k)}}): This uses IoU to reward temporal localization precision. For a predicted temporal span [ts,te][t_s, t_e] and ground truth [t'_s, t'_e]: $ \mathrm { I o U } \ = \ \frac { | [ t _ { s } , t _ { e } ] \cap [ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ] | } { | [ t _ { s } , t _ { e } ] \cup [ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ] | } $ Symbol Explanation: (As defined in Section 3.1)

    • [ t _ { s } , t _ { e } ]: Predicted temporal interval.

    • [ts,te][ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ]: Ground-truth temporal interval.

    • \cap: Intersection.

    • \cup: Union.

    • | \cdot |: Duration.

      The temporal reward is simply the IoU value: $ \mathbf { R } _ { \mathrm { t i m e } } ^ { ( k ) } \ = \ \mathrm { I o U } ^ { ( k ) } . $ Symbol Explanation:

    • Rtime(k)\mathbf{R_{time}^{(k)}}: The temporal overlap reward for the kk-th rollout.

    • IoU(k)\mathrm{IoU}^{(k)}: The IoU calculated for the kk-th rollout's predicted temporal span. This form rewards accurate temporal grounding, with a value of 1 for perfect match and 0 for no overlap.

  • Overall Reward (R(k)\mathbf{R^{(k)}}): The total reward for a rollout is the sum of these three components: $ \mathbf { R } ^ { ( k ) } = \mathbf { R } _ { \mathrm { a c c } } ^ { ( k ) } + \mathbf { R } _ { \mathrm { f o r m a t } } ^ { ( k ) } + \mathbf { R } _ { \mathrm { t i m e } } ^ { ( k ) } . $ Symbol Explanation:

    • R(k)\mathbf{R^{(k)}}: The total reward for the kk-th rollout.

    • Racc(k)\mathbf{R_{acc}^{(k)}}: The accuracy reward.

    • Rformat(k)\mathbf{R_{format}^{(k)}}: The format compliance reward.

    • Rtime(k)\mathbf{R_{time}^{(k)}}: The temporal overlap reward.

      For RL training, Group Relative Policy Optimization (GRPO) [34] is used. For each prompt xDx \in \mathcal{D} (where D\mathcal{D} is the dataset of prompts), a group of KK responses (rollouts) are drawn from the behavior policy πθold\pi_{\theta_{old}}. $ y ^ { ( k ) } \sim \pi _ { \theta _ { \mathrm { o l d } } } ( \cdot { \lvert } x ) , \quad k = 1 , \dots , K , \ y ^ { ( k ) } = ( y _ { 1 } ^ { ( k ) } , \dots , y _ { T _ { k } } ^ { ( k ) } ) , \qquad T _ { k } = \mathrm { l e n } ( y ^ { ( k ) } ) . $ Symbol Explanation:

    • y(k)y^{(k)}: The kk-th generated response (rollout) sequence.

    • πθold\pi_{\theta_{old}}: The behavior policy with parameters θold\theta_{old}.

    • xx: The input prompt from the dataset D\mathcal{D}.

    • KK: The number of sampled rollouts in a group.

    • yt(k)y_t^{(k)}: The tt-th token in the kk-th rollout.

    • TkT_k: The length of the kk-th rollout.

      A group baseline bb and advantages A(k)A^{(k)} are calculated: $ b = \frac { 1 } { K } \sum _ { k = 1 } ^ { K } R ^ { ( k ) } , \qquad A ^ { ( k ) } = R ^ { ( k ) } - b , $ Symbol Explanation:

    • bb: The group baseline, which is the average reward of all rollouts in the group.

    • A(k)A^{(k)}: The advantage for the kk-th rollout, representing how much better (or worse) its reward is compared to the group baseline.

    • R(k)R^{(k)}: The scalar reward for the kk-th rollout, as defined above.

      The policy maximizes a length-normalized, token-conditional KL-regularized objective: $ { { \mathcal { I } ( \theta ) = \mathbb { E } } } { \underset { { y ^ { ( k ) } } \sim \pi _ { \theta _ { \mathrm { o l d } } } ( \cdot \vert x ) } { \underbrace { x \sim \mathcal { D } } } \bigg [ \frac { 1 } { K } \sum _ { k = 1 } ^ { K } \frac { 1 } { T _ { k } } \sum _ { t = 1 } ^ { T _ { k } } A ^ { ( k ) } \log \pi _ { \theta } \big ( y _ { t } ^ { ( k ) } \mid x , y _ { < t } ^ { ( k ) } \big ) \bigg ] } \ { } { - \beta \mathbb { E } _ { x \sim \mathcal { D } } \bigg [ \frac { 1 } { K } \sum _ { k = 1 } ^ { K } \frac { 1 } { T _ { k } } \sum _ { t = 1 } ^ { T _ { k } } D _ { \mathrm { K L } } \Big ( \pi _ { \theta } ( \cdot \vert x , y _ { < t } ^ { ( k ) } ) | \pi _ { \mathrm { r e f } } ( \cdot \vert x , y _ { < t } ^ { ( k ) } ) \Big ) \bigg ] } $ Symbol Explanation:

    • I(θ)\mathcal { I } ( \theta ): The objective function to be maximized for the current policy with parameters θ\theta.

    • ExD\mathbb{E}_{x \sim \mathcal{D}}: Expectation over prompts xx sampled from the dataset D\mathcal{D}.

    • E{y(k)}πθold(x)\mathbb{E}_{\{y^{(k)}\} \sim \pi_{\theta_{old}}(\cdot|x)}: Expectation over the group of rollouts y(k)y^{(k)} sampled from the behavior policy πθold\pi_{\theta_{old}} given prompt xx.

    • logπθ(yt(k)x,y<t(k))\log \pi_{\theta}(y_t^{(k)} \mid x, y_{<t}^{(k)}): The log-probability of generating token yt(k)y_t^{(k)} under the current policy πθ\pi_\theta, given the prompt xx and preceding tokens y<t(k)y_{<t}^{(k)}. This term encourages the policy to generate actions that lead to high rewards.

    • β\beta: A hyperparameter controlling the strength of the KL regularization term.

    • DKL(πθ(x,y<t(k))πref(x,y<t(k)))D_{\mathrm{KL}}(\pi_{\theta}(\cdot \mid x, y_{<t}^{(k)}) \| \pi_{\mathrm{ref}}(\cdot \mid x, y_{<t}^{(k)})): The Kullback-Leibler (KL) divergence between the current policy πθ\pi_\theta and a frozen reference policy πref\pi_{\mathrm{ref}}. This term penalizes large deviations from the reference policy, helping to stabilize training and prevent aggressive policy updates.

    • πref\pi_{\mathrm{ref}}: A frozen reference policy, typically the SFT model or an earlier version of the RL policy, used to prevent the policy from drifting too far during RL.

4.2.2.3. Agentic Reinforcement Fine-tuning (RFT)

This final stage is designed to stabilize the agentic behaviors learned during RL and consolidate multimodal reasoning. It is motivated by findings that RFT is crucial for strengthening reasoning capabilities in LLMs. The process involves:

  1. Filtering High-Quality Trajectories: As described in the RFT Data Curation section, trajectories from RL runs that achieve both correct final answers and IoU >= 0.3 for temporal grounding are selected.
  2. Self-Distillation: These filtered trajectories are converted into supervised training examples.
  3. Post-RL Refinement: The model (initialized with the best-performing RL checkpoint) is then fine-tuned on this self-generated, well-grounded dataset. This in-distribution supervision helps the model internalize robust grounding and tool-calling patterns, leading to further performance gains beyond what SFT or RL alone can achieve.

4.2.3. Overall Framework

The overall LongVT framework operates in an iterative "hypothesis-verification" cycle, where the model skims global frames, proposes a temporal window, invokes the crop_video() tool to resample finer-grained evidence, and self-corrects if the initial retrieval is insufficient. This entire decision-making process is refined through the three-stage training pipeline, optimizing against a joint answer-temporal grounding reward to align with human-like verification strategies.

5. Experimental Setup

5.1. Datasets

LongVT leverages a combination of existing and newly curated datasets for both training and evaluation.

Training Datasets: The training process uses a diverse mixture of data, as detailed in Table 1 (provided in Section 4.2.1.3).

  • VideoSIAH (Self-Curated):

    • SFT (w/o tool): 228,835 samples from LongVideo-Reason CoT [4], Video-R1 CoT [8], and Image-based CoT.
    • SFT (w/ tool): 19,161 samples from Gemini-distilled iMCoTT and Qwen-distilled iMCoTT. These are tool-augmented examples for open-ended QA and temporal grounding.
    • RL: 1,667 Gemini-distilled open-ended QA samples over long videos.
    • RFT: 15,353 Self-distilled iMCoTT samples for agentic behaviors.
  • Image-based CoT Data (Detailed Breakdown): The following are the detailed statistics of Image-based CoT Data for Cold-Start SFT, as presented in Table 5 of the original paper:

    Source Purpose Samples
    LLaVA-CoT [51] General Visual Reasoning 54,591
    OpenVLThinker [6] Complex Reasoning 2,829
    We-Math 2.0 [32] Mathematical Reasoning 602

    Table 5. Detailed Statistics of Image-based CoT Data for Cold-Start SFT.

    These datasets are chosen to provide a strong foundation in general visual reasoning, complex logical inference, and mathematical problem-solving, which are crucial for underpinning robust temporal reasoning.

Evaluation Benchmarks: The models are evaluated on four challenging benchmarks:

  • VideoMME [9]: A comprehensive evaluation benchmark for multi-modal LLMs in video analysis. Videos have an average duration of ≈1018 seconds.

  • VideoMMMU [13]: Evaluates knowledge acquisition from multi-discipline professional videos. Videos have an average duration of ≈506 seconds. It has sub-categories like adaptation, comprehension, and perception.

  • LVBench [46]: An extreme long video understanding benchmark. Videos have an average duration of ≈4101 seconds.

  • VideoSIAH-Eval (Self-Curated): Designed specifically for evidence-sparse long-video reasoning. It comprises 244 videos and 1,280 carefully filtered QA pairs with human-in-the-loop validation. Average video duration is ≈1688 seconds. Its design aims to overcome data contamination and option bias found in other benchmarks, as discussed in Section 8 of the paper.

    The following figure (Figure 6 from the original paper) shows the category distribution for VideoSIAH-Eval:

    该图像是一个示意图,展示了视频类别(图(a))和问题类别(图(b))的分布比例。其中,视频类别包括教育、娱乐、音乐等,问题类别则涉及对象识别、空间关系、情节概述等。 Figure 6. VideoSIAH-Eval Statistics. (a) shows the distribution of video categories, and (b) shows the proportion of question categories, highlighting the diversity of our proposed benchmark.

This figure demonstrates the diversity of video categories (Travel & Events, Gaming, Education, Sports, Food, Uncategorized) and question categories (Object Recognition, Temporal Reasoning, Action Recognition, Plot Synopsis, Counting, Spatial Relationship, Mathematical Reasoning, Emotion Recognition), confirming the benchmark's comprehensive nature.

Data Sample Example: The following figure (Figure 10 from the original paper) shows a representative sample from both SFT and RFT stages, demonstrating the structure of iMCoTT with tool calls:

该图像是一个示意图,展示了一个人在厨房和洗衣区之间移动,描述了洗衣过程中的思考过程和工具调用。此过程中,提到将颜色从粉红确认到蓝色,强调了自我修正的过程。 Figure 10. SFT/RFT Data Example. This example shows an iMCoTT reasoning trace, where the model iteratively refines its understanding of the video content by cropping specific segments (e.g., [763.00s - 995.00s]) to find evidence for the question: "Across the series of festive snack demonstrations...what does the man consistently keep in his arms?" The answer is grounded in the retrieved video segment: "a small white dog."

The following figure (Figure 9 from the original paper) presents the evaluation prompts used in LLM-as-a-Judge for measuring answer accuracy during RL:

该图像是一个示意图,展示了一位男士在多个小食展示中始终抱着一只小白狗的证据。通过对不同时间段的观察,证实了小狗在整个展示过程中的持续存在。 Figure 9. LLM-as-a-Judge Reward Prompt. This prompt template is used to evaluate the consistency between a model's generated answer and the ground-truth answer, providing a score of 1, 0.5, or 0.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a complete explanation:

  • Average Score:

    1. Conceptual Definition: This metric represents the average performance of a model across multiple evaluation benchmarks. It provides a generalized measure of the model's overall capability. For the context of this paper, it's an average of the scores obtained on VideoMME, VideoMMMU (average of its sub-metrics), LVBench, and VideoSIAH-Eval.
    2. Mathematical Formula: $ \text{Average Score} = \frac{1}{N} \sum_{i=1}^{N} \text{Score}_i $
    3. Symbol Explanation:
      • Average Score\text{Average Score}: The final average score across all benchmarks.
      • NN: The total number of benchmarks included in the average (in this case, 4: VideoMME, VideoMMMU, LVBench, VideoSIAH-Eval, where VideoMMMU's score is itself an average of its sub-metrics).
      • Scorei\text{Score}_i: The score obtained by the model on the ii-th benchmark.
  • Accuracy (for LLM-as-a-Judge):

    1. Conceptual Definition: This metric quantifies the semantic consistency between a model's generated answer and the ground-truth answer. It goes beyond simple keyword matching by evaluating if the core meaning is preserved. The scores reflect whether the answer is fully correct, partially correct, or incorrect/inconsistent. This is used as part of the answer accuracy reward in RL.
    2. Mathematical Formula: The LLM-as-a-Judge assigns one of three categorical verdicts, which are then mapped to numerical scores: $ \mathbf { R } _ { \mathrm { a c c } } ^ { ( k ) } = { \left{ \begin{array} { l l } { 1 , } & { \text{if } J ^ { ( k ) } = \operatorname { F } \text{ (Fully consistent)} , } \ { 0 . 5 , } & { \text{if } J ^ { ( k ) } = \operatorname { P } \text{ (Partially consistent)} , } \ { 0 , } & { \text{if } J ^ { ( k ) } = \operatorname { I } \text{ (Inconsistent)} . } \end{array} \right. } $
    3. Symbol Explanation:
      • Racc(k)\mathbf{R_{acc}^{(k)}}: The accuracy reward for the kk-th rollout.
      • J(k)J^{(k)}: The categorical verdict from LLM-as-a-Judge for the kk-th rollout, which can be FF, PP, or II.
  • Intersection over Union (IoU) (for Temporal Grounding):

    1. Conceptual Definition: IoU is a standard metric for evaluating the overlap between two temporal intervals (predicted and ground truth). It is calculated as the ratio of the duration of their intersection to the duration of their union. A higher IoU value indicates a better temporal alignment between the predicted and actual event boundaries. This is used as part of the temporal overlap reward in RL.
    2. Mathematical Formula: $ \mathrm { I o U } \ = \ \frac { | [ t _ { s } , t _ { e } ] \cap [ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ] | } { | [ t _ { s } , t _ { e } ] \cup [ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ] | } $
    3. Symbol Explanation:
      • [ t _ { s } , t _ { e } ]: The predicted temporal interval (start time tst_s, end time tet_e).
      • [ts,te][ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ]: The ground-truth temporal interval (start time t'_s, end time t'_e).
      • \cap: Denotes the intersection operation between two intervals, yielding the duration of their overlap.
      • \cup: Denotes the union operation between two intervals, yielding the total duration covered by either interval.
      • | \cdot |: Represents the duration (length) of an interval.
  • IoU@0.3, IoU@0.5, IoU@0.7 (for Temporal Grounding Benchmarks like Charades-STA):

    1. Conceptual Definition: These metrics measure the percentage of predictions for which the IoU score with the ground-truth interval meets or exceeds a specific threshold (e.g., 0.3, 0.5, or 0.7). A higher value indicates that a larger proportion of predictions are well-aligned with the ground truth at that specific level of strictness. IoU@0.7 is stricter than IoU@0.3.
    2. Mathematical Formula: No specific formula is given in the paper, but conceptually: $ \text{IoU@Threshold} = \frac{\text{Number of predictions with IoU} \ge \text{Threshold}}{\text{Total number of predictions}} \times 100% $
    3. Symbol Explanation:
      • IoU@Threshold\text{IoU@Threshold}: The percentage of predictions meeting the IoU threshold.
      • Threshold\text{Threshold}: The specific IoU value (e.g., 0.3, 0.5, 0.7) that the prediction's IoU must meet or exceed.
  • mIoU (Mean IoU) (for Temporal Grounding Benchmarks):

    1. Conceptual Definition: This metric is the average IoU score calculated over all predictions in a dataset. It provides a single summary measure of the average temporal alignment quality across the entire set of events.
    2. Mathematical Formula: No specific formula is given in the paper, but conceptually: $ \text{mIoU} = \frac{1}{\text{N}} \sum_{i=1}^{\text{N}} \text{IoU}_i $
    3. Symbol Explanation:
      • mIoU\text{mIoU}: The mean Intersection over Union.
      • N\text{N}: The total number of predictions.
      • IoUi\text{IoU}_i: The IoU score for the ii-th prediction.

5.3. Baselines

LongVT's performance is compared against both open-source and proprietary LMMs.

Open-Source LMMs:

  • Qwen2.5-VL-7B [1]: The base model used for LongVT's development, evaluated as a baseline.
  • Video-R1-7B [8]: An R1-style model specifically designed for reinforcing video reasoning in MLLMs.
  • VideoRFT-7B [44]: Focuses on incentivizing video reasoning via reinforced fine-tuning.
  • Video-Thinker-7B [45]: A model that aims to spark "thinking with videos" using reinforcement learning.

Proprietary LMMs:

  • GPT-4o [16]: OpenAI's flagship multimodal model.

  • Gemini 1.5 Pro [40]: Google's advanced multimodal model, known for its long context window.

    Note on VITAL [57]: The paper explicitly states that direct comparisons to VITAL, another concurrent tool-augmented video-centric LMM, are not included because its model checkpoints are not publicly available, which hinders fair and reproducible experiments.

5.4. Experimental Details

  • Base Model: Qwen2.5-VL-7B [1] is used as the foundational model across all experiments.
  • Frame Sampling Regimes:
    • Sparse Frame Sampling: 64 uniformly sampled video frames are used.
    • Dense Frame Sampling: Either 512 or 768 uniformly sampled frames are used, with the better result reported.
  • Prompting:
    • Reasoning Prompt: Indicated by (\checkmark) for standard reasoning-style prompts or (x\pmb x) for direct question-answering prompts.
    • Tool Calling: Indicated by (\checkmark) if native tool calling is enabled in the prompt or (x\pmb x) if disabled.
  • Evaluation Framework: LMMs-Eva1 framework [58] is used for unified evaluation.
  • Inference Setup: A standard Model Context Protocol server paired with an online inference engine [19] supporting continuous batching for asynchronous requests. Special delimiter tags are injected into the generation stream to parse reasoning steps, tool invocations, and final answers. Performance is quantified using a hybrid scoring mechanism combining deterministic rule-based validators with semantic evaluation via LLM-as-a-Judge [53].

Implementation Details: The following are the detailed hyperparameters across training stages, as presented in Table 6 of the original paper:

Component SFT RL RFT
Optimizer AdamW [29] AdamW AdamW
Learning Rate (LR) 5e-5 1e-6 5e-5
LR Scheduler cosine constant cosine
Weight Decay 0.0 1e-2 0.0
No. of Training Steps 3000 160 1600
No. of Warmup Steps 300 0 160
Max Length 51200 52384 51200
Dynamic Batch Size True False True
Remove Padding True True True
Liger Kernel True False True
No. of GPUs 32 64 64
No. of Frames 512 512 512

Table 6. Detailed Hyperparameters across Training Stages. Unless otherwise specified, all experiments are conducted on NVIDIA A800-SXM4-80GB GPUs.

  • SFT: Initialized with Qwen2.5-VL-7B-Instruct [1] using the LMMs-Engine [28] framework. Employs an online stream packing strategy with iterable datasets, concatenating input samples to fill a fixed buffer (51,200 tokens) to optimize throughput and minimize memory. Training continues until convergence.
  • RL: Built upon the verl library [36], extended for multi-turn and multimodal tool-augmented rollouts via SGLang [60]. Global batch size of 16, 16 rollouts per prompt. Maximum 16,384 new tokens and 36,000 total prompt length. Constant temperature of 1.0 for exploration. Early stopping is used when reward metrics saturate.
  • RFT: Uses the same efficient training infrastructure as SFT, initialized with the best-performing RL checkpoint. The training corpus consists of high-quality, self-distilled trajectories from RL rollouts. Computational resources are scaled up to 64 GPUs to accommodate the augmented dataset and accelerate refinement.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that LongVT achieves state-of-the-art performance among open-source video-centric LMMs, particularly excelling in long-video reasoning tasks.

The following are the performance comparison with existing video-centric LMMs across various long video understanding and reasoning benchmarks, as presented in Table 2 of the original paper:

TaPerormance Compariso withExisting ideo-CentricMs across arius LongIdonderstanndReas The numbers with \approx den everaira benarkda ultsuroffic reot [9, 3,.

Model Reasoning Prompt Tool Calling VideoMME (≈1018 sec) [9] w/ subtitle VideoMMMU (≈506 sec) [13] LVBench [46] (≈4101 sec) VideoSIAH-Eval (≈1688 sec) Average Score
adaptation comprehension perception
Proprietary LMMs
GPT-4o [16] X X 77.2 66.0† 62.0† 55.7† 30.8† 17.4 51.5
Gemini 1.5 Pro [40] X X 81.3* 59.0* 53.3 49.3 33.1* - 55.2
Open-Source LMMs with Sparse Frame Sampling
Qwen2.5-VL-7B [1] × X 62.6 37.3 28.0 36.7 30.7 28.1 37.2
Video-R1-7B [8] × 61.0 36.3 40.7 52.3 37.2 27.9 42.6
VideoRFT-7B [44] X 60.9 36.7 42.0 53.0 34.7 26.5 42.3
Video-Thinker-7B [45] X 61.0 34.3 44.7 53.0 52.2 10.4 42.6
LongVT-7B-SFT (Ours) 12.5 37.7 46.0 58.3 36.0 26.8 36.2
LongVT-7B-RL (Ours) ~ ; 66.1 32.7 44.7 50 37.8 31.0 43.7
Open-Source LMMs with Dense Frame Sampling
Qwen2.5-VL-7B [1] X X 64.3 35.7 44.3 56.7 40.9 33.8 46.0
Video-R1-7B [8] X 60.5 37.3 38.7 46.3 40.1 33.1 42.7
VideoRFT-7B [44] X 49.2 37.7 40.7 48.7 18.7 26.9 37.0
Video-Thinker-7B [45] X 60.8 37.7 42.7 55.3 54.3 6.6 42.9
LongVT-7B-SFT (Ours) 64.9 32.3 42.0 49.7 41.1 34.8 44.1
LongVT-7B-RL (Ours) 66.1 37.7 42.3 56.3 41.4 35.9 46.6
LongVT-7B-RFT (Ours) ~ 67.0 35.7 43.7 56.7 41.3 42.0 47.7

Table 2. Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks. The numbers with ≈ denote average video duration in seconds. Benchmarking results are official or reproduced from [9, 3].

Key Observations from Table 2:

  • Overall Superiority: LongVT-7B-RFT achieves the highest average score (47.7) among all open-source LMMs in the dense frame sampling setting, outperforming the next best (Qwen2.5-VL-7B) by 1.7 points. This indicates the effectiveness of the proposed iMCoTT and three-stage training.
  • Dense Frame Sampling Advantage: LongVT models (especially RL and RFT) show significantly stronger performance with dense frame sampling compared to sparse, highlighting the importance of finer-grained visual information when available.
  • Performance on VideoSIAH-Eval: On the challenging VideoSIAH-Eval benchmark, designed for fine-grained evidence retrieval from hours-long videos, LongVT-7B-RFT achieves 42.0, substantially outperforming the second-best open-source model (Qwen2.5-VL-7B with 33.8) by over 8 points. This is a strong validation of LongVT's ability in its targeted task.
  • Closing the Gap with Proprietary Models: LongVT-7B-RFT's average score of 47.7 is roughly 3.8 points behind GPT-4o (51.5) and 7.5 points behind Gemini 1.5 Pro (55.2) (if considering Gemini 1.5 Pro as the absolute best). This significantly narrows the performance gap between open-source and proprietary LMMs for long-video reasoning.
  • Impact of Tool Calling: Models with Tool Calling enabled (LongVT variants) generally perform better, especially in dense sampling. For example, LongVT-7B-SFT with dense sampling (44.1) significantly improves over sparse (36.2), and LongVT-7B-RFT pushes this further to 47.7. The initial low score of LongVT-7B-SFT in sparse sampling (12.5) seems anomalous or indicates that SFT alone might struggle without sufficient frames or further RL refinement.

6.2. Data Presentation (Tables)

The following are the comprehensive ablation studies on data recipes, training strategies, and the decoupled temporal grounding reward, as presented in Table 3 of the original paper:

Setting VideoMME [9] VideoMMMU [13] LVBench [46] test VideoSIAH-Eval test Average Score
w/ subtitle adaptation comprehension perception
Data Recipe
SFT w/o self-curated iMCoTT 8.4 33.6 41.6 46.0 15.1 4.1 24.8
SFT w/ self-curated iMCoTT (LongVT-7B-SFT) 64.9 32.3 42.0 49.7 41.1 34.8 44.1
RL w/o self-curated QAs 55.1 30.6 42.0 45.6 38.4 30.8 40.4
RL w/ self-curated QAs (LongVT-7B-RL) 66.1 37.7 42.3 56.3 41.4 35.9 46.6
Training Stage
SFT only (LongVT-7B-SFT) 64.9 32.3 42.0 49.7 41.1 34.8 44.1
RL only 52.7 35.33 43.0 55.1 37.1 28.2 41.9
SFT+RL (LongVT-7B-RL) 66.1 37.7 42.3 56.3 41.4 35.9 46.6
SFT+RL+RFT (LongVT-7B-RFT) 67.0 35.7 43.7 56.7 41.3 42.0 47.7
Decoupled Temporal Grounding Reward
Charades-STA [10]
IOU@0.3 IoU@0.5 IoU@0.7 mIoU Average Score
RL w/o Decoupled Reward 31.5 19.9 9.1 21.2 20.4
RL w/ Recall Reward 32.0 20.4 9.6 21.6 20.9
RL w/IoU Reward 41.0 25.8 11.7 27.2 26.4

Table 3. Comprehensive Ablation Studies. C SFT, RL.

The following are the data contamination study results for Qwen-VL series, as presented in Table 4 of the original paper:

Setting VideoMME [9] VideoMMMU [13] VideoSIAH-Eval
w/o subtitle adaptation comprehension perception test
Qwen2.5-VL-7B-Instruct [1]
Original 64.3 35.7 44.3 56.7 33.8
No Visual 40.1 25.7 38.3 39.3 12.7
Rearranged Choices 56.0 29.7 40.3 67.0 -
Qwen3-VL-8B-Instruct [43]
Original 69.3 40.7 60.3 71.3 46.6
No Visual 44.1 33.7 39.3 46.7 0.00
Rearranged Choices 69.0 36.3 47.7 69.3 -

Table 4. Data Contamination Study. For MCQ benchmarks, the Rearranged Choices column reports the accuracy when the answer mapping is randomized. For VideoSIAH-Eval, Rearranged Choices is not applicable.

The following are the inference latency comparison across various long video understanding and reasoning benchmarks, as presented in Table 7 of the original paper:

Model VideoMMMU [13] LVBench [46] VideoMME [9] VideoSIAH-Eval Average
Qwen2.5-VL-7B [1] 2108.6 2014.7 3031.6 1834.3 2247.3
Video-R1-7B [8] 1341.8 1550.6 2483.3 1900.3 1819.0
VideoRFT-7B [44] 1937.9 2154.3 3544.2 2052.6 2422.3
Video-Thinker-7B [45] 3153.8 3834.9 2475.1 1899.2 2840.8
LongVT-7B-RFT (Ours) 1329.8 1509.3 2754.0 1891.1 1871.1

Table 7. Inference Latency (seconds) on various Long Video Understanding and Reasoning Benchmarks. All inference on 8 NVIDIA A800-SXM4-80GB GPUs.

6.3. Ablation Studies / Parameter Analysis

The paper conducts extensive ablation studies to understand the contribution of different components and design choices.

6.3.1. Fine-Grained Reasoning Data Matters

  • Impact of Self-Curated iMCoTT (SFT stage): As shown in Table 3 (Data Recipe section), removing the self-curated iMCoTT data during SFT (SFT w/o self-curated iMCoTT vs. SFT w/ self-curated iMCoTT) leads to a drastic drop in average score (from 44.1 to 24.8) and particularly on VideoSIAH-Eval (from 34.8 to 4.1). This indicates that the VideoSIAH data, specifically designed for tool-integrated reasoning, is critical for shaping the model's ability to handle long-form videos.
  • Impact of Self-Curated QAs (RL stage): Similarly, in the RL stage, removing self-curated QAs (RL w/o self-curated QAs vs. RL w/ self-curated QAs) results in a significant performance drop on VideoSIAH-Eval (from 35.9 to 30.8) and a lower average score (from 46.6 to 40.4). This emphasizes that the quality and specificity of the RL training data are crucial for improving answer accuracy, temporal localization, and systematic tool use.

6.3.2. Recall Encourages Coverage; IoU Demands Precision

  • Temporal Grounding Reward Choice: The paper ablates different temporal grounding reward functions in RL, specifically comparing Recall and IoU. As shown in Table 3 (Decoupled Temporal Grounding Reward section), RL w/ IoU Reward achieves an mIoU of 27.2 on Charades-STA [10], significantly outperforming RL w/o Decoupled Reward (21.2) and RL w/ Recall Reward (21.6).

  • Hypothesis on Recall: The paper hypothesizes that Recall can be reward-hacked by simply enlarging the predicted temporal span to encompass the ground-truth interval, which monotonically raises Recall without necessarily improving boundary precision. IoU, by contrast, implicitly penalizes span inflation through its union term, leading to tighter timestamp proposals and more disciplined tool use. This is further supported by Figure 3 (left panel), which shows that Recall accuracy plateaus, validating the reward hacking behavior.

    The following figure (Figure 3 from the original paper) shows the effects of time reward ablation and tool reward ablation:

    该图像是图表,展示了 Time Reward Ablation(左)和 Tool Reward Ablation(右)的实验结果。图中左侧的准确率和奖励曲线与训练步数的关系,以及右侧工具调用次数的变化,反映了不同实验条件下模型的表现。 Figure 3. Training Dynamics. (a) Time Reward Ablation: Evolution of Accuracy, IoU, and Recall metrics on Charades-STA for different reward functions. (b) Tool Reward Ablation: Effect of explicit tool-call reward on tool usage and accuracy during training.

6.3.3. Is Tool Reward Really Necessary?

  • SFT's Role: As seen in Figure 3 (right panel), the baseline Qwen2.5-VL-7B collapses to near-zero tool calls without SFT. After cold-start SFT (LongVT-7B-SFT), tool-call frequency substantially increases and continues to rise during RL, indicating SFT is essential for establishing basic tool-calling competence.
  • Tool Reward's Limited Benefit: Surprisingly, explicitly adding a tool reward (a binary bonus for tool invocation) brings little benefit. In later RL stages, the configuration without the tool reward even shows slightly higher tool-use frequency, and accuracy remains largely unchanged. This suggests that once SFT grounds the tool's semantics, the model learns when to invoke it based on the overall answer accuracy and temporal grounding rewards, without needing an additional specific tool invocation bonus, which might even suppress exploration. The final recipe thus discards the tool reward.

6.3.4. SFT Builds Competence; RL Optimizes Decisions; RFT Stabilizes Behaviors

  • Importance of SFT: Table 3 (Training Stage section) shows that RL only (without SFT) yields the lowest scores across all benchmarks (average 41.9), significantly worse than SFT only (LongVT-7B-SFT, average 44.1). This confirms that SFT is indispensable for teaching the model the basic tool-use paradigm—selecting temporal windows, inspecting content, and incorporating evidence. Without SFT, the model exhibits poor tool-use ability and behavioral inconsistencies, becoming confused by tool outputs rather than integrating them as evidence.
  • RL for Generalization: SFT+RLSFT+RL (LongVT-7B-RL) significantly improves over SFT only (LongVT-7B-SFT) from an average score of 44.1 to 46.6. This demonstrates RL's role in optimizing the model's decision-making (when to inspect, how long to crop, how to integrate evidence) and enhancing its generalization ability to held-out videos and unseen question templates.
  • RFT for Stabilization: SFT+RL+RFTSFT+RL+RFT (LongVT-7B-RFT) achieves the highest average score of 47.7, further improving upon SFT+RLSFT+RL. Notably, on VideoSIAH-Eval, RFT pushes the score from 35.9 to 42.0. This indicates that RFT, by distilling high-reward trajectories back into supervised data, effectively stabilizes agentic behaviors and consolidates long-horizon reasoning and temporal grounding, realizing the full benefits of temporal-grounding feedback.

6.3.5. Reflection Trajectory: From Verbose Self-Correction to Internalized Tool Usage

The paper analyzes the evolution of the model's internal thought process by tracking the proportion of reflection tokens. The following figure (Figure 7 from the original paper) visualizes this trend:

Figure 7. Trend of Reflection-Related Words and the Corresponding Word Cloud across All Rollouts. Figure 7. Trend of Reflection-Related Words and the Corresponding Word Cloud across All Rollouts.

There are three distinct phases:

  1. Verbose Self-Correction (Steps 0-50): Initially, the reflection density is high. The model generates extensive verbal self-correction and iterative reasoning to compensate for poor localization accuracy and sub-optimal tool usage.
  2. Efficiency Optimization (Steps 50-80): As the policy matures and intrinsic grounding capability improves, reflection density significantly drops. The model autonomously prunes unnecessary linguistic fillers, learning that prolonged reflection is redundant, maximizing reward efficiency.
  3. Internalized Proficiency (After 80 Steps): The reflection curve stabilizes at a concise baseline. This indicates a shift towards selective reasoning, where explicit reflection is invoked only when ambiguity needs to be resolved. The core semantics of tool interaction have been internalized. The word cloud (right panel of Figure 7) confirms that remaining reflection tokens are semantically grounded (e.g., "segment," "confirm"), serving functional roles for temporal reasoning rather than generic linguistic fillers.

6.3.6. What Motivates VideoSIAH? Unveiling the Data Contamination in Qwen-VL Series

The paper conducts a data contamination study on Qwen-VL series models (Qwen2.5-VL-7B-Instruct and Qwen3-VL-8B-Instruct) to demonstrate the necessity of VideoSIAH-Eval.

  • "No Visual" Performance: As shown in Table 4, both Qwen2.5-VL and Qwen3-VL achieve surprisingly high scores on VideoMME and VideoMMMU even without any video frames (e.g., Qwen2.5-VL scores 40.1 on VideoMME without subtitles, far exceeding random guessing for 4-option MCQs). This strongly indicates severe leakage and potential memorization of textual information or correlations in these benchmarks. In stark contrast, Qwen3-VL's score on VideoSIAH-Eval drops to 0.00 in the "No Visual" setting, confirming that VideoSIAH-Eval is clean and non-contaminated, forcing models to rely on visual grounding.

  • "Rearranged Choices" Reveals Overfitting: For MCQ-based benchmarks, Qwen2.5-VL's performance significantly drops when answer choices are rearranged (e.g., from 64.3 to 56.0 on VideoMME). This suggests models might memorize specific option mappings rather than genuinely understanding the content. VideoSIAH-Eval uses an open-ended QA format, making it immune to this option bias.

    These findings underscore that VideoSIAH-Eval provides a more robust and reliable assessment of genuine long-video reasoning capabilities.

6.4. Inference Efficiency Analysis

The paper investigates inference latency to address the concern that multi-turn agentic frameworks might be inherently slower.

  • Counter-Intuitive Efficiency: As shown in Table 7, LongVT-7B-RFT demonstrates competitive, and in some cases, superior inference efficiency. It achieves the lowest latency on VideoMMMU (1329.8 seconds) and LVBench (1509.3 seconds) among the compared models, while remaining competitive on VideoMME and VideoSIAH-Eval.
  • Reason for Efficiency: This efficiency, despite multi-turn tool interactions, is attributed to the precision of LongVT's reasoning. Unlike baselines that might hallucinate or generate verbose, uncertainty-driven descriptions by "blindly rephrasing" uncertain visual memories, LongVT proactively seeks and grounds its answers in retrieved frames. This evidence-based approach circumvents the need for lengthy, potentially incorrect textual generation, leading to more concise and faster token generation overall.
  • Human-like Viewing: The efficiency aligns with human-like viewing habits, where one doesn't watch an entire video frame-by-frame but strategically samples and encodes relevant segments. LongVT's ability to crop_video and focus on relevant parts helps avoid the prohibitive computational cost and context overflow of encoding extremely long sequences entirely.

6.5. Qualitative Examples

The paper provides several qualitative examples to illustrate LongVT's reasoning process and self-correction capabilities.

  • Self-Correction in Single-Turn (Figure 11): The example shows a single-turn case where the model initially identifies the basin color as pink but then uses internal monologue to re-check visual evidence, realizing a hallucination. It then successfully self-corrects and outputs the correct answer (Blue). This highlights the model's ability to reflect and revise its hypothesis based on visual input.

    The following figure (Figure 11 from the original paper) illustrates a self-correction case:

    该图像是示意图,展示了通过长视频思考过程的框架。图中包含了用户查询、时间窗口以及对视频片段的工具调用过程,以确定所涉及的国家的旗帜。此外,图像上标出了一些具体的时间段,用以说明视频加工的细节。 Figure 11. Self-Correction in Single-Turn. The model initially hallucinates the basin color as pink, but then re-inspects the visual evidence, self-corrects, and outputs the correct answer (Blue).

  • Multi-Turn Refinement (Figure 12): This example demonstrates multi-turn tool interactions where the model iteratively refines its temporal window. An initial crop_video might miss the target event, leading the model to adjust start and end times for a subsequent tool call until the relevant visual evidence (e.g., a US flag) is successfully identified.

    The following figure (Figure 12 from the original paper) shows a multi-turn refinement example:

    该图像是一个视频时间轴示意图,展示了在不同时间点(如75秒、87秒、174秒等)的视频帧。时间轴下方标识了不断变化的场景,强调了视频内容的多样性与时间长短的关系。 Figure 12. Multi-Turn Refinement. The model's initial tool call (80s-100s) misses the US flag. Through self-correction, it refines the parameters and calls the tool again with the correct window (344s-372s) to successfully identify the US flag.

  • Comparison with Textual CoT (Figure 13): LongVT is compared against a standard textual CoT baseline for identifying colors of sports cars in a Honey promotion scene.

    • The textual CoT baseline hallucinates unseen visual details (e.g., incorrect object appearance or colors), demonstrating its vulnerability without active visual verification.

    • LongVT (using iMCoTT) follows an active verify-and-correct procedure. It calls crop_video around a hypothesized time, detects that the retrieved segment lacks the queried objects (luxury sports cars), adjusts the crop region based on its reasoning, and then successfully locates the correct evidence to produce the accurate answer ("One is white and the other is yellow"). This showcases LongVT's superior grounding and self-correction.

      The following figure (Figure 13 from the original paper) compares Thinking with Textual CoT vs. Thinking with iMCoTT (Ours):

      该图像是一个示意图,展示了LongVT框架中多模态工具思维的流程,包括从整体到局部的推理过程。该流程通过对视频片段的剪辑和细化,逐步获取可用视觉证据。 该图像是一个示意图,展示了LongVT框架中多模态工具思维的流程,包括从整体到局部的推理过程。该流程通过对视频片段的剪辑和细化,逐步获取可用视觉证据。 Figure 13. Thinking with Textual CoT vs. Thinking with iMCoTT (Ours). The top panel shows that a standard Textual CoT baseline hallucinates unseen visual details and outputs an incorrect answer ("White and Yellow"). The bottom panel demonstrates LongVT's iMCoTT workflow: it initially calls crop_video at 90s-120s, realizes the mislocalization, self-corrects, and then makes another tool call at 174s-190s to successfully identify the correct luxury sports car images and provide the accurate answer.

6.6. Failure Case Analysis

The paper presents a representative failure case to emphasize the importance of the cold-start SFT stage. The following figure (Figure 14 from the original paper) illustrates this:

该图像是图表,展示了一系列与长视频推理相关的样本和工具的示意。图中的内容可能涉及长视频理解和推理任务的数据样本,反映了训练和评估过程中的关键步骤及工具调用。整体结构清晰,能够有效帮助理解长视频推理框架的工作原理。 Figure 14. RL-Only Failure Case. The model correctly recognizes the need to inspect the glass coffee table via a tool call. However, after receiving resampled frames, it fails to integrate the evidence to answer the specific question ("which video-game device"). Instead, it reverts to generic video captioning, restating superficial scene descriptions.

In this example, an RL-only variant (without cold-start SFT) correctly invokes the crop_video tool to inspect a glass coffee table for video-game devices. However, after receiving the resampled frames, the model fails to perform the specific reasoning required by the question. Instead of identifying the video-game device, it becomes confused by the context shift and reverts to generic video captioning, merely restating superficial scene descriptions. This behavior underscores that cold-start SFT is essential for teaching the model the intended semantics of tool usage and how to effectively integrate tool outputs into its reasoning process, preventing such behavioral inconsistencies.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces LongVT, an innovative end-to-end agentic framework that empowers LMMs to reliably reason over long-form videos. By adopting a human-inspired global-to-local reasoning strategy, LongVT implements interleaved Multimodal Chain-of-Tool-Thought (iMCoTT) where LMMs actively use a native video cropping tool to inspect specific temporal segments and gather finer-grained visual evidence. This approach transforms long-video understanding from passive frame consumption to active, evidence-seeking reasoning with self-correction capabilities.

A key contribution is VideoSIAH, a newly curated, large-scale, fine-grained data suite and evaluation benchmark, specifically designed to address evidence-sparse long-video reasoning tasks and overcome data contamination issues found in existing benchmarks.

The effectiveness of LongVT is attributed to its meticulously designed three-stage training pipeline: cold-start Supervised Fine-Tuning (SFT) for foundational capabilities, Agentic Reinforcement Learning (RL) with a novel joint answer-temporal grounding reward for optimizing decisions, and Agentic Reinforcement Fine-Tuning (RFT) for stabilizing learned behaviors. Through extensive empirical validation, LongVT consistently outperforms strong baselines across four challenging benchmarks, significantly narrowing the performance gap between open-source and proprietary LMMs in long-video understanding.

7.2. Limitations & Future Work

The authors acknowledge a primary limitation:

  • Memory Footprint and Context Window: While the multi-turn tool interactions do not significantly increase inference latency, the memory footprint of recursive reasoning remains a bottleneck. As the number of interaction turns increases (especially for ultra-long or infinite video streams), the accumulation of history tokens (including dense visual features returned by tools) can rapidly exhaust the context budget of the underlying LMM. This poses a risk of Out-of-Memory (OOM) errors during training and performance degradation due to truncation.

    To address this, the authors suggest a promising future direction:

  • Multi-Agent Collaboration: Inspired by advancements in multi-agent reinforcement learning (e.g., MATPO [31]), they envision a hierarchical framework. In this setup, a "Manager Agent" would orchestrate high-level planning and dispatch sub-tasks to specialized "Worker Agents." Each Worker Agent could be responsible for inspecting distinct temporal segments or executing specific tool calls. Workers would then summarize their observations into concise natural language updates for the Manager Agent, effectively decoupling context management from reasoning. This scalable, divide-and-conquer architecture could theoretically support infinite-horizon reasoning loops without context overflow.

7.3. Personal Insights & Critique

This paper presents a compelling and well-executed approach to a critical problem in multimodal AI: enabling reliable reasoning over long-form videos. The human-inspired global-to-local reasoning paradigm, coupled with native tool calling, is a highly intuitive and effective way to tackle the evidence-sparse nature of long-video content.

Inspirations drawn from this paper:

  • Native Tool Calling as a Game Changer: The idea of treating an LMM's temporal grounding ability as a native tool (crop_video()) is quite innovative. It integrates the visual processing seamlessly into the agentic reasoning loop, rather than relying on external, separate modules. This "native" integration is likely a key factor in its success.
  • Importance of Data Curation: The meticulous design and curation of the VideoSIAH dataset, especially its focus on segment-in-a-haystack scenarios and the rigorous human-in-the-loop validation and data contamination study, highlight the critical role of high-quality, task-specific data in pushing the boundaries of LMM capabilities. This emphasizes that model architecture and training strategies are only as good as the data they learn from.
  • Three-Stage Training Strategy: The cold-start SFT, agentic RL, and agentic RFT pipeline is a robust and well-justified approach. The ablation studies clearly demonstrate the indispensable role of each stage, particularly cold-start SFT for building initial competence and RFT for stabilizing agentic behaviors. This comprehensive training recipe provides valuable insights for developing complex agentic LMMs.
  • Efficiency of Active Reasoning: The finding that LongVT's multi-turn agentic reasoning can be as, or more, efficient than single-turn baselines is counter-intuitive and significant. It suggests that actively seeking and grounding evidence can lead to more concise and less hallucinated outputs, ultimately saving computational resources by avoiding verbose, uncertain generation.

Potential issues, unverified assumptions, or areas for improvement:

  • Dependence on LLM-as-a-Judge: While LLM-as-a-Judge is a powerful evaluation tool for open-ended tasks, its reliability is still subject to the capabilities and potential biases of the underlying LLM used as the judge. Its "ground truth" is a derived consistency score, not absolute human judgment. The consistency scores could vary with different judge models or prompting strategies.
  • Scalability of crop_video Tool: While the paper addresses context window limits with multi-agent future work, the crop_video tool itself might become a bottleneck for extremely fine-grained temporal grounding or very rapid event changes. The resampling frequency and efficiency of the underlying vision encoder for dense frames could become a limiting factor for real-time applications or extremely long videos.
  • Complexity of iMCoTT Debugging: Although iMCoTT enhances transparency by making reasoning steps explicit, debugging failures in a multi-turn, tool-augmented RL system can still be complex. Pinpointing whether a failure stems from poor temporal grounding, incorrect tool invocation, or faulty multimodal reasoning within an RL loop can be challenging.
  • Prompt Sensitivity: As with most LLMs, the performance of the agentic reasoning might be sensitive to the exact prompt templates used for the initial query and for tool invocation/response generation. The paper provides templates, but real-world deployment might require careful prompt engineering.

Transferability to other domains: The core principles of LongVTglobal-to-local reasoning, native tool calling for dynamic information retrieval, and iterative self-correction—are highly transferable.

  • Image Understanding: Applying this to extremely high-resolution images or panoramic images, where LMMs could use a "zoom" tool to inspect specific regions for details.

  • Audio/Speech Processing: An LMM could use tools to "focus" on specific temporal segments of an audio track, perhaps to isolate a speaker or an event, and then analyze the finer-grained audio features.

  • Document Analysis: For long documents or scientific papers, an LMM could "skim" abstracts and headings, then use a "section-read" or "figure-inspect" tool to zoom in on relevant paragraphs or figures for detailed understanding.

  • Robotics/Embodied AI: An agentic model controlling a robot could use "sensor-focus" tools (e.g., zoom camera, listen closely) to gather more precise information from its environment before making a decision, mirroring the hypothesis-verification loop.

    Overall, LongVT represents a significant step towards more intelligent and reliable multimodal agents that can actively engage with complex, long-form data.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.