LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
TL;DR Summary
LongVT introduces an end-to-end framework enhancing long video reasoning via interleaved Multimodal Chain-of-Tool-Thought, leveraging LMMs' temporal grounding. It releases the VideoSIAH dataset for training and evaluation, significantly improving performance on various benchmarks
Abstract
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
1.2. Authors
Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing
Affiliations: The authors are affiliated with:
- MiroMind AI
- Nanyang Technological University (NTU)
- Hong Kong University of Science and Technology (HKUST(GZ))
- Tsinghua University (THU)
- LMMs-Lab Team
1.3. Journal/Conference
The paper is published as a preprint on arXiv (arXiv:2511.20785). arXiv is an open-access repository for preprints of scientific papers in various fields, including computer science. While it is not a peer-reviewed journal or conference proceeding, it is a widely recognized platform for disseminating early research findings and facilitating rapid scientific communication. Many papers first appear on arXiv before undergoing formal peer review and publication in conferences or journals.
1.4. Publication Year
2025
1.5. Abstract
Large multimodal models (LMMs) show promise in video reasoning with textual Chain-of-Thought (CoT), but they suffer from hallucinations, especially in long videos where evidence is sparse and spread out. Inspired by human long-video comprehension (global skimming then local examination), this paper introduces LongVT, an end-to-end agentic framework enabling "Thinking with Long Videos" through interleaved Multimodal Chain-of-Tool-Thought (iMCoTT). LongVT leverages LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on specific clips and resample finer-grained frames. This global-to-local reasoning loop continues until answers are supported by visual evidence. To address the scarcity of fine-grained question-answering (QA) data for long video reasoning, the authors curate and will release VideoSIAH, a data suite for training and evaluation. The training dataset comprises 247.9K samples for tool-integrated cold-start supervised fine-tuning (SFT), 1.6K for agentic reinforcement learning (RL), and 15.4K for agentic reinforcement fine-tuning (RFT). The evaluation benchmark consists of 1,280 QA pairs meticulously curated via a semi-automatic pipeline with human-in-the-loop validation. With a carefully designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms strong existing baselines across four challenging long-video understanding and reasoning benchmarks. Codes, data, and model checkpoints are publicly available.
1.6. Original Source Link
https://arxiv.org/abs/2511.20785 (Preprint) PDF Link: https://arxiv.org/pdf/2511.20785v1.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem LongVT aims to solve is the limitation of current large multimodal models (LMMs) in reliably reasoning over long-form videos. While LMMs have shown potential in video reasoning, particularly with textual Chain-of-Thought (CoT), they are prone to hallucinations—generating information that is not supported by the visual evidence. This issue is exacerbated in long videos (exceeding 15 minutes) because the crucial evidence needed to answer a question might be sparse, subtle, and temporally dispersed across hours of footage. Existing LMM approaches often rely on R1-style paradigms (Supervised Fine-Tuning followed by Group Relative Policy Optimization (GRPO) based reinforcement learning), which are largely language-centric and struggle with deep visual reasoning. Their uniform frame sampling further hinders adaptive capture of key visual evidence, often missing fine-grained or decisive moments critical for accurate long-video understanding.
This problem is important because understanding long-form videos is a major challenge in multimodal artificial intelligence, underpinning real-world applications like event spotting in sports, long-range film analysis, and complex video question answering. The existing methods lack the capability to perform human-like visual operations to guide reasoning, such as skimming and zooming in on relevant segments. The paper's innovative idea is to enable LMMs to perform human-like visual operations by integrating a native video cropping tool into their reasoning process, allowing for a global-to-local inspection strategy.
2.2. Main Contributions / Findings
The paper makes three primary contributions:
- An End-to-End Agentic Paradigm for Long-Video Reasoning:
LongVTintroduces an agentic framework that natively interleavesmultimodal tool-augmented CoTwith on-demand clip inspection over hours-long videos. This allows LMMs to transition from passive frame consumption to active, evidence-seeking reasoning, thereby enabling more effective and reliable long-video understanding. This framework enablesself-correctionandhypothesis-verificationloops, inspired by human cognitive processes. - VideoSIAH Data Suite for Evidence-Sparse Long-Video Reasoning: To address the scarcity of fine-grained QA data, the paper constructs
VideoSIAH, a large-scale, diverse, and high-quality data suite. This includes a training dataset withtool-integrated reasoning traces(247.9K SFT samples, 1.6K RL samples, 15.4K RFT samples) and a dedicated evaluation benchmark,VideoSIAH-Eval(1,280 human-in-the-loop validated QA pairs), specifically designed forvideo segment-in-a-haystackscenarios where evidence is sparse. - Meticulously Designed Three-Stage Training Strategy and Comprehensive Validation: The paper proposes a robust three-stage training pipeline:
- Cold-start Supervised Fine-Tuning (SFT): To establish foundational capabilities like temporal window proposal, tool invocation, and multimodal evidence composition.
- Agentic Reinforcement Learning (RL): To optimize a novel
joint answer-temporal grounding rewardfunction, refining tool-using rollouts. - Agentic Reinforcement Fine-Tuning (RFT): To distill high-quality RL trajectories into supervised data, stabilizing agentic behaviors and consolidating long-horizon reasoning.
Through extensive empirical validation and ablation studies,
LongVTconsistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks, narrowing the performance gap with proprietary LMMs.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand LongVT, a beginner should be familiar with the following concepts:
- Large Multimodal Models (LMMs): These are advanced artificial intelligence models that can process and understand information from multiple modalities, typically text and images/videos. They extend the capabilities of large language models (LLMs) by adding visual understanding, allowing them to answer questions about images, describe video content, or perform complex reasoning tasks involving both text and visual inputs.
- Chain-of-Thought (CoT): A prompting technique used with large language models to enable complex reasoning. Instead of directly asking for an answer,
CoTinvolves instructing the model to "think step-by-step" or "show your work." This encourages the model to break down a complex problem into intermediate steps, which often leads to more accurate and verifiable answers, reducinghallucinations(incorrect or fabricated information). In multimodal settings (Multimodal Chain-of-Thought), this involves reasoning over visual inputs as well. - Hallucinations: In the context of AI,
hallucinationsrefer to instances where a model generates content (text or visual descriptions) that is plausible but factually incorrect, not supported by its input data, or completely fabricated. This is a significant challenge for LMMs, especially when dealing with ambiguous or sparse information in long videos. - Temporal Grounding: The task of identifying the precise start and end times (a
temporal spanortime window) within a video that corresponds to a given natural language query or event description. For example, given the query "the player scoring a goal,"temporal groundingwould identify the exact video segment where the goal occurs. - Agentic Frameworks / AI Agents: An
agentic frameworkrefers to an AI system designed to operate autonomously in an environment, making decisions and taking actions to achieve a goal.AI Agentsoften involveplanning,memory,tool use, andself-reflection. In the context of LMMs, an agentic framework allows the model to interact with its environment (e.g., a video) by calling specialized tools (likecrop_video) to gather more information, rather than just passively processing a pre-defined input. - Supervised Fine-Tuning (SFT): A common technique in machine learning where a pre-trained model (like an LMM) is further trained on a labeled dataset for a specific downstream task. The model learns to map inputs to desired outputs based on explicit supervision (correct examples). In
LongVT,cold-start SFTis used to teach the base LMM foundational skills like tool invocation and initial reasoning patterns. - Reinforcement Learning (RL): A paradigm where an agent learns to make decisions by performing actions in an environment and receiving
rewardsorpenaltiesfor those actions. The goal is to learn apolicythat maximizes cumulative reward.RLis used inLongVTto optimize the model's decision-making process, such as when to use a tool and how to interpret its output, by providing feedback (rewards) based on the correctness of answers and temporal grounding. - Reinforcement Fine-Tuning (RFT): A stage of training that often follows
RL. It involves converting high-quality trajectories (sequences of actions and observations) generated duringRLinto supervised training examples. Theseself-distilledexamples are then used to further fine-tune the model in a supervised manner, stabilizing the beneficial behaviors learned duringRLand improving performance. - Group Relative Policy Optimization (GRPO): An
RLalgorithm often used for trainingLarge Language Models (LLMs)to align with human preferences or perform complex reasoning. It builds onProximal Policy Optimization (PPO)by sampling multiple responses (rollouts) for each prompt, evaluating them with a reward model, and using agroup baselineto reduce variance in theadvantage estimates. This allows for more stable and efficient learning, particularly for open-ended generation tasks. - Intersection over Union (IoU): A common evaluation metric used to quantify the overlap between two bounding boxes or
temporal intervals. Fortemporal grounding,IoUmeasures the ratio of the intersection duration of the predicted and ground-truth intervals to their union duration. A higherIoUindicates better overlap. Mathematical Formula: $ \mathrm { I o U } \ = \ \frac { | [ t _ { s } , t _ { e } ] \cap [ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ] | } { | [ t _ { s } , t _ { e } ] \cup [ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ] | } $ Symbol Explanation:[ t _ { s } , t _ { e } ]: The predicted temporal interval, where is the start time and is the end time.- : The ground-truth temporal interval, where
t'_sis the start time andt'_eis the end time. - : Represents the intersection of the two intervals (the duration where they overlap).
- : Represents the union of the two intervals (the total duration covered by either interval).
- : Denotes the length or duration of a temporal interval.
- LLM-as-a-Judge: A technique where a powerful
Large Language Model (LLM)is used to evaluate the quality of responses generated by other models. Instead of relying on human annotators or rule-based systems, theLLM-as-a-Judgeassesses responses based on criteria like correctness, coherence, and relevance. This is particularly useful for open-ended tasks where traditional metrics are difficult to apply.
3.2. Previous Works
The paper contextualizes its work within two main streams of research:
-
RL-Based Multimodal Reasoning:
- Early Inspiration:
OpenAI o1[17] andDeepSeek-R1[11] extendedGRPO-styleRLfrom text-only reasoning to multimodal domains. These foundational works showed the potential ofRLfor improving reasoning capabilities. - Vision-centric: Methods like [15, 30, 59] applied
RLto improve image question answering (QA), grounding [7, 27, 35], and segmentation [26]. - Video-centric: More directly relevant, works such as
Video-R1[8] and [44] tackle video QA, while [47] focuses ontemporal grounding, and [23] addressesspatiotemporal grounding. Recent efforts [4] have scaledRLto long videos. - Audio and Omnimodal:
RLhas also been applied to audio QA [20, 48] and broaderomnimodalreasoning [62]. - Key takeaway: These works collectively demonstrate that
RL-based reasoning significantly improves cross-modal understanding.
- Early Inspiration:
-
Tool-Augmented Agentic LMMs:
- Images: Recent methods [38, 50, 54, 61] enhance image reasoning by interleaving
pixel-level operations(e.g., zooming, drawing auxiliary lines, generative imagery) to process finer details and reducehallucinations.DeepEyes[61] is specifically mentioned as having similar training dynamics concerning reflection tokens. - Videos:
VITAL[57] is a concurrent work that also explorestool-augmented RLfor improving video QA andtemporal grounding.
- Images: Recent methods [38, 50, 54, 61] enhance image reasoning by interleaving
3.3. Technological Evolution
The field has evolved from language-centric models that primarily process text, to multimodal models that can handle both text and visual information. Initially, multimodal models often relied on supervised fine-tuning (SFT) and simple Chain-of-Thought (CoT) prompting, which proved effective for short videos but struggled with the scale and complexity of long-form content, leading to hallucinations.
The next evolutionary step involved incorporating Reinforcement Learning (RL) (R1-style paradigms) to enhance reasoning and align model behavior with desired outcomes, moving beyond token-level likelihood optimization. Concurrently, the concept of AI agents emerged, where models are equipped with tools to interact dynamically with their environment, much like humans use tools to gather more information.
LongVT fits within this timeline by pushing the boundaries of agentic LMMs specifically for long-form video understanding. It combines RL-based reasoning with tool augmentation, allowing models to actively seek out visual evidence, mirroring human cognitive strategies for long video comprehension (global skim, local inspection). This moves beyond passive processing towards active, interactive, and self-correcting reasoning, especially addressing the segment-in-a-haystack problem where crucial evidence is sparse.
3.4. Differentiation Analysis
Compared to prior work, particularly VITAL [57], LongVT introduces several key innovations:
-
Target Task and Dataset:
VITALfocuses on general video QA andtemporal grounding.LongVTspecifically targetsvideo segment-in-a-haystackreasoning, where evidence is extremely sparse and temporally dispersed inhours-long footage. To address this, it contributes a large-scale, high-quality dataset,VideoSIAH, and a dedicated benchmark,VideoSIAH-Eval. This dataset is designed to explicitly triggertool-integrated reasoningand reveal emergent human-likeself-reflectioncapabilities.
-
Training Paradigm:
VITALusestool-augmented RLbut the specifics of its training stages are not detailed in the comparison.LongVTproposes a novel three-stage closed-loop training paradigm:- Cold-start SFT: To provide a robust foundation for tool-calling and reasoning, which the paper empirically shows is indispensable (Figure 14).
- Agentic RL: For enhancing generalization.
- Agentic RFT: A dedicated stage that leverages high-quality
rollout traces(self-distilled fromRL) for iterative self-refinement and stabilization ofagentic behaviors.
-
Reward Function:
- Prior works often rely on
multi-task objectives(e.g.,Video-R1[8], [23]) orexplicit tool rewards(e.g.,VITAL[57],DeepEyes[61]). LongVTshows thatsingle-task RLwith a decoupled temporal-grounding reward (specificallyIoU) can achieve state-of-the-art performance. Thisdecoupled rewardis integrated into ajoint answer-temporal grounding rewardfunction, unifying answer correctness and temporal precision without requiring explicit tool invocation bonuses. The paper also ablates the necessity of tool reward, finding it less critical than strongSFTandIoUrewards.
- Prior works often rely on
-
Emphasis on Emergent Behavior:
LongVTexplicitly designs its training and data to foster emergent human-likeself-reflectionandhypothesis-verificationbehaviors, which is a core part of itsinterleaved Multimodal Chain-of-Tool-Thought (iMCoTT).
4. Methodology
4.1. Principles
The core idea behind LongVT is to mimic human comprehension of long videos: first skim globally to identify potentially relevant segments, and then zoom in on those segments for detailed examination. This global-to-local reasoning strategy is implemented through an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought (iMCoTT).
The theoretical basis and intuition are that LMMs can leverage their inherent temporal grounding capabilities to dynamically interact with the video. This involves proposing precise temporal windows (hypotheses), using a native video cropping tool (crop_video()) to resample finer-grained frames within that window (verification), and then refining the reasoning based on the new visual evidence. This hypothesis-verification cycle allows the model to self-correct when initial retrievals are insufficient or inaccurate, similar to how a human would re-inspect a part of a video. The goal is to ground answers in retrieved visual evidence, reducing hallucinations that often plague LMMs when dealing with sparse or dispersed evidence in long videos.
The following figure (Figure 4 from the original paper) visualizes the overall framework of LongVT, showing its iterative hypothesis-verification cycle and the role of the crop_video tool:
该图像是一个示意图,展示了LongVT的工作流程,包括全局浏览(Global Skim)与细粒度推理(Finer-grained Reasoning)。图中呈现了在视频特定时间点(如和)重新采样的视频帧,以及通过思考和奖励管理机制得到的最终预测答案。
Figure 4. The overall framework of LongVT. LongVT enables Thinking with Long Videos through an iterative hypothesis-verification cycle. This is incentivized via cold-start SFT, enabling the model to skim global frames and proactively invoke the crop_video tool to resample fine-grained evidence. In cases where the initial retrieval (e.g., at ) proves insufficient, the model leverages learned self-correction to reinvoke the tool (e.g., at ) with refined parameters. Crucially, this entire decision-making trajectory is consolidated via agentic RL, which optimizes the policy against the joint answer-temporal grounding reward (), enhancing the model's generalization ability to further align with human-like verification strategies.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. VideoSIAH: Fine-Grained Data Suite for Evidence-Sparse Long-Video Reasoning
LongVT's training and evaluation heavily rely on a newly constructed dataset called VideoSIAH. This dataset is designed to tackle the unique challenges of long-video reasoning, where LMMs must locate sparse, fine-grained, and causally decisive moments within hours-long content. Existing datasets are often too coarse-grained and do not sufficiently supervise the learning of temporal hypothesis formation, verification, or revision.
4.2.1.1. Data Pipeline
The VideoSIAH dataset is curated using a semi-automatic, human-in-the-loop pipeline that generates temporally grounded reasoning traces aligned with human cognitive processes.
The following figure (Figure 2 from the original paper) illustrates the data pipeline for VideoSIAH:
Figure 2. VideoSIAH Data Pipeline. We construct high-quality, temporally grounded reasoning traces for long videos. We automatically detect scenes, merge short segments, generate detailed captions using Qwen2.5-VL-72B, and create initial QA pairs. Text-based filtering removes low-quality QAs, and multimodal filtering (with GLM-4.5V) ensures visual consistency. Annotator feedback refines prompts for QA generation, filtering, and iMCoTT construction, ensuring high-fidelity data. The tool-integrated iMCoTT reasoning traces are generated only for the cold-start SFT stage, whereas RL training operates solely on the filtered QA pairs.
The steps are as follows:
- Automatic Scene Detection & Segmentation: Long videos are first processed to detect scene changes. Consecutive segments shorter than 10 seconds are merged to create semantically stable units.
- Detailed Caption Generation: For each stable video segment,
Qwen2.5-VL-72B[1] is employed to generate detailed descriptions. These captions capture salient objects, spatial relations, and evolving events, forming the semantic basis forQA pair generation. - Initial QA Generation:
QA pairsare created from these detailed captions. These questions cover a wide range of aspects includingtemporal events,spatial layouts,motion,object attributes, andscene transitions, ensuring broad coverage and diversity. - Filtering Stages:
- Text-based QA Filtering: Low-quality or ill-posed questions (e.g., those with
answer leakage, where the question implicitly contains the answer) are removed using linguistic heuristics and cross-model agreement checks. - Multimodal QA Filtering:
GLM-4.5V[12] is used to verify the consistency of answers against the actual video segments. This step eliminateshallucinatedor visually unsupported claims.
- Text-based QA Filtering: Low-quality or ill-posed questions (e.g., those with
- Prompt-Feedback Refinement Loop: Human annotators provide feedback, which is used to refine the prompting rules for
QA generation,filtering, andiMCoTT (interleaved Multimodal Chain-of-Tool-Thought)construction. This iterative refinement process ensures high-fidelity,temporally grounded, and scalable data collection with minimal manual annotation.
4.2.1.2. Dataset Curation
VideoSIAH is divided into different splits for Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and Reinforcement Fine-Tuning (RFT).
-
SFT Data Curation: This data aims to enhance both
tool-calling capabilityand general reasoning. It includes three major categories:- Tool-augmented Multi-round Data: These are
QA pairswithiMCoTTtraces that involve multipletool calls. For hours-long videos, a singletool callmight not capture the correcttemporal segment, necessitatingmulti-round tool-calling. The probability of selecting a video sample formulti-round curationis defined adaptively based on its length: $ P _ { \mathrm { m u l t i } } = 1 - \frac { L _ { \mathrm { m a x } } - \mathrm { c l i p } ( L _ { \mathrm { v i d e o } } , L _ { \mathrm { m a x } } , L _ { \mathrm { m i n } } ) } { L _ { \mathrm { m a x } } - L _ { \mathrm { m i n } } } $ Symbol Explanation:- : The probability of choosing a given data sample for
multi-round generation. - : The length of the video in question.
- : The maximum video length threshold.
- : The minimum video length threshold.
- : A function that restricts the value of to the range
[b, a]. Specifically, if , it returns ; if , it returns ; otherwise, it returns . This formula ensures that longer videos have a higher probability of undergoingmulti-round data generation, thereby improvingtemporal coverageandreasoning completeness.
- : The probability of choosing a given data sample for
- Image Reasoning Data: A mixture of diverse image-based reasoning datasets is included to strengthen fundamental perceptual capabilities.
- Video Reasoning Data: General video
QAdatasets are also incorporated.
- Tool-augmented Multi-round Data: These are
-
RL Data Curation: This split is built from
filtered segment-in-a-haystack QA pairs.- Length-balanced Subset: QA pairs are grouped by video duration (short, medium, long), and a length-balanced subset is sampled to ensure diverse video durations are covered.
- Difficulty-aware Filter: For each question,
rollouts(generated responses) are drawn from the current policy. Items are discarded if all trajectories answer correctly (too easy) or all fail (too hard), focusingRLon a middle band of difficulty to provide more informativereward signals.
-
RFT Data Curation: This data is used for post-
RLrefinement.- High-Quality Trajectory Filtering: Trajectories from early
RLruns are kept only if they meet two criteria:- The model produces the correct final answer.
- The predicted
temporal spanachieves anIntersection over Union (IoU)of at least 0.3 with the ground-truth window.
- Supervised Training Examples: These filtered, high-quality trajectories are converted into supervised examples for
RFT. This provides high-precisionin-distribution supervision, stabilizing optimization and strengtheninggroundingandtool-calling behavior.
- High-Quality Trajectory Filtering: Trajectories from early
4.2.1.3. Dataset Statistics
The following are the dataset statistics of VideoSIAH, as presented in Table 1 of the original paper:
| Split | Source | Purpose | Samples | Total |
| SFT (w/o tool) | LongVideo-Reason CoT [4] | Reasoning-augmented Open-ended QA | 5,238 | 228,835 |
| Video-R1 CoT [8] | Reasoning-augmented Video QA | 165,575 | ||
| Image-based CoT | Reasoning-augmented Image QA | 58,022 | ||
| SFT (w/ tool) | Gemini-distilled iMCoTT | Tool-augmented Open-ended QA | 12,766 | 19,161 |
| Qwen-distilled iMCoTT | Tool-augmented Temporal Grounding | 6,395 | ||
| RL | Gemini-distilled | Open-ended QA over Long Videos | 1,667 | 17,020 |
| RFT | Self-distilled iMCoTT | Agentic Behaviors | 15,353 |
Table 1. Dataset Statistics of VideoSIAH. Our proposed dataset contains non-tool SFT data, tool-augmented SFT data, RL QAs, and self-distilled RFT traces.
VideoSIAH-Eval Benchmark: This dedicated evaluation benchmark consists of 244 videos and 1,280 carefully filtered QA pairs with human-in-the-loop validation. It is designed for long-form video reasoning, with an average video duration of approximately 1,688 seconds. About 71.84% of videos are in the 15-30 minute range, and 28.16% are longer than 30 minutes.
4.2.2. Training Strategy
LongVT employs a three-stage training pipeline to elicit robust "Thinking with Long Videos" behaviors.
4.2.2.1. Cold-Start Supervised Fine-Tuning (SFT)
This initial stage is crucial for equipping the base LMM with fundamental capabilities necessary for tool-augmented reasoning. The paper empirically shows that without this cold-start SFT, the model struggles significantly in later RL stages, often failing to improve or even collapsing. The SFT stage teaches the model:
- Proposing Temporal Windows: The ability to identify and suggest a precise
time windowwhere relevant events might occur. - Invoking Video Tools: The skill to correctly call the
crop_video()tool and understand its function. - Composing Multimodal Evidence: How to integrate the
finer-grained framesreturned by the tool into its reasoning process to formulate an answer. - Self-Correcting: The capacity to recognize when an initial
temporal windowis suboptimal and to reinvoke the tool with refined parameters.
4.2.2.2. Agentic Reinforcement Learning (RL)
In this stage, the model acts as a tool-using agent. It learns when to inspect the video, how long to crop the video segment, and how to integrate the retrieved evidence into its reasoning. GRPO [34] is employed for optimization. A key innovation here is the joint answer-temporal grounding reward function, which unifies answer accuracy, format compliance, and temporal grounding precision.
The training objective for SFT is Next-Token Prediction, where the model minimizes the negative log-likelihood of target tokens. For a sequence of tokens and a model parameterized by that defines conditional probabilities , the loss function is:
$
\mathcal { L } ( \theta ) = - \sum _ { t = 1 } ^ { T } \log p _ { \theta } ( x _ { t } \mid x _ { < t } )
$
Symbol Explanation:
- : The
negative log-likelihoodloss function for the model with parameters . - : A sequence of tokens .
- : The total number of tokens in the sequence.
- : An index iterating through the tokens from
1to . - : The probability of the token given all preceding tokens , as predicted by the model with parameters . This loss encourages the model to assign higher probability to the correct next token in the sequence.
The three-part reward modeling for RL is defined as follows:
-
Answer Accuracy (): An
LLM-as-a-Judge[53] is used to assess the quality of the generated answer against the ground-truth answer . This judge provides a categorical verdict : $ J ^ { ( k ) } = \mathrm { J u d g e } _ { \mathrm { L L M } } \left( \hat { a } ^ { ( k ) } , a ^ { \star } \right) \in { \mathrm { F } , \mathrm { P } , \mathrm { I } } $ Symbol Explanation:-
: The categorical verdict for the -th rollout.
-
: The function performed by the
LLM-as-a-Judge. -
: The answer generated by the model for the -th rollout.
-
: The ground-truth answer.
-
: Fully consistent (semantically equivalent to ).
-
: Partially consistent (contains some correct information but is incomplete or imprecise).
-
: Inconsistent (incorrect or contradictory).
The
accuracy rewardis then normalized: $ \mathbf { R } _ { \mathrm { a c c } } ^ { ( k ) } = { \left{ \begin{array} { l l } { 1 , } & { { \mathrm { i f ~ } } J ^ { ( k ) } = \operatorname { F } , } \ { 0 . 5 , } & { { \mathrm { i f ~ } } J ^ { ( k ) } = \operatorname { P } , } \ { 0 , } & { { \mathrm { i f ~ } } J ^ { ( k ) } = \operatorname { I } . } \end{array} \right. } $ Symbol Explanation: -
: The
accuracy rewardfor the -th rollout. -
: The verdict from
LLM-as-a-Judgefor the -th rollout.
-
-
Format Compliance (): This reward ensures that the model's output adheres to a required schema . $ \mathbf { R } _ { \mathrm { f o r m a t } } ^ { ( k ) } = \left{ \begin{array} { l l } { 1 , } & { \mathrm { i f } y ^ { ( k ) } \mathrm { m a t c h e s } \ : \mathcal { S } , } \ { 0 , } & { \mathrm { o t h e r w i s e } . } \end{array} \right. $ Symbol Explanation:
- : The
format compliance rewardfor the -th rollout. - : The full textual output of the -th rollout.
- : The required output schema (e.g., specific XML tags or JSON structure).
- : The
-
Temporal Overlap (): This uses
IoUto rewardtemporal localizationprecision. For a predictedtemporal spanand ground truth[t'_s, t'_e]: $ \mathrm { I o U } \ = \ \frac { | [ t _ { s } , t _ { e } ] \cap [ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ] | } { | [ t _ { s } , t _ { e } ] \cup [ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ] | } $ Symbol Explanation: (As defined in Section 3.1)-
[ t _ { s } , t _ { e } ]: Predicted temporal interval. -
: Ground-truth temporal interval.
-
: Intersection.
-
: Union.
-
: Duration.
The
temporal rewardis simply theIoUvalue: $ \mathbf { R } _ { \mathrm { t i m e } } ^ { ( k ) } \ = \ \mathrm { I o U } ^ { ( k ) } . $ Symbol Explanation: -
: The
temporal overlap rewardfor the -th rollout. -
: The
IoUcalculated for the -th rollout's predicted temporal span. This form rewards accuratetemporal grounding, with a value of 1 for perfect match and 0 for no overlap.
-
-
Overall Reward (): The total reward for a rollout is the sum of these three components: $ \mathbf { R } ^ { ( k ) } = \mathbf { R } _ { \mathrm { a c c } } ^ { ( k ) } + \mathbf { R } _ { \mathrm { f o r m a t } } ^ { ( k ) } + \mathbf { R } _ { \mathrm { t i m e } } ^ { ( k ) } . $ Symbol Explanation:
-
: The total reward for the -th rollout.
-
: The
accuracy reward. -
: The
format compliance reward. -
: The
temporal overlap reward.For
RLtraining,Group Relative Policy Optimization (GRPO)[34] is used. For each prompt (where is the dataset of prompts), a group of responses (rollouts) are drawn from thebehavior policy. $ y ^ { ( k ) } \sim \pi _ { \theta _ { \mathrm { o l d } } } ( \cdot { \lvert } x ) , \quad k = 1 , \dots , K , \ y ^ { ( k ) } = ( y _ { 1 } ^ { ( k ) } , \dots , y _ { T _ { k } } ^ { ( k ) } ) , \qquad T _ { k } = \mathrm { l e n } ( y ^ { ( k ) } ) . $ Symbol Explanation: -
: The -th generated response (rollout) sequence.
-
: The
behavior policywith parameters . -
: The input prompt from the dataset .
-
: The number of sampled rollouts in a group.
-
: The -th token in the -th rollout.
-
: The length of the -th rollout.
A
group baselineandadvantagesare calculated: $ b = \frac { 1 } { K } \sum _ { k = 1 } ^ { K } R ^ { ( k ) } , \qquad A ^ { ( k ) } = R ^ { ( k ) } - b , $ Symbol Explanation: -
: The
group baseline, which is the average reward of all rollouts in the group. -
: The
advantagefor the -th rollout, representing how much better (or worse) its reward is compared to thegroup baseline. -
: The scalar reward for the -th rollout, as defined above.
The policy maximizes a
length-normalized,token-conditional KL-regularized objective: $ { { \mathcal { I } ( \theta ) = \mathbb { E } } } { \underset { { y ^ { ( k ) } } \sim \pi _ { \theta _ { \mathrm { o l d } } } ( \cdot \vert x ) } { \underbrace { x \sim \mathcal { D } } } \bigg [ \frac { 1 } { K } \sum _ { k = 1 } ^ { K } \frac { 1 } { T _ { k } } \sum _ { t = 1 } ^ { T _ { k } } A ^ { ( k ) } \log \pi _ { \theta } \big ( y _ { t } ^ { ( k ) } \mid x , y _ { < t } ^ { ( k ) } \big ) \bigg ] } \ { } { - \beta \mathbb { E } _ { x \sim \mathcal { D } } \bigg [ \frac { 1 } { K } \sum _ { k = 1 } ^ { K } \frac { 1 } { T _ { k } } \sum _ { t = 1 } ^ { T _ { k } } D _ { \mathrm { K L } } \Big ( \pi _ { \theta } ( \cdot \vert x , y _ { < t } ^ { ( k ) } ) | \pi _ { \mathrm { r e f } } ( \cdot \vert x , y _ { < t } ^ { ( k ) } ) \Big ) \bigg ] } $ Symbol Explanation: -
: The objective function to be maximized for the current policy with parameters .
-
: Expectation over prompts sampled from the dataset .
-
: Expectation over the group of rollouts sampled from the
behavior policygiven prompt . -
: The log-probability of generating token under the current policy , given the prompt and preceding tokens . This term encourages the policy to generate actions that lead to high rewards.
-
: A hyperparameter controlling the strength of the
KL regularizationterm. -
: The
Kullback-Leibler (KL) divergencebetween the current policy and afrozen reference policy. This term penalizes large deviations from thereference policy, helping to stabilize training and prevent aggressive policy updates. -
: A
frozen reference policy, typically theSFTmodel or an earlier version of theRLpolicy, used to prevent the policy from drifting too far duringRL.
-
4.2.2.3. Agentic Reinforcement Fine-tuning (RFT)
This final stage is designed to stabilize the agentic behaviors learned during RL and consolidate multimodal reasoning. It is motivated by findings that RFT is crucial for strengthening reasoning capabilities in LLMs.
The process involves:
- Filtering High-Quality Trajectories: As described in the
RFT Data Curationsection,trajectoriesfromRLruns that achieve both correct final answers andIoU>= 0.3 fortemporal groundingare selected. - Self-Distillation: These
filtered trajectoriesare converted intosupervised training examples. - Post-RL Refinement: The model (initialized with the best-performing
RLcheckpoint) is then fine-tuned on thisself-generated, well-grounded dataset. Thisin-distribution supervisionhelps the model internalize robustgroundingandtool-calling patterns, leading to further performance gains beyond whatSFTorRLalone can achieve.
4.2.3. Overall Framework
The overall LongVT framework operates in an iterative "hypothesis-verification" cycle, where the model skims global frames, proposes a temporal window, invokes the crop_video() tool to resample finer-grained evidence, and self-corrects if the initial retrieval is insufficient. This entire decision-making process is refined through the three-stage training pipeline, optimizing against a joint answer-temporal grounding reward to align with human-like verification strategies.
5. Experimental Setup
5.1. Datasets
LongVT leverages a combination of existing and newly curated datasets for both training and evaluation.
Training Datasets: The training process uses a diverse mixture of data, as detailed in Table 1 (provided in Section 4.2.1.3).
-
VideoSIAH (Self-Curated):
- SFT (w/o tool): 228,835 samples from
LongVideo-Reason CoT[4],Video-R1 CoT[8], andImage-based CoT. - SFT (w/ tool): 19,161 samples from
Gemini-distilled iMCoTTandQwen-distilled iMCoTT. These aretool-augmentedexamples foropen-ended QAandtemporal grounding. - RL: 1,667
Gemini-distilled open-ended QAsamples over long videos. - RFT: 15,353
Self-distilled iMCoTTsamples foragentic behaviors.
- SFT (w/o tool): 228,835 samples from
-
Image-based CoT Data (Detailed Breakdown): The following are the detailed statistics of Image-based CoT Data for Cold-Start SFT, as presented in Table 5 of the original paper:
Source Purpose Samples LLaVA-CoT [51] General Visual Reasoning 54,591 OpenVLThinker [6] Complex Reasoning 2,829 We-Math 2.0 [32] Mathematical Reasoning 602 Table 5. Detailed Statistics of Image-based CoT Data for Cold-Start SFT.
These datasets are chosen to provide a strong foundation in general visual reasoning, complex logical inference, and mathematical problem-solving, which are crucial for underpinning robust
temporal reasoning.
Evaluation Benchmarks: The models are evaluated on four challenging benchmarks:
-
VideoMME [9]: A comprehensive evaluation benchmark for
multi-modal LLMsin video analysis. Videos have an average duration of ≈1018 seconds. -
VideoMMMU [13]: Evaluates knowledge acquisition from multi-discipline professional videos. Videos have an average duration of ≈506 seconds. It has sub-categories like
adaptation,comprehension, andperception. -
LVBench [46]: An
extreme long video understanding benchmark. Videos have an average duration of ≈4101 seconds. -
VideoSIAH-Eval (Self-Curated): Designed specifically for
evidence-sparse long-video reasoning. It comprises 244 videos and 1,280 carefully filteredQA pairswithhuman-in-the-loop validation. Average video duration is ≈1688 seconds. Its design aims to overcomedata contaminationandoption biasfound in other benchmarks, as discussed in Section 8 of the paper.The following figure (Figure 6 from the original paper) shows the category distribution for
VideoSIAH-Eval:
Figure 6. VideoSIAH-Eval Statistics. (a) shows the distribution of video categories, and (b) shows the proportion of question categories, highlighting the diversity of our proposed benchmark.
This figure demonstrates the diversity of video categories (Travel & Events, Gaming, Education, Sports, Food, Uncategorized) and question categories (Object Recognition, Temporal Reasoning, Action Recognition, Plot Synopsis, Counting, Spatial Relationship, Mathematical Reasoning, Emotion Recognition), confirming the benchmark's comprehensive nature.
Data Sample Example:
The following figure (Figure 10 from the original paper) shows a representative sample from both SFT and RFT stages, demonstrating the structure of iMCoTT with tool calls:
Figure 10. SFT/RFT Data Example. This example shows an iMCoTT reasoning trace, where the model iteratively refines its understanding of the video content by cropping specific segments (e.g., [763.00s - 995.00s]) to find evidence for the question: "Across the series of festive snack demonstrations...what does the man consistently keep in his arms?" The answer is grounded in the retrieved video segment: "a small white dog."
The following figure (Figure 9 from the original paper) presents the evaluation prompts used in LLM-as-a-Judge for measuring answer accuracy during RL:
Figure 9. LLM-as-a-Judge Reward Prompt. This prompt template is used to evaluate the consistency between a model's generated answer and the ground-truth answer, providing a score of 1, 0.5, or 0.
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a complete explanation:
-
Average Score:
- Conceptual Definition: This metric represents the average performance of a model across multiple evaluation benchmarks. It provides a generalized measure of the model's overall capability. For the context of this paper, it's an average of the scores obtained on
VideoMME,VideoMMMU(average of its sub-metrics),LVBench, andVideoSIAH-Eval. - Mathematical Formula: $ \text{Average Score} = \frac{1}{N} \sum_{i=1}^{N} \text{Score}_i $
- Symbol Explanation:
- : The final average score across all benchmarks.
- : The total number of benchmarks included in the average (in this case, 4: VideoMME, VideoMMMU, LVBench, VideoSIAH-Eval, where VideoMMMU's score is itself an average of its sub-metrics).
- : The score obtained by the model on the -th benchmark.
- Conceptual Definition: This metric represents the average performance of a model across multiple evaluation benchmarks. It provides a generalized measure of the model's overall capability. For the context of this paper, it's an average of the scores obtained on
-
Accuracy (for
LLM-as-a-Judge):- Conceptual Definition: This metric quantifies the semantic consistency between a model's generated answer and the ground-truth answer. It goes beyond simple keyword matching by evaluating if the core meaning is preserved. The scores reflect whether the answer is fully correct, partially correct, or incorrect/inconsistent. This is used as part of the
answer accuracy rewardinRL. - Mathematical Formula: The
LLM-as-a-Judgeassigns one of three categorical verdicts, which are then mapped to numerical scores: $ \mathbf { R } _ { \mathrm { a c c } } ^ { ( k ) } = { \left{ \begin{array} { l l } { 1 , } & { \text{if } J ^ { ( k ) } = \operatorname { F } \text{ (Fully consistent)} , } \ { 0 . 5 , } & { \text{if } J ^ { ( k ) } = \operatorname { P } \text{ (Partially consistent)} , } \ { 0 , } & { \text{if } J ^ { ( k ) } = \operatorname { I } \text{ (Inconsistent)} . } \end{array} \right. } $ - Symbol Explanation:
- : The
accuracy rewardfor the -th rollout. - : The categorical verdict from
LLM-as-a-Judgefor the -th rollout, which can be , , or .
- : The
- Conceptual Definition: This metric quantifies the semantic consistency between a model's generated answer and the ground-truth answer. It goes beyond simple keyword matching by evaluating if the core meaning is preserved. The scores reflect whether the answer is fully correct, partially correct, or incorrect/inconsistent. This is used as part of the
-
Intersection over Union (IoU) (for Temporal Grounding):
- Conceptual Definition:
IoUis a standard metric for evaluating the overlap between twotemporal intervals(predicted and ground truth). It is calculated as the ratio of the duration of their intersection to the duration of their union. A higherIoUvalue indicates a better temporal alignment between the predicted and actual event boundaries. This is used as part of thetemporal overlap rewardinRL. - Mathematical Formula: $ \mathrm { I o U } \ = \ \frac { | [ t _ { s } , t _ { e } ] \cap [ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ] | } { | [ t _ { s } , t _ { e } ] \cup [ t _ { s } ^ { \prime } , t _ { e } ^ { \prime } ] | } $
- Symbol Explanation:
[ t _ { s } , t _ { e } ]: The predicted temporal interval (start time , end time ).- : The ground-truth temporal interval (start time
t'_s, end timet'_e). - : Denotes the intersection operation between two intervals, yielding the duration of their overlap.
- : Denotes the union operation between two intervals, yielding the total duration covered by either interval.
- : Represents the duration (length) of an interval.
- Conceptual Definition:
-
IoU@0.3, IoU@0.5, IoU@0.7 (for Temporal Grounding Benchmarks like Charades-STA):
- Conceptual Definition: These metrics measure the percentage of predictions for which the
IoUscore with the ground-truth interval meets or exceeds a specific threshold (e.g., 0.3, 0.5, or 0.7). A higher value indicates that a larger proportion of predictions are well-aligned with the ground truth at that specific level of strictness.IoU@0.7is stricter thanIoU@0.3. - Mathematical Formula: No specific formula is given in the paper, but conceptually: $ \text{IoU@Threshold} = \frac{\text{Number of predictions with IoU} \ge \text{Threshold}}{\text{Total number of predictions}} \times 100% $
- Symbol Explanation:
- : The percentage of predictions meeting the
IoUthreshold. - : The specific
IoUvalue (e.g., 0.3, 0.5, 0.7) that the prediction'sIoUmust meet or exceed.
- : The percentage of predictions meeting the
- Conceptual Definition: These metrics measure the percentage of predictions for which the
-
mIoU (Mean IoU) (for Temporal Grounding Benchmarks):
- Conceptual Definition: This metric is the average
IoUscore calculated over all predictions in a dataset. It provides a single summary measure of the average temporal alignment quality across the entire set of events. - Mathematical Formula: No specific formula is given in the paper, but conceptually: $ \text{mIoU} = \frac{1}{\text{N}} \sum_{i=1}^{\text{N}} \text{IoU}_i $
- Symbol Explanation:
- : The
mean Intersection over Union. - : The total number of predictions.
- : The
IoUscore for the -th prediction.
- : The
- Conceptual Definition: This metric is the average
5.3. Baselines
LongVT's performance is compared against both open-source and proprietary LMMs.
Open-Source LMMs:
- Qwen2.5-VL-7B [1]: The base model used for
LongVT's development, evaluated as a baseline. - Video-R1-7B [8]: An
R1-stylemodel specifically designed for reinforcing video reasoning inMLLMs. - VideoRFT-7B [44]: Focuses on incentivizing video reasoning via
reinforced fine-tuning. - Video-Thinker-7B [45]: A model that aims to spark "thinking with videos" using
reinforcement learning.
Proprietary LMMs:
-
GPT-4o [16]: OpenAI's flagship
multimodal model. -
Gemini 1.5 Pro [40]: Google's advanced
multimodal model, known for its long context window.Note on
VITAL[57]: The paper explicitly states that direct comparisons toVITAL, another concurrent tool-augmented video-centricLMM, are not included because its model checkpoints are not publicly available, which hinders fair and reproducible experiments.
5.4. Experimental Details
- Base Model:
Qwen2.5-VL-7B[1] is used as the foundational model across all experiments. - Frame Sampling Regimes:
- Sparse Frame Sampling: 64 uniformly sampled video frames are used.
- Dense Frame Sampling: Either 512 or 768 uniformly sampled frames are used, with the better result reported.
- Prompting:
- Reasoning Prompt: Indicated by () for standard reasoning-style prompts or () for direct question-answering prompts.
- Tool Calling: Indicated by () if native
tool callingis enabled in the prompt or () if disabled.
- Evaluation Framework:
LMMs-Eva1framework [58] is used for unified evaluation. - Inference Setup: A standard
Model Context Protocolserver paired with an online inference engine [19] supportingcontinuous batchingfor asynchronous requests. Specialdelimiter tagsare injected into the generation stream to parsereasoning steps,tool invocations, andfinal answers. Performance is quantified using a hybrid scoring mechanism combiningdeterministic rule-based validatorswithsemantic evaluationviaLLM-as-a-Judge[53].
Implementation Details: The following are the detailed hyperparameters across training stages, as presented in Table 6 of the original paper:
| Component | SFT | RL | RFT |
| Optimizer | AdamW [29] | AdamW | AdamW |
| Learning Rate (LR) | 5e-5 | 1e-6 | 5e-5 |
| LR Scheduler | cosine | constant | cosine |
| Weight Decay | 0.0 | 1e-2 | 0.0 |
| No. of Training Steps | 3000 | 160 | 1600 |
| No. of Warmup Steps | 300 | 0 | 160 |
| Max Length | 51200 | 52384 | 51200 |
| Dynamic Batch Size | True | False | True |
| Remove Padding | True | True | True |
| Liger Kernel | True | False | True |
| No. of GPUs | 32 | 64 | 64 |
| No. of Frames | 512 | 512 | 512 |
Table 6. Detailed Hyperparameters across Training Stages. Unless otherwise specified, all experiments are conducted on NVIDIA A800-SXM4-80GB GPUs.
- SFT: Initialized with
Qwen2.5-VL-7B-Instruct[1] using theLMMs-Engine[28] framework. Employs an online stream packing strategy with iterable datasets, concatenating input samples to fill a fixed buffer (51,200 tokens) to optimize throughput and minimize memory. Training continues until convergence. - RL: Built upon the
verllibrary [36], extended for multi-turn and multimodaltool-augmented rolloutsviaSGLang[60]. Global batch size of 16, 16 rollouts per prompt. Maximum 16,384 new tokens and 36,000 total prompt length. Constant temperature of 1.0 for exploration. Early stopping is used when reward metrics saturate. - RFT: Uses the same efficient training infrastructure as
SFT, initialized with the best-performingRLcheckpoint. The training corpus consists of high-quality,self-distilled trajectoriesfromRL rollouts. Computational resources are scaled up to 64 GPUs to accommodate the augmented dataset and accelerate refinement.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that LongVT achieves state-of-the-art performance among open-source video-centric LMMs, particularly excelling in long-video reasoning tasks.
The following are the performance comparison with existing video-centric LMMs across various long video understanding and reasoning benchmarks, as presented in Table 2 of the original paper:
TaPerormance Compariso withExisting ideo-CentricMs across arius LongIdonderstanndReas The numbers with den everaira benarkda ultsuroffic reot [9, 3,.
| Model | Reasoning Prompt | Tool Calling | VideoMME (≈1018 sec) [9] w/ subtitle | VideoMMMU (≈506 sec) [13] | LVBench [46] (≈4101 sec) | VideoSIAH-Eval (≈1688 sec) | Average Score | ||
| adaptation | comprehension | perception | |||||||
| Proprietary LMMs | |||||||||
| GPT-4o [16] | X | X | 77.2 | 66.0† | 62.0† | 55.7† | 30.8† | 17.4 | 51.5 |
| Gemini 1.5 Pro [40] | X | X | 81.3* | 59.0* | 53.3 | 49.3 | 33.1* | - | 55.2 |
| Open-Source LMMs with Sparse Frame Sampling | |||||||||
| Qwen2.5-VL-7B [1] | × | X | 62.6 | 37.3 | 28.0 | 36.7 | 30.7 | 28.1 | 37.2 |
| Video-R1-7B [8] | ✓ | × | 61.0 | 36.3 | 40.7 | 52.3 | 37.2 | 27.9 | 42.6 |
| VideoRFT-7B [44] | ✓ | X | 60.9 | 36.7 | 42.0 | 53.0 | 34.7 | 26.5 | 42.3 |
| Video-Thinker-7B [45] | ✓ | X | 61.0 | 34.3 | 44.7 | 53.0 | 52.2 | 10.4 | 42.6 |
| LongVT-7B-SFT (Ours) | √ | ✓ | 12.5 | 37.7 | 46.0 | 58.3 | 36.0 | 26.8 | 36.2 |
| LongVT-7B-RL (Ours) | ~ | ; | 66.1 | 32.7 | 44.7 | 50 | 37.8 | 31.0 | 43.7 |
| Open-Source LMMs with Dense Frame Sampling | |||||||||
| Qwen2.5-VL-7B [1] | X | X | 64.3 | 35.7 | 44.3 | 56.7 | 40.9 | 33.8 | 46.0 |
| Video-R1-7B [8] | ✓ | X | 60.5 | 37.3 | 38.7 | 46.3 | 40.1 | 33.1 | 42.7 |
| VideoRFT-7B [44] | ✓ | X | 49.2 | 37.7 | 40.7 | 48.7 | 18.7 | 26.9 | 37.0 |
| Video-Thinker-7B [45] | ✓ | X | 60.8 | 37.7 | 42.7 | 55.3 | 54.3 | 6.6 | 42.9 |
| LongVT-7B-SFT (Ours) | √ | √ | 64.9 | 32.3 | 42.0 | 49.7 | 41.1 | 34.8 | 44.1 |
| LongVT-7B-RL (Ours) | √ | √ | 66.1 | 37.7 | 42.3 | 56.3 | 41.4 | 35.9 | 46.6 |
| LongVT-7B-RFT (Ours) | ~ | √ | 67.0 | 35.7 | 43.7 | 56.7 | 41.3 | 42.0 | 47.7 |
Table 2. Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks. The numbers with ≈ denote average video duration in seconds. Benchmarking results are official or reproduced from [9, 3].
Key Observations from Table 2:
- Overall Superiority:
LongVT-7B-RFTachieves the highest average score (47.7) among all open-sourceLMMsin the dense frame sampling setting, outperforming the next best (Qwen2.5-VL-7B) by 1.7 points. This indicates the effectiveness of the proposediMCoTTand three-stage training. - Dense Frame Sampling Advantage:
LongVTmodels (especiallyRLandRFT) show significantly stronger performance with dense frame sampling compared to sparse, highlighting the importance of finer-grained visual information when available. - Performance on VideoSIAH-Eval: On the challenging
VideoSIAH-Evalbenchmark, designed forfine-grained evidence retrievalfromhours-long videos,LongVT-7B-RFTachieves 42.0, substantially outperforming the second-best open-source model (Qwen2.5-VL-7B with 33.8) by over 8 points. This is a strong validation ofLongVT's ability in its targeted task. - Closing the Gap with Proprietary Models:
LongVT-7B-RFT's average score of 47.7 is roughly 3.8 points behindGPT-4o(51.5) and 7.5 points behindGemini 1.5 Pro(55.2) (if consideringGemini 1.5 Proas the absolute best). This significantly narrows the performance gap between open-source and proprietaryLMMsforlong-video reasoning. - Impact of Tool Calling: Models with
Tool Callingenabled (LongVTvariants) generally perform better, especially in dense sampling. For example,LongVT-7B-SFTwith dense sampling (44.1) significantly improves over sparse (36.2), andLongVT-7B-RFTpushes this further to 47.7. The initial low score ofLongVT-7B-SFTin sparse sampling (12.5) seems anomalous or indicates that SFT alone might struggle without sufficient frames or further RL refinement.
6.2. Data Presentation (Tables)
The following are the comprehensive ablation studies on data recipes, training strategies, and the decoupled temporal grounding reward, as presented in Table 3 of the original paper:
| Setting | VideoMME [9] | VideoMMMU [13] | LVBench [46] test | VideoSIAH-Eval test | Average Score | ||
| w/ subtitle | adaptation | comprehension | perception | ||||
| Data Recipe | |||||||
| SFT w/o self-curated iMCoTT | 8.4 | 33.6 | 41.6 | 46.0 | 15.1 | 4.1 | 24.8 |
| SFT w/ self-curated iMCoTT (LongVT-7B-SFT) | 64.9 | 32.3 | 42.0 | 49.7 | 41.1 | 34.8 | 44.1 |
| RL w/o self-curated QAs | 55.1 | 30.6 | 42.0 | 45.6 | 38.4 | 30.8 | 40.4 |
| RL w/ self-curated QAs (LongVT-7B-RL) | 66.1 | 37.7 | 42.3 | 56.3 | 41.4 | 35.9 | 46.6 |
| Training Stage | |||||||
| SFT only (LongVT-7B-SFT) | 64.9 | 32.3 | 42.0 | 49.7 | 41.1 | 34.8 | 44.1 |
| RL only | 52.7 | 35.33 | 43.0 | 55.1 | 37.1 | 28.2 | 41.9 |
| SFT+RL (LongVT-7B-RL) | 66.1 | 37.7 | 42.3 | 56.3 | 41.4 | 35.9 | 46.6 |
| SFT+RL+RFT (LongVT-7B-RFT) | 67.0 | 35.7 | 43.7 | 56.7 | 41.3 | 42.0 | 47.7 |
| Decoupled Temporal Grounding Reward | |||||||
| Charades-STA [10] | |||||||
| IOU@0.3 | IoU@0.5 | IoU@0.7 | mIoU | Average Score | |||
| RL w/o Decoupled Reward | 31.5 | 19.9 | 9.1 | 21.2 | 20.4 | ||
| RL w/ Recall Reward | 32.0 | 20.4 | 9.6 | 21.6 | 20.9 | ||
| RL w/IoU Reward | 41.0 | 25.8 | 11.7 | 27.2 | 26.4 | ||
Table 3. Comprehensive Ablation Studies. C SFT, RL.
The following are the data contamination study results for Qwen-VL series, as presented in Table 4 of the original paper:
| Setting | VideoMME [9] | VideoMMMU [13] | VideoSIAH-Eval | ||
| w/o subtitle | adaptation | comprehension | perception | test | |
| Qwen2.5-VL-7B-Instruct [1] | |||||
| Original | 64.3 | 35.7 | 44.3 | 56.7 | 33.8 |
| No Visual | 40.1 | 25.7 | 38.3 | 39.3 | 12.7 |
| Rearranged Choices | 56.0 | 29.7 | 40.3 | 67.0 | - |
| Qwen3-VL-8B-Instruct [43] | |||||
| Original | 69.3 | 40.7 | 60.3 | 71.3 | 46.6 |
| No Visual | 44.1 | 33.7 | 39.3 | 46.7 | 0.00 |
| Rearranged Choices | 69.0 | 36.3 | 47.7 | 69.3 | - |
Table 4. Data Contamination Study. For MCQ benchmarks, the Rearranged Choices column reports the accuracy when the answer mapping is randomized. For VideoSIAH-Eval, Rearranged Choices is not applicable.
The following are the inference latency comparison across various long video understanding and reasoning benchmarks, as presented in Table 7 of the original paper:
| Model | VideoMMMU [13] | LVBench [46] | VideoMME [9] | VideoSIAH-Eval | Average |
| Qwen2.5-VL-7B [1] | 2108.6 | 2014.7 | 3031.6 | 1834.3 | 2247.3 |
| Video-R1-7B [8] | 1341.8 | 1550.6 | 2483.3 | 1900.3 | 1819.0 |
| VideoRFT-7B [44] | 1937.9 | 2154.3 | 3544.2 | 2052.6 | 2422.3 |
| Video-Thinker-7B [45] | 3153.8 | 3834.9 | 2475.1 | 1899.2 | 2840.8 |
| LongVT-7B-RFT (Ours) | 1329.8 | 1509.3 | 2754.0 | 1891.1 | 1871.1 |
Table 7. Inference Latency (seconds) on various Long Video Understanding and Reasoning Benchmarks. All inference on 8 NVIDIA A800-SXM4-80GB GPUs.
6.3. Ablation Studies / Parameter Analysis
The paper conducts extensive ablation studies to understand the contribution of different components and design choices.
6.3.1. Fine-Grained Reasoning Data Matters
- Impact of Self-Curated
iMCoTT(SFT stage): As shown in Table 3 (Data Recipe section), removing the self-curatediMCoTTdata duringSFT(SFT w/o self-curated iMCoTTvs.SFT w/ self-curated iMCoTT) leads to a drastic drop in average score (from 44.1 to 24.8) and particularly onVideoSIAH-Eval(from 34.8 to 4.1). This indicates that theVideoSIAHdata, specifically designed fortool-integrated reasoning, is critical for shaping the model's ability to handle long-form videos. - Impact of Self-Curated QAs (RL stage): Similarly, in the
RLstage, removing self-curated QAs (RL w/o self-curated QAsvs.RL w/ self-curated QAs) results in a significant performance drop onVideoSIAH-Eval(from 35.9 to 30.8) and a lower average score (from 46.6 to 40.4). This emphasizes that the quality and specificity of theRLtraining data are crucial for improvinganswer accuracy,temporal localization, andsystematic tool use.
6.3.2. Recall Encourages Coverage; IoU Demands Precision
-
Temporal Grounding Reward Choice: The paper ablates different
temporal grounding rewardfunctions inRL, specifically comparingRecallandIoU. As shown in Table 3 (Decoupled Temporal Grounding Reward section),RL w/ IoU Rewardachieves anmIoUof 27.2 onCharades-STA[10], significantly outperformingRL w/o Decoupled Reward(21.2) andRL w/ Recall Reward(21.6). -
Hypothesis on Recall: The paper hypothesizes that
Recallcan bereward-hackedby simply enlarging the predictedtemporal spanto encompass the ground-truth interval, which monotonically raisesRecallwithout necessarily improving boundary precision.IoU, by contrast, implicitly penalizesspan inflationthrough itsunion term, leading totighter timestamp proposalsandmore disciplined tool use. This is further supported by Figure 3 (left panel), which shows thatRecallaccuracy plateaus, validating thereward hackingbehavior.The following figure (Figure 3 from the original paper) shows the effects of time reward ablation and tool reward ablation:
Figure 3. Training Dynamics. (a) Time Reward Ablation: Evolution of Accuracy, IoU, and Recall metrics on Charades-STA for different reward functions. (b) Tool Reward Ablation: Effect of explicit tool-call reward on tool usage and accuracy during training.
6.3.3. Is Tool Reward Really Necessary?
- SFT's Role: As seen in Figure 3 (right panel), the baseline
Qwen2.5-VL-7Bcollapses to near-zerotool callswithoutSFT. Aftercold-start SFT(LongVT-7B-SFT),tool-call frequencysubstantially increases and continues to rise duringRL, indicatingSFTis essential for establishing basictool-calling competence. - Tool Reward's Limited Benefit: Surprisingly, explicitly adding a
tool reward(a binary bonus fortool invocation) brings little benefit. In laterRLstages, the configuration without thetool rewardeven shows slightly highertool-use frequency, and accuracy remains largely unchanged. This suggests that onceSFTgrounds the tool's semantics, the model learns when to invoke it based on the overallanswer accuracyandtemporal grounding rewards, without needing an additional specifictool invocationbonus, which might even suppress exploration. The final recipe thus discards thetool reward.
6.3.4. SFT Builds Competence; RL Optimizes Decisions; RFT Stabilizes Behaviors
- Importance of SFT: Table 3 (Training Stage section) shows that
RL only(withoutSFT) yields the lowest scores across all benchmarks (average 41.9), significantly worse thanSFT only(LongVT-7B-SFT, average 44.1). This confirms thatSFTis indispensable for teaching the model the basictool-use paradigm—selectingtemporal windows, inspecting content, and incorporating evidence. WithoutSFT, the model exhibitspoor tool-use abilityandbehavioral inconsistencies, becoming confused by tool outputs rather than integrating them as evidence. - RL for Generalization: (LongVT-7B-RL) significantly improves over
SFT only(LongVT-7B-SFT) from an average score of 44.1 to 46.6. This demonstratesRL's role in optimizing the model's decision-making (when to inspect, how long to crop, how to integrate evidence) and enhancing itsgeneralization abilityto held-out videos and unseen question templates. - RFT for Stabilization: (LongVT-7B-RFT) achieves the highest average score of 47.7, further improving upon . Notably, on
VideoSIAH-Eval,RFTpushes the score from 35.9 to 42.0. This indicates thatRFT, by distillinghigh-reward trajectoriesback intosupervised data, effectivelystabilizes agentic behaviorsandconsolidates long-horizon reasoningandtemporal grounding, realizing the full benefits oftemporal-grounding feedback.
6.3.5. Reflection Trajectory: From Verbose Self-Correction to Internalized Tool Usage
The paper analyzes the evolution of the model's internal thought process by tracking the proportion of reflection tokens. The following figure (Figure 7 from the original paper) visualizes this trend:
Figure 7. Trend of Reflection-Related Words and the Corresponding Word Cloud across All Rollouts.
There are three distinct phases:
- Verbose Self-Correction (Steps 0-50): Initially, the
reflection densityis high. The model generates extensive verbalself-correctionanditerative reasoningto compensate for poorlocalization accuracyandsub-optimal tool usage. - Efficiency Optimization (Steps 50-80): As the policy matures and intrinsic
grounding capabilityimproves,reflection densitysignificantly drops. The model autonomously prunes unnecessary linguistic fillers, learning that prolongedreflectionis redundant, maximizingreward efficiency. - Internalized Proficiency (After 80 Steps): The
reflection curvestabilizes at a concise baseline. This indicates a shift towardsselective reasoning, where explicitreflectionis invoked only when ambiguity needs to be resolved. The core semantics oftool interactionhave been internalized. The word cloud (right panel of Figure 7) confirms that remainingreflection tokensaresemantically grounded(e.g., "segment," "confirm"), serving functional roles fortemporal reasoningrather than generic linguistic fillers.
6.3.6. What Motivates VideoSIAH? Unveiling the Data Contamination in Qwen-VL Series
The paper conducts a data contamination study on Qwen-VL series models (Qwen2.5-VL-7B-Instruct and Qwen3-VL-8B-Instruct) to demonstrate the necessity of VideoSIAH-Eval.
-
"No Visual" Performance: As shown in Table 4, both
Qwen2.5-VLandQwen3-VLachieve surprisingly high scores onVideoMMEandVideoMMMUeven without any video frames (e.g., Qwen2.5-VL scores 40.1 on VideoMME without subtitles, far exceeding random guessing for 4-option MCQs). This strongly indicatessevere leakageand potentialmemorizationof textual information or correlations in these benchmarks. In stark contrast,Qwen3-VL's score onVideoSIAH-Evaldrops to 0.00 in the "No Visual" setting, confirming thatVideoSIAH-Evalis clean and non-contaminated, forcing models to rely on visual grounding. -
"Rearranged Choices" Reveals Overfitting: For
MCQ-based benchmarks,Qwen2.5-VL's performance significantly drops when answer choices are rearranged (e.g., from 64.3 to 56.0 onVideoMME). This suggests models mightmemorize specific option mappingsrather than genuinely understanding the content.VideoSIAH-Evaluses anopen-ended QA format, making it immune to thisoption bias.These findings underscore that
VideoSIAH-Evalprovides a more robust and reliable assessment of genuinelong-video reasoningcapabilities.
6.4. Inference Efficiency Analysis
The paper investigates inference latency to address the concern that multi-turn agentic frameworks might be inherently slower.
- Counter-Intuitive Efficiency: As shown in Table 7,
LongVT-7B-RFTdemonstrates competitive, and in some cases, superiorinference efficiency. It achieves the lowest latency onVideoMMMU(1329.8 seconds) andLVBench(1509.3 seconds) among the compared models, while remaining competitive onVideoMMEandVideoSIAH-Eval. - Reason for Efficiency: This efficiency, despite multi-turn
tool interactions, is attributed to the precision ofLongVT's reasoning. Unlike baselines that mighthallucinateor generate verbose, uncertainty-driven descriptions by "blindly rephrasing" uncertain visual memories,LongVTproactively seeks and grounds its answers inretrieved frames. Thisevidence-based approachcircumvents the need for lengthy, potentially incorrect textual generation, leading to more concise and faster token generation overall. - Human-like Viewing: The efficiency aligns with
human-like viewing habits, where one doesn't watch an entire video frame-by-frame butstrategically samplesand encodes relevant segments.LongVT's ability tocrop_videoand focus on relevant parts helps avoid the prohibitivecomputational costandcontext overflowof encoding extremely long sequences entirely.
6.5. Qualitative Examples
The paper provides several qualitative examples to illustrate LongVT's reasoning process and self-correction capabilities.
-
Self-Correction in Single-Turn (Figure 11): The example shows a single-turn case where the model initially identifies the basin color as pink but then uses internal monologue to
re-check visual evidence, realizing ahallucination. It then successfullyself-correctsand outputs the correct answer (Blue). This highlights the model's ability to reflect and revise its hypothesis based on visual input.The following figure (Figure 11 from the original paper) illustrates a self-correction case:
Figure 11. Self-Correction in Single-Turn. The model initially hallucinates the basin color as pink, but then re-inspects the visual evidence, self-corrects, and outputs the correct answer (Blue). -
Multi-Turn Refinement (Figure 12): This example demonstrates
multi-turn tool interactionswhere the model iteratively refines itstemporal window. An initialcrop_videomight miss the target event, leading the model to adjuststartandend timesfor a subsequenttool calluntil the relevant visual evidence (e.g., a US flag) is successfully identified.The following figure (Figure 12 from the original paper) shows a multi-turn refinement example:
Figure 12. Multi-Turn Refinement. The model's initial tool call (80s-100s) misses the US flag. Through self-correction, it refines the parameters and calls the tool again with the correct window (344s-372s) to successfully identify the US flag. -
Comparison with Textual CoT (Figure 13):
LongVTis compared against a standardtextual CoT baselinefor identifying colors of sports cars in aHoney promotion scene.-
The
textual CoT baselinehallucinates unseen visual details(e.g., incorrect object appearance or colors), demonstrating its vulnerability without active visual verification. -
LongVT(usingiMCoTT) follows anactive verify-and-correct procedure. It callscrop_videoaround a hypothesized time, detects that the retrieved segment lacks the queried objects (luxury sports cars),adjusts the crop regionbased on its reasoning, and then successfully locates the correct evidence to produce the accurate answer ("One is white and the other is yellow"). This showcasesLongVT's superior grounding andself-correction.The following figure (Figure 13 from the original paper) compares
Thinking with Textual CoTvs.Thinking with iMCoTT (Ours):
该图像是一个示意图,展示了LongVT框架中多模态工具思维的流程,包括从整体到局部的推理过程。该流程通过对视频片段的剪辑和细化,逐步获取可用视觉证据。
Figure 13. Thinking with Textual CoT vs. Thinking with iMCoTT (Ours). The top panel shows that a standard Textual CoT baseline hallucinates unseen visual details and outputs an incorrect answer ("White and Yellow"). The bottom panel demonstrates LongVT's iMCoTT workflow: it initially calls crop_video at 90s-120s, realizes the mislocalization, self-corrects, and then makes another tool call at 174s-190s to successfully identify the correct luxury sports car images and provide the accurate answer.
-
6.6. Failure Case Analysis
The paper presents a representative failure case to emphasize the importance of the cold-start SFT stage. The following figure (Figure 14 from the original paper) illustrates this:
Figure 14. RL-Only Failure Case. The model correctly recognizes the need to inspect the glass coffee table via a tool call. However, after receiving resampled frames, it fails to integrate the evidence to answer the specific question ("which video-game device"). Instead, it reverts to generic video captioning, restating superficial scene descriptions.
In this example, an RL-only variant (without cold-start SFT) correctly invokes the crop_video tool to inspect a glass coffee table for video-game devices. However, after receiving the resampled frames, the model fails to perform the specific reasoning required by the question. Instead of identifying the video-game device, it becomes confused by the context shift and reverts to generic video captioning, merely restating superficial scene descriptions. This behavior underscores that cold-start SFT is essential for teaching the model the intended semantics of tool usage and how to effectively integrate tool outputs into its reasoning process, preventing such behavioral inconsistencies.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces LongVT, an innovative end-to-end agentic framework that empowers LMMs to reliably reason over long-form videos. By adopting a human-inspired global-to-local reasoning strategy, LongVT implements interleaved Multimodal Chain-of-Tool-Thought (iMCoTT) where LMMs actively use a native video cropping tool to inspect specific temporal segments and gather finer-grained visual evidence. This approach transforms long-video understanding from passive frame consumption to active, evidence-seeking reasoning with self-correction capabilities.
A key contribution is VideoSIAH, a newly curated, large-scale, fine-grained data suite and evaluation benchmark, specifically designed to address evidence-sparse long-video reasoning tasks and overcome data contamination issues found in existing benchmarks.
The effectiveness of LongVT is attributed to its meticulously designed three-stage training pipeline: cold-start Supervised Fine-Tuning (SFT) for foundational capabilities, Agentic Reinforcement Learning (RL) with a novel joint answer-temporal grounding reward for optimizing decisions, and Agentic Reinforcement Fine-Tuning (RFT) for stabilizing learned behaviors. Through extensive empirical validation, LongVT consistently outperforms strong baselines across four challenging benchmarks, significantly narrowing the performance gap between open-source and proprietary LMMs in long-video understanding.
7.2. Limitations & Future Work
The authors acknowledge a primary limitation:
-
Memory Footprint and Context Window: While the
multi-turn tool interactionsdo not significantly increaseinference latency, thememory footprintofrecursive reasoningremains a bottleneck. As the number ofinteraction turnsincreases (especially forultra-longorinfinite video streams), the accumulation ofhistory tokens(including dense visual features returned by tools) can rapidlyexhaust the context budgetof the underlyingLMM. This poses a risk ofOut-of-Memory (OOM)errors during training andperformance degradationdue to truncation.To address this, the authors suggest a promising future direction:
-
Multi-Agent Collaboration: Inspired by advancements in
multi-agent reinforcement learning(e.g.,MATPO[31]), they envision ahierarchical framework. In this setup, a "Manager Agent" would orchestrate high-level planning and dispatch sub-tasks to specialized "Worker Agents." EachWorker Agentcould be responsible for inspecting distincttemporal segmentsor executing specifictool calls. Workers would thensummarize their observationsinto concise natural language updates for theManager Agent, effectivelydecoupling context managementfrom reasoning. This scalable,divide-and-conquer architecturecould theoretically supportinfinite-horizon reasoning loopswithoutcontext overflow.
7.3. Personal Insights & Critique
This paper presents a compelling and well-executed approach to a critical problem in multimodal AI: enabling reliable reasoning over long-form videos. The human-inspired global-to-local reasoning paradigm, coupled with native tool calling, is a highly intuitive and effective way to tackle the evidence-sparse nature of long-video content.
Inspirations drawn from this paper:
- Native Tool Calling as a Game Changer: The idea of treating an
LMM'stemporal grounding abilityas anative tool(crop_video()) is quite innovative. It integrates the visual processing seamlessly into theagentic reasoning loop, rather than relying on external, separate modules. This "native" integration is likely a key factor in its success. - Importance of Data Curation: The meticulous design and curation of the
VideoSIAHdataset, especially its focus onsegment-in-a-haystackscenarios and the rigoroushuman-in-the-loop validationanddata contamination study, highlight the critical role of high-quality, task-specific data in pushing the boundaries ofLMMcapabilities. This emphasizes that model architecture and training strategies are only as good as the data they learn from. - Three-Stage Training Strategy: The
cold-start SFT,agentic RL, andagentic RFTpipeline is a robust and well-justified approach. The ablation studies clearly demonstrate the indispensable role of each stage, particularlycold-start SFTfor building initial competence andRFTfor stabilizingagentic behaviors. This comprehensive training recipe provides valuable insights for developing complexagentic LMMs. - Efficiency of Active Reasoning: The finding that
LongVT's multi-turnagentic reasoningcan be as, or more, efficient than single-turn baselines is counter-intuitive and significant. It suggests that actively seeking and grounding evidence can lead to more concise and lesshallucinatedoutputs, ultimately saving computational resources by avoiding verbose, uncertain generation.
Potential issues, unverified assumptions, or areas for improvement:
- Dependence on
LLM-as-a-Judge: WhileLLM-as-a-Judgeis a powerful evaluation tool for open-ended tasks, its reliability is still subject to the capabilities and potential biases of the underlyingLLMused as the judge. Its "ground truth" is a derived consistency score, not absolute human judgment. The consistency scores could vary with different judge models or prompting strategies. - Scalability of
crop_videoTool: While the paper addressescontext windowlimits withmulti-agentfuture work, thecrop_videotool itself might become a bottleneck for extremely fine-grainedtemporal groundingor very rapid event changes. Theresamplingfrequency and efficiency of the underlying vision encoder for dense frames could become a limiting factor for real-time applications or extremely long videos. - Complexity of
iMCoTTDebugging: AlthoughiMCoTTenhances transparency by making reasoning steps explicit, debugging failures in a multi-turn,tool-augmented RLsystem can still be complex. Pinpointing whether a failure stems from poortemporal grounding, incorrecttool invocation, or faultymultimodal reasoningwithin anRLloop can be challenging. - Prompt Sensitivity: As with most
LLMs, the performance of theagentic reasoningmight be sensitive to the exactprompt templatesused for the initial query and fortool invocation/response generation. The paper provides templates, but real-world deployment might require careful prompt engineering.
Transferability to other domains:
The core principles of LongVT—global-to-local reasoning, native tool calling for dynamic information retrieval, and iterative self-correction—are highly transferable.
-
Image Understanding: Applying this to extremely high-resolution images or
panoramic images, whereLMMscould use a "zoom" tool to inspect specific regions for details. -
Audio/Speech Processing: An
LMMcould use tools to "focus" on specifictemporal segmentsof an audio track, perhaps to isolate a speaker or an event, and then analyze thefiner-grained audio features. -
Document Analysis: For
long documentsorscientific papers, anLMMcould "skim" abstracts and headings, then use a "section-read" or "figure-inspect" tool tozoom inon relevant paragraphs or figures for detailed understanding. -
Robotics/Embodied AI: An
agentic modelcontrolling a robot could use "sensor-focus" tools (e.g.,zoom camera,listen closely) to gather more precise information from its environment before making a decision, mirroring thehypothesis-verificationloop.Overall,
LongVTrepresents a significant step towards more intelligent and reliablemultimodal agentsthat can actively engage with complex, long-form data.
Similar papers
Recommended via semantic vector search.