Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
TL;DR Summary
Ego-R1 presents an RL-trained agent utilizing a Chain-of-Tool-Thought (CoTT) process to address ultra-long egocentric video reasoning challenges. By decomposing complex tasks and dynamically invoking tools, it effectively extends video understanding capabilities from hours to wee
Abstract
We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
- Authors: Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu.
- Affiliations: The authors are affiliated with S-Lab at Nanyang Technological University, A*STAR in Singapore, Simon Fraser University, and Shanghai AI Lab. This represents a collaboration between academic and research institutions.
- Journal/Conference: The paper is currently a preprint available on arXiv, an open-access archive for scholarly articles. This means it has not yet undergone formal peer review for publication in a journal or conference.
- Publication Year: The paper was submitted to arXiv in June 2024 (the paper itself lists futuristic publication years for some references and its own ID
2506.13654v1, which typically implies a 2025 submission, but this is likely a forward-looking convention or typo). - Abstract: The paper introduces Ego-R1, a framework designed for reasoning over ultra-long egocentric videos that span days or weeks. The core of Ego-R1 is a "Chain-of-Tool-Thought" (CoTT) process, orchestrated by an agent trained with reinforcement learning (RL). This process breaks down complex reasoning tasks into smaller, manageable steps, where the agent calls specific tools to answer sub-questions related to temporal retrieval and multimodal understanding. To train this agent, the authors developed a two-stage paradigm: Supervised Finetuning (SFT) on a custom CoTT dataset, followed by RL. They created the
Ego-R1 Datadataset (comprisingEgo-CoTT-25Kfor SFT andEgo-QA-4.4Kfor RL) and a new evaluation benchmark,Ego-R1 Bench, featuring week-long videos. The results show that Ego-R1 significantly improves the ability to understand and reason over videos extending up to a week long. - Original Source Link: https://arxiv.org/abs/2506.13654v1 (Preprint)
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Standard video understanding models struggle with ultra-long egocentric videos (spanning days or weeks). These videos are computationally expensive to process, and existing methods either lose crucial information through compression/sampling or are limited by rigid, predefined reasoning steps.
- Importance: Egocentric videos offer a rich, first-person perspective on daily life, crucial for applications like personal memory assistance, activity tracking, and life-logging. Answering questions like "When was the last time I saw my keys?" requires reasoning over vast, sparse, and temporally distant events.
- Innovation: The paper proposes a shift from monolithic models that process entire videos to a more flexible, agentic framework. This "Ego-R1 Agent" mimics human problem-solving by dynamically choosing and using a set of specialized tools in a step-by-step manner, a process the authors term Chain-of-Tool-Thought (CoTT).
-
Main Contributions / Findings (What):
- Ego-R1 Framework: A novel agent-based system that uses a Large Language Model (LLM) trained via Reinforcement Learning to reason over ultra-long videos. It dynamically calls specialized tools to retrieve and analyze information.
- Chain-of-Tool-Thought (CoTT): A structured reasoning paradigm where complex questions are decomposed into a chain of "thought -> tool call -> observation" steps, making the reasoning process interpretable and modular.
- Specialized Toolkit: A set of three complementary tools designed for this task: a hierarchical text-based retrieval system (
h-rag), a short-term video analysis model (video-llm), and a fine-grained image analysis model (vlm). - New Datasets:
- Ego-R1 Data: A large-scale dataset for training the agent, consisting of
Ego-CoTT-25K(25,000 reasoning traces) for SFT andEgo-QA-4.4K(4,400 question-answer pairs) for RL. - Ego-R1 Bench: A new benchmark for evaluating long-horizon reasoning, containing human-verified QA pairs from week-long egocentric videos.
- Ego-R1 Data: A large-scale dataset for training the agent, consisting of
- Superior Performance: The Ego-R1 Agent significantly outperforms existing state-of-the-art models on the challenging
Ego-R1 Bench, extending effective video understanding from hours to a full week.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Egocentric Video: A video recorded from a first-person perspective, typically using a wearable camera. These videos are characterized by their long duration, continuous nature, and rich contextual information about the wearer's life, habits, and interactions.
- Large Language Models (LLMs) & Multimodal LLMs (MLLMs): LLMs (e.g., GPT-4) are AI models trained on vast amounts of text data, capable of complex reasoning and text generation. MLLMs (e.g., LLaVA) extend this capability to understand and reason about both text and visual inputs like images and videos.
- Chain-of-Thought (CoT) Reasoning: A prompting technique where an LLM is asked to "think step by step" before giving a final answer. This breaks down a problem and often leads to more accurate and logical reasoning. Ego-R1 extends this to
Chain-of-Tool-Thought(CoTT), where each step involves not just thinking but also using an external tool. - Agentic Tool-Use: A paradigm where an LLM acts as a central "agent" or "orchestrator" that can call external functions, APIs, or other models (the "tools") to gather information or perform actions that it cannot do on its own (e.g., searching the web, analyzing a video clip).
- Reinforcement Learning (RL): A machine learning training method where an agent learns to make a sequence of decisions in an environment to maximize a cumulative reward. In this paper, the agent learns the optimal sequence of tool calls to correctly answer a question.
- Retrieval-Augmented Generation (RAG): A technique that enhances an LLM's knowledge by first retrieving relevant documents or information from a large database (a knowledge base) and then using that information as context to generate a more accurate and grounded response.
-
Previous Works & Differentiation:
- Monolithic MLLMs (
LongVA,LLaVA-Video): These models try to process video inputs directly. For very long videos, this becomes computationally prohibitive due to the massive number of tokens required to represent all the frames. - Sampling/Compression Methods: To handle long videos, some methods uniformly sample frames or compress video information. The major drawback is the high risk of missing sparse but critical events relevant to a query.
- Early Video Agents (
VideoAgent, T*): These systems also use LLMs to call vision tools. However, they typically rely on fixed, predefined reasoning pipelines or simple tool sequences. They lack the ability to dynamically and iteratively decide which tool to use next based on the ongoing reasoning context. - Ego-R1's Differentiation: Ego-R1 stands out due to its dynamic, multi-step tool-calling mechanism, orchestrated by an RL-trained agent. Instead of a rigid plan, the agent adapts its strategy on the fly, making it suitable for the unpredictable and sparse nature of events in week-long videos. The hierarchical RAG system is another key innovation that allows for efficient searching across different time scales (days, hours, minutes).
- Monolithic MLLMs (
4. Methodology (Core Technology & Implementation)
The core of Ego-R1 is an agent that intelligently uses a set of tools to answer questions about ultra-long videos. This process is structured as a Chain-of-Tool-Thought (CoTT).
4.1. The Chain-of-Tool-Thought (CoTT) Framework
CoTT formalizes the reasoning process as a trajectory of steps. A complete CoTT trajectory is a sequence of steps . Each step consists of three parts:
-
: Thought. The agent's internal reasoning in natural language (e.g., "I need to find out when the user was at the supermarket. I should search the daily logs.").
-
: Tool Call. The specific tool the agent decides to invoke, along with the necessary arguments (e.g., h_rag(level='day', keywords=['supermarket'])).
-
: Observation. The output returned by the executed tool (e.g., "Day 3 log: visited supermarket at 14:30.").
This loop repeats, with each new observation feeding into the agent's next thought, until it has gathered enough information to provide a final answer. Image 1 provides a clear visual example of this process.
该图像为示意图,展示了Ego-R1模型使用Chain-of-Tool-Thought方法对超长时间(多天)第一视角视频进行推理的流程。图中通过逐天的视频片段和工具箱表示,每天记录视频信息,经由不同工具模块(如Hierarchical_RAG、Video_LLM、VLM)分步骤、分层次地处理查询“谁在超市扶梯前面”这一问题,辅以机器思考(think)与工具调用(tool)对话,最终得出答案“Tasha”,体现了该模型分解复杂推理任务的过程。
4.2. The Specialized Toolkit
Ego-R1 is equipped with three complementary tools to handle different aspects of video understanding:
-
Hierarchical RAG (
h-rag): This is a text-based tool for efficient temporal retrieval.-
Principle: Instead of searching the raw video, it searches through textual summaries organized in a time-based hierarchy.
-
Structure: As shown in Image 4, raw videos are first broken into 30-second clips. These clips are summarized. The summaries are then aggregated into 10-minute summaries, then hourly summaries, and finally daily summaries. This creates a multi-granularity "memory bank."
-
Procedure: The agent performs a top-down search. For a query, it might first search the daily summaries to identify the relevant day, then drill down to the relevant hour, and so on. This is far more efficient than scanning the entire video.
该图像为示意图,展示了Ego-R1框架如何对超长时长(达7天、44.3小时)的视频进行多层次时间粒度的检索和记忆构建。自下而上分为日(DAY)、时(Hour)、十分钟(10 Minutes)、三十秒片段(30 Seconds)四个时间尺度,依次生成关键词记忆库,用于视频片段检索与理解,体现了多时间尺度的递进式信息汇总和调用过程。
-
-
Video Language Model (
video-llm): This is a visual tool for short-horizon video understanding.- Function: Once
h-raghas localized a potentially relevant time window (e.g., "Day 3, 14:30-14:35"), thevideo-llmis called to analyze the actual video clip from that segment. - Capability: It can understand dynamic actions, interactions, and sequential events within a clip of up to ten minutes.
- Function: Once
-
Vision Language Model (
vlm): This is a visual tool for fine-grained frame analysis.- Function: If the agent needs to identify a very specific detail within a single moment (e.g., "What is written on the cereal box?"), it can call the
vlmon a specific frame identified by a timestamp. - Capability: It provides high-resolution visual details that might be missed by the
video-llm.
- Function: If the agent needs to identify a very specific detail within a single moment (e.g., "What is written on the cereal box?"), it can call the
4.3. Data Generation and Agent Training
Training the Ego-R1 agent requires specialized data that demonstrates how to use these tools effectively.
-
Ego-R1 Data: The authors created a comprehensive dataset for this purpose, as illustrated in Image 2.
-
Ego-QA-4.4K: A set of 4,400 question-answer pairs derived from over 500 hours of egocentric video. This data combines AI-generated questions (verified by humans) and human-annotated questions.
-
Ego-CoTT-25K: For 2,900 high-quality QA pairs, the authors used a powerful proprietary LLM to automatically generate 25,000 CoTT reasoning traces. Each trace is a complete, step-by-step example of how to solve the question using the toolkit.
该图像为示意图,展示了Ego-R1框架中原始问答数据收集与Chain-of-Tool-Thought(CoTT)生成的流程。左侧描述通过Gemini模型和人工验证对EgoLife视频日志进行多选题问答数据的生成与标注;右侧展示了CoTT推理链的迭代过程,包含多步“思考-调用工具-观察”循环,最终经由验证输出答案。图中结合了示例问题、推理步骤和工具调用的具体内容,体现了系统如何处理超长视频的复合推理任务。
-
-
Two-Stage Training Strategy: The agent is trained in two phases, shown in Image 3.
-
Stage 1: Supervised Fine-Tuning (SFT): A pretrained LLM (Qwen-2.5-3B-Instruct) is fine-tuned on the
Ego-CoTT-25Kdataset. This stage teaches the model the basic syntax of tool calls and the structure of CoTT reasoning. The resulting model is namedEgo-R1-SFT. -
Stage 2: Reinforcement Learning (RL): The
Ego-R1-SFTmodel is further trained using an RL algorithm called Gradient-Regularized Policy Optimization (GRPO). In this stage, the model (the "policy") generates tool-calling trajectories ("rollouts"). It receives a positive reward if its final answer is correct and zero otherwise. RL helps the model learn a more robust and effective strategy for choosing tools to maximize the final reward.
该图像为流程示意图,展示了Ego-R1框架的两阶段训练流程。第一阶段为基于CoTT的监督微调(SFT),将预训练模型调优为Ego-R1-SFT,用于处理输入问题和调用外部工具进行多步推理。第二阶段通过GRPO强化学习,根据奖励模型反馈优化策略,最终形成Ego-R1智能体,实现动态多步工具调用以回答复杂问题。图中箭头清晰标示了数据流和训练阶段的关系。
The GRPO objective function is defined as:
- Explanation: This complex formula is a variant of policy optimization. In essence, it aims to update the model's parameters to maximize the expected reward.
- is the agent's policy (the model being trained).
- is the "advantage," which measures how much better a particular action (generating a token) is compared to the average action at that step.
- The
minandclipfunctions are part of the PPO (Proximal Policy Optimization) family, which prevents the policy from changing too drastically in one update, ensuring stable training. - The term is a regularizer that penalizes the policy for diverging too much from the original SFT model , which helps maintain reasoning capabilities learned during SFT.
-
5. Experimental Setup
-
Datasets:
- Egocentric:
Ego-R1 Bench(Proposed): The primary benchmark, with 300 QA pairs on videos averaging 44.3 hours. It specifically tests long-horizon reasoning.EgoSchema: A benchmark with 3-minute video clips testing reasoning about human intent.EgoLifeQA: A benchmark with videos averaging 44.3 hours. The authors used a cleaned subset to avoid overlap with their training data.
- Exocentric (Third-Person View):
VideoMME (long): A benchmark for general long-video understanding, with videos averaging 41 minutes. Used to test the model's generalization ability.
- Egocentric:
-
Evaluation Metrics:
- Accuracy (%):
- Conceptual Definition: This metric measures the percentage of questions the model answers correctly. It is the primary metric for evaluating performance on the question-answering tasks.
- Mathematical Formula:
- Symbol Explanation:
Number of Correct Predictions: The count of questions for which the model's final answer matches the ground-truth answer.Total Number of Predictions: The total number of questions in the evaluation set.
- Format Accuracy (%):
- Conceptual Definition: Used in the ablation study, this metric measures the percentage of tool calls generated by the agent that are syntactically correct and can be successfully executed by the system. It evaluates the model's ability to adhere to the required tool-use format.
- Accuracy (%):
-
Baselines: Ego-R1 was compared against a wide range of state-of-the-art models, grouped into four categories:
- MLLMs:
LongVA,LLaVA-Video,LLaVA-OneVision,InternVideo2.5, and the proprietaryGemini-1.5-Pro. These are end-to-end models. - RAG Methods:
LLaVA-Video + Video-RAGandLongVA + Video-RAG. These combine an MLLM with a retrieval component. - Reasoning Models:
Video-R1, a model specifically designed for video reasoning. - Video Agents:
VideoAgentandLLaVA-OneVision + T*, which are agent-based systems but with less dynamic reasoning pipelines.
- MLLMs:
6. Results & Analysis
6.1. Core Results
The main results are presented in Table 2, which has been transcribed below.
This is a transcription of Table 2 from the paper.
| Method | Size | Frames | Exocentric | \multicolumn{3}{c|}{Egocentric}
| :--- | :--- | :--- | :--- | :--- | :--- | :---
| | | | VideoMME (long)
41 min | EgoSchema
3 min | EgoLifeQA
44.3 h | Ego-R1 Bench
44.3 h
| MLLMs | | | | | |
| LongVA [81] | 7B | 64 | 45.0 | 44.1 | 33.0 | 23.0
| LLaVA-Video [82] | 7B | 64 | 61.5 | 57.3 | 36.4 | 29.0
| LLaVA-OneVision [28] | 7B | 1 FPS | 60.0 | 60.1 | 30.8 | 31.6
| InternVideo2.5 [64] | 8B | 512 | 53.4 | 63.9 | 33.0 | 34.0
| Gemini-1.5-Pro [58] | - | - | 67.4 | 72.2 | 36.9 | 38.3
| RAG Methods | | | | | |
| LLaVA-Video + Video-RAG [37] | 7B | 64 | 46.0 | 66.7 | 30.0 | 29.3
| LongVA + Video-RAG [37] | 7B | 64 | 55.7 | 41.0 | 26.0 | 31.0
| Reasoning Models | | | | | |
| Video-R1 [16] | 7B | 64 | 50.8 | - | 34.0 | 20.0
| Video Agents | | | | | |
| VideoAgent [63] | - | 8 | 50.8 | 54.1 | 29.2 | 32.6
| LLaVA-OneVision + T* [79] | 7B | 8 | 46.3 | 66.6 | 35.4 | 35.6
| Ours | | | | | |
| Ego-R1 | 3B | - | 64.9 | 68.2 | 36.0* | 46.0
- Performance on
Ego-R1 Bench: This is the most significant result. Ego-R1 achieves 46.0% accuracy, dramatically outperforming all other methods, including the powerful proprietary modelGemini-1.5-Pro(38.3%). This demonstrates the effectiveness of the dynamic CoTT approach for the novel challenge of ultra-long video reasoning. - Generalization to Exocentric Video: On
VideoMME (long), Ego-R1 achieves 64.9%, the second-highest score overall and the best among open-weight models. This shows that the reasoning framework, though trained on egocentric data, generalizes well to other video types. - Performance on Other Egocentric Benchmarks: On
EgoSchema, Ego-R1 is highly competitive (68.2%), second only to Gemini. OnEgoLifeQA, its performance (36.0%) is on par with the top baselines. - Model Size Efficiency: Notably, Ego-R1 uses a 3B parameter model, which is much smaller than the 7B/8B models used by most competitors, highlighting the efficiency of its agentic architecture.
6.2. Ablation Studies
Ablation studies were conducted to dissect the contribution of each component of the Ego-R1 framework.
-
Impact of Training Regimes (Table 3):
This is a transcription of Table 3 from the paper.
Base Model Training Regimes Acc.% Format Acc.% SFT RL Qwen-2.5 3B-Instruct 1.4 4.3 ✓ 0.0 (↓1.4) 13.3 (↑9.0) ✓ 34.3 (↑32.9) 100.0 (↑95.7) ✓ ✓ 46.0 (↑44.6) 100.0 (↑95.7) - Key Insight: SFT is crucial. Without it, the model cannot perform the task (1.4% accuracy). RL alone fails completely (0.0% task accuracy) because it doesn't know the reasoning structure. SFT provides the foundational knowledge of how to reason and use tools, while the subsequent RL stage refines this strategy to achieve the best performance.
-
Impact of Tool Configuration (Table 4):
This is a transcription of the two sub-tables in Table 4 from the paper.
Method Video_LLM Ego-R1 Bench Ego-R1 LLaVA-Video [82] 43.7 Gemini-1.5-Pro [58] 46.0 Method Tool-used Ego-R1 Bench Ego-R1 RAG only 39.7 Full 46.0 - Key Insights:
- Modularity: Using a stronger visual tool (
Gemini-1.5-Proas thevideo_llm) improves overall performance. This validates the modular design of Ego-R1, which allows individual components to be upgraded easily. - Necessity of Full Toolkit: Relying solely on the
RAGtool results in a significant performance drop. This confirms that both high-level temporal retrieval (from RAG) and detailed visual analysis (fromvideo-llmandvlm) are essential for solving complex reasoning tasks.
- Modularity: Using a stronger visual tool (
- Key Insights:
6.3. Qualitative Analysis
Image 5 shows a side-by-side comparison of reasoning traces from Video-R1 and Ego-R1.
该图像为比对示意图,展示了Video-R1方法与Ego-R1方法在四个案例中的问答过程和结果对比。每个案例包含问题、选项及两种方法的推理步骤、使用的工具及最终答案。图中用红色叉号标记Video-R1的错误答案,绿色对号标记Ego-R1的正确答案。Ego-R1通过多步骤分解和多模态工具调用,实现了对长时间段内视频内容的更准确理解与推理。
- In successful cases (1-3),
Ego-R1demonstrates a more detailed and logical reasoning process. It breaks the problem down, uses theh-ragtool to narrow down the time frame, and then uses visual tools to confirm details, leading to the correct answer. The step-by-step output makes its decision-making process transparent and interpretable. - Case 4 highlights a failure mode. The agent correctly identifies a relevant time range in Step 1 but fails to explore it further in subsequent steps, leading to an incorrect final answer. This shows that while powerful, the agent's policy is not perfect and can sometimes fail to follow through on promising leads.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces Ego-R1, a powerful and innovative framework for reasoning over ultra-long egocentric videos. By combining a dynamic, agent-based approach with a
Chain-of-Tool-Thoughtprocess and a specialized toolkit, Ego-R1 pushes the boundary of video understanding from hours to weeks. The work not only demonstrates superior performance but also provides a scalable, modular, and interpretable solution to a very challenging problem. -
Limitations & Future Work:
- Author-Acknowledged: The authors propose future work leveraging the multi-perspective nature of their data collection for tasks like social behavior analysis (modeling group activities) and building a personal habits tracker (identifying individual routines and patterns over long periods).
- Implicit Limitations: The framework's performance is inherently capped by the capabilities of its individual tools. An error in the
h-ragsummary or a hallucination from thevideo-llmcan derail the entire reasoning chain. The generation of CoTT data also relies on a powerful proprietary model, which might introduce specific biases into the training data.
-
Personal Insights & Critique:
- Significance: Ego-R1 represents a significant conceptual leap in long-form video understanding. The move away from end-to-end models towards modular, agentic systems that mimic human-like decompositional reasoning is a promising direction for tackling complex, real-world AI problems.
- Practical Implications: The framework is a strong step toward creating genuinely useful personal AI assistants. An agent that can accurately recall and reason about one's life events over weeks has immense potential for memory augmentation, health monitoring, and personalized assistance.
- Open Questions: How robust is the agent to noisy or ambiguous real-world data? Can the reasoning policy generalize to completely unseen types of questions or tools? The dependency on proprietary models for data generation and as tools raises questions about reproducibility and accessibility for the wider research community. Nevertheless, Ego-R1 sets a new standard and a clear path forward for the field.
Similar papers
Recommended via semantic vector search.