DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
TL;DR Summary
This paper presents DoraemonGPT, an LLM-driven system for understanding dynamic scenes, overcoming the limitations of current visual agents focused on static images. By converting videos into symbolic memory and utilizing sub-task tools, it enables effective spatio-temporal reaso
Abstract
Recent LLM-driven visual agents mainly focus on solving image-based tasks, which limits their ability to understand dynamic scenes, making it far from real-life applications like guiding students in laboratory experiments and identifying their mistakes. Hence, this paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes. Considering the video modality better reflects the ever-changing nature of real-world scenarios, we exemplify DoraemonGPT as a video agent. Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes. This structured representation allows for spatial-temporal querying and reasoning by well-designed sub-task tools, resulting in concise intermediate results. Recognizing that LLMs have limited internal knowledge when it comes to specialized domains (e.g., analyzing the scientific principles underlying experiments), we incorporate plug-and-play tools to assess external knowledge and address tasks across different domains. Moreover, a novel LLM-driven planner based on Monte Carlo Tree Search is introduced to explore the large planning space for scheduling various tools. The planner iteratively finds feasible solutions by backpropagating the result's reward, and multiple solutions can be summarized into an improved final answer. We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios. The code will be released at https://github.com/z-x-yang/DoraemonGPT.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
1.2. Authors
Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, Yi Yang
1.3. Journal/Conference
The paper was published on arXiv as a preprint and later mentioned as "ICML24" in Table 2a and 2b, indicating acceptance at the International Conference on Machine Learning (ICML) in 2024. ICML is a top-tier international academic conference in the field of machine learning, highly reputable and influential.
1.4. Publication Year
2024 (Published at UTC: 2024-01-16T14:33:09.000Z)
1.5. Abstract
The paper introduces DoraemonGPT, a comprehensive system driven by Large Language Models (LLMs) designed to understand dynamic scenes, contrasting with most existing LLM-driven visual agents that primarily focus on image-based tasks. Exemplified as a video agent, DoraemonGPT processes an input video and a question/task by first converting the video into a symbolic memory that stores task-related attributes. This structured representation enables spatial-temporal querying and reasoning through well-designed sub-task tools, yielding concise intermediate results. To overcome LLMs' limitations in specialized domain knowledge, plug-and-play tools are incorporated to access external knowledge. A novel LLM-driven planner based on Monte Carlo Tree Search (MCTS) explores the large planning space for scheduling these tools. This planner iteratively finds feasible solutions by backpropagating result rewards, and multiple solutions can be summarized into an improved final answer. The system's effectiveness is evaluated on three benchmarks and several real-world scenarios.
1.6. Original Source Link
https://arxiv.org/abs/2401.08392 (Preprint) PDF Link: https://arxiv.org/pdf/2401.08392v4.pdf
2. Executive Summary
2.1. Background & Motivation
Core Problem
The core problem DoraemonGPT aims to solve is the limited ability of current Large Language Model (LLM)-driven visual agents to understand and reason about dynamic scenes, such as those found in videos. Existing visual agents predominantly focus on static image-based tasks, which restricts their applicability in real-life scenarios requiring continuous temporal understanding and interaction.
Importance of the Problem
Understanding dynamic scenes is crucial for many real-world applications. The paper explicitly mentions examples like guiding students in laboratory experiments and identifying their mistakes, which require not just recognizing objects, but understanding sequences of actions, their timing, and causal relationships. The real world is inherently dynamic and ever-changing. Therefore, enabling LLMs to process and reason over video data is a significant step towards more advanced, generally intelligent AI systems. Challenges include:
- Spatial-temporal Reasoning: The ability to infer relationships between instances across both space and time (e.g., object trajectories, interactions, scene changes).
- Larger Planning Space: Videos introduce the complexity of actions, intentions, and temporal semantics, significantly expanding the search space for decomposing and solving tasks compared to static images.
- Limited Internal Knowledge: LLMs, despite their vast training data, cannot encode all specialized knowledge required for every possible video understanding task (e.g., scientific principles).
Paper's Entry Point or Innovative Idea
The paper's entry point is to design a comprehensive and conceptually elegant system that empowers LLMs to understand dynamic video scenes by addressing the aforementioned challenges. Its innovative idea centers around three pillars:
- Structured Information Collection: Converting raw video into a
task-related symbolic memory (TSM)that stores relevant spatial-temporal attributes, enabling efficient querying and reasoning. This avoids overwhelming LLMs with excessive context, which can hinder performance. - Enhanced Solution Exploration: Introducing a novel
LLM-driven plannerbased onMonte Carlo Tree Search (MCTS)to explore the large planning space. This allows the system to consider multiple potential solutions and refine its answers, moving beyond greedy, single-path reasoning. - Extensible Knowledge: Incorporating
plug-and-play toolsto access external, specialized knowledge sources, effectively expanding the LLM's expertise beyond its internal training data.
2.2. Main Contributions / Findings
Primary Contributions
- DoraemonGPT System Design: Proposes a comprehensive and conceptually elegant
LLM-driven systemfor dynamic scene understanding, exemplified as avideo agent. It is intuitive, versatile, and compatible with various foundation models. - Task-related Symbolic Memory (TSM): Introduces a novel approach to create a compact and queryable
TSMby decoupling spatial-temporal attributes into space-dominant and time-dominant memories. LLMs dynamically select and extract only task-relevant information into anSQL tablefor efficient access. - Sub-task and Knowledge Tools: Designs
sub-task toolsfor efficientspatial-temporal queryingof theTSM(e.g., "When," "Why," "What," "How," "Count") andplug-and-play knowledge toolsto incorporate external, domain-specific knowledge sources (symbolic, textual, web). - Monte Carlo Tree Search (MCTS) Planner: Develops a novel
LLM-driven MCTS plannerto effectively explore the large planning space of complex video tasks. This planner iteratively finds feasible solutions bybackpropagatingrewards, allowing for the generation and summarization of multiple solutions. - Extensive Evaluation: Conducts extensive experiments on three benchmarks (
NExT-QA, ,Ref-YouTube-VOS) and severalin-the-wild scenarios, demonstrating the system's effectiveness and versatility.
Key Conclusions or Findings
- Superior Performance on Dynamic Tasks:
DoraemonGPTsignificantly outperforms recent LLM-driven competitors (e.g.,ViperGPT,VideoChat) on causal, temporal, and descriptive reasoning tasks in video question answering, and substantially improves referring video object recognition. This highlights the efficacy of itsMCTS plannerandTSM. - Necessity of Structured Memory: The
Task-related Symbolic Memoryis crucial, especially for tasks requiring fine-grained understanding like referring object segmentation, whereDoraemonGPTremarkably surpasses supervised models without learning on the specific dataset. - Effectiveness of MCTS Planner: The
MCTS plannereffectively explores the large planning space, yielding better performance compared to greedy or naive search methods, particularly when generating multiple answer candidates. The ability tobackpropagaterewards guides more efficient exploration. - Benefits of Knowledge Extension: The
plug-and-play knowledge toolsenableDoraemonGPTto tackle complex, domain-specific problems that LLMs alone cannot handle, showcasing its extensibility. - Foundation Model Agnostic: The system can benefit from advancements in underlying
foundation models, as demonstrated by the improved performance when usingInstructBLIPfor captioning. - Real-world Applicability:
DoraemonGPTcan handle complexin-the-wild tasks, such as checking experimental operations or video editing, which were previously challenging for existing approaches.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand DoraemonGPT, a reader needs to be familiar with several core concepts in artificial intelligence, particularly in the fields of natural language processing (NLP) and computer vision (CV).
- Large Language Models (LLMs): These are advanced artificial intelligence models (like
GPT-3.5-turbo,PaLM 2,Llama) trained on vast amounts of text data to understand, generate, and process human language. They can perform tasks like question answering, text summarization, translation, and even code generation. InDoraemonGPT, LLMs act as the central reasoning engine, coordinating tools and making decisions. They are crucial for task decomposition, understanding context, and synthesizing final answers. - Visual Agents: These are AI systems that combine visual perception with decision-making and action capabilities, often driven by LLMs. They can interpret visual input (images or videos) and perform tasks based on that understanding. The paper notes that most current visual agents focus on static images.
- Multi-modal Understanding: This refers to the ability of AI systems to process and integrate information from multiple modalities, such as text, images, audio, and video.
DoraemonGPTaims for multi-modal understanding by integrating video (visual and auditory) with textual queries. - Foundation Models: These are large AI models pre-trained on broad data at scale, designed to be adapted to a wide range of downstream tasks. Examples include
BLIP(for image captioning),YOLOv8(for object detection),Whisper(for speech recognition), etc.DoraemonGPTleverages several such models astoolsto extract information from videos. - Monte Carlo Tree Search (MCTS): An algorithmic search strategy commonly used in artificial intelligence, particularly in game playing (e.g., Go, chess). It builds a search tree by repeatedly performing four steps:
- Selection: Choose the best child node to explore further.
- Expansion: Add a new child node to the selected node.
- Simulation (Rollout): Run a simulated playout from the new node to a terminal state.
- Backpropagation: Update the statistics (e.g., win/loss count, reward) of the nodes along the path from the new node to the root, based on the simulation result.
In
DoraemonGPT, MCTS is adapted to guide the LLM'splanningprocess, helping it explore different sequences of tool calls to find optimal solutions in a largeplanning space.
- ReAct (Reasoning and Acting): A prompting strategy for LLMs that interweaves
reasoning(Thought) andacting(Action, Action Input, Observation) steps. TheThoughtstep allows the LLM to explicitly verbalize its reasoning process, and theActionstep allows it to interact with external tools or environments. TheObservationstep provides feedback from the environment.DoraemonGPTuses aReAct-style step for its non-root nodes in theMCTS planner. - In-context Learning (ICL): A paradigm where LLMs learn to perform new tasks by being given a few examples (demonstrations) within their input prompt, without requiring parameter updates or fine-tuning.
DoraemonGPTuses ICL for selectingTask-related Symbolic Memory (TSM)types and for guiding sub-task tools. - Symbolic Memory (e.g., SQL): Instead of raw, unstructured data, symbolic memory stores information in a structured, semantic format that can be easily queried and reasoned about.
SQL (Structured Query Language)is a standard language for managing and querying relational databases.DoraemonGPTconverts video information intoSQL tablesfor efficient access.
3.2. Previous Works
The paper positions DoraemonGPT within the context of recent advancements in LLM-driven visual agents and multi-modal understanding.
- LLM-driven Agents for Image-based Tasks: The introduction highlights that
recent LLM-driven visual agents mainly focus on solving image-based tasks. This refers to works likeViperGPT(Surís et al., 2023),HuggingGPT(Shen et al., 2023), andVisual Programming(Gupta & Kembhavi, 2023). These systems demonstrate promise in decomposing complex image tasks into subtasks and solving them using variousvision-and-language models (VLMs)orAPIs.- ViperGPT (Surís et al., 2023): Leverages code generation models to create subroutines from VLMs through a provided API. It solves tasks by generating Python code that is then executed.
DoraemonGPTcompares directly againstViperGPT, noting its limitations in dynamic video understanding. - VideoChat (Li et al., 2023b): An end-to-end chat-centric video understanding system that integrates several
foundation modelsandLLMsto build a chatbot. It is mentioned as a competitor but often treats video as a sequence of images or relies on pre-extracted information, whichDoraemonGPTaims to improve upon.
- ViperGPT (Surís et al., 2023): Leverages code generation models to create subroutines from VLMs through a provided API. It solves tasks by generating Python code that is then executed.
- Multi-modal Understanding Systems: Earlier efforts focused on specific tasks (
Lu et al., 2019; Marino et al., 2019; 2021; Bain et al., 2021). More general systems emerged withFrozen(Tsimpoukelli et al., 2021) showing how to empower LLMs with visual input. This led tolarge multimodal models(OpenAI, 2023; Driess et al., 2023; Zhu et al., 2023a) andzero-shot systems(Li et al., 2023a; Yu et al., 2023).DoraemonGPTbuilds on this by focusing on thedynamic modality. - LLM-driven Modular Systems (Planning):
- Fixed Paths: Many systems
(Gupta & Kembhavi, 2023; Wu et al., 2023a; Surís et al., 2023; Shen et al., 2023)decompose tasks into an ordered sequence of subtasks, each addressed by a specific module.ViperGPTandHuggingGPTfall into this category. - Dynamic Paths: Other works
(Nakano et al., 2021; Yao et al., 2022; Yang et al., 2023)perform planning and execution concurrently, allowing for interactive and error-tolerant approaches.ReAct(Yao et al., 2022) is a key example here.DoraemonGPTextends this by usingMCTSfor more robust exploration of the planning space.
- Fixed Paths: Many systems
- LLMs with External Memory: This area explores how to augment LLMs with external knowledge to overcome their internal knowledge limitations and context window constraints.
- Textual Memory: Storing long contexts as embeddings and retrieving them by similarity (
Zhu et al., 2023b; Park et al., 2023). Example: document question answering. - Symbolic Memory: Modeling memory as structured representations with symbolic languages (e.g.,
SQLfor databases,Cheng et al., 2022; Sun et al., 2023; Hu et al., 2023).DoraemonGPTfalls into this category by creatingSQL-based symbolic memoryfrom videos.
- Textual Memory: Storing long contexts as embeddings and retrieving them by similarity (
3.3. Technological Evolution
The evolution of AI in multi-modal understanding and LLM-driven agents has progressed from:
- Task-specific Models: Early models were often trained for specific
vision-language tasks(e.g.,image captioning,visual question answering) and lacked generalizability. - General Multi-modal Models: The rise of
foundation modelslikeCLIP(Radford et al., 2021),BLIP(Li et al., 2022), andGPT-4(OpenAI, 2023) demonstrated impressive zero-shot and few-shot capabilities across a broader range of visual and textual tasks. These models could encode richer representations. - LLM-driven Agents for Static Images: The power of
LLMswas then harnessed to act asplannersororchestratorsforvisual tasks. Systems likeViperGPTandVisual ChatGPTshowed how LLMs could decompose tasks and call specializedvision-language models(VLMs) to process static images, often by generating code or structured instructions. - Towards Dynamic Scenes (DoraemonGPT's context): The natural next step is to extend
LLM-driven agentstodynamic scenes(videos). This is whereDoraemonGPTenters, recognizing that simply treating videos as sequences of static images or relying on pre-extracted information (as some concurrent works do) is insufficient for deepspatial-temporal reasoning. The field is moving towards more intelligent agents that can reason over continuous, evolving information streams and adaptively use knowledge.
3.4. Differentiation Analysis
DoraemonGPT differentiates itself from previous LLM-driven visual agents through several key innovations, particularly in its approach to handling dynamic scenes:
-
Focus on Dynamic Modalities: Unlike most existing agents that
mainly focus on solving image-based tasks(ViperGPT,Visual ChatGPT),DoraemonGPTis specifically designed forvideos, aiming to understand theirdynamic and ever-changing nature. This is a fundamental shift from static to temporal reasoning. -
Task-related Symbolic Memory (TSM):
- Previous
LLM-driven agentsfor video often treat video as asequence of images(ViperGPT) or build chatbots based onpre-extracted information(VideoChat). This can lead toredundant contextormissing crucial dynamic cues. DoraemonGPT'sTSMis novel because it dynamically selects and extracts only task-relevant spatial and temporal attributes (decoupled intospace-dominantandtime-dominant memories). This structured,SQL-basedrepresentation makes information querying efficient and avoids overwhelming theLLM.
- Previous
-
MCTS Planner for Large Planning Space:
- Many prior
LLM-driven plannersusegreedy search methods(ViperGPT,HuggingGPT), generating a single chain of actions. WhileReAct-style planning allows for dynamic paths, it doesn't necessarily explore the solution space broadly. DoraemonGPTintroduces anovel LLM-driven MCTS planner. By adaptingMCTS, it canefficiently explore the large planning spaceinherent in dynamic video tasks, findmultiple feasible solutions, andsummarize them into an improved final answer. This is a significant improvement over single-path, greedy approaches, especially foropen-ended questions.
- Many prior
-
Comprehensive Tool Integration and Knowledge Extension:
- While other agents use
tools(ViperGPTgenerates Python code for APIs,HuggingGPTconnectsfoundation models),DoraemonGPTexplicitly designssub-task toolstailored forspatial-temporal reasoningover itsTSM. - Furthermore, its
plug-and-play knowledge tools(symbolic, textual, web) provide a structured way toaccess external, domain-specific knowledge, directly addressing thelimited internal knowledgeof LLMs. This makesDoraemonGPTmore robust for specialized tasks like scientific experiment analysis.
- While other agents use
-
Explicit Decoupling of Spatial-Temporal Attributes: The explicit
decoupling of spatial-temporal attributesinto two distinct memory types (space-dominantandtime-dominant) is a specific design choice that enhances the system's ability to handle diverse video questions, whether they concern objects' movements or overall video events.In essence,
DoraemonGPTmoves beyond simply chainingVLMsor treating videos superficially. It systematically tackles the challenges of video understanding by building acompact, queryable memoryand employing anintelligent search strategy(MCTS) to reason over this memory, augmented byexternal knowledge.
4. Methodology
DoraemonGPT is an LLM-driven agent designed to understand dynamic video scenes by effectively utilizing various tools to decompose complex video tasks into sub-tasks and solve them. The overall architecture, as shown in Figure 2 of the original paper, consists of three main components: Task-related Symbolic Memory (TSM), Sub-task and Knowledge Tools, and a Monte Carlo Tree Search (MCTS) Planner.
The general workflow is as follows:
-
Input: Given a video () and a textual task/question ().
-
Memory Extraction:
DoraemonGPTfirst analyzes to determine relevant information and then extracts aTask-related Symbolic Memory (TSM)from . -
Planning and Execution: Using a
Monte Carlo Tree Search (MCTS) Planner,DoraemonGPTautomatically schedules a set oftools(sub-task tools for querying TSM, knowledge tools for external knowledge, and other utility tools) to solve . -
Solution Refinement: The
MCTS plannerexplores theplanning space, generatesmultiple possible answers, and thensummarizes them into an improved final answer.The following are the results from Figure 2 of the original paper:
该图像是一个示意图,展示了DoraemonGPT系统如何处理视频输入和任务。通过任务相关的符号记忆构建,系统能够进行空间和时间主导的查询,引导用户获取实验步骤和科学原理的描述。使用蒙特卡洛树搜索规划器,该系统能够生成有效的解决方案并总结结果。
4.1. Task-related Symbolic Memory (TSM)
Videos are complex dynamic data containing rich spatial-temporal relations. For a given question about a video , only a subset of attributes are critical for the solution, while a large amount of information might be irrelevant. To address this, DoraemonGPT extracts and stores potentially relevant video information into a TSM before attempting to solve .
4.1.1. TSM Construction
The construction of TSM involves two main steps:
-
Task Type Selection: An
LLM-driven planneruses anin-context learning (ICL)method to determine the type ofTSMneeded based on the question . This is done by prompting theLLMwith task descriptions for eachTSMtype. TheLLMpredicts a suitableTSMin the format "Action:<TSM_type>construction...". -
Attribute Extraction and Storage: Once the
TSMtype is identified, the correspondingAPIis called to extracttask-related attributes. These attributes are then stored in anSQL table, making them accessible viasymbolic languages(e.g.,SQL).DoraemonGPTdesigns two main types of memory based onspatial-temporal decoupling, a concept widely applied invideo representation learning(Bertasius et al., 2021; Arnab et al., 2021):
-
Space-dominant Memory (SDM): This memory type is primarily used for questions related to specific
targets(e.g., persons, animals) or theirspatial relations.- Extraction Process:
Multi-object tracking methods(Maggiolino et al., 2023) are used to detect and track instances. - Attributes: Each instance stores attributes including:
Unique ID: To identify individual objects.Semantic Category: The type of object (e.g., "person").Trajectory & Segmentation: For localization, capturing the object's movement (bounding box) and shape (mask) in each frame.Appearance Description: Textual descriptions of the instance's visual characteristics, extracted by models likeBLIP(Li et al., 2022) /BLIP-2(Li et al., 2023a), used for text-based grounding.Action Classification: The action performed by the instance.
- Extraction Process:
-
Time-dominant Memory (TDM): This memory type focuses on constructing
temporal-related informationof the video, requiring comprehension of content throughout the video.- Attributes: Stored attributes include:
-
Timestamp: The time marker of a frame or clip. -
Audio Content: Speech recognition results obtained viaASR(e.g.,Whisperby Radford et al., 2023). -
Optical Content:Optical Character Recognition (OCR)results (e.g.,PaddlePaddle, 2023) for text appearing in the video. -
Captioning: Frame-level captions generated byBLIPs(Li et al., 2022; 2023a; Dai et al., 2023) and clip-level captions derived by deduplicating similar and continuous frame-level results.The following are the results from Table 1 of the original paper:
Attribute Used Model Explanation Space-dominant Memory ID number A unique ID assigned to an instance Category YOLOv8 (Jocher et al., 2023)/Grounding DINO (Liu et al., 2023c) The category of an instance, e.g., person Trajectory Deep OC-Sort (Maggiolino et al., 2023)/DeAOT (Yang & Yang, 2022) An instance's bounding box in each frame Segmentation YOLOv8-Seg (Jocher et al., 2023)/DeAOT (Yang & Yang, 2022) An instance's segmentation mask in each frame Appearance BLIP (Li et al., 2022) / BLIP-2 (Li et al., 2023a) A description of an instance's appearance Action Intern Video (Wang et al., 2022) The action of an instance Time-dominant Memory Timestamp The timestamp of a frame/clip Audio content Whisper (Radford et al., 2023) Speech recognition results of the video Optical content OCR (PaddlePaddle, 2023) Optical character recognition results of the video Captioning BLIP (Li et al., 2022)/BLIP-2 (Li et al., 2023a)/InstructBlip (Dai et al., 2023) Frame-level/clip-level captioning results
-
- Attributes: Stored attributes include:
Table 1: Attributes and Models used for TSM Construction.
4.1.2. Sub-task Tools
While LLM-driven agents can access external information by learning from the entire memory or generating symbolic sentences (e.g., SQL), these methods can increase context length, potentially leading to information omission or distraction. To improve efficiency and effectiveness, DoraemonGPT designs a series of sub-task tools, each responsible for querying information from the TSMs by answering specific sub-task questions.
- Tool Functionality: Each
sub-task toolis an individualLLM-driven sub-agentwith task-specific prompts and examples. It generatesSQLqueries to access theTSMsand answer the givensub-task question. - Tool Description: The
LLM-driven plannerlearns about each tool through itsin-context description, which includes the sub-task description, tool name, and tool inputs. - Tool Calling: To call a tool,
DoraemonGPTparsesLLM-generated commands like "Action:[tool_name]Input:video_name#(sub_question...)". - Types of Sub-task Tools:
When: Fortemporal understanding(e.g., "When did the dog walk past the sofa?").Why: Forcausal reasoning(e.g., "Why did the lady shake the toy?").What: For describing required information (e.g., "What's the name of the experiment?").How: For manner, means, or quality (e.g., "How does the baby keep himself safe?").Count: For counting instances (e.g., "How many people are in the room?").Other: For questions not covered by the above (e.g., "Who slides farther at the end?").
- Flexibility: A sub-question might be suitable for multiple sub-tools, and the
MCTS planner(§4.3) is designed to explore these different selections.
4.2. Knowledge Tools and Others
DoraemonGPT acknowledges that LLM-driven agents might lack specialized domain knowledge. Therefore, it supports plug-and-play integration of external knowledge sources to assist the LLM in comprehending specialized content.
4.2.1. Knowledge Tools
Each knowledge tool consists of:
- In-context knowledge description: Describes the external knowledge source.
- API function: Queries information from the source via question answering.
Three types of
API functionsare considered for different knowledge forms:
- Symbolic Knowledge: For structured formats like Excel or
SQL tables. TheAPI functionis a symbolic question-answering sub-agent, similar tosub-task tools. - Textual Knowledge: For natural language text like research publications or textbooks. The
API functionis built ontext embeddingandsearching(OpenAI, 2022). - Web Knowledge: For information from the internet. The
API functionusessearch engine APIs(e.g., Google, Bing).
4.2.2. General Utility Tools
Beyond knowledge tools, DoraemonGPT also integrates general utility tools commonly found in LLM-driven agents (Xi et al., 2023) to complete specialized vision tasks (e.g., video editing and inpainting).
4.3. Monte Carlo Tree Search (MCTS) Planner
Previous LLM-driven planners often follow a greedy search method, generating a single action sequence. DoraemonGPT proposes a novel tree-search-like planner equipped with MCTS (Coulom, 2006; Kocsis & Szepesvári, 2006; Browne et al., 2012) to efficiently explore the large planning space and find better solutions.
The planning space is viewed as a tree:
-
Root Node (): Represents the initial question input .
-
Non-root Node: Represents an
actionortool call, structured as aReAct-style step: (thought,action,action input,observation). -
Leaf Node: Contains a
final answer(or indicates a failure). -
Action Sequence: A path from the root node to a leaf node.
The
MCTS planneriteratively executes four phases for times, producing solutions:
The following are the results from Figure 3 of the original paper:
该图像是示意图,展示了基于蒙特卡罗树搜索的规划过程,包括节点选择、分支扩展、链执行和回传四个步骤。各步骤以树状结构呈现,标注了不同动作的奖励值 。
4.3.1. Node Selection
- Purpose: Select an expandable node from which to plan a new solution.
- First Iteration: Only the root node is selectable.
- Subsequent Iterations: A non-leaf node is randomly selected based on its
sampling probability, formulated as: Where is thereward valueof node , initialized to 0 and updated duringReward Back-propagation. Nodes with higher rewards have a greater probability of being selected.
4.3.2. Branch Expansion
- Purpose: Add a new child to the selected expandable node, creating a new branch (a new tool call).
- Process: To encourage the
LLMto generate a tool call different from previous child nodes,historical tool actionsare added to theLLM's prompt, instructing it to make a different choice. Thisin-context promptis then removed for subsequent steps in the current chain execution.
4.3.3. Chain Execution
- Purpose: Generate a new solution (an action sequence) starting from the newly expanded branch.
- Process: A
step-wise LLM-driven planner(Yao et al., 2022) generates a sequence oftool calls(nodes). - Termination: The execution terminates upon obtaining a
final answeror encountering anexecution error.
4.3.4. Reward Back-propagation
- Purpose: Update the reward values of ancestor nodes based on the outcome of a newly found leaf node.
- Process: After obtaining a
leaf/outcome node, its reward is gradually propagated to its ancestor nodes up to the root . - Reward Types:
- Failure: If the planner produces an unexpected result (e.g., failed tool call, incorrect format), the reward is set to a
negative value(e.g., -1). - Non-failure: If the planner successfully produces a result (even if its correctness against ground truth is unknown), is set to a
positive value(e.g., 1).
- Failure: If the planner produces an unexpected result (e.g., failed tool call, incorrect format), the reward is set to a
- Back-propagation Function: The paper uses a decay mechanism, arguing that outcomes are more related to nearby nodes. The reward update for an ancestor node is formulated as:
Where:
-
is the reward of the ancestor node .
-
is the reward of the leaf node (either positive or negative ). Here, is a positive base reward.
-
denotes the
node distance(number of steps) between node and the leaf node . -
is a
hyperparametercontrolling thedecay rateof the reward. -
The term acts as a decay factor: the further the node distance, the greater the reward decay ratio. A higher increases the probability of expanding nodes closer to non-failure leaf nodes.
After iterations, the planner collects at most non-failure answers. For open-ended questions, these answers can be
summarizedby anLLMto generate aninformative final answer. For single-/multiple-choice questions, avoting processcan determine the final answer.
-
The prompt structure for LLMs in the MCTS planner uses the ReAct format:
" " I
Regarding a given video from {video_filename}, answer the following questions as best you can. You have access to the following tools:
{tool_descriptions}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {input_question}
{ancestor_history}
Thought: {expansion_prompt} {agent_scratchpad}
II I II
{video_filename}: The file path of the input video.{input_question}: The given question/task regarding the video.{tool_descriptions}: Descriptions of available tools.{tool_names}: Names of available tools.{ancestor_history}: TheReActhistory (thought, action, action input, observation node) of all ancestor nodes of the current non-root node.{expansion_prompt}: Used to guide theLLMto make a different choice during branch expansion.{agent_scratchpad}: Placeholder for theReActoutput of theLLM.
5. Experimental Setup
5.1. Datasets
The authors conduct experiments on three diverse datasets to comprehensively validate DoraemonGPT's utility, covering video question answering (VQA) and referring video object segmentation tasks in dynamic scenes.
-
NExT-QA (Xiao et al., 2021):
- Description: A
video question answeringdataset containing 34,132 training and 4,996 validation video-question pairs. - Characteristics: Each question is annotated with a
question type(causal, temporal, descriptive) and 5 answer candidates. - Usage: For
ablation studies, 30 samples per type (90 questions total) are randomly sampled from the training set. The validation set is used for method comparison. - Example (Conceptual): A video shows someone cooking. A "causal" question might be "Why did the water boil?"; a "temporal" question might be "When did the chef add salt?"; a "descriptive" question might be "What is the person holding?".
- Description: A
-
TVQA+ (Lei et al., 2020):
- Description: An enhanced version of the
TVQAdataset (Lei et al., 2018), augmented with 310.8Kbounding boxes. - Characteristics: These bounding boxes link visual concepts in questions and answers to depicted objects in videos, enabling
spatial-temporal grounding. - Usage: For evaluation, 900 samples are randomly sampled from the validation set, consistent with previous work (Gupta & Kembhavi, 2023).
- Example (Conceptual): A video from a TV show might ask "Who passed the remote control to whom?", and the bounding boxes would help ground "remote control" and the two "persons" in the video.
- Description: An enhanced version of the
-
Ref-YouTube-VOS (Seo et al., 2020):
- Description: A large-scale
referring video object segmentationdataset with approximately 15,000 referential expressions associated with over 3,900 videos. - Characteristics: Covers diverse scenarios and aims to evaluate
pixel-wise spatial-temporal segmentation. - Usage: The validation set (202 videos, 834 objects with expressions) is used to validate
DoraemonGPT's effectiveness in segmenting objects based on textual descriptions. - Example (Conceptual): Given a video of a busy street and the query "the red car turning left," the system needs to identify and segment the specific red car across multiple frames.
- Description: A large-scale
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, its conceptual definition, mathematical formula, and symbol explanation are provided below.
5.2.1. Question Answering Metrics (NExT-QA, TVQA+)
The standard metric used for question answering is top-1 accuracy.
- Conceptual Definition:
Top-1 Accuracymeasures the proportion of questions for which the model's highest-confidence answer matches the correct answer (ground truth). It indicates how often the model gets the exact answer right among a set of choices. - Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $
- Symbol Explanation:
-
Number of correct predictions: The count of instances where the model's predicted answer for a question is identical to the ground truth answer. -
Total number of predictions: The total number of questions for which the model made a prediction.On
NExT-QA, additional specialized accuracy metrics are reported:
-
- : Accuracy for
causal questions. - : Accuracy for
temporal questions. - : Accuracy for
descriptive questions. - : The overall accuracy of all questions (equivalent to the general
top-1 accuracy). - : The average of , , and .
5.2.2. Referring Object Segmentation Metrics (Ref-YouTube-VOS)
For referring object segmentation, the metrics are evaluated on the official challenge server of Ref-YouTube-VOS. The primary reported metric is , which is the average of region similarity () and contour accuracy ().
-
Region Similarity ( - Jaccard Index):
- Conceptual Definition: Also known as
Intersection over Union (IoU), this metric quantifies the overlap between the predicted segmentation mask and the ground truth mask. It measures how similar the shapes and positions of the segmented regions are. A higher value indicates better overlap. - Mathematical Formula: $ \mathcal{J}(S_p, S_{gt}) = \frac{|S_p \cap S_{gt}|}{|S_p \cup S_{gt}|} $
- Symbol Explanation:
- : The set of pixels belonging to the predicted segmentation mask.
- : The set of pixels belonging to the ground truth segmentation mask.
- : The cardinality (number of pixels) of set A.
- : The intersection of the predicted and ground truth masks (pixels common to both).
- : The union of the predicted and ground truth masks (all pixels in either mask).
- Conceptual Definition: Also known as
-
Contour Accuracy ( - F-measure):
- Conceptual Definition: This metric evaluates the accuracy of the boundaries (contours) of the segmented objects. It is particularly sensitive to the precision of the object edges. A higher value indicates more precise boundary alignment with the ground truth. It is often calculated as the harmonic mean of precision and recall for boundary pixels.
- Mathematical Formula: (A common way to define F-measure for boundary quality, which the paper implies by "contour accuracy") Let be the set of boundary pixels of the predicted mask and be the set of boundary pixels of the ground truth mask. Precision: Recall: $ \mathcal{F} = \frac{2 \cdot P_c \cdot R_c}{P_c + R_c} $ (Note: The exact formulation of contour accuracy can vary; this is a standard interpretation of F-measure for boundaries).
- Symbol Explanation:
- : Precision, measuring how many of the predicted boundary pixels are actually correct.
- : Recall, measuring how many of the true boundary pixels were successfully detected.
-
Overall Metric ():
- Conceptual Definition: The average of
Region Similarity (\mathcal{J})andContour Accuracy (\mathcal{F}). This combined metric provides a holistic evaluation, considering both the overall overlap of the segmented region and the precision of its boundaries. - Mathematical Formula: $ \mathcal{J}&\mathcal{F} = \frac{\mathcal{J} + \mathcal{F}}{2} $
- Symbol Explanation:
- : Region similarity (Jaccard Index).
- : Contour accuracy (F-measure).
- Conceptual Definition: The average of
5.3. Baselines
DoraemonGPT is compared against several open-sourced LLM-driven agents and state-of-the-art supervised models to demonstrate its performance.
-
LLM-driven Agents:
- ViperGPT (Surís et al., 2023): A method that
leverages code generation models to create subroutines from vision-and-language models through a provided API. It solves tasks bygenerating Python codethat is subsequently executed. The authors reimplemented it using officially released code and equipped it withDeAOT(Yang & Yang, 2022; Cheng et al., 2023) for fair comparison on object tracking and segmentation, asDoraemonGPTalso uses these. - VideoChat (Li et al., 2023b): An
end-to-end chat-centric video understanding systemthat integratesfoundation modelsandLLMsto build a chatbot.
- ViperGPT (Surís et al., 2023): A method that
-
Supervised VQA Models (on NExT-QA):
HME(Fan et al., 2019)VQA-T(Yang et al., 2021a)ATP(Buch et al., 2022)VGT(Xiao et al., 2022)MIST(Gao et al., 2023b) - reported as previous SOTA on NExT-QA.
-
Supervised Referring Video Object Segmentation Models (on Ref-YouTube-VOS):
-
CMSA(Ye et al., 2019) -
URVOS(Seo et al., 2020) -
VLT(Ding et al., 2021) -
ReferFormer(Wu et al., 2022a) -
SgMg(Miao et al., 2023) -
OnlineRefer(Wu et al., 2023b) - reported as previous SOTA on Ref-YouTube-VOS.The paper notes that other competitors were not included due to the lack of available code for video tasks.
-
5.4. Implementation Details
- Large Language Model (LLM):
GPT-3.5-turbo APIprovided by OpenAI is used as the coreLLMfor reasoning and planning. - Foundation Models for TSM Extraction:
- Captioning:
BLIPseries (BLIP(Li et al., 2022),BLIP-2(Li et al., 2023a),InstructBlip(Dai et al., 2023)). - Object Detection:
YOLOv8(Jocher et al., 2023). - Object Tracking:
Deep OC-Sort(Maggiolino et al., 2023). - Optical Character Recognition (OCR):
PaddleOCR(PaddlePaddle, 2023). - Action Recognition:
InternVideo(Wang et al., 2022). - Speech Recognition (ASR):
Whisper(Radford et al., 2023). - Referring Object Detection:
Grounding DINO(Liu et al., 2023c). - Tracking and Segmentation:
DeAOT(Yang & Yang, 2022; Cheng et al., 2023).
- Captioning:
- Learning Setting: All experiments are conducted under the
in-context learning (ICL)setting, meaning no fine-tuning of theLLMorfoundation modelson task-specific data. - External Knowledge: For fairness in quantitative and qualitative comparisons, external knowledge tools are not used in the main evaluation, but their capability is demonstrated in
in-the-wild examples. - Hyperparameters for MCTS Planner:
- Base reward (): Set to
1. - Decay rate (): Set to
0.5. - Number of solutions to explore (): For VQA experiments, is chosen for a good
accuracy-cost trade-off.
- Base reward (): Set to
6. Results & Analysis
6.1. Zero-shot Video Question Answering
6.1.1. NExT-QA Results
The authors compare DoraemonGPT against several top-leading supervised VQA models and LLM-driven systems on the NExT-QA dataset.
The following are the results from Table 2a of the original paper:
| Method | Pub. | AccC | AccT | AccD | Avg | AccA | |
|---|---|---|---|---|---|---|---|
| HME (Fan et al., 2019) | CVPR19 | 46.2 | 48.2 | 58.3 | 50.9 | 48.7 | |
| VQA-T (Yang et al., 2021a) | ICCV21 | 41.7 | 44.1 | 60.0 | 48.6 | 45.3 | |
| ATP (Buch et al., 2022) | CVPR22 | 53.1 | 50.2 | 66.8 | 56.7 | 54.3 | |
| VGT (Xiao et al., 2022) | ECCV22 | 52.3 | 55.1 | 64.1 | 57.2 | 55.0 | |
| MIST (Gao et al., 2023b) | ICCV23 | 54.6 | 56.6 | 66.9 | 59.3 | 57.2 | |
| †ViperGPT (Surís et al., 2023) | arXiv23 | 43.2 | 49.4 | 45.5 | |||
| VideoChat (Li et al., 2023b) | 50.2 | 47.0 | 65.7 | 52.5 | 51.8 | ||
| DoraemonGPT (Ours) | ICML24 | 54.7 | 50.4 | 70.3 | 58.5 | 55.7 | |
Table 2a: NExT-QA (Xiao et al., 2021) results. †: reimplemented using officially released codes. ‡: ViperGPT equipped with DeAOT (Yang & Yang, 2022; Cheng et al., 2023).
Analysis:
- Overall Performance:
DoraemonGPTachieves atotal accuracy (AccA)of 55.7%, which is competitive with recently proposedsupervised models. For example, it nearly matchesVGT(55.0%) and is only slightly behindMIST(57.2%), which is theprevious SOTA. This is remarkable asDoraemonGPToperates in azero-shotsetting, without specific training onNExT-QA. - Descriptive Questions (AccD):
DoraemonGPTshows a significant improvement indescriptive questions, achieving 70.3%. Thisoutperforms the previous SOTA, MIST(66.9%). The authors attribute this to theTask-related Symbolic Memory (TSM)providing sufficient information for reasoning. - Temporal Questions (AccT):
Supervised modelsgenerally perform slightly better ontemporal questions. For instance,MISTachieves 56.6% compared toDoraemonGPT's 50.4%. This suggests that whileDoraemonGPThastime-dominant memory, supervised models might have learned more intricateunderlying temporal patternsthrough explicit training. - Comparison with LLM-driven Competitors:
DoraemonGPTsignificantly outperformsViperGPTacross all comparable metrics:AccC(54.7% vs 43.2%),AccD(70.3% vs 49.4%), andAccA(55.7% vs 45.5%). This demonstrates the advantage ofDoraemonGPT'sTSMandMCTS planneroverViperGPT's code generation approach, especially for dynamic scenes.DoraemonGPTalso surpassesVideoChat(AccC54.7% vs 50.2%,AccT50.4% vs 47.0%,AccD70.3% vs 65.7%,AccA55.7% vs 51.8%). The improvements are consistent across all question types (4.5%, 3.4%, 4.6%, and 3.9% respectively). This indicates the efficacy ofDoraemonGPT's overall framework, particularly itsMCTS plannerguided byTSM.
6.1.2. TVQA+ Results
The evaluation on further confirms the method's superiority.
The following are the results from Figure 5 of the original paper:
Figure 5. Comparison on TVQA+ (Lei et al., 2020) (§3.2).
Analysis:
DoraemonGPTachieves the highestTop-1 accuracyof 40.3% on .- It
outperforms ViperGPTby a substantial 10.2% (40.3% vs 30.1%). - It also
surpasses VideoChatby 5.9% (40.3% vs 34.4%). The paper states thatViperGPT's lower performance on (andNExT-QA) is because it is not specifically designed fordynamic videos, consistent with the findings onNExT-QA. This reinforces the conclusion that a dedicated approach todynamic scene understandingis critical for video-based tasks.
6.2. Zero-shot Referring Object Segmentation
Ref-YouTube-VOS Results
DoraemonGPT's capability in referring object segmentation is evaluated on Ref-YouTube-VOS against state-of-the-art supervised models and LLM-driven agents.
The following are the results from Table 2b of the original paper:
| Method | Pub. | J | F | J&F | |
|---|---|---|---|---|---|
| Sde | CMSA (Ye et al., 2019) | CVPR19 | 36.9 | 43.5 | 40.2 |
| URVOS (Seo et al., 2020) | ECCV20 | 47.3 | 56.0 | 51.5 | |
| VLT (Ding et al., 2021) | ICCV21 | 58.9 | 64.3 | 61.6 | |
| ReferFormer (Wu et al., 2022a) | CVPR22 | 58.1 | 64.1 | 61.1 | |
| SgMg (Miao et al., 2023) | ICCV23 | 60.6 | 66.0 | 63.3 | |
| OnlineRefer (Wu et al., 2023b) | ICCV23 | 61.6 | 67.7 | 64.8 | |
| A | ‡ViperGPT (Surís et al., 2023) | ICCV23 | 24.7 | 28.5 | 26.6 |
| DoraemonGPT (Ours) | ICML24 | 63.9 | 67.9 | 65.9 |
Table 2b: Ref-YouTube-VOS (Seo et al., 2020) results. ‡: ViperGPT equipped with DeAOT (Yang & Yang, 2022; Cheng et al., 2023).
Analysis:
- Superiority over Supervised Models:
DoraemonGPT, operating in azero-shotmanner (without training onRef-YouTube-VOS), achieves an impressiveJ&F scoreof 65.9%. Thisremarkably surpasses recent supervised models, including the previous SOTA,OnlineRefer(64.8%). This strong performance is attributed toDoraemonGPT'stask-related symbolic memory, which effectively grounds video instances with textual descriptions. - Comparison with LLM-driven Competitor:
ViperGPTperforms very poorly, achieving only a 26.6%J&F score. This stark difference highlightsViperGPT'slack of a well-designed video information memory, leading to failures in grounding the referred object or accurately tracking it in the video. - Visual Evidence: Figure 4 (not provided in text, but described in context) visually demonstrates
DoraemonGPT's higher accuracy in identifying, tracking, and segmenting referred objects, contrasting withViperGPT's failure cases where recognized objects do not match semantic and descriptive aspects. The strong results onRef-YouTube-VOSunderscore the critical importance of building asymbolic video memoryfor accuratereferring video object segmentation.
6.3. In-the-wild Example
The paper states that DoraemonGPT demonstrates a versatile skill set, including checking experimental operations, video understanding, and video editing. It adeptly handles complex questions by exploring multiple reasoning paths and leveraging external sources for comprehensive answers. Appendix A.2 provides more details and examples.
Figure 7 from the appendix visually illustrates some of these capabilities:
Figure 7. In-the-wild examples of DoraemonGPT.
Analysis:
The figure shows DoraemonGPT's ability to:
-
Video Understanding and Editing: The top example implies a task where
DoraemonGPTcan understand the content of a video and perform editing actions, such as removing unwanted segments (e.g., "remove the portion between 'start' and 'end' from the video"). This showcases its capability to act as avideo agentfor practical tasks. -
Complex Question Answering with External Knowledge: The bottom example demonstrates
DoraemonGPT's capacity to answer questions aboutexperimental procedures, even those requiringscientific knowledge. The prompt "I see a person doing some experiments in the video. Please help me figure out the person's operations and the scientific principles behind this experiment" suggestsDoraemonGPTwould need to analyze the video's actions and then consultexternal knowledge toolsto explain the scientific context. This highlights the integration ofknowledge toolsand theMCTS planner's ability to explore reasoning paths involving external data. The example shows it generates anSQL queryto extract information about the experiment fromTSM(e.g.,SELECT Category, Trajectory FROM Space_Memory WHERE Category = 'person').These
in-the-wildexamples showcaseDoraemonGPT's practical applicability beyond benchmark datasets, particularly its capacity forinteractive planningandmulti-source knowledge integration.
6.4. Diagnostic Experiment
To gain deeper insights, ablative experiments are conducted on NExT-QA.
6.4.1. Task-related Symbolic Memory (TSM)
The authors investigate the essential components of DoraemonGPT: space-dominant memory (SDM) and time-dominant memory (TDM).
The following are the results from Table 3a of the original paper:
| TDM | SDM | AccC | AccT | AccD | AccA |
| ✓ | 63.3 | 26.7 | 53.3 | 47.8 | |
| ✓ | 53.3 | 23.3 | 46.7 | 41.1 | |
| ✓ | ✓ | 96.7 | 46.7 | 53.3 | 65.7 |
Table 3a: Essential components of TSM on NExT-QA.
Analysis:
- The table shows that using
SDMalone yields anAccAof 47.8%, whileTDMalone gives 41.1%. - When
both TDM and SDM are combined, theAccAsignificantly jumps to 65.7%. Thisconfirms the necessity of dynamically querying two types of symbolic memory. - Specific Question Types:
TDM(Time-dominant Memory) ismore preferred for temporal questions(implied by better performance onAccTwhen combined with SDM, and its conceptual focus). While the individualAccTforTDMalone is lower (23.3%) thanSDMalone (26.7%), the dramatic increase inAccTwhen both are combined (46.7%) suggestsTDMplays a crucial role in improving temporal understanding when integrated.SDM(Space-dominant Memory) providesrelevant information for descriptive questions. AlthoughSDMalone achieves 53.3% onAccDcompared toTDM's 46.7%, the combined system also achieves 53.3% forAccD, indicatingSDMis a strong driver for descriptive tasks.
- The results clearly support the design choice to decouple and integrate both types of memory for comprehensive video understanding.
6.4.2. Multiple Solutions by MCTS Planner
The influence of the number of answer candidates () explored by the MCTS planner is studied. represents a greedy search.
The following are the results from Table 3c of the original paper:
| N | AccC | AccT | AccD | AccA |
| 1 | 63.3 | 20.0 | 46.7 | 43.3 |
| 2 | 80.0 | 43.3 | 46.7 | 56.7 |
| 3 | 86.7 | 43.3 | 53.3 | 61.1 |
| 4 | 96.7 | 46.7 | 53.3 | 65.7 |
Table 3c: Number of answer candidates (N) for MCTS planner on NExT-QA.
Analysis:
- Improved Performance with More Solutions: Increasing from 1 to 4 leads to a consistent and significant improvement in
AccA(from 43.3% to 65.7%). This strongly supports the hypothesis thata single answer is far from enough to handle the larger planning space for dynamic modalities. - Efficacy of MCTS: This experiment proves the
efficacy of the MCTS plannerin exploring diverse solutions and converging towards better answers. The ability to explore multiple paths and then summarize/vote on them is crucial. - Trade-off: The authors note that for
single-choice questions(like inNExT-QA), exploring might not yield positive returns and increases API call costs. This highlights a practicalaccuracy-cost trade-off.
6.4.3. Back-propagation in MCTS Planner
The effect of the base reward (\alpha) and decay rate (\beta) in the reward back-propagation mechanism is ablated.
The following are the results from Table 3d of the original paper:
| AccC | AccT | AccD | AccA | ||
| 0.5 | 1.0 | 86.7 | 23.3 | 50.0 | 53.3 |
| 1.0 | 0.5 | 96.7 | 46.7 | 53.3 | 65.7 |
| 0.5 | 2.0 | 86.7 | 26.7 | 50.0 | 54.4 |
| 2.0 | 0.5 | 83.3 | 46.7 | 50.0 | 60.0 |
| 2.0 | 2.0 | 80.0 | 46.7 | 50.0 | 58.9 |
Table 3d: Reward and decay rate () for MCTS planner on NExT-QA.
Analysis:
- The
AccAvalues range from 53.3% to 65.7% across different combinations of and . - The highest
AccA(65.7%) is achieved with and , which is chosen as thedefault setting. - The results show that the performance is
stable regardless of the combination of\alphaand\beta$$ used, implying robustness of theMCTSframework. Even sub-optimal parameter choices still yield reasonable performance, far superior to (43.3%). - The explanation notes that some special combinations of and can transform
MCTSintodepth-first search (DFS)(e.g., setting and for both failure and non-failure cases). This connection helps understand the behavior of the planner under extreme parameter settings.
6.4.4. Exploring Strategies used by Planner
The advantage of the MCTS planner is verified by comparing it with several standard exploring strategies for .
The following are the results from Table 3e of the original paper:
| Strategy | AccC | AccT | AccD | AccA |
| DFS | 66.7 | 36.7 | 50.0 | 51.1 |
| Root | 73.3 | 16.7 | 46.7 | 45.6 |
| Uniform | 67.7 | 26.7 | 50.0 | 47.8 |
| MCTS | 96.7 | 46.7 | 53.3 | 65.7 |
Table 3e: Exploring strategies () for MCTS planner on NExT-QA.
Analysis:
MCTSachieves the highestAccAof 65.7%, significantly outperforming all othernaive strategies.- Suboptimal Baselines:
DFS(51.1%),Root(45.6%), andUniform(47.8%) all show suboptimal performance. This is because theyinability to leverage the value/reward of the outcome leaf nodes and accordingly adjust their searching strategy. - MCTS Advantage:
MCTS adaptively samples a node with the guidance of reward back-propagation, which ismore effective in a large solution space. This result strongly validates the superiority of the proposedMCTS planner. - DFS vs. Uniform on Temporal Questions:
DFS(36.7%) notably outperformsUniform(26.7%) ontemporal questions, while performing comparably or worse ondescriptiveandcausal questions. The authors hypothesize thattemporal questionsoften containcues(e.g., "at the beginning") thatDFScan exploit to find specific periods in the video, whereasUniformsampling lacks this targeted exploration.
6.4.5. Impact of Captioning Models
Experiments are conducted to assess the impact of different captioning models, which are crucial for DoraemonGPT's visual input perception.
The following are the results from Table 3b of the original paper:
| Models | AccC | AccT | AccD | AccA |
| BLIP-2 | 51.4 | 45.5 | 63.3 | 51.2 |
| InstructBlip | 54.7 | 50.4 | 70.3 | 55.7 |
Table 3b: Captioning models for TSM on NExT-QA.
Analysis:
- Using
InstructBLIPresults in anAccAof 55.7%, which is superior toBLIP-2's 51.2%. - This suggests that
DoraemonGPT can benefit from the development of stronger foundation models, especially those withinstruction tuninglikeInstructBLIP. Bettercaptioningdirectly translates to richer and more accurate information stored inTSM, which in turn improves theLLM's reasoning capabilities.
6.5. Evaluation on the Inference Time and Token Usage Efficiency
The paper analyzes the efficiency of DoraemonGPT compared to baselines.
The following are the results from Table 4 of the original paper:
| Method | Prompt tokens | Node tokens | Steps per Solution | Tokens per Answer | NExT-QA AccA |
|---|---|---|---|---|---|
| ViperGPT (Surís et al., 2023) | 4127 | - | - | 4127 | 45.5 |
| VideoChat (Li et al., 2023b) | 722 | - | - | 722 | 51.8 |
| DoraemonGPT | 617 | 34.6 | 2.3 | 1498 | 55.7 |
Table 4: Token Efficiency (Averaged on the NExT-QA (Xiao et al., 2021) s_val).
Analysis:
-
Prompt Efficiency:
DoraemonGPT's prompt design ismore efficientwith 617prompt tokenscompared toVideoChat(722 tokens) andViperGPT(4127 tokens). This is crucial as shorter prompts reduceAPI costsand improveLLMeffectiveness by avoidingdistraction from irrelevant context. -
Total Token Usage:
DoraemonGPTuses 1498tokens per answer(includingnode tokensandsteps). While this is higher thanVideoChat(722 tokens), it is significantly lower thanViperGPT(4127 tokens). The higher token count compared toVideoChatis likely due to theMCTS plannerexploring multiple steps/solutions. -
Accuracy vs. Efficiency Trade-off: Despite a higher
Tokens per AnswerthanVideoChat,DoraemonGPTsignificantly outperforms VideoChatinAccA(55.7% vs 51.8%). This suggests that the increased token usage is a worthwhile trade-off for the substantialaccuracy gainsachieved through its sophisticatedplanningandmemorymechanisms. -
Inference Time: The paper mentions that
video processing(memory building) is linear with video length. At 1 frame per second (fps), it takes about 1 second.VideoChatcreatestimestamp memoryand takes around 1 minute to process a 1-minute video.ViperGPT's time is harder to compare fairly due to execution failures (6.7% failure rate onNExT-QA).These results indicate that
DoraemonGPTachieves superior performance withreasonable token efficiencyandscalable video processing, validating its design for complex dynamic tasks.
6.6. Typical Failure Cases
The authors provide insights into DoraemonGPT's typical failure cases on NExT-QA.
The following are the results from Figure 10 of the original paper:
Figure 10. Typical failure cases on NExT-QA (Xiao et al., 2021) (§A.6).
Analysis: The figure shows two examples of common failure patterns:
-
Misinterpretation of Temporal Order: In the top example, for the question "What did the child do before interacting with the dinosaur toy?",
DoraemonGPTincorrectly identifies the action "standing" as happening before playing with the toy, when the ground truth is "watching". This suggests challenges in precisetemporal orderingandcausal inference, where subtle distinctions between actions (standingvs.watching) might be missed or confused. -
Ambiguity in Referring Expressions: In the bottom example, for the question "What is the dog doing after passing by the sofa?",
DoraemonGPTfails to accurately describe the dog's action, predicting "running away" when the actual action might be more nuanced or context-dependent (ground truth not explicitly given but implied as different). This could stem from ambiguities in thereferring expressionor limitations inaction recognitionwhen the scene is complex or actions are subtle.These failures highlight areas for future improvement, particularly in handling fine-grained
temporal reasoningand resolvingambiguitiesinaction descriptionsfrom theTSM.
7. Conclusion & Reflections
7.1. Conclusion Summary
DoraemonGPT is presented as a novel LLM-driven agent specifically designed for understanding dynamic video tasks, addressing a critical gap left by most existing LLM-driven visual agents that focus primarily on static images. Its core innovations include:
-
Conceptually Elegant System: A modular design that integrates
memory,tools, and a sophisticatedplannerfor dynamic scenes. -
Compact Task-related Symbolic Memory (TSM): Decouples, extracts, and stores
spatial-temporal attributesfrom videos into structuredspace-dominantandtime-dominant memories, making information access efficient. -
Effective and Decomposed Memory Querying: Utilizes
symbolic sub-task tools(e.g., "When," "Why," "What") for precisespatial-temporal reasoning. -
Plug-and-Play Knowledge Tools: Allows integration of
external knowledge sources(symbolic, textual, web) to augmentLLM's domain-specific expertise. -
Automated MCTS Planner: Employs a novel
Monte Carlo Tree Searchplanner toexplore large planning spaces, generatemultiple potential solutions, andsummarize them into an informative final answerbybackpropagating rewards. -
Answer Diversity: The
MCTS plannerinherently supports finding diverse answers, which is crucial foropen-ended questions.Extensive experiments on
NExT-QA, , andRef-YouTube-VOS, along within-the-wild scenarios, confirmDoraemonGPT's versatility and effectiveness, demonstrating superior performance over competitors, especially indescriptive video question answeringandreferring video object segmentation.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
- Memory Refinement: While
TSMis effective, there'sno generalized division of task typesfor memory construction. Future work couldintroduce more nuanced categoriesorautomatically learn optimal memory structures. - Foundation Model Dependence:
DoraemonGPT's performance is inherently tied to the capabilities of the underlyingfoundation modelsused for extraction (e.g.,YOLOv8,BLIP,Whisper). Limitations in these models (e.g., category limitations in object detection, segmentation accuracy, speech recognition in complex scenarios, handling occluded text) directly impactDoraemonGPT's performance. Future improvements would come from advancements in thesefoundation models. - Computational Cost: Reliance on
available online LLM services(e.g.,OpenAI) limits its use inreal-time, resource-constrained scenariosdue toinference timeandtoken costs. Optimizing these aspects is a clear future direction. - MCTS Planner Scope: While effective, the
MCTS plannercurrently guides tool usage; extending its capabilities tosubdivide task typesmore adaptively formemory constructioncould enhance its overall planning ability.
7.3. Personal Insights & Critique
DoraemonGPT presents a significant step forward in enabling LLM-driven agents to tackle the complexities of dynamic scene understanding.
Inspirations and Applications:
-
Modular and Extensible Design: The modular architecture (Memory, Tools, Planner) is highly inspiring. It allows for
plug-and-playupgrades offoundation modelsorexternal knowledge sourceswithout overhauling the entire system. This makesDoraemonGPTa flexible framework that can evolve with AI advancements. -
Bridging the Static-Dynamic Gap: The explicit focus on
spatial-temporal reasoningviaTSMand theMCTS planneris a crucial advancement. This approach could be transferred to other domains requiringdynamic environment interaction, such asrobotics,autonomous driving, orinteractive simulations, where understanding sequences of events and their causal relationships is paramount. -
Enhanced Planning with MCTS: The adaptation of
MCTSforLLM planningis a powerful idea. It moves beyond simple greedy execution and allows fordeliberate explorationof solutions, addressing thefragilityof single-pathLLM reasoning. This concept could be applied to any complex problem whereLLMsneed to explore a vast solution space, such ascode generation with debugging,scientific discovery, orstrategic game playing. -
Transparency and Debuggability: The
ReAct-style nodes and explicitSQL-based memoryoffer a degree oftransparencyin theLLM's reasoning process. One can inspect the intermediate thoughts, actions, and observations, which is vital fordebuggingandbuilding trustin AI systems.Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Generality of TSM Schema: While powerful, the current
TSM schema(space-dominant,time-dominant) and their associated attributes are manually designed. As the complexity of video tasks grows,dynamically generating or adapting the TSM schemaitself (perhaps with another meta-LLM) could be an area for improvement. -
Robustness of Foundation Models: The paper correctly identifies the dependency on
foundation models. If these models produceerroneous extractions(e.g., mis-classified objects, inaccurate captions, wrong actions),DoraemonGPT'sTSMwill be flawed, leading to incorrect reasoning. The system's reliance on theirzero-shot performancefor extraction means it inherits their biases and limitations. -
MCTS Reward Signal: The current
reward back-propagationrelies on a binaryfailure/non-failuresignal. For more nuancedopen-ended tasks, aricher, possibly learned reward function(e.g., throughreinforcement learning from human feedback - RLHForlearned critics) could guide theMCTSmore effectively towards truly optimal solutions, rather than just feasible ones. -
Scalability for Very Long Videos: While the memory extraction scales linearly with video length, the computational cost of processing, storing, and querying
TSMforextremely long videos(hours-long) might still be substantial. Techniques forsummarizingorhierarchically structuring TSMfor long-duration content could be explored. -
Interactive Refinement: The
MCTS plannergenerates solutions and then summarizes. Could there be a moreinteractive loopwith a human in the loop, allowing users to guide the exploration or provide feedback on intermediate steps, especially for very ambiguous or complex tasks?Overall,
DoraemonGPTprovides a robust and well-thought-out framework forvideo agents, pushing the boundaries of whatLLM-driven systemscan achieve in dynamic environments. Its methodical approach to memory construction and planning offers valuable lessons for futuremulti-modal AI research.
Similar papers
Recommended via semantic vector search.