Fine-Grained Captioning of Long Videos through Scene Graph Consolidation
TL;DR Summary
This work introduces a novel framework that consolidates scene graphs from short video captions to generate fine-grained long video captions efficiently without fine-tuning, improving zero-shot performance and reducing computational cost.
Abstract
Recent advances in vision-language models have led to impressive progress in caption generation for images and short video clips. However, these models remain constrained by their limited temporal receptive fields, making it difficult to produce coherent and comprehensive captions for long videos. While several methods have been proposed to aggregate information across video segments, they often rely on supervised fine-tuning or incur significant computational overhead. To address these challenges, we introduce a novel framework for long video captioning based on graph consolidation. Our approach first generates segment-level captions, corresponding to individual frames or short video intervals, using off-the-shelf visual captioning models. These captions are then parsed into individual scene graphs, which are subsequently consolidated into a unified graph representation that preserves both holistic context and fine-grained details throughout the video. A lightweight graph-to-text decoder then produces the final video-level caption. This framework effectively extends the temporal understanding capabilities of existing models without requiring any additional fine-tuning on long video datasets. Experimental results show that our method significantly outperforms existing LLM-based consolidation approaches, achieving strong zero-shot performance while substantially reducing computational costs.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Fine-Grained Captioning of Long Videos through Scene Graph Consolidation
- Authors: Sanghyeok Chu, Seonguk Seo, Bohyung Han. The authors are affiliated with Seoul National University, a leading institution for AI research in South Korea.
- Journal/Conference: The paper is available as a preprint on arXiv. Preprints are research articles shared publicly before or during the peer-review process. The provided link points to a future publication date (February 2025), which is unusual but will be taken as the intended timeline for formal submission or publication.
- Publication Year: 2025 (as indicated by the arXiv identifier).
- Abstract: The paper addresses the challenge of generating coherent and detailed captions for long videos, a task where existing vision-language models (VLMs) fall short due to their limited temporal receptive fields. The authors propose a novel, computationally efficient framework that does not require supervised fine-tuning. Their method first uses an off-the-shelf VLM to generate captions for short video segments. These captions are then converted into structured
scene graphs. The core innovation is agraph consolidationalgorithm that merges these individual graphs into a single, unified graph representing the entire video. Finally, a lightweightgraph-to-textdecoder generates the final, comprehensive caption. The authors report that their method significantly outperforms existing Large Language Model (LLM)-based consolidation techniques in zero-shot settings while being much more computationally efficient. - Original Source Link:
- arXiv Page: https://arxiv.org/abs/2502.16427
- PDF Link: https://arxiv.org/pdf/2502.16427v2.pdf
- Status: Preprint.
2. Executive Summary
-
Background & Motivation (Why): While modern AI can generate impressive descriptions for images and short videos, its ability to understand and summarize long videos (e.g., several minutes) is still limited. The primary reason is that most models can only process a small "window" of video at a time. Existing solutions to this problem are not ideal:
-
Supervised Fine-tuning: Training models on large, annotated long-video datasets is expensive, and such datasets are scarce. This approach also limits the model's ability to generalize to new types of videos.
-
LLM-based Summarization: Using powerful LLMs to summarize captions from video segments is a flexible zero-shot approach, but it is computationally very expensive and can sometimes "hallucinate" or miss fine-grained details.
This paper aims to bridge this gap by creating a method that is both zero-shot (no fine-tuning on target datasets) and computationally efficient.
-
-
Main Contributions / Findings (What):
- A Novel Framework for Long Video Captioning: The paper introduces a new pipeline, named
SGVC(Scene Graph Video Captioning), which usesscene graphsas a structured intermediate representation to aggregate information over time. - Graph Consolidation Algorithm: The core technical innovation is an algorithm that can intelligently merge multiple scene graphs from different video segments into one cohesive graph, preserving key objects, their attributes, and relationships across the entire video.
- Strong Zero-Shot Performance with High Efficiency: The proposed method is shown to outperform state-of-the-art LLM-based approaches on standard video captioning benchmarks, all while requiring significantly fewer computational resources (GPU memory and inference time).
- A Novel Framework for Long Video Captioning: The paper introduces a new pipeline, named
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Vision-Language Models (VLMs): These are AI models trained to understand connections between visual data (images, videos) and text. They can perform tasks like generating a caption for an image (
image captioning) or answering questions about a video (visual question answering). Examples include BLIP and Flamingo. - Large Language Models (LLMs): These are massive neural networks trained on vast amounts of text data, making them exceptionally good at understanding, generating, and reasoning with human language. Examples include GPT-4 and Mistral.
- Scene Graph: A scene graph is a structured data format that represents the contents of an image or scene. It consists of nodes (representing objects, e.g., "man," "ball") and edges (representing the relationships between them, e.g., "kicking"). Nodes can also have attributes (e.g., "man" has attribute "wearing red shirt"). This provides a more detailed and organized summary than a simple text sentence.
- Zero-Shot Learning: A challenging scenario in machine learning where a model is expected to perform a task without having seen any examples of that specific task during its training. For example, captioning a long video without ever being trained on a dataset of long videos and their corresponding captions.
- Vision-Language Models (VLMs): These are AI models trained to understand connections between visual data (images, videos) and text. They can perform tasks like generating a caption for an image (
-
Previous Works:
- Supervised Video Captioning: Early and powerful methods were trained on large datasets of video-caption pairs. While effective, they are data-hungry and struggle to generalize to video types not seen during training.
- Zero-Shot Video Captioning: To overcome the need for paired data, some methods use pre-trained models like CLIP to guide a language model to generate relevant text at inference time (
ZeroCap,MAGIC) or train decoders on text-only data (DeCap). However, these methods often produce generic captions and struggle with complex, long videos. - Zero-Shot Long Video Captioning: This is the most relevant area.
- Recursive/Memory-based Methods: These approaches process a video segment by segment, maintaining a "memory" of what has been seen. However, they typically require supervised fine-tuning on the target dataset, which violates the zero-shot principle.
- LLM-based Consolidation: Methods like
VidILandVideo ChatCaptionerfeed text descriptions from multiple video segments into an LLM and ask it to write a summary.VidILcreates complex prompts with objects, events, and frame captions.Video ChatCaptioneruses an interactive chat format where an LLM queries a VLM about different frames. While powerful, these methods are slow and computationally demanding.
-
Differentiation: The proposed
SGVCmethod distinguishes itself by:- Using a Structured Intermediate Representation: Instead of feeding a long string of unstructured text captions to an LLM,
SGVCconverts captions into scene graphs. This structured format helps to explicitly track objects and their relationships over time, reducing information loss and hallucination. - Employing a Lightweight Decoder: Rather than relying on a massive, general-purpose LLM (with billions of parameters) for the final text generation,
SGVCuses a much smaller, specializedgraph-to-textmodel. This is the key to its computational efficiency.
- Using a Structured Intermediate Representation: Instead of feeding a long string of unstructured text captions to an LLM,
4. Methodology (Core Technology & Implementation)
The SGVC framework is a four-stage pipeline, elegantly illustrated in Figure 1 of the paper.

1. Generating Segment-Level Captions:
- The long input video is first divided into multiple temporal segments (e.g., by uniformly sampling frames or short clips).
- An off-the-shelf VLM (the paper experiments with
BLIP,BLIP2, andInternVL2.5) is used to generate a text caption for each segment. This step leverages the power of existing pre-trained models without needing to retrain them.
2. Parsing Captions into Scene Graphs:
- Each text caption is then processed by a textual scene graph parser. The paper uses the
FACTUAL-MRparser. - This parser converts a sentence like "An elderly woman is cooking in a kitchen" into a structured graph with nodes for "woman" (attribute: "elderly") and "kitchen," connected by an edge labeled "is cooking in."
- Formally, a scene graph is , where is the set of objects and is the set of relationships (edges). An object has a class and attributes .
3. Scene Graph Consolidation: This is the core contribution of the paper. The goal is to merge all the individual segment-level graphs into a single, comprehensive video-level graph. The process is detailed in Algorithm 1.
- Merging Two Scene Graphs:
- Given two graphs, and , the first step is to find which objects in correspond to which objects in .
- This is framed as an optimal matching problem solved using the Hungarian algorithm. The matching score between objects is their cosine similarity in an embedding space produced by a graph encoder .
- The formula for finding the optimal permutation that maximizes the sum of similarities is:
- Symbol Explanation:
- : The optimal matching (permutation) of objects.
- : The set of all possible permutations.
- : A graph encoder that converts a graph into embeddings.
- : A function that extracts the embedding for the -th object from the encoded graph.
- : The source and target scene graphs being merged.
- : The target graph with its objects reordered according to permutation .
- The formula essentially finds the object mapping that results in the highest total cosine similarity between corresponding object embeddings.
- Symbol Explanation:
- A match is considered valid only if its similarity score is above a threshold .
- For each valid pair of matched objects, they are merged into a single new object. The attributes of the merged object are the union of the attributes from the original two objects (e.g., if "woman" in frame 1 and "woman with glasses" in frame 2 are matched, the merged object is "woman with glasses").
- Iterative Consolidation: The merging process is repeated. In each step, the two most similar graphs from the current set of graphs are merged. This continues until only one unified graph remains.
- Prioritized Subgraph Extraction: For generating more concise captions, the authors propose an optional step. They track how many times each object node has been part of a merge. Nodes that appear and are merged frequently are considered more important. By selecting the top- most frequently merged nodes and their connected subgraphs, they can create a more focused scene graph that highlights the key entities in the video.
4. Video Caption Generation:
- Graph-to-Text Model:
- The final consolidated scene graph is fed into a
graph-to-textmodel to generate the final human-readable caption. - This model uses a transformer architecture, with a graph encoder and a text decoder. The authors use
BERT-basefor the encoder and the decoder fromT5-base, resulting in a relatively lightweight model (235M parameters). - A key detail is that the encoder's attention mechanism is masked to respect the graph's structure. This means a node can only directly "attend" to its neighbors in the graph, enforcing the relational structure.
- The final consolidated scene graph is fed into a
- Training:
- The model is trained on a large dataset of ~2.5 million graph-text pairs, created by parsing captions from existing public datasets (like MS-COCO, Visual Genome) and model-generated captions for videos from Kinetics-400.
- The training objective is a standard language modeling loss, where the model learns to predict the next word of a caption given the input graph and the previously generated words.
- Symbol Explanation:
- : The training loss for model parameters .
- : The -th token (word) in the ground-truth caption.
- : The total number of tokens in the caption.
- : The input scene graph.
- Symbol Explanation:
- Crucially, this training uses no data from the target test datasets, making the final evaluation a true zero-shot test.
5. Experimental Setup
- Datasets:
- Video Captioning:
MSR-VTT: Contains ~10k YouTube video clips, each about 10-30 seconds long, with 20 human-written captions per video.MSVD: Contains ~2k video clips of similar length, also with multiple captions each.
- Video Paragraph Captioning:
ActivityNet Captions: A more challenging dataset with longer videos (averaging ~2 minutes). The task is to generate a detailed paragraph describing the sequence of events. The evaluation is on theae-valset.
- Video Captioning:
- Evaluation Metrics:
- BLEU-4 (B@4):
- Conceptual Definition: Measures the precision of n-grams (sequences of n words) in the generated caption compared to a set of reference captions. A higher score means more overlapping phrases.
BLEU-4specifically looks at 1, 2, 3, and 4-word phrases. - Formula:
- Symbol Explanation: is a brevity penalty to penalize captions that are too short. is the modified n-gram precision. are weights, typically uniform (). is usually 4.
- Conceptual Definition: Measures the precision of n-grams (sequences of n words) in the generated caption compared to a set of reference captions. A higher score means more overlapping phrases.
- METEOR:
- Conceptual Definition: An improvement over BLEU that considers synonyms and stemming (matching "run" with "running"). It computes a score based on an alignment between the generated and reference captions. It balances precision and recall.
- Formula:
- Symbol Explanation: is precision, is recall of unigram matches. The
Penaltyterm is based on the "chunkiness" of the matches, penalizing fragmented matches.
- CIDEr:
- Conceptual Definition: Measures how similar a generated caption is to the consensus of a set of human-written captions. It gives more weight to n-grams that are important and descriptive (using tf-idf weighting). It is considered to correlate well with human judgment.
- Formula:
- Symbol Explanation: For a candidate caption and reference set with captions , it computes the average cosine similarity between their tf-idf weighted n-gram vectors .
- BERTScore:
- Conceptual Definition: Measures semantic similarity instead of exact word matches. It uses contextual embeddings from a BERT model to compute the cosine similarity between tokens in the generated and reference captions. It provides scores for Precision (
P_BERT), Recall (R_BERT), and F1-score (F_BERT). - Formulas:
- Symbol Explanation: and are the sets of token embeddings from the reference and generated captions, respectively.
P_BERTfinds for each token in the generated text its best match in the reference text, whileR_BERTdoes the reverse.
- Conceptual Definition: Measures semantic similarity instead of exact word matches. It uses contextual embeddings from a BERT model to compute the cosine similarity between tokens in the generated and reference captions. It provides scores for Precision (
- BLEU-4 (B@4):
- Baselines:
- LLM Summarization: The most direct comparison. The same segment-level captions used by
SGVCare fed directly to an LLM (Mistral-7BorGPT-4o mini) with a prompt asking it to summarize them into a single video caption. - LLM-based Video Understanding Methods:
VidILandVideo ChatCaptioner, which represent the state-of-the-art in zero-shot long video understanding using LLMs.
- LLM Summarization: The most direct comparison. The same segment-level captions used by
6. Results & Analysis
-
Core Results: The results consistently show that
SGVCis superior to the baselines.Zero-Shot Video Captioning (MSR-VTT & MSVD)
-
Table 1: SGVC vs. Other LLM Methods: This table is a transcription of the data from the paper.
Dataset Method Backbone VLM B@4 METEOR CIDEr PBERT RBERT FBERT MSR-VTT VidIL (Wang et al., 2022b) BLIP+CLIP 3.2 14.8 3.1 0.134 0.354 0.225 VidIL† (Wang et al., 2022b) 13.6 20.0 20.2 0.461 0.552 0.490 Video ChatCaptioner (Chen et al., 2023) BLIP2 13.2 22.0 16.5 0.396 0.510 0.436 SGVC (Ours) BLIP / BLIP2 17.7 / 18.4 22.5 / 23.1 24.0 / 26.1 0.476 / 0.467 0.539 / 0.542 0.490 / 0.487 MSVD VidIL (Wang et al., 2022b) BLIP+CLIP 2.5 16.5 2.3 0.124 0.404 0.238 30.7 32.0 60.3 0.656 0.726 0.674 Video ChatCaptioner (Chen et al., 2023) BLIP2 22.7 31.8 35.8 0.496 0.651 0.550 SGVC (Ours) 22.6 / 25.3 30.2 / 32.0 50.2 / 53.3 0.575 / 0.571 0.646 / 0.669 0.589 / 0.597 (Note: The
†indicates a few-shot setting for VidIL, which uses examples from the training set, making it not truly zero-shot.SGVCoutperforms even this stronger baseline on MSR-VTT.) Analysis:SGVCconsistently outperformsVidIL(in zero-shot) andVideo ChatCaptioneracross almost all metrics, especially theCIDErscore, which correlates well with human judgment. This indicatesSGVC's captions are more semantically relevant and human-like. -
Table 2: SGVC vs. LLM Summarization: This table is a transcription of the data from the paper.
Dataset Method Backbone VLM B@4 METEOR CIDEr PBERT RBERT FBERT MSR-VTT Summarization w/ Mistral-7B BLIP / BLIP2 9.6 / 11.5 21.6 / 23.1 10.8 / 15.4 0.313 / 0.308 0.516 / 0.528 0.395 / 0.397 SGVC (Ours) BLIP / BLIP2 17.7 / 18.4 22.5 / 23.1 24.0 / 26.1 0.476 / 0.467 0.539 / 0.542 0.490 / 0.487 MSVD Summarization w/ Mistral-7B BLIP / BLIP2 15.2 / 22.5 28.3 / 31.9 30.3 / 41.6 0.477 / 0.500 0.623 / 0.664 0.527 / 0.558 SGVC (Ours) BLIP / BLIP2 22.6 / 25.3 30.2 / 32.0 50.2 / 53.3 0.575 / 0.571 0.646 / 0.669 0.589 / 0.597 Analysis: This is a crucial comparison because both methods start with the exact same segment-level captions.
SGVC's massive lead (e.g.,CIDErof 26.1 vs. 15.4 on MSR-VTT) proves that consolidating information via structured scene graphs is far more effective than simply asking an LLM to summarize unstructured text.
Zero-Shot Video Paragraph Captioning (ActivityNet)
- Tables 3 and 4 show similar trends on the more challenging
ActivityNet Captionsdataset.SGVCagain outperforms all baselines, including summarization with the powerfulGPT-4o mini. The use of a video-centric backbone VLM (InternVL2.5) further boostsSGVC's performance, demonstrating its modularity.
-
-
Efficiency Analysis (Table 5): This table is a transcription of the data from the paper.
Method VLM Backbone Params. (B) GPU (GB) Time (s) CIDEr Using reference Using GPT API VidIL† BLIP+CLIP 0.67 3.57 1.32 20.2 ✓ ✓ Video ChatCaptioner BLIP2 3.75 14.53 3.65 16.5 - ✓ Summarization w/ Mistral-7B BLIP 7.50 14.50 1.27 10.8 - - BLIP2 11.00 28.20 1.51 15.4 - - SGVC (Ours) BLIP 0.74 5.07 1.14 24.0 - - BLIP2 4.24 18.40 1.37 26.1 - - Analysis: The
SGVC(Ours) with aBLIPbackbone is remarkably efficient. It uses only 0.74B parameters and 5.07 GB of GPU memory, making it far lighter than LLM summarization, yet it achieves the second-highestCIDErscore. Even when paired with the largerBLIP2backbone for the best performance, it is still more efficient than theMistral-7Bsummarization approach with the same backbone. This confirms the paper's claim of achieving high performance at a substantially lower computational cost. -
Ablations / Parameter Sensitivity:
- Impact of k (Table 6): Analyzing the
prioritized subgraph extractionparameter shows a trade-off. A smaller (focusing on the most central object) yields higher precision (CIDEr,P_BERT), while a larger (including more context) yields higher recall (METEOR,R_BERT). This is an intuitive and useful finding for tuning the output's level of detail. - Impact of τ (Table 7): The model's performance is stable for the object matching similarity threshold in the range [0.80, 0.95], indicating that the method is not overly sensitive to this hyperparameter.
- Impact of k (Table 6): Analyzing the
-
Qualitative Results: Figures 2 and 3 in the paper show visual examples.
该图像是视频帧的连续截图组,展示了手持玩具模型的不同视角和动作,反映了视频片段层面的动态场景变化。In the example of the man opening a toy egg (top left), the ground truth is "A man opening a toy egg set."
-
LLM summ.mentions a "toy box" and "toy airplane." -
Video ChatCaptionerhallucinates a "toy car" and "cup of water." -
Ours(SGVC) correctly identifies the "toy airplane" and "box," producing a coherent and accurate description: "A hand holding a toy airplane in front of a box with a surprised expression."In another example of runners on a track (top right),
SGVC's caption "A group of runners crouching down a line on a track competing in a race" is more detailed and accurate than the competitors' outputs. These examples visually confirm that the structured consolidation ofSGVChelps it stay grounded in the video's content and avoid the factual errors that plague other methods.
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces
SGVC, a novel and effective framework for generating fine-grained captions for long videos. By converting segment-level captions into scene graphs and consolidating them into a unified representation, the method aggregates temporal information in a structured and robust way. This approach is not only more accurate than existing LLM-based summarization techniques but also significantly more computationally efficient, all while operating in a zero-shot setting. -
Limitations & Future Work:
- Error Propagation: The pipeline is sequential. An error made by the initial VLM captioner or the scene graph parser will be passed down and could negatively impact the final output. The quality of the entire system is capped by the quality of its weakest component.
- Parser Dependency: The method relies heavily on a textual scene graph parser. The expressiveness and accuracy of the consolidated graph are limited by what the parser can extract from text.
- Future Work (from paper): The authors suggest that the CPU-based graph merging algorithm could be accelerated with a GPU implementation for even faster inference.
-
Personal Insights & Critique:
- Novelty and Significance: The core idea of using scene graphs as a "structured memory" for long-term video understanding is highly innovative. It provides a compelling alternative to simply concatenating text or using black-box memory mechanisms. This work highlights a promising direction for building more reliable and interpretable long-form video models.
- Strengths: The framework's modularity is a major advantage. One can easily swap in a better VLM captioner or a more advanced scene graph parser as they become available, continuously improving the system's performance. Its efficiency and zero-shot nature make it highly practical for real-world applications where fine-tuning for every new domain is infeasible.
- Potential Improvements: The current object matching is based on text embeddings. Integrating visual features directly into the graph consolidation process could make the object matching more robust, especially in cases where text descriptions are ambiguous. For example, if the caption says "a person," visual features could help distinguish between different people across frames.
- Broader Impact: This research could significantly advance applications in video search, summarization, and accessibility. By generating detailed, accurate descriptions of long videos, it can help users quickly understand video content without watching it in its entirety and enable new ways for visually impaired users to access video information.
Similar papers
Recommended via semantic vector search.