Self-Chained Image-Language Model for Video Localization and Question Answering
TL;DR Summary
The SeViLA framework introduces a solution for video question answering, addressing issues from uniform frame sampling. Utilizing the BLIP-2 model, it efficiently combines temporal keyframe localization and QA, significantly improving performance while reducing the need for expen
Abstract
Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We propose two ways of chaining these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. Our SeViLA framework outperforms several strong baselines on 5 challenging video QA and event prediction benchmarks, and achieves the state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We also analyze the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Self-Chained Image-Language Model for Video Localization and Question Answering
1.2. Authors
Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal. All authors are affiliated with UNC Chapel Hill.
1.3. Journal/Conference
This paper is published at arXiv, a preprint server, and was published on 2023-05-11T17:23:00.000Z. While arXiv itself is not a peer-reviewed journal or conference, it is a widely used platform for disseminating cutting-edge research in computer science, physics, mathematics, and other fields, often preceding formal publication in top-tier conferences or journals. The abstract mentions "Advances in Neural Information Processing Systems" and "International Conference on Machine Learning" in the references, suggesting the authors target such venues.
1.4. Publication Year
2023
1.5. Abstract
This paper introduces Self-Chained Video Localization-Answering (SeViLA), a novel framework designed to address the limitations of existing video question answering (QA) models, which often rely on uniformly sampled video frames without explicit language-aware temporal modeling. Such uniform sampling can lead to missing crucial visual information, especially when only a specific moment in a video is relevant to a query. Training query-aware video moment localizers typically requires expensive annotations and high computational costs.
SeViLA leverages a single pre-trained image-language model (BLIP-2) to perform both temporal keyframe localization and video QA. It comprises two modules, Localizer and Answerer, both fine-tuned from BLIP-2. The framework utilizes two chaining mechanisms:
-
Forward Chain: The
Localizeridentifies multiple language-awarekeyframesin a video, which are then used by theAnswererto predict the final answer. -
Reverse Chain: The
Answerergenerateskeyframe pseudo-labelsto refine theLocalizer. This process alleviates the need for costlyvideo moment localization annotations.The
SeViLAframework demonstrates superior performance against strong baselines on five challengingvideo QAandevent prediction benchmarks. It achievesstate-of-the-art (SOTA)results in bothfine-tuning(NExT-QA, STAR) andzero-shot(NExT-QA, STAR, How2QA, VLEP) settings. The paper also includes a comprehensive analysis of theLocalizer's impact, comparisons with other temporal localization models, effects of pre-training and self-refinement, and the influence of varying the number ofkeyframes.
1.6. Original Source Link
https://arxiv.org/abs/2305.06988 PDF Link: https://arxiv.org/pdf/2305.06988v2.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The field of multimodal AI has seen significant advancements, particularly with large pre-trained image-language models (image-LMs). These models are efficient at representation learning and have been adapted for video-language models (video-LMs). However, a fundamental challenge arises because video data inherently has a temporal dimension, which is often not explicitly addressed when adapting image-LMs.
The core problem the paper aims to solve is the inefficiency and potential for information loss in video question answering (VQA) when using image-LMs or traditional video-LMs that rely on uniformly sampled video frames.
-
Problem 1: Uniform Sampling Limitations: Many existing approaches simply concatenate uniformly or randomly sampled video frames as visual input. This method fails to incorporate
language-aware, temporal modeling. When a query pertains to a specific, brief moment in a longer video, uniform sampling might miss the critical visual cues, leading to incorrect answers. It also burdens the model with irrelevant information. -
Problem 2: Cost of Temporal Grounding Annotations: Humans naturally focus on relevant video segments and "rewind" to find answers. Mimicking this in
AIrequires aquery-aware video moment localizer. However, training such a localizer demands expensive, frame-leveltemporal grounding annotations, which are resource-intensive to create. -
Problem 3: Scaling Challenges for Video-LMs:
Video-LMsare harder to scale thanimage-LMsdue to higher computational costs and the difficulty in obtaining large-scale video-language paired datasets.The paper's entry point and innovative idea revolve around tackling these issues by leveraging the power of a single, already
large pre-trained image-language model(BLIP-2) to perform both intelligenttemporal keyframe localizationandquestion answering. This approach aims to makevideo-LMsmore efficient and effective without incurring the high costs associated with new, specializedvideo-LMpre-training or extensivetemporal localization annotations.
2.2. Main Contributions / Findings
The primary contributions of the SeViLA paper are:
-
A Novel
Self-Chained Video Localization-Answering (SeViLA)Framework: The paper introduces a new framework that repurposes a singleimage-language model(BLIP-2) into two specialized modules: aLocalizerfor language-awaretemporal keyframe localizationand anAnswererforquestion answeringon videos. Both modules areparameter-efficiently fine-tunedfrom theBLIP-2backbone. -
Two Chaining Mechanisms for Enhanced Performance and Efficiency:
- Forward Chain: The
Localizerfirst identifieslanguage-aware keyframesin a video, which then serve as targeted visual input for theAnswererto predict responses. This mimics human selective attention. - Reverse Chain (Self-Refinement): A novel
pseudo-labelingmethod is proposed where theAnswerergenerateskeyframe pseudo-labels. These labels are then used to refine theLocalizer, significantly reducing the dependency on expensivevideo moment localization annotations.
- Forward Chain: The
-
State-of-the-Art (SOTA) Empirical Performance:
SeViLAdemonstrates strong empirical performance, outperforming several robust baselines and achievingSOTAresults on five challengingvideo QAandevent prediction benchmarks. This includes:Fine-tuning settings:SOTAon NExT-QA and STAR.Zero-shot settings:SOTAon NExT-QA, STAR, How2QA, and VLEP.
-
Comprehensive Analysis: The paper provides in-depth ablation studies and analyses demonstrating the effectiveness of each component, including the
Localizer's impact, comparisons with other temporal localization models, effects of pre-training and self-refinement, and the influence of varying the number ofkeyframes. It also shows that theLocalizercan perform strongly as a standalonemoment retrieval model.These findings collectively highlight the effectiveness of integrating
language-aware temporal localizationwithquestion answeringusing a singleimage-language modeland a cleverself-refinementmechanism, makingvideo-language tasksmore accurate and resource-efficient.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the SeViLA framework, it's essential to understand several foundational concepts in AI, natural language processing (NLP), and computer vision (CV).
- Image-Language Models (Image-LMs): These are
deep learning modelsdesigned to understand and process both images and natural language text simultaneously. They learn to align visual and linguistic information, enabling tasks likeimage captioning,visual question answering, andimage-text retrieval.Image-LMsare often pre-trained on massive datasets of image-text pairs, learning rich, multimodal representations. - Video-Language Models (Video-LMs): An extension of
image-LMsto handle video data. Videos introduce the additional complexity of a temporal dimension, requiring models to understand sequences of visual information and their evolution over time, alongside natural language. This includes tasks likevideo captioning,video question answering, andvideo moment retrieval. - BLIP-2 (Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models): A state-of-the-art
image-language modelthatSeViLAbuilds upon.BLIP-2's architecture comprises three main components:- Frozen Image Encoder: Typically a pre-trained
Vision Transformer (ViT)(e.g.,ViT-GinBLIP-2). This component processes an image and extracts its visual features. "Frozen" means its parameters are not updated duringBLIP-2's training orSeViLA's fine-tuning, preserving the powerful visual representations it learned during its own pre-training. - Frozen Large Language Model (LLM): A powerful
generative language model(e.g.,Flan-T5inBLIP-2). This component is responsible for generating human-like text, understanding complex linguistic instructions, and performing variousNLPtasks. Like the image encoder, it's kept frozen to retain its extensive linguistic knowledge. - Q-Former (Querying Transformer): This is the crucial, trainable component that acts as a bridge between the frozen image encoder and the frozen
LLM. It takes fixed-length visual features from the image encoder and a set oflearnable query embeddings. Through atransformerarchitecture, it learns to extract the mostsalient(important) visual information relevant to a given text prompt, effectively compressing the visual input into a form that theLLMcan understand. It undergoes a two-stage pre-training:- Image-to-Text Pre-training: Connects
Q-Formerto theimage encoderto learn to extract informative visual features for text generation. - Q-Former to LLM Connection: Connects
Q-Formerto theLLMto leverage its generative capabilities, projecting query embeddings into theLLM's dimension, serving assoft visual prompts.
- Image-to-Text Pre-training: Connects
- Frozen Image Encoder: Typically a pre-trained
- Parameter-Efficient Fine-tuning: A set of techniques used to adapt large pre-trained models to new downstream tasks with minimal changes to their original parameters. Instead of fine-tuning the entire model (which can be computationally expensive and prone to
catastrophic forgetting), only a small fraction of parameters (e.g., adapter layers, or inBLIP-2's case, theQ-Former) are trained. This significantly reduces computational costs and memory requirements. - Large Language Models (LLMs): Very large
deep learning modelsthat are pre-trained on vast amounts of text data to understand, generate, and process human language. They are typically based on thetransformerarchitecture and can perform a wide range ofNLPtasks. Examples includeFlan-T5andGPT-3. - Transformer: A
neural network architectureintroduced in "Attention Is All You Need" (Vaswani et al., 2017), which revolutionizedsequence modeling(e.g.,NLP). Its core innovation is theself-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element.- Self-Attention Mechanism: For an input sequence,
self-attentioncalculatesattention scoresbetween every pair of elements (e.g., words in a sentence or patches in an image). These scores determine how muchfocuseach element should place on other elements when computing its own representation. The mechanism typically involves three learned matrices: Query (Q), Key (K), and Value (V).- The formula for
scaled dot-product attention(the basis forself-attention) is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $- : Query matrix, derived from the input embeddings. It represents what we're looking for.
- : Key matrix, derived from the input embeddings. It represents what we're looking up.
- : Value matrix, derived from the input embeddings. It contains the information to be extracted.
- : Transpose of the Key matrix.
- : Scaling factor, where is the dimension of the
key vectors. This helps to stabilize thesoftmaxfunction during training, preventing very large values from dominating. - : A function that converts a vector of numbers into a probability distribution, ensuring weights sum to 1.
- The result is a weighted sum of the
Value vectors, where the weights are determined by theattention scoresbetweenQueryandKey.
- The formula for
- Self-Attention Mechanism: For an input sequence,
- Pseudo-labeling: A
semi-supervised learningtechnique where a model is first trained on a small amount of labeled data. Then, it's used to make predictions on a larger set of unlabeled data. The most confident predictions on the unlabeled data are treated as "pseudo-labels" and added to the training set for subsequent model refinement. This helps leverage large amounts of unlabeled data, reducing the need for expensive manual annotations. - Moment Retrieval / Grounding: A task in
video-language understandingwhere the goal is to identify a specific temporal segment (a "moment") within a video that corresponds to a given natural language query or description. For example, given a video and the query "the person petting a dog," the model should output the start and end timestamps of that action. - Fine-tuning vs. Zero-shot Learning:
- Fine-tuning: Involves taking a pre-trained model and further training it on a specific downstream task with labeled data. The pre-trained weights are updated to adapt the model to the nuances of the new task.
- Zero-shot Learning: Refers to the ability of a model to perform a task it has not been explicitly trained on, without any task-specific examples. This typically relies on the model's ability to generalize from its pre-training knowledge and leverage semantic understanding of the input and desired output format.
3.2. Previous Works
The paper contextualizes SeViLA within the landscape of image-language and video-language models, highlighting the evolution and challenges in adapting these models for video understanding.
- Image-Language Pre-trained Models: The authors acknowledge the rapid advancements in
image-LMssuch asBLIP[34],CLIP[55],Florence[89], andBLIP-2[35]. These models have benefited from larger model sizes and vast pre-training datasets (e.g., LAION-400M, LAION-5B). The paper notes thatimage-LMshave scaled more rapidly thanvideo-LMsdue to easier accessibility of image data and simpler data structures.SeViLAleverages this by building uponBLIP-2. - Image-to-Video Transfer Learning: This area focuses on adapting
image-LMsfor video tasks, often using a limited number of video frames to enhance learning efficiency. Previous methods include:CLIP4Clip[44]: AdaptsCLIPforvideo clip retrieval.FrozenBiLM[85]: Extends frozenbidirectional language models(likeBERT) to incorporate multiple images and addsvideo-level pre-training.VidIL[72]: Converts multiple images into hierarchical captions with temporal order to aidLMsin comprehending video events. The paper points out a common limitation of these works: they often employ auniform sampling strategy, which is notlanguage-aware. This can lead to the loss of key visual cues and burden models with irrelevant information.SeViLAdirectly addresses this by introducing aLocalizerforlanguage-aware visual information.
- Language-aware Keyframe Localization: Several methods have attempted to select relevant frames based on language queries:
Buch et al. [3](Temp[ATP]): Optimized an end-to-end pipeline to select a single keyframe using answer labels.Lu et al. [42](LGDN): Selects frames using separate image and language models, then answers questions with aQA modelhaving multiple training objectives.Qian et al. [54]: Designed avideo clip proposal modelwith predefined ranges, iteratively training it with aQA model.Kim et al. [24](SeViTFiD): Used asemi-parametric retrieverto obtainkeyframesbased on frame and language feature similarity.SeViLAdifferentiates itself by adopting a largeimage-LM(BLIP-2) as itsLocalizer, which is then chained with anAnswerer. Crucially,SeViLAallows theAnswererto refine theLocalizerthroughpseudo-labelsin a reverse chain, avoiding expensivetemporal grounding annotations. This is a significant distinction from methods requiring explicit frame-level labels for localization.
3.3. Technological Evolution
The evolution in multimodal AI has progressed from processing static images to dynamic videos, driven by advancements in deep learning architectures (especially transformers) and the availability of larger datasets.
- Early Multimodal Models: Initially, vision and language tasks were often handled by separate models or simpler fusion techniques.
- Rise of Image-Language Models: The development of models like
CLIPandBLIPmarked a turning point, demonstrating how joint pre-training on massive image-text datasets could yield powerful, generalizable representations. These models excelled at understanding the semantics across modalities. - Challenges with Video: Extending these
image-LMsto videos proved challenging. Videos add thetemporal dimension, meaning models must not only understand what is happening but also when and how events unfold. This led to two main approaches:- Direct Video-LM Pre-training: Developing dedicated
video-LMs(e.g.,InternVideo) with specialized architectures and pre-training on video-text pairs. However, these are often limited by data availability and computational cost compared toimage-LMs. - Image-to-Video Transfer: Adapting powerful
image-LMsto video tasks, often by treating videos as sequences of images. The main drawback here was theuniform samplingproblem, where temporal relevance was overlooked.
- Direct Video-LM Pre-training: Developing dedicated
- Focus on Language-Aware Temporal Modeling: Recent work, including
SeViLA, has recognized the need forlanguage-aware temporal modeling. This means the model should intelligently select or focus on video segments most relevant to a given linguistic query, rather than processing all frames equally. SeViLA's Place:SeViLAfits within this evolution by offering an innovative solution forimage-to-video transferthat explicitly addresseslanguage-aware temporal modeling. It leverages the strength of aSOTA image-LM(BLIP-2) and introduces a novelself-chaining mechanismto effectively perform both localization andQAwhile also mitigating theexpensive annotation costof temporal localization throughpseudo-labeling. This represents a step towards more intelligent and resource-efficientvideo-language understanding.
3.4. Differentiation Analysis
SeViLA introduces several key innovations compared to previous approaches, particularly concerning language-aware temporal modeling and annotation efficiency:
-
Single
Image-LMfor Dual Tasks: Unlike many previousvideo-LMsorimage-to-video transfermethods that either use separate models for localization andQAor adaptimage-LMsprimarily forQAafter some form of frame selection,SeViLArepurposes a singleBLIP-2model to create both itsLocalizerandAnswerer. This promotes synergy and consistency across modules, as they share the same underlying architecture and much of the pre-trained knowledge. -
Explicit
Language-Aware Temporal Keyframe Localization: Many priorimage-to-video transfermethods, such asCLIP4Clip[44] orFrozenBiLM[85], rely onuniformorrandom samplingof frames. This means they process all frames equally, potentially missing crucial information or being burdened by irrelevant data.SeViLA'sLocalizeris explicitly designed to identifylanguage-aware keyframesusing a specificlocalization prompt, making the visual input to theAnswererhighly relevant to the query. This is a direct response to the identified limitation ofuniform sampling. -
Novel
Self-Refinement(Reverse Chain) withPseudo-labeling: This is a major differentiator. While some methods (Buch et al. [3],Qian et al. [54]) try to optimize frame selection, they often still depend onanswer labelsor predefined ranges.SeViLA'sreverse chainemployspseudo-labeling, where theAnswerer(which is already good atQA) provides feedback to refine theLocalizer. This elegantly addresses the high cost of manualtemporal grounding annotations, a significant practical barrier for developing robustlocalization models. Otherkeyframe localizationmethods generally require explicitmoment retrievallabels oranswer labelsfor training their localization components. -
Chained Inference and Training: The
forward chain(Localizer output feeds Answerer input) andreverse chain(Answerer feedback refines Localizer) create a symbiotic relationship between the two modules. This iterative improvement mechanism is distinct from sequential, independent training or one-off frame selection processes. -
Performance with Sparse Frames: The
Localizerselects a sparse set ofkeyframes. This contrasts with approaches that try to processdenseframes (HERO[36], [40]) or rely onvoting(BLIP-2 voting).SeViLAshows that by intelligently selecting a few relevant frames, it can achieve superior performance, highlighting the quality over quantity of visual input. -
Outperforming
Video-LMswithoutVideo Pre-training: Surprisingly,SeViLA(and evenBLIP-2baselines) can outperform dedicatedvideo-LMslikeInternVideo[71] inzero-shot settings. This suggests that the scale and strongrepresentation learningofimage-LMs, when combined withsmart temporal modelingviaSeViLA's localization, can surpass models specifically pre-trained on video data, underscoring the efficiency ofimage-to-video transferwhen done correctly.In essence,
SeViLAoffers a more holistic and resource-efficient solution forvideo-language understandingby making animage-LMperform intelligent,language-aware temporal localizationandQAin a mutually reinforcing loop, all while sidestepping the prohibitive costs of extensive manual annotation for temporal grounding.
4. Methodology
4.1. Principles
The core idea behind Self-Chained Video Localization-Answering (SeViLA) is to adapt a powerful, pre-trained image-language model (image-LM), specifically BLIP-2, to effectively handle video-language tasks. The foundational principle is that a well-established image-LM already possesses strong capabilities in understanding the content of individual frames and their relation to language. The main challenge for video is the temporal dimension – identifying which frames are relevant to a given query.
SeViLA addresses this by leveraging BLIP-2's architecture to create two specialized modules:
-
Localizer: A module tasked with intelligently selectinglanguage-aware keyframesfrom a video based on a query. This mimics human selective attention, focusing on relevant visual information. -
Answerer: A module responsible for performingquestion answering (QA)by synthesizing information from these selectedkeyframesand the query.The "self-chained" aspect refers to the bidirectional interaction between these two modules:
-
Forward Chain: The
Localizer's output (selectedkeyframes) directly feeds into theAnswereras its visual input. This ensures theAnswereroperates on highly relevant, language-conditioned visual data. -
Reverse Chain: The
Answererprovides feedback to theLocalizer. Specifically, theAnswerer's ability to correctly answer a question using individual frames is used to generatepseudo-labelsforkeyframes, which then refine theLocalizer. This self-refinement mechanism is crucial for mitigating the need for expensive manualtemporal localization annotations.The theoretical basis and intuition are that a
large image-LMpossesses sufficient visual and linguistic understanding to generalize to temporal reasoning, provided it is guided to focus on the right moments. By decoupling the localization and answering tasks, yet chaining them,SeViLAaims for more accurate and efficient video understanding, optimizing the use of existing powerfulimage-LMsfor the complexities of video.
4.2. Core Methodology In-depth (Layer by Layer)
The SeViLA framework is built upon BLIP-2 and consists of two main modules, Localizer and Answerer, which are chained in both forward and reverse directions.
4.2.1. Preliminaries: BLIP-2
SeViLA adopts BLIP-2 as its backbone. BLIP-2 is a state-of-the-art pre-trained image-language model with a specific architecture:
- Frozen Image Encoder: This component (e.g.,
ViT[11, 16], specificallyViT-GinSeViLA) processes raw images to extract high-level visual features. "Frozen" means its parameters are fixed and not updated duringSeViLA's training. - Frozen Large Language Model (LLM): This component (e.g.,
Flan-T5[7]) handles linguistic understanding and generation. It's also kept frozen to preserve its extensive language knowledge. - Q-Former: This is the only trainable component of
BLIP-2within theSeViLAframework. It's atransformermodule [66] that acts as an adapter, connecting theimage encoderand theLLM.- Input: Visual features from the
image encoderandlearnable query embeddings. - Output: Fixed-length visual features . The
Q-Formeris designed to extract the most informative visual information relevant to text, effectively compressing the visual input. - Pre-training: The
Q-Formerundergoes a two-stage pre-training:-
Image-to-Text Pre-training: It's connected to the
image encoderto learn to extract visual information necessary for generating text. This stage helps it to filter out irrelevant visual details. -
Q-FormertoLLMConnection: It's connected to theLLMto leverage itsgenerative language capabilities. This is done using a fully-connected layer that projects thequery embeddingsinto theLLM's input dimension. These projected features then serve assoft visual prompts[22] for theLLM.In
SeViLA, both thevisual encoderand theLLMfromBLIP-2are kept frozen. Only theQ-Formers(one for theLocalizerand one for theAnswerer) and a single linear layer after eachQ-Formerare updated during training, makingSeViLAparameter-efficient.
-
- Input: Visual features from the
4.2.2. Self-Chained Video Localization-Answering (SEVILA)
The SeViLA framework adapts BLIP-2 into two distinct roles: a Localizer for temporal localization and an Answerer for question answering.
The following figure (Figure 2 from the original paper) shows the system architecture of SEVILA:

该图像是示意图,展示了SEVILA框架中的LocALIzer(顶部)和AnswErer(底部)模块的结构及功能。LocALIzer通过选择多个语言相关的关键帧,指导AnswErer聚焦于重要的视觉时刻以预测答案。两者均从单一的预训练模型BLIP-2初始化,仅微调Q-former及线性层(2.5 ext{%}的总参数)。
Figure 2: In SEVILA framework, LocALIzer (top) selects top-K video frames, which guides AnswErer (bottom) to focus on important language-aware video moments and predict answers. Both LocALIzER and AnswERER are initialized from a single pre-trained BLIP-2 model, where only Q-formers and a linear layer of total parameters) are tuned for each module. We omit the linear layer after the Q-former for simplicity.
4.2.2.1. LocALIZER
The Localizer's primary objective is to select language-aware keyframe features from a video.
- Frame Feature Extraction: Given a video, frames are uniformly sampled. A frozen
image encoder(fromBLIP-2) extracts features for each frame, resulting in . The entire video is then represented as a set of frame features . These features are extracted once and reused. - Visual Query Feature Generation: Each frame feature independently passes through a
Q-Formerspecific to theLocalizer, denoted as , to produce visual query features . - Language Context Creation: A language context is formed by concatenating the
question,options, and a specificlocalization prompt. The prompt used is: "Does the information within the frame provide the necessary details to accurately answer the given question?". - Scoring with LLM: The visual query feature for each frame and the language context are concatenated and fed into the
LLM(Flan-T5). TheLLMthen outputs a score for each frame, which represents the probability of generating the word 'yes'. - Keyframe Localization: Based on these scores , the
Localizerselects the top- frames aslanguage-aware keyframes, where is typically much smaller than . Let these selected keyframe visual features be . TheLocalizercan be formulated as: $ K = \operatorname { L o c a L I Z E R } ( V , L ) , \quad | K | = k \ll n $- : The set of selected
language-aware keyframevisual features. - : The function representing the
Localizermodule. - : The set of all uniformly sampled frame features from the video.
- : The language context, including the question, options, and localization prompt.
- : Denotes that the number of selected keyframes () is much smaller than the total number of sampled frames ().
- : The set of selected
4.2.2.2. AnswERER
The Answerer module takes the keyframes identified by the Localizer and generates the video-level answer.
- Keyframe Visual Query Feature Generation: The
keyframesobtained from theLocalizerare processed through a separateQ-Formerspecific to theAnswerer, denoted as . This step follows the same procedure as in theLocalizerto obtainquery features. - Answer Generation with LLM: The
LLMis fed with all thesekeyframe visual query featuresand the language contexts. These are concatenated together to form the input. TheLLMthen predicts the video-level answer . TheAnswerercan be formulated as: $ a = \operatorname { A N S W E R E R } ( K , L ) $- : The predicted video-level answer.
- : The function representing the
Answerermodule. - : The set of
language-aware keyframevisual features provided by theLocalizer. - : The language context, including the question and options. The concatenation is explicitly given as , where represents the language context for answering (question and options). This approach enables modeling with multiple frame inputs.
4.2.3. Training AnswERER and LocALIZER via Self-Chaining
The following figure (Figure 3 from the original paper) shows the forward and reverse chain of SEVILA:

该图像是示意图,展示了自链式视频定位与问答框架(SeViLA)的双向推理过程。上方为正向链,定位器找到多个语言感知关键帧,回答者基于这些关键帧预测答案;下方为反向链,通过回答者生成伪标签来精炼定位器。
Figure 3: Top: In the forward chain, the LocALizer finds multiple language-aware keyframes, then the Answerer utilizes these keyframes to predict answers. We use the forward chain for both inference and AnswErer fine-tuning. Bottom: In the reverse chain, we generate keyframe pseudo-labels by using the AnSWERer to refine the LocALIZER.
4.2.3.1. Fine-tuning AnswERER in Forward Chain
The Answerer module is fine-tuned on downstream QA tasks.
- During this phase, the
Answererreceiveskeyframesthat have been generated by the (potentially pre-trained)Localizer. - This process utilizes the
forward chainfor bothinferenceandfine-tuning, allowing theAnswererto learn to predict answers based on the selected, relevantkeyframes. This helps theAnswererto focus on important language-aware video moments.
4.2.3.2. Refining LocALIZER in Reverse Chain
To overcome the need for costly frame-level localization annotations, SeViLA employs a pseudo-labeling [26] strategy in a reverse chain to refine the Localizer.
- Pseudo-label Generation: The
frozen Answereris prompted with aQA taskusing individual frames. A frame is labeled as akeyframe pseudo-labelif theAnswerercan correctly output the ground-truth answer using only that specific frame as visual input. - Localizer Training: The
Localizeris then trained to identify these generatedlanguage-aware pseudo-label keyframes. This process improves theLocalizer's accuracy in identifying relevant frames without requiring manual annotations.
4.2.3.3. Pre-training LocALIZER with Moment Retrieval Label
To further enhance the Localizer's capability, a transfer learning step is performed via pre-training on a video moment retrieval/grounding task.
- Dataset: The
QVHighlightsdataset [30] is used, which provides videos, queries, andvideo-level temporal span labels. - Label Conversion: These
temporal span annotationsare converted intobinary localization labelsfor each frame. A frame receives a positive label if its timestamp falls within a providedtemporal span. - Objective: This
pre-traininghelps theLocalizerto learn to associate language queries with relevant temporal segments in videos, providing a strong initial foundation beforeself-refinement.
5. Experimental Setup
5.1. Datasets
SeViLA evaluates its framework on a diverse set of video-language benchmarks encompassing Video Question Answering (Video QA), Video Event Prediction (Video EP), and Video Moment Retrieval.
-
NExT-QA [77]:
- Domain: Causal and temporal reasoning in videos.
- Scale: 5,440 videos, averaging 44 seconds in length, with approximately 52,000 questions.
- Characteristics: Questions are categorized into three types:
Temporal (Tem.): Involves understanding sequences and timing of events.Causal (Cau.): Requires reasoning about cause-and-effect relationships.Descriptive (Des.): Focuses on describing visual content.
- Purpose: To evaluate a model's ability to reason about complex temporal and causal relationships in videos.
-
STAR [75]:
- Domain: Situated reasoning in real-world videos.
- Scale: 22,000 video clips, averaging 12 seconds in length, with 60,000 questions.
- Characteristics: Questions are designed to test reasoning in context, categorized into four types:
Interaction (Int.): About interactions between entities.Sequence (Seq.): About the order of events.Prediction (Pre.): About predicting future events.Feasibility (Fea.): About the possibility of events.
- Purpose: To assess a model's understanding of implicit and explicit reasoning required in dynamic, situated contexts.
-
How2QA [36]:
- Domain: Open-domain
QAon instructional videos. - Scale: 44,000 questions paired with 22,000 60-second clips, selected from 9,035 videos.
- Purpose: To evaluate
QAcapabilities on practical, user-generated content, often requiring common-sense reasoning.
- Domain: Open-domain
-
TVQA [27]:
- Domain:
QAon popular TV shows. - Scale: 152,000 questions coupled with 21,000 video clips, averaging 76 seconds.
- Purpose: To test comprehensive video understanding, often requiring rich context from dialogue, actions, and character relationships.
- Domain:
-
VLEP [28] (Video Event Prediction):
- Domain: Predicting future events in videos.
- Scale: 28,726 future event prediction cases from 10,234 diverse TV shows and YouTube Lifestyle Vlog video clips.
- Characteristics: Formulated as a
multi-choice QAtask where the model predicts two future events. - Purpose: To assess a model's ability to anticipate and reason about future occurrences based on observed video content.
-
QVHighlights [30] (Video Moment Retrieval):
-
Domain: Identifying specific temporal spans in videos corresponding to natural language queries.
-
Scale: 10,148 videos (average duration 150s), 18,367 moments, and 10,310 queries.
-
Purpose: To evaluate the
Localizer's capability as a standalonemoment retrieval model, assessing its precision intemporal grounding.These datasets collectively provide a comprehensive evaluation across different facets of
video-language understanding, including temporal, causal, descriptive, and situated reasoning, as well as future event prediction and precise moment localization. Their diversity in video length, content, and question types makes them suitable for rigorously validating theSeViLAframework's effectiveness and generalizability.
-
5.2. Evaluation Metrics
For each task, specific evaluation metrics are used to quantify the model's performance.
-
For Video Question Answering (NExT-QA, STAR, How2QA, TVQA) and Video Event Prediction (VLEP):
- Metric:
Answer Accuracy - Conceptual Definition: Accuracy measures the proportion of predictions that the model made correctly. In the context of
multi-choice QA, it's the percentage of questions for which the model selected the correct option from a given set of choices. It focuses on the overall correctness of the final answer. - Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
- : The count of instances where the model's predicted answer matches the true ground-truth answer.
- : The total number of questions or instances for which the model made a prediction.
- Metric:
-
For Video Moment Retrieval (QVHighlights): The paper follows Lei et al. [30] for these metrics.
-
Metric 1:
Mean Average Precision (mAP)over multipleIntersection over Union (IoU)thresholds. -
Conceptual Definition:
mAPis a common metric in object detection andmoment retrievalthat evaluates the average precision (a measure of relevant retrieved items) across various recall levels, and then averages theseAverage Precision (AP)scores over multipleIoUthresholds and often across different queries or classes. It provides a comprehensive measure of both the correctness of the predicted moment and its overlap with the ground-truth moment.Intersection over Union (IoU): A measure of the overlap between two bounding boxes (or, in this case, temporal segments). It is calculated as the area (or duration) of the intersection divided by the area (or duration) of the union of the predicted segment and the ground-truth segment. AnIoUthreshold (e.g., 0.5, 0.7) determines if a prediction is considered correct.Precision: The proportion of correctly predicted positive instances among all instances predicted as positive.Recall: The proportion of correctly predicted positive instances among all actual positive instances.
-
Mathematical Formula: (The paper does not provide the explicit formula for
mAPbut refers to standard practice. A common formulation formAPinmoment retrievalinvolves computingAverage Precision (AP)for each query and then averagingAPover all queries andIoUthresholds. TheAPfor a single query is typically calculated by integrating theprecision-recall curve.) $ \text{AP}q = \sum_k (\text{Recall}k - \text{Recall}{k-1}) \cdot \text{Precision}k $ $ \text{mAP} = \frac{1}{|\mathcal{Q}| \cdot |\text{IoU Thresh}|} \sum{q \in \mathcal{Q}} \sum{\tau \in \text{IoU Thresh}} \text{AP}_q(\tau) $ -
Symbol Explanation:
- : Average Precision for a specific query .
- : Recall at the -th threshold of the
precision-recall curve. - : Precision at the -th threshold of the
precision-recall curve. - : The total number of queries.
- : The number of
IoUthresholds used (e.g., [0.5, 0.55, ..., 0.95]). - : Average Precision for query at a specific
IoUthreshold .
-
Metric 2:
Recall@1() where a prediction is considered positive if it has a highIoU(Intersection over Union) with one of the ground truth moments. -
Conceptual Definition:
Recall@1measures the proportion of queries for which the top-ranked predicted moment (the one with the highest confidence score) correctly localizes a ground-truth moment. A prediction is deemed "correct" if itsIoUwith a ground-truth moment exceeds a specified threshold (e.g., 0.5 or 0.7). It focuses on the quality of the single best prediction. -
Mathematical Formula: (The paper does not provide an explicit formula for
Recall@1but relies on the standard definition inmoment retrieval.) $ \text{Recall@1}(\tau) = \frac{\text{Number of Queries with Correct Top-1 Prediction at IoU } \geq \tau}{\text{Total Number of Queries}} $ -
Symbol Explanation:
-
: The count of queries where the model's highest-ranked moment prediction achieves an
IoUof at least with any ground-truth moment. -
: The total number of queries in the dataset.
-
: The
IoUthreshold (e.g., 0.5 or 0.7).The paper reports performance on the validation set for
NExT-QA,STAR,How2QA,TVQA, andVLEP, and on the hidden test set forQVHighlights.
-
-
5.3. Baselines
SeViLA is compared against several strong baselines and previous state-of-the-art models to demonstrate its superiority.
-
State-of-the-Art Video-Language Pre-trained Models:
InternVideo [71]: A recentSOTA video-language pre-trained model.SeViLAspecifically compares against its largestMM-L-14 variant(1B parameters), initialized fromCLIP-L/14[55], using its default 8-frame setting. The authors fine-tuned this model themselves for comparison.Flamingo-80B [1]: A very largevisual language model(80 billion parameters) designed forfew-shot learning. Mentioned for itszero-shotperformance, especially onSTAR.ViperGPT [63]: Another recent model forvisual inferenceviaPython executionforreasoning, evaluated inzero-shotsettings.
-
BLIP-2 Based Baselines (adapted by the authors): These serve as direct comparisons to show the impact of
SeViLA's design choices when starting from the sameBLIP-2backbone.BLIP-2voting: This baseline processes each uniformly sampled frame (e.g., 4 frames) independently usingBLIP-2. It then obtains the final answer by performingmajority votingon the answers generated for each individual frame. This model lacks explicitinter-frame temporal modeling.BLIP-2concat (ANSWERER): In this baseline,BLIP-2'sQ-Formerprocesses each uniformly sampled frame. The resulting visual features are then concatenated and fed as aprefixtoFlan-T5(theLLMbackbone), which then predicts the answer. This baseline performstemporal modelingby concatenating features, similar to theAnswerercomponent, but uses uniformly sampled frames instead of localizedkeyframes.
-
Other Keyframe Selection Methods (for comparative analysis of
Localizer): These are used to assess the effectiveness ofSeViLA'sLocalizeragainst alternative ways of selecting frames.CLIP [55]: A widely usedimage-language model. For frame selection, it calculates theimage-language similaritybetween each frame's visual feature (fromCLIP-ViT-B/32) and the combined question and option features. The top-4 frames with highest similarity are selected.Moment-DETR [30]: A model pre-trained formoment retrieval. It's used to detect atemporal spancorresponding to the question and option sentence, from which 4 frames are then uniformly sampled.ATP [3] (Answer-driven Temporal Pooling): A method that optimizes an end-to-end pipeline to select a single keyframe (or specific frames) based onanswer labels.SeViLAcompares against itsfine-tunedversion.Differentiable Top-K [8]: A technique that allows fordifferentiable selectionof top-K elements. It's used here as a plugin after theQ-Formerto learn salientframe feature selectionin an end-to-end manner, compared in afine-tunedsetting.
-
Other Previous Works (from Table 1): The paper also compares against various other models such as
HERO [36],JustAsk [84],VidIL [72],T+T [40],All-in-One [67],VGT [78],MIST [18],VFC [50],CoVGT [79], andHiTeA [87]. These represent a broad spectrum of approaches tovideo-language understanding, including those that usespeech inputordense frames.
5.4. SEVILA Implementation Details
The SeViLA framework is carefully implemented and trained to leverage the power of BLIP-2 efficiently.
-
SEVILA Architecture:
- Backbone:
BLIP-2[35], which has a total of4.1 Billion parameters. - Frozen Components:
Visual Encoder:ViT-G[16] (1 Billion parameters).Large Language Model (LLM):Flan-T5 XL[7] (3 Billion parameters).
- Trainable Components: Only the
Q-Formers(one forLocalizer, one forAnswerer) and a singlefully-connected layerafter eachQ-Formerare fine-tuned. - Parameter Efficiency: The total number of
trained parametersis106 Million, which constitutes2.5%of the totalBLIP-2parameters. This highlights theparameter-efficient fine-tuningapproach.
- Backbone:
-
SEVILA Framework Training:
-
Hardware: Experiments are conducted using
4 NVIDIA A6000 GPUs(48 GB VRAM each). -
Loss Function: Standard
cross-entropy lossis used between the model's outputs and the target values.The following table (Table 11 from the original paper) provides
SEVILAframework training hyperparameters forLocalizer pre-training,Answerer fine-tuning, andLocalizer self-refinement:Dataset Batch Size per GPU Learning Rate Warmup Step Epoch Gradient Accumulation Step LoCALIZER Pre-Training QVHighlights 64 3e-5 1000 80 1 ANSWErER Fine-tuning in Forward Chain NExT-QA 8 3e-5 1000 10 2 STAR 8 3e-5 1000 10 2 How2QA 4 3e-5 3000 10 4 TVQA 4 3e-5 8000 10 4 VLEP 4 1e-5 1200 10 4 LoCALIZER Self-Refinement in Reverse Chain NExT-QA 64 3e-5 500 10 1 STAR 64 3e-5 500 10 1 How2QA 64 3e-5 500 10 1 TVQA 64 3e-5 2000 10 1 VLEP 64 3e-5 500 10 1
Table 11: SEVILA framework training hyperparameters.
LocalizerPre-training:- Dataset:
QVHighlights. - Duration: 80 epochs, approximately 12 hours using
4 GPUswith 29GB VRAM each.
- Dataset:
LocalizerSelf-Refinement (Reverse Chain):- Dataset: Each downstream dataset (NExT-QA, STAR, etc.).
- Duration: An additional 10 epochs of training on
pseudo-labels, taking 3-17 hours depending on the dataset.
AnswererFine-tuning (Forward Chain):- Dataset: Each downstream dataset.
- Duration: 10 epochs with
answer labels, using a frozen pre-trainedLocalizer. This phase takes 8-48 hours depending on the dataset.
-
-
Prompt Engineering: Multiple
QAandlocalization promptsare tested, and the one yielding the bestzero-shot performanceon the downstream task is selected. -
LocalizerPre-training Details: The following figure (Figure 6 from the original paper) illustratesLocalizerpre-training and aggregation forvideo moment retrieval:
该图像是示意图,展示了LocALizeR在视频时刻检索任务中的预训练过程。左侧显示了针对查询句子"一只鲨鱼正在水下游泳"的关键帧定位,右侧则展示了如何将帧级预测聚合为时间跨度的过程。这里的跨度阈值设为6。Figure 6: Left: For LocALizeR pre-training, we utilize the video moment retrieval labels for the keyframe localization task. Right: we aggregate LocALizeR's frame-level predictions into video-level span predictions.
QVHighlights[30] is used.Temporal span annotationsfromQVHighlightsare converted intobinary keyframe localization labelsby comparing frame timestamps with the spans. A frame is akeyframeif its timestamp falls within a relevant span.- A
prompt templateis designed and filled withquery sentencesfromQVHighlights, ensuring similar input format forpre-trainingand downstream tasks.
-
Details of Aggregation for Video Moment Retrieval:
- When evaluating the
LocalizeronQVHighlightsformoment retrieval, frame-level predictions (binary 'yes'/'no' forkeyframe) need to be aggregated intovideo-level temporal spans. - A
hyperparametercalledspan thresholdis used: This is the maximum number of continuous 'no' predictions (frames not localized askeyframe) allowed within a single span. If more 'no's occur consecutively, the segment is split into separate spans. - The
span thresholdis set to6, determined by analyzing the average interval among grounding spans in theQVHighlights training data.
- When evaluating the
6. Results & Analysis
6.1. Core Results Analysis
SeViLA demonstrates strong performance across various video QA and event prediction benchmarks in both fine-tuning and zero-shot settings.
6.1.1. Fine-tuning Comparison to SOTA on Video QA and Event Prediction
The following table (Table 1 from the original paper) shows fine-tuning results on video question answering (NExT-QA, STAR, How2QA, TVQA) and video event prediction (VLEP).
| Model (# Frames) | NExT-QA | STAR | How2QA TVQA VLEP | |||||||||
| Tem. Cau. Des. Avg. Int. Seq. Pre. Fea. Avg. | ||||||||||||
| (w/ speech input or use dense frames) | ||||||||||||
| HERO (dense/1fps) [36] | - | - | - | - | 73.8 | 73.6 | - | |||||
| JustAsk (20) [84] | 51.4 | 49.6 | 63.1 | 52.3 | - | 84.4 | - | - | ||||
| FrozenBiLM (10) [85] | - | - | - | 86.7 | 82.0 | - | ||||||
| VidIL 4-shot (12) [72] | - | - | - | - | - | 72.0 | ||||||
| T+T (dense/1fps) [40] | - | - | 92.4 | - | - | |||||||
| T+T (+ASR, dense/1fps) [40] | - | - | - | 93.2 | - | |||||||
| Flamingo-80B 32-shot (30) [1] | ||||||||||||
| FrozenBiLM (10) [85] | - | - | 42.2 - | - 81.5 | 57.5 | |||||||
| All-in-One (32) [67] | 48.6 | 48.0 63.2 50.6 47.5 50.8 47.7 44.0 47.5 | - | - | ||||||||
| Temp[ATP] (32) [3] | 49.3 | 48.6 | 65.0 51.5 50.6 52.8 49.3 40.6 48.3 | - | - | |||||||
| VGT (32) [78] | 55.0 | 52.2 | 64.0 | 55.0 | 44.2 | |||||||
| MIST (32) [18] | 56.6 | 54.6 | 66.9 | 57.1 55.5 54.2 54.2 44.4 51.1 | - | |||||||
| VFC (32) [50] | 53.3 | 57.6 | 72.8 | 58.6 | - | - | - | - | ||||
| CoVGT (32) [79] | 57.4 | 58.8 | 69.3 | 60.0 | - | 45.9 | - | |||||
| SeViTFiD (10) [24] | - | 60.6 | - | - | ||||||||
| HiTeA (16) [87] | 58.3 | 62.4 | 75.6 | 63.1 | - | - | - | - | - | - | ||
| InternVideo* (8) [71] | 58.5 | 62.5 | 75.8 | 63.2 | 62.7 65.6 54.9 51.9 58.7 | 79.0 | 57.2 | 63.9 | ||||
| BLIP-2voting (4) | 65.2 | 70.1 | 80.1 | 70.1 | 52.3 54.8 49.0 51.2 51.8 | 79.6 | 54.5 | 67.0 | ||||
| BLIP-2concat (ANSWERER) (4) | 68.1 | 72.9 | 81.2 | 72.6 | 65.4 69.0 59.7 54.2 62.0 | 82.2 | 59.8 | 68.6 | ||||
| SEVILA† (32 → 4) | 68.8 | 73.4 | 83.5 | 73.4 | 63.2 66.6 61.3 60.0 62.7 | 83.7 | 59.7 | 69.0 | ||||
| SeViLA (32 → 4) | 69.4 | 74.2 | 81.3 | 73.8 | 63.7 70.4 63.1 62.4 64.9 | 83.6 | 61.6 | 68.9 | ||||
Table 1: Fine-tuning results on video question answering (NExT-QA, STAR, How2QA, TVQA) and video event prediction (VLEP). We gray out the methods take extra speech input or use dense frames. We bold the best numbers, and underlined the second-best numbers. dense/1fps: the model takes dense (1fps) video frames instead of a fixed number of frames. 3 2 4 : our LocALizer selects 4 keyframes from 32 frames. * represents the results tested by ourselves. uses the zero-shot LocALizer without refining on pseudo-labels via the reverse chain.
Key findings from fine-tuning results:
- Temporal Modeling Matters:
BLIP-2voting(which processes frames independently) performs significantly worse thanBLIP-2concat (Answerer)and othervideo-LMmodels, especially onSTAR-Sequence(a task requiring strong temporal understanding), whereBLIP-2concat(Answerer) outperformsBLIP-2votingby ( vs. ). This confirms the importance of incorporatingtemporal modelinginvideo-language tasks. - Keyframe Selection Helps (
SEVILA†):SEVILA†(using azero-shot Localizerwithoutself-refinement) consistently outperformsBLIP-2concat (Answerer)(which uses uniform sampling) across all tasks: NExT-QA (), STAR (), How2QA (), and VLEP (). It also surpasses thetop video-LM,InternVideo, by an average of . This highlights the significant benefit oflanguage-aware keyframe selection. - Self-Refinement Improves Temporal Localization (
SEVILA): When theLocalizeris refined usingpseudo-labelsvia thereverse chain(SEVILAvs.SEVILA†), performance further increases on NExT-QA (), STAR (), and TVQA (). This demonstrates the efficacy of theself-refinementmechanism and its contribution tostate-of-the-art fine-tuningperformance on NExT-QA, STAR, TVQA, and VLEP.
6.1.2. Zero-shot Comparison to SOTA on Video QA and Event Prediction
The following table (Table 2 from the original paper) shows zero-shot results on video question answering and video event prediction.
| Model (# Frames) | NExT-QA | STAR | How2QA TVQA VLEP | |||||||||
| Tem. Cau. Des. Avg. Int. Seq. Pre. Fea. Avg. | ||||||||||||
| (w/ speech input or use dense frames) | ||||||||||||
| JustAsk (20) [84] | - | - | - | - | 51.1 | - | - | |||||
| FrozenBiLM (10) [85] | - | 58.4 | 59.2 | - | ||||||||
| ViperGPT (dense/1fps) [63] | - | - | 60.0 | - | - | - | - | - | - | |||
| Flamingo-80B (30) [1] | - | |||||||||||
| FrozenBiLM (10) [85] | - | - - | - | - - | - | 39.7 | - 41.9 | - 29.7 | - | |||
| VFC (32) [50] | - 45.4 | 51.6 64.1 51.5 | - | - | - - | - | - | - - | ||||
| InternVideo* (8) [71] | 43.4 | 48.0 | 65.1 | 49.1 | 43.8 43.2 42.3 37.4 41.6 | 62.2 | 35.9 | 58.7 | ||||
| BLIP-2voting (4) | 59.1 | 61.3 | 74.9 | 62.7 | 41.8 | 39.7 40.2 39.5 40.3 | 69.8 | 35.7 | 63.8 | |||
| BLIP-2concat (AnswereR) (4) | 59.7 | 60.8 | 73.8 | 62.4 | 45.5 41.8 41.8 40.0 42.2 | 70.8 | 36.6 | 64.0 | ||||
| SEVILA† (32 → 4) | 61.3 | 61.5 75.6 63.6 48.3 45.0 44.4 40.8 44.6 | 72.3 | 38.2 | 64.4 | |||||||
Table 2: Zero-shot results on video question answering and video event prediction.
Key findings from zero-shot results:
- Image-LM Outperforms Video-LM without Video Pre-training: Surprisingly,
BLIP-2voting, despite lackinginter-frame temporal modeling, outperformsInternVideo(aSOTA video-LM) onNExT-QA(),How2QA(), andVLEP(). This indicates the immense potential of largeimage-LMsdue to their scale and extensive pre-training, even without dedicated video pre-training. - Keyframe Selection is More Effective than Uniform Sampling:
SEVILA†(combiningzero-shot Localizerandzero-shot Answerer) outperformsBLIP-2concat (Answerer)(which uses uniformly sampled frames) across all tasks: NExT-QA (), STAR (), How2QA (), TVQA (), and VLEP (). It achievesnew state-of-the-art zero-shot performanceon NExT-QA, STAR, How2QA, and VLEP, andnew state-of-the-arton TVQA using only visual and language modalities. This emphasizes the effectiveness oflanguage-aware keyframe selection.SEVILA†even outperformszero-shot Flamingo[1] (an 80B parameter model) on STAR by .
6.2. Ablation Studies on SEVILA Framework
The following table (Table 3 from the original paper) shows ablation studies on SEVILA framework.
| AnSwERER | Keyframe | NExT-QA | STAR | How2QA TVQA VLEP | ||||
| # frame finetuned? | Tem. Cau. Des. Avg. Int. Seq. Pre. Fea. Avg. | |||||||
| 32 | X | uniform | 54.7 56.7 67.8 57.7 46.2 43.6 40.7 41.042.8 | 67.0 | 33.2 | 54.0 | ||
| B. | 4 | × | uniform | 59.7 60.8 73.8 62.4 45.5 41.8 41.8 40.0 42.2 | 70.8 | 36.6 | 64.0 | |
| C. | 4 | × | LocaLizeR† | 61.3 61.5 75.6 63.6 48.3 45.0 44.4 40.8 44.6 | 72.3 | 38.2 | 64.4 | |
| D. | 4 | X | LocaLizer | 62.3 63.1 74.9 64.6 49.0 46.4 45.2 41.6 45.5 | 72.9 | 39.1 | 64.6 | |
| E. | 4 | √ | uniform | 68.1 72.9 81.2 72.6 65.4 69.0 59.7 54.2 62.0 | 82.2 | 59.8 | 68.6 | |
| F. | 4 | L | LOCAlizeR† | 68.8 73.4 83.5 73.4 63.2 66.6 61.3 60.062.7 | 83.7 | 59.7 | 69.0 | |
| G. | 4 | 4 | LOcaLIzER | 69.4 74.2 81.3 73.8 63.7 70.4 63.1 62.4 64.9 | 83.6 | 61.6 | 68.9 | |
Table 3: Ablation studies on SEViLA framework. 'uniform' refers to the uniform sampling of video frames. LocALIzER† refers to the zero-shot LocALIzER without refining on pseudo-labels.
- Sparse Frames Outperform Dense Frames (A vs. B): Reducing the number of input frames from 32 to 4 (
uniformsampling) for thezero-shot Answerer(Row A vs. B) improves average performance on NExT-QA ( to ), STAR ( to ), How2QA ( to ), TVQA ( to ), and VLEP ( to ). This suggests that for anImage-LMbackbone, too many dense frames can be distracting due to its limitedtemporal modelingability. - Keyframes Outperform Uniformly Sampled Frames (B vs. C, E vs. F):
- In the
zero-shot Answerersetting, usingkeyframesfrom thezero-shot Localizer†(Row C) significantly improves performance overuniformly sampled frames(Row B) across all tasks (e.g., NExT-QA average: vs. ; STAR average: vs. ). - Similar gains are observed in the
fine-tuned Answerersetting when usingkeyframesfromLocalizer†(Row F) compared touniform sampling(Row E).
- In the
- Pseudo-label Refinement is Effective (C vs. D, F vs. G):
- Refining the
Localizerwithpseudo-labels(Localizervs.Localizer†) further boosts performance by an average of across all tasks in thezero-shot Answerersetting (Row D vs. C). - In the
fine-tuned Answerersetting,pseudo-label refinementalso provides an average boost of across tasks (Row G vs. F).
- Refining the
6.3. Comparison to State-of-the-Art on Video Moment Retrieval
The following table (Table 4 from the original paper) shows a comparison on QVHighlights test split.
| Model | R1@0.5 | R1@0.7 | mAP |
| CAL [13] | 25.4 | 11.5 | 9.8 |
| XML [29] | 41.8 | 30.3 | 32.1 |
| Moment-DETR [30] | 52.8 | 33.0 | 30.7 |
| QD-DETR [51] | 62.4 | 44.9 | 39.8 |
| LocALizeR (Ours) | 54.5 | 36.5 | 32.3 |
Table 4: Comparison on QVHighlights test split. We aggregate frame-level results of our LocALIZER for video-level evaluation (see Appendix).
The Localizer (pre-trained on QVHighlights) performs strongly as a standalone moment retrieval model. It achieves competitive or superior performance compared to previous methods with complex temporal modeling (CAL, XML, Moment-DETR), even though SeViLA's Localizer operates on a frame-level without explicit temporal modeling. For instance, Localizer significantly outperforms Moment-DETR in mAP (32.3 vs. 30.7) and R1@0.7 (36.5 vs. 33.0). However, QD-DETR [51] still achieves the highest R1@0.5, R1@0.7, and mAP.
6.4. Detailed Analysis on the Localizer
6.4.1. Ablation on LocALIZER Pre-training and Self-refinement
The following table (Table 5 from the original paper) shows the impact of QVHighlights PreTraining (PT) and Self-Refinement (SR) for our Localizer.
| PT SR | NExT-QA | How2QA | ||
| Tem. Cau. | Des. Avg. | |||
| - | - | 60.4 61.0 | 74.6 62.9 | 70.7 |
| ✓ | - | 61.3 61.5 | 75.6 63.6 | 72.3 |
| - | ✓ | 62.1 62.6 | 75.1 64.3 | 72.8 |
| ✓ | V | 62.3 63.1 | 74.9 64.6 | 72.9 |
Table 5: The impact of QVHighlights PreTraining (PT) and Self-Refinement (SR) for our LoCALIzER in Sec. 3.3.
This ablation uses the zero-shot 4-frame Answerer.
- An
untrained BLIP-2 Localizerprovides only a minor improvement. - Both
QVHighlights pre-training (PT)andself-refinement (SR)(reverse chain) independently provide significant performance boosts. - The optimal results are achieved when both
pre-trainingandself-refinementare applied, demonstrating the method'slabel-efficiencyforkeyframe temporal localization.
6.4.2. Comparison with other Keyframe Selection Methods
The following table (Table 6 from the original paper) shows a comparison of our Localizer with other keyframe localization methods.
| Method | NExT-QA |
| AnswERER | Tem. Cau. Des. Avg. 59.7 60.8 73.7 62.4 |
| (zero-shot) | 60.0 72.5 61.8 |
| + CLIP [55] + Moment-DETR [30] + Localizer | 59.2 59.5 60.6 72.1 62.0 61.3 61.5 75.6 63.6 |
| (fine-tuning) | |
| + ATP [3] | 60.4 61.3 73.4 62.8 |
| + Differentiable Top-K [8] 59.5 59.7 72.7 61.6 | |
| + LocaliZeR | 62.3 63.1 74.9 64.6 |
Table 6: Comparison of our LocALIzeR with other keyframe localization methods.
CLIPandMoment-DETRaszero-shot keyframe selectorsdo not help theAnswerer(sometimes even degrade performance, e.g.,Moment-DETRleads to62.0Avg. NExT-QA vs.62.4forAnswereralone). This might be because their pre-training on images or short declarative sentences fails to producequestion-aware visual features, potentially distracting theAnswerer.- Our
zero-shot Localizer†improves NExT-QA by an average of . - Our
Localizer(refined withpseudo-labels) outperformsfine-tuned ATPandDifferentiable Top-Kby an average of across all question types on NExT-QA. This indicates the superior effectiveness ofSeViLA'sLocalizer.
6.4.3. Impact of Keyframe Selection Ranges and Quantities
The following table (Table 7 from the original paper) shows an ablation of different numbers of input frames and output keyframes.
| Settings | NExT-QA | How2QA | ||
| Tem. | Cau. | Des. Avg. | ||
| BLIP-2voting (8) | 59.9 | 60.2 | 72.4 62.0 | 69.8 |
| 8→1 | 59.8 | 61.1 | 76.0 62.9 | 72.4 |
| 16→1 | 59.2 | 62.6 | 74.9 63.4 | 73.2 |
| 16→4 | 60.7 | 61.5 | 75.8 63.4 | 72.4 |
| 32→4 | 61.3 | 61.5 | 75.6 63.6 | 72.3 |
| 32→8 | 59.4 | 60.9 | 74.7 62.5 | 71.3 |
| 64→8 | 58.9 | 60.9 | 74.0 62.2 | 71.8 |
Table 7: Ablation of different numbers of input frames and output keyframes.
- Even selecting just
one keyframe(8→1and16→1) with theLocalizershows significant improvements overBLIP-2voting (8)onNExT-QA-Causal(),NExT-QA-Description(), andHow2QA(). This highlights theLocalizer's effectiveness in finding salient frames. - Multiple
keyframesgenerally benefitNExT-QA-Temporalquestions. - Denser input frames (e.g.,
32→8vs.16→4or64→8vs.32→4) tend to result in worse performance, reinforcing the finding thatsparse, relevant framesare better forImage-LMs.
6.4.4. Impact of Different Frame Sampling During AnswERER Fine-tuning
The following table (Table 8 from the original paper) compares different frame sampling during Answerer fine-tuning.
| Frame Sampling | NExT-QA | ||
| Training | Inference | Temp. Cau. | Des. Avg. |
| Random | Uniform | 68.1 72.9 | 81.2 72.6 |
| Random | LoCAlizeR† | 67.6 73.4 | 84.0 73.1 |
| LOCALiZeR | Uniform | 68.2 72.7 | 80.0 72.3 |
| LocalizeR† | LOCalizeR | 68.8 73.4 | 83.5 73.4 |
Table 8: Comparing different frame sampling during ANSwERER fine-tuning. The LocALIzER† is frozen during fine-tuning. We use 4 frames for AnswERER training, while the LoCALIZER is the default 3 2 { } 4 .
- The
SeViLAframework performs optimally when theLocalizeris used consistently in bothAnswerer trainingandevaluation(LocalizeR†for both training and inference yields73.4Avg. NExT-QA). This is attributed to providing moreinformative keyframesand minimizingdomain shiftsbetween training and evaluation.
6.4.5. Upper-bound Performance Analysis on Oracle Keyframes
The following table (Table 9 from the original paper) shows BLIP-2voting and oracle (in brackets) performance analysis across datasets.
| Datasets | BLIP-2voting (Oracle) | |
| Zero-Shot | Fine-tuned | |
| NExT-QA (Avg.) | 62.7 (70.1) | 70.1 (79.7) 51.8 (72.2) |
| STAR (Avg.) | 40.3 (52.9) | |
| How2QA | 69.8 (77.8) 79.6 (86.4) | |
| TVQA | 35.7 (45.4) | 54.5 (69.0) |
| VLEP | 63.8 (70.5) | 67.0 (79.1) |
Table 9: BLIP. and oracle (in brackets) performance analysis across datasets. We use 4 frames for each video question. Oracle: at least 1 of 4 frames can give the right answer.
- This analysis assumes a "perfect"
Localizer(anoracle) that always provides the rightkeyframes. It uniformly samples four frames, gets four frame-level answers, and considers the question answered correctly if at least one frame yields the right answer. - Significant gaps exist between
BLIP-2 majority votingandoracle accuracy(e.g., NExT-QA fine-tuned: vs. ; How2QA fine-tuned: vs. ). These gaps highlight substantial room for improvement intemporal localizationto fully leverageImage-LMsforvideo-language tasks.
6.4.6. Qualitative Analysis on LocALIZER
The following figure (Figure 4 from the original paper) shows a visualization of our Localizer.

该图像是示意图,展示了在视频问答任务中利用不同帧采样(均匀采样 vs. 我们的定位器)进行回答的效果。红色选项表示使用均匀采样错误回答,绿色选项表示使用我们的定位器正确回答。最佳查看效果为彩色。
Figure 4: Visualization of our LocALIzER. We use zero-shot AnswERER with different frame sampling (uniform v.s. LocALizeR) to answer the question. Red options are answered wrongly with uniformly sampled frames. Green options are answered correctly with our LocALizeR. Best viewed in color.
- Visualizations (Figure 4 and Figure 7 in Appendix) show that the
Localizermore accurately identifiestask-related keyframescompared touniform selection, closely matching human annotations. - This accurate localization enables the
Answererto answer questions correctly, whereasuniform selectionoften leads to incorrect responses. This confirms theLocalizer's ability to effectively find relevant video moments, benefiting downstream tasks.
6.4.7. Single-frame v.s. Multi-frame LocALIZER
The following table (Table 12 from the original paper) shows a comparison between single-frame and multi-frame LocALIZER.
| AnSWERER | # frames of LoCALIZER | NExT-QA (Average) |
| zero-shot | 1 | 64.6 |
| 4 | 63.6 | |
| fine-tuned | 1 | 73.4 |
| 4 | 71.3 |
Table 12: Comparison between single-frame and multi-frame LocALIZER.
- Expanding the
Localizerto amulti-frame mode(concatenating frames into a long image forQ-Former) surprisingly performs worse than thesingle-frame Localizerin bothzero-shotandfine-tuningsettings. - This is attributed to the
BLIP-2backbone not being pre-trained on video data. The authors suggest that amulti-frame Localizercould be more powerful with sufficienttemporal grounding annotationsor large-scale video pre-training.
6.4.8. Iterative Self-refinement on LoCALIZER and AnswERER
The following table (Table 14 from the original paper) shows iterative self-refinement results of SEVILA framework.
| Iteration | NeXT-QA (Average) |
| 1 | 73.8 |
| 2 | 74.2 |
| 3 | 73.7 |
Table 14: Iterative self-refinement results of SEViLA framework.
- Iterative
self-refinement(whereAnswerergivespseudo-labelsto trainLocalizer, which then provides frames to fine-tuneAnswerer) shows marginal improvement from 1 to 2 iterations, but performance saturates at 3 iterations. Further analysis is left for future work.
6.4.9. Different Pre-training Settings of LocALIZER
The following table (Table 13 from the original paper) shows a comparison among different pre-training settings of Localizer.
| LOcaLIZER | NeXT-QA (Average) |
| w/o Localizer | 62.4 |
| + Moment-DETR | 62.0 |
| + Our Localizer (without pre-training) | 62.9 |
| + Our Localizer (weakly pre-trained with QVH ASR) | 63.2 |
| + Our Localizer (pre-trained with QVH) | 63.6 |
Table 13: Comparison among different pre-training settings of LocALIZER.
- Using a
Localizer(even without pre-training) improves over noLocalizer(62.9vs.62.4). Weakly supervised pre-trainingusingASR(Automatic Speech Recognition) further improves performance (63.2).Pre-trainingwith manualQVHighlights (QVH)annotations yields the best results (63.6), affirming the benefit of targeted pre-training for theLocalizer.
6.4.10. SEVILA Framework with Another Image-LM (MiniGPT4)
- The
self-chaining schemeis also effective withMiniGPT4[92], another recentImage-Language model. - On NExT-QA, the
zero-shot MiniGPT4 Answererachieves average accuracy, and gets a boost with thezero-shot MiniGPT4 Localizer. This indicatesSeViLA's generalizability across differentImage-LMbackbones.
6.4.11. Computational Cost of SeViLA Framework
The following table (Table 15 from the original paper) shows computational cost of SEVILA framework.
| Model | Memory (GB) | Running Time (sec./sample) | Parameter (B) |
| Answerer (4) | 7.56 | 1.79 | 4.1 |
| SeViLA (32 → 4) | 7.98 | 3.28 | 4.2 |
Table 15: Computational cost of SEVILA framework.
- Adding the
Localizerto theAnswerer(SeViLA (32 → 4)) results in a very small additional memory footprint (7.98GB vs.7.56GB) and a modest increase in running time (3.28sec/sample vs.1.79sec/sample). This is because theLocalizerandAnswerershare most parameters, demonstratingSeViLA's efficiency.
6.4.12. Impact of Prompt Design
The following table (Table 16 from the original paper) shows the impact of different localization prompts on the zero-shot Video QA performance.
| Localization Prompt | NExT-QA | |||
| Temporal | Casual | Descriptive | Average | |
| Does the frame have the information needed to answer the question correctly? | 59.9 | 61.1 | 74.2 | 62.7 |
| Does the provided frame contain the necessary information to accurately answer the given question? | 59.9 | 60.8 | 75.0 | 62.7 |
| Does the information within the frame provide the necessary details to accurately answer the given question? | 60.4 | 61.0 | 74.6 | 62.9 |
Table 16: Impact of different localization prompts on the zero-shot Video QA performance
- The model is relatively
insensitiveto slight variations in thelocalization prompt. Performance changes are minor across the tested prompts, indicating robustness in prompt design.
6.5. Visualization
The following figure (Figure 7 from the original paper) shows more visualization examples from different datasets, and with various selected keyframe amounts.

该图像是示意图,展示了不同的关键帧选择对视频问答的影响。图中展示了两种选取方式:均匀采样和我们提出的本地化选择。红色选项表示错误答案,绿色选项表示正确答案,同时也展示了人类时间定位注释的时间区间。
Figure 7: Visualization of our LocALIzER. We show various keyframe amounts in those examples. We use zero-shot AnswEReR with different frame sampling (uniform v.s. LocALIzeR) to answer the question. Red options are answered wrongly with uniformly sampled frames. Green options are answered correctly with our LocALizeR. Best viewed in color.
- The visualizations demonstrate that
SeViLA'sLocalizerconsistently identifiesrelevant framesthat align well with human annotations, regardless of the number ofkeyframesselected. - This accurate
keyframe localizationdirectly leads to correct answers from theAnswerer, whileuniform samplingoften results in incorrect responses. This provides intuitive qualitative evidence for theLocalizer's effectiveness.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces SeViLA (Self-Chained Video Localization-Answering), a novel framework designed to enhance video-language understanding by addressing the limitations of uniform frame sampling and the high cost of temporal annotations. SeViLA effectively adapts a single image-language model (BLIP-2) into two specialized modules: a Localizer for language-aware temporal keyframe localization and an Answerer for question answering on videos.
The framework's core innovation lies in its self-chaining mechanism:
-
Forward Chain: The
Localizerintelligently selectslanguage-aware keyframes, which are then fed to theAnswererto predict answers, enabling focused understanding. -
Reverse Chain: The
Answerergenerateskeyframe pseudo-labelsto refine theLocalizer, significantly reducing the dependency on expensivetemporal grounding annotations.SeViLAachievesstate-of-the-artperformance across various challengingvideo QAandevent prediction benchmarksin bothfine-tuningandzero-shotsettings. Extensive ablation studies confirm the effectiveness of itsLocalizer, the benefits ofpseudo-labelingandpre-training, and the generalizability of the framework. The work highlights thatlanguage-aware temporal localizationis crucial forvideo-language tasksand can be achieved efficiently by repurposing powerfulimage-LMs.
7.2. Limitations & Future Work
The authors acknowledge the following limitations of the SeViLA framework:
-
Frame-level Localization for Fine-grained Events: While effective for many tasks, the
Localizerperformsframe-level keyframe localization. This might not be sufficient for very complex orfine-grained temporal events(e.g., distinguishing "opening a door" from "closing a door") where subtle temporal nuances are critical. -
Future Work Direction: To address the
fine-grained temporal eventslimitation, the authors suggest exploringstructured predictionfortemporal localizationthat goes beyond simple frame-level identification. This could involve predicting temporal spans or event sequences with higher precision.The paper also discusses broader impacts related to
large image-language models: -
Societal Biases: Since
SeViLAleverages alarge image-language model(BLIP-2) pre-trained on massive internet-scale data, it may occasionally produce unexpected or inappropriate responses. This could include reflectingsocietal biasesrelated to gender, race, or sexuality, similar to otherlarge models. -
Mitigation: The authors emphasize the need for more future studies to evaluate and mitigate these
negative biasesandtoxic outputinlarge image-language models.
7.3. Personal Insights & Critique
SeViLA presents a very insightful and practical approach to bridging the gap between image-language models and video-language understanding. The core idea of repurposing a powerful image-LM like BLIP-2 for both localization and question answering is highly efficient, capitalizing on existing strong representations rather than building computationally expensive video-LMs from scratch.
Key Strengths:
- Annotation Efficiency: The
reverse chainwithpseudo-labelingis a significant contribution. It elegantly tackles the prohibitive cost oftemporal grounding annotations, which is a major bottleneck invideo-language research. Thislabel-efficientstrategy makesSeViLAhighly scalable and practical. - Intelligent Temporal Modeling: Moving beyond
uniform frame samplingtolanguage-aware keyframe localizationis a crucial step for effectivevideo understanding. TheLocalizeracts as a selective attention mechanism, guiding theAnswererto focus on relevant information, which aligns well with human cognitive processes. - Strong Empirical Performance: Achieving
SOTAresults in bothfine-tuningandzero-shotsettings across multiple benchmarks is compelling evidence of the framework's effectiveness. Thezero-shotperformance, in particular, highlights the strong generalization capabilities derived from theBLIP-2backbone and theself-chainingdesign. - Generalizability: The successful extension to
MiniGPT4suggests that theself-chainingscheme is generalizable across differentimage-LMbackbones, making it a robust paradigm.
Potential Issues/Areas for Improvement:
- Limitations of Frame-level Localization: As the authors noted,
frame-level localizationmight struggle with nuanced, fast-changingtemporal events. While they proposestructured predictionas future work, the current implementation might miss the temporal relationships between the selected frames if they are too sparse. For instance, understanding a sequence of actions like "pick up, then place down" might be harder if only the 'pick up' and 'place down' frames are selected, but not the transitional frames. - Multi-frame Localizer Performance: The observation that a
multi-frame Localizerperformed worse than asingle-frame Localizer(Table 12) is interesting. While attributed toBLIP-2's lack of video pre-training, it suggests that merely concatenating frames for aQ-Formerdoesn't automatically induce strongtemporal reasoning. Future work could explore more sophisticatedtemporal aggregationmechanisms within theQ-Formeritself or specializedtemporal attentionlayers that areparameter-efficient. - Dependence on LLM Prompt Sensitivity: Although the paper states insensitivity to prompt changes (Table 16),
LLMscan sometimes be highly sensitive to subtle prompt engineering. While the tested prompts show robustness, the reliance on anLLMfor localization scoring could still introduce fragility in other contexts or with more extreme prompt variations. - Interpretability of
Q-Former: TheQ-Formeracts as a black box that transforms visual features intoLLM-compatiblesoft prompts. Understanding what information it prioritizes and how it compresses visual data could lead to more targeted improvements.
Transferability and Future Value:
The SeViLA framework has high transferability. Its core idea of self-chained modularity and pseudo-labeling could be applied to other multimodal tasks beyond VQA, such as video summarization, event detection, or even multimodal content generation, where temporal grounding is essential but annotations are scarce. This approach could inspire future research in more efficient adaptation of large foundation models for domain-specific or data-scarce multimodal tasks. The concept of having a specialized Localizer to curate inputs for a powerful Answerer is a generalizable paradigm that could extend to other modalities where input relevance is a challenge (e.g., long audio, complex sensor data).
Similar papers
Recommended via semantic vector search.