Abstract

Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.

1. Bibliographic Information

1.1. Title

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

1.2. Authors

The paper is authored by Dongqi Liu, Chenxi Whitehouse, Xi Yu, Louis Mahon, Rohit Saxena, Zheng Zhao, Yifu Qiu, Mirella Lapata, and Vera Demberg.

Their affiliations include:

Saarland University
Max Planck Institute for Informatics (MPII)
University of Cambridge
University of Edinburgh

1.3. Journal/Conference

This paper is an arXiv preprint, published on 2025-02-12. The VISTA dataset itself is collected from leading conferences in computational linguistics (ACL Anthology, including ACL, EMNLP, NAACL, EACL, Findings of ACL) and machine learning (ICML and NeurIPS).

1.4. Publication Year

2025

1.5. Abstract

The paper introduces VISTA, a novel dataset designed for video-to-text summarization in scientific domains. It comprises 18,599 recorded AI conference presentations, each paired with its corresponding paper abstract. The authors benchmark state-of-the-art (SOTA) large multimodal models (LMMs) on VISTA and propose a plan-based framework to leverage the structured nature of scientific abstracts. Both human and automated evaluations confirm that this explicit planning significantly enhances summary quality and factual consistency. Despite these advancements, a substantial performance gap persists between models and human capabilities, underscoring the inherent challenges of the dataset. The study aims to stimulate future research in scientific video-to-text summarization.

1.6. Original Source Link

Abstract/Landing Page: https://arxiv.org/abs/2502.08279
PDF Link: https://arxiv.org/pdf/2502.08279v4.pdf The paper is available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the challenge of transforming recorded videos, particularly scientific presentations, into concise and accurate textual summaries. While Large Multimodal Models (LMMs) have made significant progress in video-to-text summarization for general content (like YouTube, movies, news), they often exhibit reduced performance in specialized scientific contexts. This reduction is attributed to their struggles with technical terminology, scientific visual elements (figures, tables), and the absence of specialized datasets for multimodal scientific content. The growing need for efficient information extraction from an increasing volume of scientific video content (e.g., conference talks) makes this problem particularly important.

The paper's entry point is to fill this dataset gap by introducing VISTA, a large-scale multimodal dataset specifically tailored for scientific video summarization. It also explores a structured approach (plan-based framework) to overcome the limitations of end-to-end models in capturing the well-defined structure of scientific abstracts.

2.2. Main Contributions / Findings

The primary contributions and findings of this paper are:

VISTA Dataset: The introduction of VISTA, a novel, large-scale multimodal dataset containing 18,599 video-summary pairs specifically for summarizing scientific presentations from video recordings. The summaries are the corresponding paper abstracts.
Comprehensive Benchmarking: Establishment of benchmark performance on VISTA through extensive evaluation of leading large language models (LLMs), audio-based models, and multimodal models in zero-shot, QLoRA fine-tuning, and full fine-tuning settings.
Plan-Based Approach: The application and validation of a plan-based approach that consistently improves summary quality and factual accuracy over SOTA models. This method leverages the structured nature of scientific abstracts by generating intermediate plans (sequences of questions) to guide summary generation.
Error Analysis and Human Evaluation: Conducted detailed error analysis, case studies, and human evaluations, confirming the efficacy of the plan-based method and identifying critical issues in model-generated summaries.
Performance Gap: Highlighting that despite advancements, a considerable gap remains between model performance (even with the plan-based approach) and human performance, indicating the challenging nature of the VISTA dataset and the task itself.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following foundational concepts:

Multimodal Learning: An area of machine learning that deals with data from multiple modalities (e.g., text, image, audio, video). The goal is to build models that can process and relate information from these diverse sources to achieve a richer understanding or perform complex tasks.
Large Multimodal Models (LMMs): Advanced AI models that integrate and process information from various modalities. They typically combine components like large language models (LLMs) with visual encoders (for images/video) and audio encoders (for sound). These models are trained to understand complex relationships across modalities, enabling tasks such as multimodal summarization, visual question answering, and multimodal generation.
Video-to-Text Summarization: The task of taking a video as input and generating a concise, coherent, and accurate textual summary that captures the main content or events of the video. This often involves processing visual frames, audio tracks, and sometimes speech transcripts.
Scientific Text Summarization: A specialized form of text summarization focused on scholarly documents (e.g., research papers, articles). This task is challenging due to the technical jargon, complex sentence structures, and domain-specific knowledge required to accurately condense scientific information while preserving factual correctness and key findings.
Plan-based Summarization: A strategy in natural language generation where an explicit intermediate "plan" or "content structure" is first generated, and then this plan guides the actual text generation process. This contrasts with end-to-end generation, where a model directly produces text from input without explicit intermediate guidance. Plans can be a sequence of topics, questions, or keywords, aiming to improve coherence, factual consistency, and control over the generated summary's structure.
Zero-shot Learning: A machine learning paradigm where a model is trained on a set of tasks or classes and then evaluated on entirely new tasks or classes it has not seen during training, without any further fine-tuning. It relies on the model's ability to generalize from its prior knowledge.
QLoRA Fine-tuning: A method for efficiently fine-tuning large language models. QLoRA (Quantized Low-Rank Adapters) quantizes a pre-trained LLM to 4-bit precision and then fine-tunes a small set of LoRA (Low-Rank Adaptation) adapters on top of the quantized model. This significantly reduces memory requirements and computational costs during fine-tuning while retaining high performance, making it feasible to fine-tune very large models on consumer-grade hardware.
Full-Parameter Fine-tuning: The traditional approach to fine-tuning a pre-trained model, where all parameters of the model are updated during training on a new, specific dataset. This typically requires substantial computational resources (GPU memory, processing power) but can lead to the highest performance for the target task.

3.2. Previous Works

The paper contextualizes its work by discussing existing research in three main areas:

Video-to-Text Summarization: This field focuses on generating summaries from videos, integrating multimodal information. Datasets like MSS, VideoXum, MMSum, Hierarchical3D, and LfVS-T support tasks ranging from instructional videos to general web content. Technical advancements include hierarchical attention models, extractive methods using multimodal features, hybrid extractive-abstractive frameworks, and Transformer-based systems. However, the paper notes that academic video summarization remains underexplored.
- Example: Transformers are a type of neural network architecture that has revolutionized natural language processing (NLP) and other fields. They rely on a mechanism called self-attention, which allows the model to weigh the importance of different parts of the input sequence when processing each element. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
  - $Q K^T$ calculates the dot product similarity between queries and keys.
  - $\sqrt{d_k}$ is a scaling factor, where $d_k$ is the dimension of the key vectors, used to prevent the dot products from becoming too large.
  - softmax function normalizes the scores to create attention weights.
  - These weights are then multiplied by the $V$ matrix to produce the output.
Scientific Text Summarization: This area aims to condense complex scholarly content into concise formats. Relevant datasets include TalkSumm (for academic video transcripts), SumSurvey (for survey papers), ACLSum (for ACL discourse), and SciNews (for simplifying research for broader audiences). Methods like RSTLoRA and RSTformer improve discourse and structural summarization, while CiteSum and SSR focus on scalability and audience-specific customization. The paper acknowledges that the complexity and diversity of scholarly texts make this a challenging domain.
Plan-based Summarization: This technique uses structured representations to improve summary quality and reduce hallucinations. Research primarily focuses on text-based planning using elements like entities, keyword prompts, and question-answer pairs. Examples include PlanVerb (converting task plans to natural language) and blueprint-based frameworks for visual storytelling. The paper highlights that plan-based strategies for multimodal tasks, especially video-to-text summarization, have received limited attention.

3.3. Technological Evolution

The field of video-to-text summarization has evolved from early methods relying on basic multimodal feature extraction and sequence-to-sequence models to increasingly sophisticated LMMs. Initially, models might have processed each modality (video frames, audio, transcripts) somewhat independently before attempting to fuse them. The rise of Transformer architectures and self-attention mechanisms enabled more effective cross-modal alignment training, allowing LMMs to integrate information from different modalities more seamlessly.

However, a key challenge in this evolution has been the domain specificity of content. While general-purpose LMMs perform well on everyday videos, they often fall short in specialized domains like science due to a lack of relevant training data and the unique characteristics of scientific content (e.g., dense information, technical jargon, specific visual cues). This paper's work on VISTA represents a crucial step in this evolution by providing a specialized dataset to bridge this gap, allowing LMMs to be specifically trained and adapted for scientific video-to-text summarization. The introduction of a plan-based framework further refines the generation process, moving beyond purely end-to-end approaches to incorporate explicit structural guidance, which is particularly beneficial for domains like science where content often follows conventional structures (e.g., IMRaD structure for scientific papers).

3.4. Differentiation Analysis

The paper differentiates its approach from previous work in several key ways:

Dataset Specialization: Unlike many existing video-to-text summarization datasets that focus on open-domain, news, or activity videos (MSS, VideoXum, MMSum), VISTA is specifically tailored for summarizing scientific presentations. This addresses a critical gap in datasets for multimodal scientific content, which LMMs currently struggle with.
Dataset Scale and Characteristics: VISTA is a large-scale dataset (18,599 samples) featuring longer inputs (average 6.8 minutes) and longer summaries (average 192.6 tokens) compared to many general-purpose datasets, making it more representative of real-world scientific presentations.
Structured Summarization for Scientific Abstracts: The paper introduces a plan-based framework adapted for video-to-text summarization of scientific content. While plan-based summarization exists for text, its application to multimodal video inputs, especially for scientific domains, is under-explored. This approach explicitly models the latent structure of scientific abstracts, which often follow a well-defined format, by using intermediate plans (sequences of questions) to guide generation. This is a core innovation compared to direct end-to-end generation which often struggles with structural coherence and factual grounding in complex domains.
Multimodal Integration and Benchmarking: The study rigorously benchmarks various SOTA large models (closed-source LMMs, open-source video-specific LMMs, text-based LLMs, audio-based models) on the new VISTA dataset. This comprehensive evaluation provides crucial insights into the strengths and weaknesses of different model architectures and modalities in a scientific context.

4. Methodology

4.1. Principles

The core idea of the method used in this paper is to leverage the inherent structure of scientific abstracts to improve the quality and factual consistency of video-to-text summaries. The intuition is that directly mapping a video to a summary (end-to-end approach) can lead to incoherent or unfaithful outputs, especially for complex scientific content. By introducing an intermediate "plan," the generation process can be guided, ensuring the summary follows a logical flow and addresses key aspects of the scientific presentation. This plan-based framework is inspired by the Question Under Discussion (QUD) theory, which posits that discourse is often structured around questions that guide conversation.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology revolves around the VISTA dataset and a plan-based summarization approach.

4.2.1. VISTA Dataset Construction

The VISTA dataset is central to this work. It comprises 18,599 aligned pairs of conference presentation recordings and their corresponding paper abstracts.

Data Acquisition:
- Sources: Data is collected from leading conferences in computational linguistics (ACL Anthology, including ACL, EMNLP, NAACL, EACL, Findings of ACL) and machine learning (ICML and NeurIPS).
- Timeframe: Content from 2020 to 2024.
- Content: Paper abstracts and video recordings are contributed by the respective paper authors, ensuring narrative consistency.
- Metadata: Paper titles, author lists, paper abstracts, links to papers, and presentation videos are collected from XML/JSON files on conference websites, avoiding the need for PDF extraction.
- Permissions: All materials are publicly accessible, and the authors obtained written confirmation for use from ICML and NeurIPS, adhering to copyright regulations.
Quality Control:
- Exclusions: Samples covering multiple papers (e.g., tutorials, invited talks) and videos shorter than one minute or longer than 30 minutes are excluded to maintain one-to-one video-to-text alignments.
- Manual Checks: Two Ph.D. candidates manually reviewed 500 randomly selected video-summary pairs for accuracy and conciseness, performing binary judgments (Valid/Invalid). No samples were rejected.
- Automated Checks: GPT-o1 was used for automated assessment across all data samples. It flagged 39 potentially invalid pairs, which upon further manual review, were confirmed as valid and retained.
Data Splits: The dataset is split into training (80%), validation (10%), and test (10%) sets, with proportional sampling to ensure balanced domain coverage in each subset.
Statistics:
- Average Video Length: 6.76 minutes
- Average Shots per Video: 16.36 (calculated using PySceneDetect with ContentDetector)
- Average Summary Sentences: 7.19
- Average Summary Tokens: 192.62
- Average Depth of Dependency Tree: 6.02 (indicating syntactic complexity, calculated using spaCy)
- Type-Token Ratio (TTR): 0.62 (reflecting lexical diversity)
- Diversity Metrics (unique n-grams): Distinct-1 (0.62), Distinct-2 (0.93), Distinct-3 (0.97)
  
  The following figure (Figure 1 from the original paper) illustrates an example from the VISTA dataset:
  
  该图像是一个插图，展示了多段视频中的幻灯片内容，强调了事实知识、PopQA、规模及参数与非参数记忆的互补性。这些内容讨论了大型语言模型在处理信息时的能力及挑战。

As seen above, the dataset pairs a conference presentation video (top) with the abstract of the corresponding paper (bottom). This example (Mallen et al., 2023) was presented at ACL 2023.

The following figure (Figure 2 from the original paper) shows the venue distribution of the VISTA dataset:

Figure 2: Venue distribution of the VISTA dataset. 该图像是一个饼图，展示了VISTA数据集中不同会议的分布情况。其中，NeurIPS占51%，其次是ICML（13%）和ACL（11%），其他会议的占比相对较小。

The pie chart above shows that NeurIPS contributes the largest portion of the dataset (51%), followed by ICML (13%) and ACL (11%).

The following figure (Figure 3 from the original paper) visualizes key attributes of the VISTA dataset:

Figure 3: Distribution of summary sentences, summary tokens, video durations, and video shots in VISTA. 该图像是一个图表，展示了 VISTA 数据集中摘要句子数量、标记数量、视频时长和视频镜头数量的分布。图中包含四个直方图，分别表示句子数量、标记数量、视频时长（分钟）和视频镜头数，显示了不同数据的频率特征。

The histograms above display the distributions of summary sentences, summary tokens, video durations, and video shots, indicating that most summaries are under 250 tokens and 10 sentences, and most videos are under 10 minutes with fewer than 30 shots.

4.2.2. Plan-Based Summarization Framework

The task is formalized as learning a conditional probability distribution $P(s \mid v)$ , where $v$ is a video (or its transcript/audio) and $s$ is its summary. The plan-based framework extends this by introducing an intermediate representation $p$ (plan).

Task Formalization: Given a dataset $D = \{ ( v _ { 1 } , s _ { 1 } ) , ( v _ { 2 } , s _ { 2 } ) , \ldots , ( v _ { n } , s _ { n } ) \}$ , the objective is to train a model $\mathcal { M }$ to learn the conditional probability distribution: $ P ( s \mid v ) $ Where:
- $v$ : Represents the input video (or its derived modalities like transcript/audio).
- $s$ : Represents the corresponding textual summary (the paper abstract).
- $n$ : The total number of video-summary pairs in the dataset.
Introduction of Plan: To address the challenge of structuring summaries coherently and faithfully, an intermediate representation, a plan $p$ , is introduced. This changes the model's objective to learn an extended conditional probability distribution: $ P ( s \mid v , p ) $ This means the summary $s$ is generated conditioned on both the input video $v$ and the plan $p$ .
Plan Structure: The plan $p$ consists of a sequence of automatically generated questions $\{ q _ { 1 } , q _ { 2 } , \dots , q _ { m } \}$ . Each question $q_i$ corresponds to a sentence to be verbalized in the summary. The plan explicitly controls the overall summary structure and the content of each sentence, which is designed to answer its corresponding question. This concept is inspired by the Question Under Discussion (QUD) theory.
Plan Generation (PG) Module:
- Generation Process: GPT-o1 is leveraged to generate silver-standard plans. These plans are generated based on reference summary sentences ( $t_i$ ) and their preceding context. For example, question $q_3$ is generated considering target sentence $t_3$ and previous summary sentences $t_1$ and $t_2$ . This ensures that the question sequence preserves the order of sentences in the reference summaries, maintaining a natural and coherent flow.
- Prompt: The prompt used for GPT-o1 to generate plan questions is designed to specify the context (Previous-Context) and the target sentence for question formulation, aligning with QUD requirements.
  
  The following figure (Figure 4 from the original paper) illustrates how GPT-o1 generates plans based on reference summaries:
  
  $Figure 4: GPT-o1 generates plans based on reference summaries. Each question `q _ { i }` corresponds to a summary sentence `t _ { i }` , which we assume constitutes its answer. Index $i$ ranges from 1 to the number of summary sentences.$ 该图像是一个示意图，展示了GPT-o1模型如何根据参考摘要生成计划。图中列出了五个与语言模型（LMs）相关的问题 $q_1, q_2, q_3, q_4, q_5$ ，强调了文中提到的各自要点。并且显示了从生成的计划到相应概念的逻辑关系。

As shown above, GPT-o1 takes a target summary sentence $t_i$ and its preceding sentences ( $t_1, \dots, t_{i-1}$ ) as input to generate the corresponding plan question $q_i$ .

Summarization Model (SG) Module:
- Architecture: Both the Plan Generation (PG) and Summary Generation (SG) modules share the same backbone architecture but are trained independently.
- Training PG: The PG module is trained on pairs of (v, p) samples, learning to generate a plan from the video input.
- Training SG: The SG module is trained on tuples ([v; p], s), where [v; p] is the concatenation of the input video $v$ and its generated plan $p$ . This module learns to generate a summary given the video and the guiding plan.
- Inference: During inference, the trained PG module first predicts a plan $\hat{p}$ for a given input video $v$ . Then, the tuple $[v; \hat{p}]$ is fed into the SG module to generate the final summary.
  
  This two-stage approach explicitly guides the summary generation, aiming to improve structural coherence and factual grounding compared to purely end-to-end models.

5. Experimental Setup

5.1. Datasets

The primary dataset used for all experiments is VISTA.

VISTA Dataset:
- Source: Computational linguistics (ACL Anthology: ACL, EMNLP, NAACL, EACL, Findings of ACL) and machine learning (ICML, NeurIPS) conferences from 2020 to 2024.
- Scale: 18,599 video-summary pairs.
- Characteristics: Each pair consists of a recorded AI conference presentation video and its corresponding paper abstract. Videos have an average length of 6.76 minutes, and summaries (abstracts) have an average length of 192.6 tokens. The abstracts serve as the ground truth summaries.
- Domain: Scientific/Academic, specifically AI research.
- Purpose: Chosen because LMMs show reduced performance in scientific contexts, and there was a lack of specialized datasets for multimodal scientific content. VISTA provides a challenging benchmark that reflects real-world academic content.
  
  The dataset is split into:
Training set: 14,881 samples (80%)
Validation set: 1,859 samples (10%)
Test set: 1,859 samples (10%) These splits are proportionally sampled to ensure balanced domain coverage across all subsets.

5.2. Evaluation Metrics

The paper employs a suite of automatic evaluation metrics to measure informativeness, alignment, and factual consistency, along with human evaluations.

Informativeness Metrics:
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) (Lin, 2004): Measures the overlap of n-grams (sequences of n words) between a machine-generated summary and one or more human-written reference summaries. It quantifies how much of the information in the reference summaries is captured by the generated summary. The paper reports F1 scores for Rouge-1, Rouge-2, and Rouge-LSum.
  - Conceptual Definition: ROUGE is a set of metrics used for evaluating automatic summarization and machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (human-produced) and counting the number of overlapping units such as n-grams, word sequences, or word pairs.
  - Mathematical Formula (ROUGE-N for precision, recall, F1-score): Let $S$ be the candidate summary and $R = \{R_1, R_2, \ldots, R_m\}$ be the set of reference summaries. $ \mathrm{ROUGE-N}{\mathrm{recall}} = \frac{\sum{i=1}^{m} \sum_{\text{n-gram} \in S} \mathrm{Count}{\mathrm{match}}(\text{n-gram})}{\sum{i=1}^{m} \sum_{\text{n-gram} \in R_i} \mathrm{Count}(\text{n-gram})} $ $ \mathrm{ROUGE-N}{\mathrm{precision}} = \frac{\sum{\text{n-gram} \in S} \mathrm{Count}{\mathrm{match}}(\text{n-gram})}{\sum{\text{n-gram} \in S} \mathrm{Count}(\text{n-gram})} $ $ \mathrm{ROUGE-N}{\mathrm{F1}} = \frac{2 \times \mathrm{ROUGE-N}{\mathrm{precision}} \times \mathrm{ROUGE-N}{\mathrm{recall}}}{\mathrm{ROUGE-N}{\mathrm{precision}} + \mathrm{ROUGE-N}_{\mathrm{recall}}} $
  - Symbol Explanation:
    - $S$ : The automatically generated summary.
    - $R_i$ : The $i$ -th human-written reference summary.
    - $m$ : The total number of reference summaries.
    - $\mathrm{Count}(\text{n-gram})$ : The number of times a specific n-gram appears in a summary.
    - $\mathrm{Count}_{\mathrm{match}}(\text{n-gram})$ : The number of times an n-gram occurs in both the candidate summary and a reference summary (clipping to the maximum number of times it occurs in any single reference).
    - ROUGE-1 uses unigrams (single words), ROUGE-2 uses bigrams (two-word sequences).
    - ROUGE-LSum (L for Longest Common Subsequence) measures the longest common subsequence match, allowing for sentence-level statistics and considering summary-level L.
- SacreBLEU (Post, 2018): Assesses linguistic consistency and fluency between generated and reference texts. It is a standardized version of the BLEU (Bilingual Evaluation Understudy) metric, designed to ensure reproducible scores.
  - Conceptual Definition: SacreBLEU is an automatic metric for evaluating the quality of text generated by machines, particularly for tasks like machine translation and summarization. It compares the generated text to one or more high-quality reference texts and calculates a score based on the overlap of n-grams, with penalties for brevity.
  - Mathematical Formula (BLEU, base for SacreBLEU): $ \mathrm{BLEU} = \mathrm{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $ Where:
    - $p_n$ : The n-gram precision for n-grams of length $n$ . $ p_n = \frac{\sum_{\text{sentence} \in \text{candidate}} \sum_{\text{n-gram} \in \text{sentence}} \mathrm{Count}{\text{clip}}(\text{n-gram})}{\sum{\text{sentence} \in \text{candidate}} \sum_{\text{n-gram} \in \text{sentence}} \mathrm{Count}(\text{n-gram})} $
    - $w_n$ : Positive weights for each n-gram precision (typically $1/N$ ).
    - $N$ : Maximum n-gram length considered (commonly 4).
    - $\mathrm{BP}$ (Brevity Penalty): A penalty factor that penalizes generated texts that are too short compared to the reference texts. $ \mathrm{BP} = \begin{cases} 1 & \text{if } c > r \ e^{(1 - r/c)} & \text{if } c \le r \end{cases} $
  - Symbol Explanation:
    - $c$ : Length of the candidate summary.
    - $r$ : Effective reference corpus length (closest reference length to candidate length).
    - $\mathrm{Count}(\text{n-gram})$ : Count of n-gram occurrences in the candidate summary.
    - $\mathrm{Count}_{\text{clip}}(\text{n-gram})$ : Count of n-gram occurrences in the candidate summary, clipped to the maximum count in any single reference.
- METEOR (Metric for Evaluation of Translation with Explicit Ordering) (Banerjee and Lavie, 2005): Calculates the harmonic mean of unigram precision and recall, placing greater emphasis on recall for a balanced evaluation. It also considers synonyms and stemmed words through external resources like WordNet.
  - Conceptual Definition: METEOR is an automatic evaluation metric for machine translation and summarization that addresses some limitations of BLEU. It computes a score based on explicit word-to-word matches between the candidate and reference translations, using stemming, synonymy, and paraphrasing. It then calculates a generalized F-mean, with recall weighted more heavily than precision, and applies a penalty for fragmentation.
  - Mathematical Formula (simplified): $ \mathrm{METEOR} = (1 - \mathrm{Penalty}) \cdot \mathrm{F}{\mathrm{mean}} $ $ \mathrm{F}{\mathrm{mean}} = \frac{10 \cdot P \cdot R}{R + 9 \cdot P} $ $ P = \frac{\text{matched words}}{\text{words in candidate}} $ $ R = \frac{\text{matched words}}{\text{words in reference}} $
  - Symbol Explanation:
    - $P$ : Precision, based on the ratio of matched words to words in the candidate summary.
    - $R$ : Recall, based on the ratio of matched words to words in the reference summary.
    - $\mathrm{Penalty}$ : A fragmentation penalty, inversely proportional to the number of contiguous matched segments between the candidate and reference.
- BERTScore (Zhang et al., 2020): Uses contextual embeddings from BERT (Bidirectional Encoder Representations from Transformers) to evaluate semantic similarity between texts. It computes similarity scores based on token embeddings, allowing for more nuanced comparisons than n-gram overlap.
  - Conceptual Definition: BERTScore leverages the contextual embeddings from pre-trained BERT models to compute a similarity score between a candidate text and a reference text. Instead of exact string matching, it measures semantic similarity by calculating cosine similarity between the BERT embeddings of words (or tokens) in both texts, then aggregates these similarities (e.g., using greedy matching) to derive precision, recall, and F1 scores.
  - Mathematical Formula (F1-score): Let $x_1, \ldots, x_k$ be tokens in the candidate text and $y_1, \ldots, y_l$ be tokens in the reference text. Let $\mathbf{e}_{x_i}$ and $\mathbf{e}_{y_j}$ be their contextual BERT embeddings. $ P = \frac{1}{k} \sum_{i=1}^{k} \max_{j=1}^{l} \text{cosine}(\mathbf{e}{x_i}, \mathbf{e}{y_j}) $ $ R = \frac{1}{l} \sum_{j=1}^{l} \max_{i=1}^{k} \text{cosine}(\mathbf{e}{x_i}, \mathbf{e}{y_j}) $ $ F_1 = 2 \frac{P \cdot R}{P + R} $
  - Symbol Explanation:
    - $k$ : Number of tokens in the candidate text.
    - $l$ : Number of tokens in the reference text.
    - $\mathbf{e}_{x_i}$ : Contextual BERT embedding for the $i$ -th token in the candidate text.
    - $\mathbf{e}_{y_j}$ : Contextual BERT embedding for the $j$ -th token in the reference text.
    - cosine: Cosine similarity function, measuring the angle between two vectors.
    - $P$ : Precision, averaging the maximum cosine similarity of each candidate token to any reference token.
    - $R$ : Recall, averaging the maximum cosine similarity of each reference token to any candidate token.
- CIDEr-D (Consensus-based Image Description Evaluation - Discriminative) (Vedantam et al., 2015): Evaluates the consensus between generated summaries and references by using TF-IDF (Term Frequency-Inverse Document Frequency) weighting combined with a decay factor to reduce the impact of repeated terms. Originally for image captioning.
  - Conceptual Definition: CIDEr-D measures the quality of generated captions (or summaries) by calculating the cosine similarity between TF-IDF weighted n-gram vectors of the candidate text and a set of reference texts. It assigns higher scores to captions that are consistent with human consensus and penalizes common, uninformative phrases. The 'D' version (Discriminative) uses a decay factor to give more weight to less frequent n-grams.
  - Mathematical Formula: $ \mathrm{CIDEr}n(c_i, \mathbf{S}i) = \exp\left(-\frac{(|c_i|-l_i)^2}{2 \sigma^2}\right) \frac{1}{|\mathbf{S}i|} \sum{s{ij} \in \mathbf{S}i} \frac{\mathbf{g}^n(c_i) \cdot \mathbf{g}^n(s{ij})}{|\mathbf{g}^n(c_i)| |\mathbf{g}^n(s{ij})|} $ The overall CIDEr-D score is the average of scores for different n-gram lengths: $ \mathrm{CIDEr}(c_i, \mathbf{S}i) = \sum{n=1}^{N} w_n \mathrm{CIDEr}_n(c_i, \mathbf{S}_i) $
  - Symbol Explanation:
    - $c_i$ : The candidate summary for image $i$ .
    - $\mathbf{S}_i$ : The set of reference summaries for image $i$ .
    - $\mathbf{g}^n(x)$ : TF-IDF weighted vector of n-grams for text $x$ .
    - $l_i$ : Average length of reference summaries for image $i$ .
    - $|c_i|$ : Length of the candidate summary $c_i$ .
    - $\sigma^2$ : Variance of reference lengths.
    - $\|\cdot\|$ : Vector norm.
    - $w_n$ : Weights for different n-gram lengths (typically uniform).
Alignment and Factual Consistency Metrics:
- VideoScore (He et al., 2024): Focuses on text-to-video alignment, evaluating how accurately video content matches given text prompts using fine-grained multi-aspect scoring.
  - Conceptual Definition: VideoScore is an automatic metric designed to evaluate the alignment between text and video content. It assesses how well a textual description (or generated summary) corresponds to the visual and auditory information present in a video. It does this by scoring multiple aspects of alignment, providing fine-grained feedback on the quality of text-video correspondence.
  - Mathematical Formula: (The paper does not provide an explicit mathematical formula for VideoScore within its main text or appendices, referring to He et al., 2024. However, typically such metrics involve combining scores from various feature extractors and cross-modal similarity modules. Conceptually, it's a weighted sum of alignments across different features and semantic levels.)
  - Symbol Explanation: Not provided in the paper for this metric.
- FactVC (Factual Consistency with Video Content) (Liu and Wan, 2023): Calculates the factual consistency of text with video content by aligning coarse-grained video-text similarity and precision-based fine-grained matching. The values are scaled to percentages (0-100).
  - Conceptual Definition: FactVC evaluates the factual correctness of a generated textual summary with respect to its source video. It aims to detect "hallucinations" or factual inaccuracies in the summary that are not supported by the video content. It combines a coarse-grained similarity check (e.g., overall topic alignment) with a more fine-grained, precision-focused matching to ensure specific details are consistent.
  - Mathematical Formula: (The paper refers to Liu and Wan, 2023 for FactVC. While the exact formula is not provided in the paper, it typically involves comparing claims in the summary against evidence extracted from the video. A common approach for factual consistency metrics is to use natural language inference (NLI) models or knowledge graph matching. For video, this involves extracting entities and events from video and matching them to the summary.)
  - Symbol Explanation: Not provided in the paper for this metric.

5.3. Baselines

The paper benchmarks its method against a variety of state-of-the-art (SOTA) models, categorized by their modality and training approach. These baselines are chosen to represent the current capabilities of different LMM and LLM paradigms.

Zero-shot Learning Baselines: These models are tested without any fine-tuning on the VISTA dataset.
- Closed-source Multimodal Models:
  - GPT-o1 (Achiam et al., 2023)
  - Gemini 2.0 (Team et al., 2023)
  - Claude 3.5 Sonnet (Anthropic, 2024)
- Open-source Video LMMs: These models process videos by extracting multimodal features (visual and/or audio) and using cross-modal attention to align and integrate information.
  - Video-LLaMA (Zhang et al., 2023)
  - Video-ChatGPT (Maaz et al., 2024)
  - Video-LLaVA (Lin et al., 2024a)
  - LLaMA-VID (Li et al., 2024c)
  - LLaVA-NeXT-Interleave (Li et al., 2025)
  - mPLUG-Ow13 (Ye et al., 2025)
- Text- and Audio-based Models: These are included to assess performance without direct video information.
  - LLaMA-3.1 (Touvron et al., 2023):
    - LLaMA-3.1transcript: Input is audio transcribed from video using OpenAI's Whisper-1.
    - LLaMA-3.1OCR: Input is on-screen text extracted from video frames using EasyOCR.
  - Qwen2-Audio (Chu et al., 2024): Input is audio converted from video using moviepy.
Fine-tuning Baselines (QLoRA and Full Fine-tuning): For the open-source models, performance is also evaluated after QLoRA fine-tuning and full-parameter fine-tuning on the VISTA training set.
Plan-based Models:
- Plan-mPlug-Ow13: The plan-based approach built on the mPLUG-Ow13 model, which was identified as the best-performing open-source model.
- Plan-mPlug-Ow13*: A variant for zero-shot inference where only the Plan Generation (PG) module is fine-tuned, and the generated plans are fed to the SG module.

5.4. Experimental Setup Details

Hyperparameters:
- Optimizer: AdamW (Loshchilov and Hutter, 2019)
- $\beta_1$ : 0.9
- $\beta_2$ : 0.999
- $\epsilon$ : 1e-9
- Weight Decay: 0.1
- Warm-up Ratio: 0.15
- Initial Learning Rate: 5e-5
- Learning Rate Scheduling: Cosine
- DeepSpeed Configuration: ZeRO-3 Offload
- Random Seed: 2025
- Dropout Rate: 0.1
- QLoRA Specifics: rank $r=32$ , scaling factor $\alpha=64$ , dropout rate for low-rank matrices 0.1.
- Epochs: 16, with early stopping.
- Batch Size: 16
- Checkpoint Saving: Model with the highest Rouge-2 F1 score on the validation set.
Inference Settings:
- Beam Search: Beam size 4
- Length Penalty: 3.0
- No-repeat n-gram size: 3
- Maximum New Tokens: 256
- Video-based LMMs: Sampling rate 0.1 fps (frames per second), 32 extracted frames.
Closed-source Models (API):
- Experimental Period: 01/09/2024 to 10/02/2025
- Temperature: 1
- Top_p: 1
- Frequency Penalty: 0.2
- Presence Penalty: 0.2
- Other parameters default for their respective platforms.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate several key findings regarding video-to-text summarization on the VISTA dataset.

The following are the results from Table 3 of the original paper:

Method Model	Open-source	R1	R2	RLsum	SacreBLEU	Meteor	BERTscore	CIDEr-D	VideoScore	FactVC
Zero-shot	LLaMA-3.1transcript	✓	23.68	4.22	21.39	2.70	14.62	80.93	1.17	1.53	34.32
	LLaMA-3.1oCR	✓	24.02	4.37	21.42	2.63	14.59	80.33	1.19	1.50	34.06
	Qwen2-Audio	✓	23.52	4.29	21.53	2.49	14.77	80.62	1.15	1.59	34.31
	Claude 3.5 Sonnet	×	27.71	5.59	24.14	3.14	17.53	82.57	1.32	1.91	50.11
	Gemini 2.0	X	27.82	5.66	24.29	4.22	17.83	82.64	1.47	2.02	52.02
	GPT-o1	X	27.90	5.69	24.37	4.38	17.90	82.63	1.61	2.17	51.36
	Video-LLaMA	✓	20.18	3.19	21.24	1.76	13.73	81.31	1.08	1.63	32.25
	Video-ChatGPT	✓	20.36	3.52	21.43	1.79	14.01	81.35	1.11	1.63	33.21
	Video-LLaVA	✓	25.29	4.50	22.52	2.82	15.13	81.39	1.17	1.65	36.45
	LLaMA-VID	✓	25.31	4.77	22.53	2.88	15.27	81.32	1.14	1.64	36.39
	LLaVA-NeXT-Interleave	✓	25.41	4.82	22.68	2.92	15.25	81.40	1.18	1.73	40.12
	mPLUG-Ow13	✓	25.57	4.82	22.84	2.99	15.33	81.39	1.21	1.77	42.07
Zero-shot (Plan-based)	Plan-mPlug-Ow13*	✓	25.62†	4.95‡	22.97‡	3.14‡	15.39†‡	81.45‡	1.27‡	1.86‡	47.37‡
QLoRA Fine-tuning	LLaMA-3.1transcript	✓	32.24	11.38	30.39	8.03	21.57	82.39	3.86	2.81	53.22
	LLaMA-3.1oCR	✓	33.01	12.11	30.52	8.04	21.55	82.41	3.92	2.77	53.19
	Qwen2-Audio	✓	32.17	12.05	30.77	7.87	21.86	82.36	4.11	2.80	54.27
	Video-LLaMA	✓	30.74	9.44	28.33	6.45	22.49	82.10	3.99	2.77	52.05
	Video-ChatGPT	✓	31.68	10.50	30.40	7.63	23.67	82.62	4.02	2.78	55.02
	Video-LLaVA	✓	33.16	12.64	30.37	8.17	23.92	82.81	4.26	2.83	59.13
	LLaMA-VID	✓	33.31	12.73	30.49	8.22	23.90	83.01	4.31	2.88	62.20
	LLaVA-NeXT-Interleave	✓	33.37	12.77	30.56	8.30	23.95	83.47	4.47	2.93	66.14
	mPLUG-Ow13	✓	33.40	12.82	30.66	8.29	23.97	83.49	4.47	2.92	70.08
QLoRA Fine-tuning (Plan-based)	Plan-mPlug-Ow13	✓	33.52†‡	13.01†‡	31.10‡	8.33	24.11†‡	83.53†	4.52	3.11†‡	73.11†‡
Full Fine-tuning	LLaMA-3.1transcript	✓	33.37	11.93	30.86	8.27	25.12	83.71	4.87	3.21	63.38
	LLaMA-3.1oCR	✓	34.02	12.42	31.72	8.51	25.11	84.09	4.89	3.32	65.84
	Qwen2-Audio	✓	33.82	12.37	31.63	8.33	25.09	83.62	4.83	3.22	66.62
	Video-LLaMA	✓	32.19	11.86	31.68	8.41	24.99	83.83	4.77	3.04	64.21
	Video-ChatGPT	✓	32.47	12.11	32.21	8.72	25.09	83.91	4.82	3.11	66.09
	Video-LLaVA	✓	33.28	13.39	32.78	9.10	25.42	83.97	4.87	3.13	66.12
	LLaMA-VID	✓	33.47	13.53	32.80	9.21	25.41	84.03	4.91	3.17	68.30
	LLaVA-NeXT-Interleave	✓	33.75	13.61	32.88	9.26	25.63	84.11	5.01	3.23	73.42
	mPLUG-Ow13	✓	34.22	13.62	32.91	9.32	25.72	84.22	5.03	3.28	71.94
Full Fine-tuning (Plan-based)	Plan-mPlug-Ow13	✓	34.53†‡	13.74‡	33.25†‡	9.56†‡	25.88†‡	84.37†‡	5.15†‡	3.33†‡	75.41‡

Key observations from the results:

Impact of Fine-tuning: Fine-tuning on in-domain data (VISTA) substantially improves performance across all evaluation metrics. Full fine-tuning consistently outperforms QLoRA fine-tuning, indicating that training all parameters leads to better adaptation to the specialized scientific domain.
Modality Importance: Video-based LMMs consistently outperform text-based (LLaMA-3.1transcript, LLaMA-3.1OCR) and audio-based (Qwen2-Audio) models. This highlights the crucial role of visual information in scientific presentations for generating high-quality summaries. For example, in full fine-tuning, mPLUG-Ow13 achieves a FactVC of 71.94, significantly higher than LLaMA-3.1transcript's 63.38.
Performance of Closed-source Models (Zero-shot): In zero-shot settings, closed-source models like GPT-o1, Gemini 2.0, and Claude 3.5 Sonnet generally lead, demonstrating their strong generalization capabilities. However, open-source models can surpass them when fine-tuned.
Effectiveness of Plan-based Approach: Plan-mPlug-Ow13 (the plan-based approach built on mPLUG-Ow13) achieves SOTA results among open-source models in both zero-shot and fine-tuned settings.
- In zero-shot, Plan-mPlug-Ow13* (where only the PG module is fine-tuned) shows improvements in FactVC (47.37) and RLsum (22.97) compared to mPLUG-Ow13 (42.07 FactVC, 22.84 RLsum).
- With full fine-tuning, Plan-mPlug-Ow13 achieves the highest overall scores, with FactVC of 75.41 (a +3.47 improvement over mPLUG-Ow13's 71.94) and RLsum of 33.25 (a +0.34 improvement). The dagger (†) and double dagger (‡) symbols indicate statistically significant improvements over the third-best (LLaVA-NeXT-Interleave) and second-best (mPLUG-Ow13) models, respectively, according to a paired t-test ( $p < 0.05$ ).
Remaining Challenges: Despite these improvements, all models (including the plan-based method) still exhibit issues with hallucinations (FactVC) and alignment (VideoScore). The gap between model performance and human performance remains significant (human reference summaries score 88.54 on FactVC and 4.62 on VideoScore), highlighting the difficulty of the dataset.

6.2. Modality Interplay

To further investigate the impact of different modalities, an experiment was conducted using Video-LLaMA with various modality combinations.

The following are the results from Table 4 of the original paper:

Modality	Zero-shot Learning				QLoRA Fine-tuning				Full Fine-tuning
Modality	R2	RLsum	VideoScore	FactVC	R2	RLsum	VideoScore	FactVC	R2	RLsum	VideoScore	FactVC
Video only	2.68	20.34	1.55	28.93	8.83	27.51	2.65	50.66	10.78	30.02	2.91	60.87
Audio only	2.14	19.72	1.41	26.84	7.52	26.34	2.48	45.79	9.23	27.93	2.73	58.02
Transcript only	2.02	18.01	1.34	25.53	6.91	24.33	2.39	44.87	8.44	25.81	2.35	54.11
Video + Audio	3.19	21.24	1.63	32.25	9.44	28.33	2.77	52.05	11.86	31.68	3.04	64.21
Video + Transcript	1.87	18.94	1.39	27.76	7.35	24.82	2.51	48.63	9.01	27.19	2.65	58.91
Audio + Transcript	1.64	18.55	1.35	27.48	7.23	24.73	2.38	47.15	8.57	25.82	2.54	55.39
Video + Audio + Transcript	1.92	19.13	1.47	28.60	7.37	25.29	2.52	50.72	9.22	27.21	2.61	59.30

Analysis of modality interplay:

Video Dominance: Video is consistently the strongest standalone modality across all learning settings and metrics (e.g., Full Fine-tuning Video only FactVC is 60.87). This is attributed to its rich spatial-temporal information in scientific presentations.
Audio Contribution: Audio provides complementary prosodic and timing cues, performing better than transcript only in some cases (e.g., Full Fine-tuning Audio only FactVC is 58.02 vs Transcript only FactVC is 54.11).
Transcript Challenges: While semantically rich, transcript only performs the worst as a standalone modality. ASR systems often produce long, noisy, and unstructured textual inputs, which can overwhelm the model's attention and interfere with alignment.
Combined Modalities:
- Video + Audio generally outperforms single modalities, highlighting the benefits of combining these two. (Full Fine-tuning Video + Audio FactVC is 64.21).
- Surprisingly, adding transcript to Video or Audio (or all three together) often leads to a decrease in performance compared to Video only or Video + Audio. For instance, Video + Transcript (58.91 FactVC) is worse than Video only (60.87 FactVC) in full fine-tuning. This suggests that current video-based LMMs struggle to effectively align and fuse token-heavy, noisy textual inputs with corresponding visual or audio information.

6.3. Ablation Studies / Parameter Analysis

6.3.1. Impact of Plan Generation Ablations

Ablations on plan generation strategies compare the proposed QUD-based approach with simpler baselines for generating questions.

The following are the results from Table 5 of the original paper:

Model	R2	RLsum	VideoScore	FactVC
Plan-mPlug-Ow13	13.74	33.25	3.33	75.41
NoQUD	13.66	33.02	3.28	73.32
Lead-3Q	12.87	30.64	2.95	71.26
Tail-3Q	11.62	30.51	2.88	63.82
Random-3Q	11.57	30.48	2.87	64.28

Analysis:

The original Plan-mPlug-Ow13 (using QUD and Previous-Context for question generation) achieves the best performance across all metrics.
NoQUD (generating all plan questions at once based on the full reference summary) performs slightly worse than the QUD-based approach, indicating the benefit of sequential, context-aware question generation.
Lead-3Q (using the first three summary sentences) performs better than Tail-3Q (last three) and Random-3Q (three random sentences). This suggests that the initial sentences of an abstract provide stronger contextual continuity for formulating guiding questions, potentially outlining the problem and method. Tail-3Q and Random-3Q perform poorly, especially on FactVC, highlighting the importance of a structured beginning for summary coherence.

6.3.2. Impact of Plan Quality

The quality of the generated plan questions is also critical. This was evaluated by comparing GPT-o1 generated questions with those from Llama-3.1 and RAST, and by introducing noise (Random Replacement (RR) and Full Random Replacement (FRR)).

The following figure (Figure 5 from the original paper) shows the impact of noise in plan generation on summarization performance:

Figure 5: Noise in plan generation impacts summarization performance. FRR is a shorthand for Full Random Replacement, and RR for Random Replacement. RAST is a SOTA question generation method. 该图像是一个条形图，展示了不同方法在 R2 指标上的表现。各方法包括 GPT-o1、LLama-3.1、RAST、RR 和 FRR，FRR 的值为 12.71，GPT-o1 的值最高，为 13.74。红色虚线和蓝色虚线分别标记了 LLaVA-NeXT-Interleave 和 mPLUG-Owl3 的位置。

The bar chart above shows R2 scores for different plan generation methods. GPT-o1 achieves the highest R2 score of 13.74. FRR (Full Random Replacement) performs the worst.

Analysis:

Question Quality Matters: Using GPT-o1 to generate questions outperforms Llama-3.1 and RAST in the zero-shot setting, reinforcing GPT-o1's capability for question generation.
Robustness to Noise: The plan-based method demonstrates a degree of robustness; RR (randomly replacing some questions) performs better than FRR (replacing all questions). This implies that while highly irrelevant questions disrupt alignment, the model can still function reasonably well with some noisy plans.
Importance of Relevance: FRR performs the worst, confirming that irrelevant questions severely degrade performance by breaking the alignment between the plan and the desired summary content.

6.3.3. Planning Beyond Vision

The plan-based method was also applied to unimodal, non-visual models to assess its generalizability.

The following are the results from Table 6 of the original paper:

Model	Setting	R2	RLsum	VideoScore	FactVC
LLaMA-3.1transcript	Zero-shot Learning	4.22 → 4.56	21.39 → 22.01	1.53 → 1.75	34.32 → 40.78
	QLoRA Fine-tuning	11.38 → 11.62	30.39 → 30.55	2.81 → 3.02	53.22 → 60.47
	Full Fine-tuning	11.93 → 12.24	30.86 → 31.38	3.21 → 3.25	63.38 → 65.21

LLaMA-3.10CR	Zero-shot Learning	4.37 → 4.59	21.42 → 21.89	1.50 → 1.72	34.06 → 40.24
	QLoRA Fine-tuning	12.11 → 12.33	30.52 → 30.78	2.77 → 2.98	53.19 → 60.38
	Full Fine-tuning	12.42 → 12.75	31.72 → 32.19	3.32 → 3.38	65.84 → 67.53

Qwen2-Audio	Zero-shot Learning	4.29 → 4.51	21.53 → 22.18	1.59 → 1.77	34.31 → 40.52
	QLoRA Fine-tuning	12.05 → 12.19	30.77 → 31.04	2.80 → 3.01	54.27 → 61.44
	Full Fine-tuning	12.37 → 12.68	31.63 → 32.12	3.22 → 3.25	66.62 → 68.25

Analysis:

The plan-based method consistently improves performance across all unimodal (text-based, audio-based) settings and evaluation metrics. A paired t-test confirms these improvements are statistically significant ( $p < 0.05$ ).
This demonstrates that planning serves as a generalizable scaffold for better discourse structure, even without visual input. For text- and audio-based models, planning can mitigate the lack of spatial-temporal signals by providing discourse-level anchors (e.g., "What problem is being addressed?") that guide the summarization trajectory.
Despite these gains, video-based planning models (Plan-mPLUG-Ow13) still outperform their non-visual counterparts by a notable margin, confirming the value of the video modality.

6.3.4. Impact of Video Context on Summary Generation

Experiments were performed to understand how different portions of the video input affect summary generation, comparing mPLUG-Ow13 with Plan-mPlug-Ow13.

The following are the results from Table 8 of the original paper:

Context	Model	R2	RLsum	VideoScore	FactVC
All	mPLUG-Ow13	13.62	32.91	3.28	71.94
	Plan-mPlug-Ow13	13.74	33.25	3.33	75.41
First 10%	mPLUG-Ow13	6.31	25.44	2.37	51.02
	Plan-mPlug-Ow13	7.37	27.38	2.52	52.39
First 30%	mPLUG-Ow13	9.42	28.88	2.78	54.10
	Plan-mPlug-Ow13	10.59	30.13	2.78	55.37
Last 10%	mPLUG-Ow13	6.53	27.34	2.51	53.64
	Plan-mPlug-Ow13	7.62	29.73	2.77	55.93
Last 30%	mPLUG-Ow13	7.32	29.17	2.82	57.36
	Plan-mPlug-Ow13	10.72	31.29	2.98	62.05

Analysis:

Full Video Best: Using the full video as input yields the best performance, as expected.
Partial Context Limitations: Partial video contexts consistently underperform the full video.
End-of-Video Importance: The last part of the video generally produces better results than the first part. Concluding sections of presentations often summarize key findings, while opening sections primarily introduce background information.
Quantity Matters: Using 30% of the video outperforms 10%, indicating that more context is generally beneficial.
Plan-based Superiority: Plan-mPlug-Ow13 consistently outperforms mPLUG-Ow13 across all video context configurations, reinforcing its effectiveness regardless of input video length.

6.3.5. Impact of Text Context on Plan Generation

This ablation investigates how the text context provided to GPT-o1 for generating plan questions affects performance.

The following figure (Figure 8 from the original paper) shows the impact of text context for plan generation:

Figure 8: Impact of text context for plan generation. 该图像是一个条形图，展示了文本上下文对计划生成的影响。图中对比了三种上下文条件下的 R2 值，分别为 No-Context（13.69）、Previous-Context（13.74）和 All-Context（13.72），并标注了 LLaVA-NeXT-Interleave 和 mPLUG-Owl3 的参考线。

The bar chart above displays R2 values for different text context configurations in plan generation: No-Context (13.69), Previous-Context (13.74), and All-Context (13.72). Previous-Context yields the highest R2 score.

Analysis:

Marginal Differences: Performance differences between No-Context, Previous-Context, and All-Context are relatively small, but all are superior to models without planning.
No-Context (Target Sentence Only): Shows the lowest performance among the planning methods but is the most cost-effective.
All-Context (Entire Summary): Achieves slightly better results than No-Context but incurs the highest computational cost due to longer input length for GPT-o1.
Previous-Context (Target Sentence + Preceding Summary): This approach, aligned with QUD theory, strikes the best balance, achieving the highest performance (R2 of 13.74) for a moderate computational cost.

6.3.6. Controllable Generation

The paper explores the ability of plan-based models to control output summaries by modifying plans, comparing it with direct instruction-based control.

Summary Readability Control (Table 9):

Condition	Plan-mPlug-Ow13		GPT-01
Condition	R2	FRE	R2	FRE
No change	13.74	30.62	5.69	26.37
Lay questions	13.38	35.17	4.26	28.94
Expert questions	13.24	23.54	4.13	24.33

Summary Length Control (Table 10):

Condition	Plan-mPlug-Ow13		GPT-01
Condition	R2	Avg. #Tokens	R2	Avg. #Tokens
No deletion	13.74	202.39	5.69	267.32
Delete 10%	11.05	178.47	4.32	220.49
Delete 30%	10.41	137.72	3.17	192.42
Delete 60%	8.01	100.32	2.98	185.28

Analysis:

Robustness of Plan-based Method: While performance (R2 scores) generally declines for both models when applying control (readability or length), the plan-based method (Plan-mPlug-Ow13) proves more robust and controllable. It experiences smaller performance drops compared to GPT-o1 (which uses direct prompt-based instructions).
Readability Control: Plan-mPlug-Ow13 effectively controls readability, achieving higher Flesch Reading Ease (FRE) for lay questions (35.17 vs 28.94 for GPT-o1) and lower FRE for expert questions (23.54 vs 24.33 for GPT-o1). This demonstrates precise control over the summary's style.
Length Control: Plan-mPlug-Ow13 aligns more closely with target compression ratios. For instance, with 60% deletion, it produces summaries averaging 100.32 tokens, whereas GPT-o1 generates much longer summaries (185.28 tokens) despite the instruction. This shows plan-based control is more effective at enforcing content retention and compression.
Hallucination Implications: Case studies (discussed in Appendix K) reveal that hallucination issues are amplified in GPT-o1 under these constraints, especially for readability control (generating more complex outputs) and length control (compensating for omitted content). The explicit planning mechanism of Plan-mPlug-Ow13 helps maintain factual alignment and avoid unsupported claims.

6.4. Human Evaluation

A human evaluation was conducted on 50 randomly selected instances from the VISTA test set.

The following figure (Figure 6 from the original paper) presents the performance of each model based on human evaluation:

Figure 6: Human evaluation results. Human-written summaries consistently outperform all neural models. 该图像是一个雷达图，展示了不同模型在摘要评估中的表现。人类生成的摘要在6个评估维度上均优于其他神经模型，包括可信度、连贯性和相关性等指标。

The radar chart above clearly shows that human-written summaries consistently outperform all neural models across all metrics (Faithfulness, Relevance, Informativeness, Conciseness, and Coherence). Plan-mPlug-Ow13 performs best among the neural models.

Analysis:

Human Superiority: Human-written summaries significantly outperform all neural summarization models across all metrics (Faithfulness, Relevance, Informativeness, Conciseness, Coherence). Humans are 81.7% more likely to be rated as "best."
Inter-annotator Agreement: High Fleiss' Kappa scores (average $\kappa = 0.787$ ) indicate substantial agreement among annotators, ensuring reliability of the human evaluation.
Neural Model Ranking:
- GPT-o1 performs the worst among neural models, being rated "worst" 63.2% of the time.
- LLAVA-NeXT-Interleave follows, with a 17.8% chance of being rated "worst."
- Plan-mPLUG-Ow13 outperforms mPLUG-Ow13 and demonstrates superior performance across all metrics among neural models. It has a higher likelihood of generating high-quality summaries.
Statistical Significance: Paired t-tests confirm that human summaries are significantly better than all neural models ( $p < 0.05$ ). The plan-based method is significantly better ( $p < 0.05$ ) than other neural models in faithfulness, coherence, and informativeness, although it still falls short of human performance.
Gap Remaining: The human evaluation reinforces the significant performance gap between automated systems and human capabilities on the challenging VISTA dataset.

6.5. LMM-as-Judge Evaluation

An LMM-as-Judge evaluation (using GPT-o1 as the evaluator) was conducted on all samples in the test set to facilitate large-scale comparisons, validating the approach against human evaluations.

The following figure (Figure 9 from the original paper) shows the LMM-as-Judge evaluation results:

Figure 9: LMM-as-Judge evaluation results showing that human-written summaries consistently outperform neural models. 该图像是一个雷达图，展示了不同模型在摘要质量上的评估结果。人类撰写的摘要在简洁性、连贯性、信息量和相关性等指标上均优于多数神经模型，而GPT-01的表现尤其较低。

The radar chart above shows LMM-as-Judge evaluation results. Similar to human evaluation, human-written summaries consistently receive the highest scores across all metrics. Among neural models, the plan-based model again performs best.

Analysis:

Consistency with Human Evaluation: The LMM-as-Judge results are broadly consistent with human evaluations. GPT-o1 as a judge assigns the lowest scores to its own responses and consistently rates human-written summaries as the best.
High Agreement: Fleiss' Kappa scores between GPT-o1 and mean human ratings on a subset of 50 samples show substantial agreement (e.g., Faithfulness $\kappa = 0.732$ , Relevance $\kappa = 0.803$ ).
Plan-based Superiority Confirmed: The LMM-as-Judge also recognizes that the plan-based model (Plan-mPLUG-Ow13) outperforms other neural models across all metrics, with statistically significant improvements ( $p < 0.05$ ) except for conciseness.
Persistent Gap: The LMM-as-Judge evaluation further highlights the persistent gap between machine-generated and human summaries, reinforcing the challenging nature of the VISTA dataset.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces VISTA, a novel and substantial dataset specifically curated for video-to-text summarization of scientific presentations. Through comprehensive benchmarking, the study reveals the inherent complexity of this task and the limitations of current large multimodal models in handling specialized scientific content. A key contribution is the proposal and validation of a plan-based summarization approach that incorporates discourse-aware planning prior to summary generation. Both automated and extensive human evaluations confirm that this explicit planning consistently enhances summary quality, factual coverage, and coherence across various settings. While the plan-based method significantly improves upon existing SOTA models, a noticeable performance gap remains between automated systems and human capabilities, underscoring the challenging nature of the VISTA dataset and the task at hand. The paper concludes by positioning VISTA as a robust foundation for future research in scientific video-to-text summarization.

7.2. Limitations & Future Work

The authors acknowledge several limitations:

Data Bias: While the VISTA dataset is large and diverse, potential inherent biases in the data have not been investigated. The data represents only a fraction of real-world scenarios, so findings may not generalize universally.
Abstract as Proxy: The paper's core hypothesis that a paper abstract serves as an accurate proxy for a video summary is acknowledged to have potential nuances. While quality control ensures strong alignment, minor differences between an abstract and a summary derived solely from video content might exist.
Model Scope: The effectiveness of the plan-based method was tested on a selection of video-based, audio-based, and text-based large models, but not exhaustively across all possible model architectures or modalities. The optimal planning approach for the dataset was also not definitively identified.
Task Scope: The study focused exclusively on video-to-text summarization within scientific domains. Its applicability to other NLP tasks (e.g., multimodal machine translation, question answering, reasoning) or other domains remains unexplored, though likely adaptable.
Automated Evaluation Limitations: Despite using a suite of metrics and hallucination detection methods, automated metrics have inherent limitations and may not capture all aspects of model quality.
Human Evaluation Sample Size: The human evaluation was conducted on a relatively small subset (50 video-summary pairs), which may not fully represent the entire dataset. The evaluators, while graduate students, were not necessarily experts in video-to-text summarization and had varying assessment skills.
LMM-as-Judge Biases: LMM-as-Judge paradigms, while enabling large-scale evaluation, may inherit biases from their pretraining data. Data contamination is a concern if GPT-o1 (used as the judge) was trained on overlapping data. While validated against human evaluation on a small subset, its reliability across diverse topics or styles needs caution.

Future research directions suggested include:
Investigating inherent biases within the VISTA dataset.
Exploring alternative or optimal plan-based methods for video-to-text summarization.
Applying plan-based methods to other multimodal NLP tasks.
Developing more robust automated evaluation metrics that better align with human judgment.
Conducting larger and more diverse human evaluations.
Addressing the LMM-as-Judge biases and improving its reliability.

7.3. Personal Insights & Critique

This paper makes a significant contribution to the field of multimodal learning by introducing VISTA, a much-needed specialized dataset for scientific video-to-text summarization. The meticulous data collection and quality control processes are commendable, ensuring that the dataset is highly relevant and challenging. The finding that LMMs struggle with scientific content despite their general capabilities underscores the importance of domain-specific data and highlights that AI general intelligence is still far from being achieved in specialized, knowledge-intensive fields.

The plan-based framework is a clever and effective approach. Scientific abstracts inherently possess a structured nature (e.g., introduction, methods, results, conclusion), and explicitly guiding the generation process with these structures (via questions) is intuitive and demonstrably beneficial. The improvements in factual consistency and coherence are particularly valuable for scientific summarization, where accuracy is paramount. The ablation studies effectively demonstrate the contribution of each component, especially the superiority of Previous-Context for question generation and the robustness of the plan-based method against noisy inputs. The controllable generation experiments further highlight the practical utility of planning for tailoring summaries to specific needs (e.g., readability, length).

Critically, the paper transparently acknowledges the persistent gap between human and machine performance. This gap is not a weakness but a testament to the challenge posed by the VISTA dataset and the complexity of understanding and synthesizing highly technical, multimodal information. It provides a clear direction for future research.

One potential area for deeper exploration could be the automatic extraction of structured plans directly from the video content, rather than relying on silver-standard plans generated from reference summaries. While the current method uses GPT-o1 to create plans from abstracts, a fully end-to-end plan-based model that extracts these plans directly from video input (e.g., by identifying key segments or visual cues) could be a powerful advancement. Additionally, exploring the pedagogical implications of such summarization tools for scientific education and knowledge dissemination could be a fascinating application. The paper's robust methodology and insightful analysis provide a strong foundation for these and many other future investigations.

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~32 min read · 46,768 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. VISTA Dataset Construction

4.2.2. Plan-Based Summarization Framework

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Experimental Setup Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Modality Interplay

6.3. Ablation Studies / Parameter Analysis

6.3.1. Impact of Plan Generation Ablations

6.3.2. Impact of Plan Quality

6.3.3. Planning Beyond Vision

6.3.4. Impact of Video Context on Summary Generation

6.3.5. Impact of Text Context on Plan Generation

6.3.6. Controllable Generation

6.4. Human Evaluation

6.5. LMM-as-Judge Evaluation

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers