Paper status: completed

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

Published:08/26/2023

Temporal Redundancy in Vision Transformers (1)Video Object Detection Optimization (1)Action Recognition Acceleration (1)Adaptive Compute Control (1)Eventful Transformers Model (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Eventful Transformers reduce Vision Transformers' high computational cost in video processing by exploiting temporal redundancy. The method identifies and re-processes only significantly changed tokens across frames, achieving 2-4x computational savings with minimal accuracy loss

Abstract

Vision Transformers achieve impressive accuracy across a range of visual recognition tasks. Unfortunately, their accuracy frequently comes with high computational costs. This is a particular issue in video recognition, where models are often applied repeatedly across frames or temporal chunks. In this work, we exploit temporal redundancy between subsequent inputs to reduce the cost of Transformers for video processing. We describe a method for identifying and re-processing only those tokens that have changed significantly over time. Our proposed family of models, Eventful Transformers, can be converted from existing Transformers (often without any re-training) and give adaptive control over the compute cost at runtime. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100). Our approach leads to significant computational savings (on the order of 2-4x) with only minor reductions in accuracy.

Mind Map

In-depth Reading

English Analysis~15 min read · 17,751 chars

1. Bibliographic Information

Title: Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers
Authors: Matthew Dutson, Yin Li, and Mohit Gupta. Their affiliation is the University of Wisconsin-Madison.
Journal/Conference: The paper is available on arXiv, a preprint server. This indicates it is a research article that has not yet undergone formal peer review for a conference or journal, or is awaiting publication after acceptance. arXiv is a highly respected platform for disseminating cutting-edge research in fields like computer science and physics.
Publication Year: 2023 (first version submitted in August).
Abstract: The paper addresses the high computational cost of Vision Transformers, particularly in video recognition tasks. The authors propose a method to exploit temporal redundancy between consecutive video frames or clips. Their approach identifies tokens (parts of the input) that have changed significantly over time and re-processes only them. This method, embodied in a model family called "Eventful Transformers," can be applied to existing Transformer models, often without retraining, and allows for adaptive control of the computational budget at runtime. Experiments on large-scale video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100) datasets show significant computational savings (2-4x) with only minor drops in accuracy.
Original Source Link: https://arxiv.org/abs/2308.13494 (PDF: http://arxiv.org/pdf/2308.13494v1)

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Vision Transformers (ViTs) have achieved state-of-the-art performance on many computer vision tasks but are notoriously computationally expensive. This problem is exacerbated in video processing, where models are applied repeatedly to a sequence of frames, leading to prohibitive costs for real-world deployment on resource-constrained devices.
- Gap in Prior Work: While standard practice involves processing each video frame "from scratch," natural videos exhibit significant temporal redundancy—consecutive frames are often very similar. Existing methods for exploiting this redundancy were primarily designed for Convolutional Neural Networks (CNNs) and are not directly applicable to the unique architecture of Transformers, specifically the self-attention mechanism.
- Innovation: The paper introduces a novel method to reuse computations from previous time steps within the Transformer architecture. The core idea is to treat the processing pipeline as "eventful," meaning computation is triggered only by significant changes in the input tokens over time. This approach not only improves efficiency but also introduces runtime adaptivity, allowing the computational cost to be adjusted dynamically.
Main Contributions / Findings (What):
1. A Novel Gating Mechanism: The paper proposes a token gating mechanism that compares incoming tokens to a stored reference from a previous time step. Only tokens that exhibit significant change ("events") are selected for updating.
2. Efficient Transformer Block Design: The authors detail how to modify a standard Transformer block to incorporate this gating mechanism. They present sparse update strategies that accelerate not only token-wise operations (like MLPs) but also the computationally intensive query-key and attention-value products within the self-attention module.
3. Adaptive and Retraining-Free Application: The proposed "Eventful Transformers" can be converted from off-the-shelf, pre-trained models, often without any fine-tuning. The method provides a simple parameter (r in the top-r policy) that allows for direct, real-time control over the accuracy-compute trade-off.
4. Strong Empirical Results: The method is validated on large-scale video benchmarks for object detection and action recognition, demonstrating computational savings of 2-4x with only minor accuracy degradation.

Foundational Concepts:
- Vision Transformer (ViT): A deep learning model architecture that adapts the Transformer, originally designed for natural language processing, to computer vision tasks. A ViT first splits an image into a sequence of fixed-size patches. Each patch is flattened and linearly projected to create a "token" vector. These tokens, along with positional information, are then fed through a series of Transformer blocks.
- Transformer Block: The fundamental building block of a Transformer. It primarily consists of two sub-layers: a Multi-Head Self-Attention (MSA) layer and a position-wise Multilayer Perceptron (MLP). Each sub-layer is followed by a residual connection and layer normalization.
- Self-Attention: The key mechanism in Transformers. It allows the model to weigh the importance of different tokens in the input sequence when computing the representation for a given token. It calculates attention scores by comparing each token's "query" vector to every other token's "key" vector. These scores are then used to create a weighted sum of all tokens' "value" vectors. This process has a computational complexity that is quadratic in the number of tokens ( $O(N^2)$ ).
Previous Works & Differentiation:
- Efficient Transformers: Many prior works aimed to reduce the $O(N^2)$ complexity of self-attention using sparse or low-rank approximations. The proposed method is orthogonal to these, as it focuses on reducing computation by exploiting temporal redundancy between different inputs, not spatial redundancy within a single input.
- Token Selection (Spatial Redundancy): Methods like Adaptive Token Sampling (ATS) prune or merge tokens within a single image to reduce redundancy. This paper targets temporal redundancy (across frames), making it complementary. The authors demonstrate this by combining their method with a simple spatial pooling technique.
- Temporal Redundancy in CNNs: Works like Skip-Convolutions and DeltaCNN exploited temporal redundancy for CNNs by updating only parts of the feature maps. However, their techniques are tailored to the convolutional structure and are not directly transferable to the token-based, globally interactive nature of Transformers.
- Spatiotemporal Gated Transformers (STGT): This is the most closely related prior work. The authors differentiate their method on two key points:
  1. STGT only accelerates token-level operations and does not address the expensive self-attention operator. Eventful Transformers accelerate all major components, including self-attention.
  2. STGT's gating logic is described as "lossy" and can degrade on long sequences. The proposed reference-based gating mechanism is designed to be more robust.

4. Methodology (Core Technology & Implementation)

The core of the paper is the "Eventful Transformer," a modified Vision Transformer that selectively updates its internal states based on temporal changes.

Principles: The central idea is to avoid recomputing the entire network for every video frame. Instead, the model maintains a memory of its previous state (token values) and only updates the parts that have significantly changed. This is achieved through a combination of gate and buffer modules.
Steps & Procedures:

1. Token Gating: Detecting Redundancy The mechanism for detecting changes is composed of two main modules:
- Gate Module: This module decides which tokens to update. As shown in Image 4, it performs a four-step process:
  1. Compute Error: It calculates the difference $e = u - c$ between the stored reference tokens $u$ and the current incoming tokens $c$ .
  2. Apply Selection Policy: A policy is applied to the error tensor $e$ to generate a binary mask $m$ , which identifies the $M$ tokens (out of a total $N$ ) that have changed the most.
  3. Extract Selected Tokens: The $M$ tokens selected by the mask are "gathered" from the current input $c$ to form a smaller tensor $\tilde{c} \in \mathbb{R}^{M \times D}$ . This tensor is the output of the gate.
  4. Update Reference: The reference tensor $u$ is updated for the selected tokens with their new values from $c$ .
    
    该图像是示意图，展示了论文中“Token Gating”（令牌门控）方法的四个步骤：1）计算当前令牌与参考令牌之间的误差；2）根据误差应用选择策略生成掩码；3）提取需要更新的令牌作为输出；4）利用掩码更新参考令牌以备下一次比较。图中以分块的羊的图像表示了令牌的选择和更新过程。
- Buffer Module: This module complements the gate. It takes the processed sparse tensor (of size $M \times D$ ) and "scatters" the updated tokens back into a full-sized tensor ( $N \times D$ ), filling in the non-updated positions with their previous values stored in the buffer's state.
  
  Image 5 illustrates how a gate-buffer pair is used to accelerate token-wise operations. The gate reduces the number of active tokens, the operation is performed on this smaller set, and the buffer restores the full tensor shape.
  
  $Figure 3. Accelerating token-wise operations. The gate reduces the number of active tokens from $N$ to $M$ .Subsequent token-wise operations operate on a smaller tensor and therefore have a lower com…$
  
  2. Building an Eventful Transformer Block Image 1 shows the architecture of a modified Transformer block. Gate-buffer pairs are strategically placed to accelerate different components.
该图像是一个流程示意图，展示了Eventful Transformers中针对视频处理的改进Transformer操作流程。图中用不同形状和颜色模块区分了现有Transformer操作、门控模块、缓存模块和增量门控模块，突出说明了稀疏计算（行稀疏和列稀疏）和矩阵乘法的执行顺序，反映了该方法通过选择性处理显著变化的token来节省计算资源的机制。
- Token-wise Operations: Operations like the MLP and the linear projections for query (q), key (k), and value (v) are inherently token-wise. A gate-buffer pair is wrapped around them, reducing their computational cost by a factor of N/M.
- The Query-Key Product: The computation of the attention matrix pre-softmax, $B = qk^T$ , is a major bottleneck. An element $B_{ij}$ needs updating if either the $i$ -th query token or the $j$ -th key token has changed. The authors propose a sparse update:
  1. Compute updates from changed query tokens: $\tilde{q}k^T$ . This result updates the corresponding $M$ rows of the stored matrix $B$ .
  2. Compute updates from changed key tokens: $q\tilde{k}^T$ . This result updates the corresponding $M$ columns of $B$ . The cost is reduced from $O(N^2D)$ to $O(2NMD)$ , providing savings when fewer than half the tokens are updated ( $M < N/2$ ).
- The Attention-Value Product: The product Av is even trickier because changes in the value tokens $v$ are sparse, but after multiplication with a dense attention matrix $A$ , the output is dense. A simple sparse update is not possible. Instead, a delta-based update is proposed. Let $A_o, v_o$ be the old values and $A_\Delta, v_\Delta$ be the changes. The new product $A_n v_n = (A_o + A_\Delta)(v_o + v_\Delta)$ can be updated incrementally using the previous result $A_o v_o$ : $A_n v_n = A_o v_o + A_n v_\Delta + A_\Delta v_n - A_\Delta v_\Delta$ Each of the three new terms ( $A_n v_\Delta$ , $A_\Delta v_n$ , $A_\Delta v_\Delta$ ) involves a multiplication with a sparse delta matrix. As shown in Image 6, the computation is made efficient by removing the zero rows and columns from the matrices before multiplication, reducing the cost from $O(N^2D)$ to $O(2NMD)$ .
  
  $Figure 6. The attention-value product. We propose a delta-based strategy for sparsely updating the product `_ { A v }` We reduce the cost of each sub-product by cutting rows and columns that do not c…$ 该图像是示意图，展示了基于增量（delta）策略稀疏更新注意力-值乘积的过程。图中通过矩阵行列的裁剪，避免了零乘法运算，从而减少计算开销。不同颜色标注分别对应注意力更新、值更新、联合更新、无更新及非零相关行列。公式涉及矩阵相乘和增量拆解。
Token Selection Policies: The effectiveness of the gating mechanism depends on the policy used to select which tokens to update.
- Top-r policy: This policy selects the $r$ tokens with the largest L2 norm of the error $e$ . It is simple, lightweight, and provides direct control over the computational budget by varying the hyperparameter $r$ . This is the primary policy used in the experiments.
- Threshold policy: This policy selects all tokens where the error norm exceeds a threshold $h$ . The number of updated tokens becomes input-adaptive, varying with the amount of change in the scene. This can offer a better accuracy-cost trade-off but provides less direct control over the compute cost.

5. Experimental Setup

Datasets:
- ImageNet VID (ILSVRC 2015): A large-scale dataset for video object detection. The validation set contains 555 videos. It is used to evaluate the ViTDet model.
- EPIC-Kitchens 100: A dataset of egocentric (first-person) videos focused on kitchen activities, used for action recognition. It is highly dynamic. The ViViT model is evaluated on this dataset.
Evaluation Metrics:
- mAP50 (mean Average Precision @ IoU=0.5): A standard metric for object detection.
  1. Conceptual Definition: It measures the average precision over all object classes and/or recall values, where a detection is considered correct (a true positive) if its Intersection over Union (IoU) with a ground-truth bounding box is at least 0.5 (50%). Higher mAP50 indicates better detection accuracy.
  2. Mathematical Formula: $\text{mAP} = \frac{1}{C} \sum_{i=1}^{C} \text{AP}_i$
  3. Symbol Explanation:
    - $C$ : The total number of object classes.
    - $\text{AP}_i$ : The Average Precision for class $i$ . AP is calculated as the area under the precision-recall curve for that class.
- Top-1 Accuracy: A common metric for classification tasks.
  1. Conceptual Definition: It measures the percentage of test samples for which the model's prediction with the highest confidence score is the correct label.
  2. Mathematical Formula: $\text{Top-1 Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of samples}}$
  3. Symbol Explanation: A prediction is "correct" if the class with the highest probability output by the model matches the ground-truth class.
Baselines:
- Base Model: The original, unmodified ViT models (ViTDet and ViViT).
- STGT [37]: A re-implementation of the Spatiotemporal Gated Transformers method, simplified to use the same top-r policy for a fair comparison of the core mechanisms.
- Token-wise only: An ablation of the proposed method where temporal redundancy is only exploited for token-wise operations (MLP, linear projections), but not for the query-key or attention-value products. This baseline helps quantify the benefits of the more complex sparse attention updates.

6. Results & Analysis

Core Results:

Video Object Detection (ViTDet on ImageNet VID)

Image 7 shows the trade-off between the computational savings ratio and the drop in mAP50. At an input size of 1024x1024, updating just 768 out of 4096 tokens (r=768) results in a 3.8x reduction in GFlops with only a 1.68% absolute drop in mAP50 (from 82.93% to 81.25%). The results demonstrate that significant computational savings are achievable with minimal performance loss.

$Figure 7. Video object detection results. Computation savings ratio (positive axis) and relative reductions in mAP50 score (negative axis) for our method. Results are for the ViTDet model \[38\] on the…$ 该图像是图表，展示了不同大小（1024和672）输入下，更新的token数量对计算节省比率（绿色，上半部分）和mAP50准确率变化（红色，下半部分）的影响。横轴表示更新的token数，纵轴显示计算节省比率和相对mAP50变化，图中数据反映了随着token数减少，计算效率增加但准确率略有下降的趋势。

Image 8 compares the accuracy-cost trade-off curves. The proposed method (orange line) consistently outperforms both STGT (red line) and the "Token-wise only" ablation (green line), achieving higher accuracy for the same computational budget. This confirms that accelerating the self-attention products is crucial for maximizing efficiency.

$Figure 8. Video object detection comparison and ablation. The accuracy-cost tradeoff for our method, compared with STGT \[37\] and an ablation that only accelerates token-wise operations. See the suppl…$

Video Action Recognition (ViViT on EPIC-Kitchens 100)

For this more dynamic task, fine-tuning the temporal sub-model was necessary to adapt to the changed statistics of the spatial model's output. Image 9 shows that even after fine-tuning for a specific budget (r value), the model remains adaptive and performs well across a range of budgets. For instance, the model fine-tuned with r=100 achieves a 2.4x reduction in TFlops with a 1.62% absolute accuracy drop when tested with r=140.

$Figure 9. Video action recognition results. Our results for action recognition on EPIC-Kitchens 100 using the ViViT model \[1\]. We report the total TFlops per video (spatial $^ +$ temporal sub-models)…$ 该图像是图表，展示了不同参数r值调优后模型在EPIC-Kitchens 100动作识别任务上，随着计算量（TFLOPs）变化的准确率（Accuracy %）比较。图中包括基础模型及r=50、100、200三种调优模型的性能趋势。

Ablations / Parameter Sensitivity:

Combining Spatial and Temporal Redundancy The paper presents a proof-of-concept experiment combining their temporal method with spatial pooling in the key and value tokens of ViTDet. The results, transcribed from Table 1 in the paper, show that the two approaches are complementary.

This is a transcription of Table 1 from the paper.

Variant	r	mAP50 (%)	GFlops
Base model		82.93	467.4
Spatial	—	80.15	388.1
Spatiotemporal	2048	80.14	217.0
Spatiotemporal	1536	80.07	169.3
Spatiotemporal	1024	79.50	121.0
Spatiotemporal	768	78.69	96.3
Spatiotemporal	512	76.96	70.9
Spatiotemporal	256	71.35	44.5

The Spatial model alone reduces GFlops from 467.4 to 388.1. When combined with the temporal method (Spatiotemporal), the GFlops are further reduced to as low as 44.5, demonstrating a synergistic effect.

Runtime Analysis The authors provide preliminary runtime measurements, transcribed from Table 2, which confirm that the reduction in FLOPs translates to real-world speedups on both CPU and GPU. For example, the Spatiotemporal ViTDet model at size 672x672 is 3.1x faster on CPU (1492ms to 478ms) and 1.36x faster on GPU (28.3ms to 20.8ms) than the base model. The authors note these speedups could be improved with custom CUDA kernels.

This is a transcription of Table 2 from the paper.

Model	Size	Variant	r	GPU	CPU
ViTDet	1024	Base model		86.6	5150
ViTDet	1024	Spatial		58.9	3116
ViTDet	1024	Temporal	512	69.9	3570
ViTDet	1024	Spatiotemporal	512	38.1	1682
ViTDet	672	Base model		28.3	1492
ViTDet	672	Spatial		23.3	1055
ViTDet	672	Temporal	256	21.6	838
ViTDet	672	Spatiotemporal	256	20.8	478
ViViT	320	Base model	−	950	5.45e4
ViViT	320	Temporal	50	545	2.15e4

Visualization of Updates

Image 3 provides a qualitative visualization. The Error map highlights regions of motion (the moving car), and the Updates mask confirms that the model correctly identifies these dynamic areas as needing re-computation, while static background areas are reused.

$Figure 10. Visualization of updates. The error $^ e$ and update mask $_ { \\mathbf { \\nabla } } \\mathbf { m } _ { \\mathbf { \\nabla } }$ , for the pre-QKV gate in block 3 of ViTDet. Video source: \[57\]$ 该图像是示意图，展示了视频物体检测中基于事件触发机制的预测、误差热力图及更新掩码的对比。上排为原视频帧中的预测结果，中排为对应的误差分布热力图，下排为基于误差计算得到的更新掩码，白色代表需要重新处理的token，黑色代表可复用的token，用以展示Temporal Redundancy利用效果。

7. Conclusion & Reflections

Conclusion Summary: The paper successfully introduces "Eventful Transformers," a practical and effective method for reducing the computational cost of Vision Transformers on video tasks. By identifying and updating only temporally significant tokens, the method achieves 2-4x computational savings with minimal accuracy loss. A key advantage is its ability to be applied to existing models, often without retraining, and its provision for adaptive, runtime control over the compute budget.
Limitations & Future Work:
- Memory Overhead: The primary limitation is increased memory usage. The gate and buffer modules store reference tensors. While modest for token vectors, storing the full attention matrix $A$ can be memory-intensive for high-resolution inputs with global attention.
- Implementation Overhead: The current implementation relies on standard PyTorch operators, which introduces overhead. The authors suggest that custom fused CUDA kernels could further improve wall-clock speedups.
- Future Work: The authors propose further exploration of integrating their temporal method with more sophisticated spatial redundancy techniques (e.g., adaptive token merging) and developing learned token selection policies.
Personal Insights & Critique:
- Novelty and Impact: This work is a significant contribution to making Vision Transformers practical for video analysis. The idea of "event-based" processing is intuitive and powerful, and the detailed methodology for accelerating all parts of the Transformer block, including self-attention, is a key technical achievement.
- Practicality: The ability to convert existing models without retraining is a major practical advantage, lowering the barrier to adoption. The adaptive nature is also crucial for real-world systems where available resources may fluctuate.
- The Memory-Compute Trade-off: The memory overhead is a critical consideration. For edge devices with tight memory constraints, using the full method on high-resolution video might be challenging. The authors' suggestion to selectively disable gating on the attention matrix is a reasonable compromise in such scenarios.
- Generalizability: While demonstrated on video, the core concept could potentially be extended to other domains with sequential data that exhibit redundancy, such as processing time-series data or iterative refinement in generative models. The key challenge would be defining an appropriate "change" metric for the tokens in those domains.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.