VideoMamba: State Space Model for Efficient Video Understanding
TL;DR Summary
VideoMamba adapts the Mamba State Space Model for efficient video understanding, tackling local redundancy and global dependencies. Leveraging its linear-complexity operator and self-distillation, it achieves superior scalability, fine-grained motion sensitivity, and long-term vi
Abstract
Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain. The proposed VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers. Its linear-complexity operator enables efficient long-term modeling, which is crucial for high-resolution long video understanding. Extensive evaluations reveal VideoMamba's four core abilities: (1) Scalability in the visual domain without extensive dataset pretraining, thanks to a novel self-distillation technique; (2) Sensitivity for recognizing short-term actions even with fine-grained motion differences; (3) Superiority in long-term video understanding, showcasing significant advancements over traditional feature-based models; and (4) Compatibility with other modalities, demonstrating robustness in multi-modal contexts. Through these distinct advantages, VideoMamba sets a new benchmark for video understanding, offering a scalable and efficient solution for comprehensive video understanding. All the code and models are available at https://github.com/OpenGVLab/VideoMamba.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: VideoMamba: State Space Model for Efficient Video Understanding
- Authors: Kunchang Li, Xinhao Li, Yi Wang, Yinan He, alli Wang, Limin Wang, and Yu Qiao.
- Affiliations: The authors are affiliated with OpenGVLab, Shanghai AI Laboratory; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; and the State Key Laboratory for Novel Software Technology, Nanjing University.
- Journal/Conference: The paper is available on arXiv, which is a preprint server. This means it has not yet undergone formal peer review for a conference or journal but is shared to disseminate research quickly.
- Publication Year: 2024 (First version submitted in March 2024).
- Abstract: The paper introduces
VideoMamba, a model that adapts the Mamba architecture (a State Space Model) for video understanding. It aims to solve the dual challenges of handling local redundancy and modeling global dependencies in videos.VideoMambais presented as a more efficient alternative to 3D Convolutional Neural Networks (CNNs) and Video Transformers, thanks to its linear-complexity operator, which is ideal for long, high-resolution videos. The authors highlight four key abilities ofVideoMamba: (1) scalability without extensive pretraining, aided by a novel self-distillation technique; (2) high sensitivity for recognizing fine-grained, short-term actions; (3) superior performance on long-term video tasks; and (4) compatibility with other modalities like text. The paper concludes thatVideoMambasets a new benchmark for efficient and comprehensive video understanding. - Original Source Link:
- arXiv Page: https://arxiv.org/abs/2403.06977
- PDF Link: http://arxiv.org/pdf/2403.06977v2
- Publication Status: Preprint.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Understanding video content requires modeling both fine-grained local details and long-range global dependencies. Traditional architectures struggle to do both efficiently. 3D CNNs are good at capturing local spatiotemporal features but have a limited receptive field. Video Transformers excel at modeling long-range dependencies using self-attention but suffer from quadratic computational complexity (), making them prohibitively expensive for long, high-resolution videos.
- Importance & Gaps: As video content becomes longer and higher in resolution, the need for an efficient and effective backbone model is critical. Existing models either compromise on performance (e.g., divided attention in TimeSformer) or are too computationally intensive for practical use on long sequences.
- Innovation: This paper proposes to use a State Space Model (SSM), specifically
Mamba, as the core building block for a video understanding model.Mambaoffers linear complexity (O(N)) in sequence length, combining the global context modeling ability of Transformers with the efficiency of recurrent models. The paper introducesVideoMamba, a pure SSM-based architecture tailored for the video domain.
-
Main Contributions / Findings (What):
- Novel Model (
VideoMamba): The paper presents the first purely SSM-based model for general video understanding. It adapts the bidirectionalMambablock for 3D spatiotemporal data and demonstrates its effectiveness. - Scalability via Self-Distillation: The authors identify an overfitting issue when scaling up
Mamba-based models. They propose a simple yet effective self-distillation strategy where a smaller, pre-trainedVideoMambamodel guides the training of a larger one, enabling successful scaling without requiring massive pre-training datasets. - State-of-the-Art Performance & Efficiency:
VideoMambais shown to be highly efficient, running up to 6x faster and using 40x less GPU memory than theTimeSformeron long videos. It achieves state-of-the-art or competitive results across a wide range of tasks:- Short-term Action Recognition: High sensitivity to fine-grained motion (e.g., on the Something-Something V2 dataset).
- Long-term Video Understanding: Superior performance on benchmarks like Breakfast, COIN, and LVU through efficient end-to-end training.
- Multi-modal Understanding: Strong performance in video-text retrieval tasks, demonstrating its compatibility with other modalities.
- Novel Model (
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Video Understanding: The general task of teaching computers to comprehend the content of videos. This includes tasks like action recognition (what is happening?), temporal action localization (when is it happening?), and video-text retrieval (finding a video based on a text description).
- Convolutional Neural Networks (CNNs): Models that use convolutional filters to process grid-like data (e.g., images). 3D CNNs extend this concept to video by using 3D filters to capture motion and appearance information across both space and time. They are efficient at learning local patterns.
- Transformers: An architecture originally designed for natural language processing that relies on the self-attention mechanism. Self-attention allows the model to weigh the importance of all other elements in a sequence when processing a single element, enabling it to capture long-range dependencies. In video, a sequence is formed by flattened patches of video frames. Its main drawback is the computational cost, which grows quadratically with the sequence length.
- State Space Models (SSMs): A class of models from control theory used to describe a system with inputs, outputs, and internal states. In deep learning, they have been adapted to model sequences. They can be seen as a sophisticated type of recurrent neural network (RNN).
- Mamba: A recent and highly influential SSM architecture. Its key innovation is the Selective Scan Mechanism (S6), which allows the model's parameters to be input-dependent. This enables
Mambato selectively focus on or ignore parts of the input sequence, effectively compressing long-range information into its hidden state. It achieves this with linear time complexity, making it a powerful candidate to replace the self-attention mechanism in Transformers.
-
Previous Works:
- 3D CNNs: Models like
I3D,SlowFast, andX3Dwere dominant in video understanding. They built on successful 2D CNNs by extending convolutions into the time dimension. - Video Transformers: Models like
TimeSformerandViViTadapted the Vision Transformer (ViT) for video. To manage the high computational cost, they often used "divided attention," where spatial and temporal attention are computed separately, which can be less expressive than joint spatiotemporal attention. - Hybrid Models: Architectures like
UniFormerandVideoSwintried to combine the strengths of CNNs (local feature extraction) and Transformers (global dependency modeling) to balance efficiency and performance. - SSMs in Vision: While SSMs like
S4were explored for vision, the recent success ofMambaspurred its application in 2D image tasks (Vision Mamba,VMamba).VideoMambaextends this trend to the more challenging video domain.
- 3D CNNs: Models like
-
Differentiation:
VideoMambadistinguishes itself from prior work in several key ways:- It is a purely SSM-based model, not a hybrid of CNNs and Transformers.
- It directly tackles the quadratic complexity bottleneck of Transformers, making end-to-end training on long videos computationally feasible.
- It introduces a self-distillation technique to overcome the scaling challenges specific to Mamba-style architectures in vision, a problem noted in previous work like
VMamba.
4. Methodology (Core Technology & Implementation)
The core of VideoMamba is the adaptation of the Mamba state space model to handle 3D spatiotemporal video data.
-
Principles: The underlying principle is to replace the quadratic-complexity self-attention mechanism of Transformers with the linear-complexity Selective Scan Mechanism (S6) of
Mamba. This allows the model to efficiently process very long sequences of video patches while still capturing global context. -
SSM Preliminaries: SSMs are based on a continuous system that maps an input sequence to an output sequence through a hidden state .
-
is the hidden state.
-
is the input sequence.
-
is the output sequence.
-
is the evolution matrix.
-
and are projection matrices.
This continuous system is discretized for use in deep learning.
Mamba's key innovation is making the parameters , , and a timescale parameter data-dependent. This means they are dynamically generated from the input tokens, allowing the model to selectively propagate or forget information based on the content, which is the core of the Selective Scan Mechanism (S6).
-
-
VideoMamba Architecture:
该图像是论文中VideoMamba模型的结构示意图,展示了基于ViT架构的Bidirectional Mamba Block处理3D视频序列的流程。左图(a)说明输入视频先进行3D分块嵌入,结合时空位置信息进入多层双向Mamba模块,最终由分类头输出动作类别。右图(b)以示意路径图形式说明双向时空扫描策略,包括“Forward Scan”和“Backward Scan”,用于捕获时空依赖。As shown in Image 1, the
VideoMambaarchitecture closely follows the design of a vanilla Vision Transformer (ViT):-
Input Processing: An input video is divided into non-overlapping 3D patches (e.g., of size ). These patches are flattened and projected into a sequence of tokens , where is the total number of patches.
-
Tokenization and Positional Embeddings: A learnable
[CLS]token is prepended to the sequence. Since SSMs are position-sensitive, learnable spatial and temporal position embeddings ( and ) are added to the tokens. -
Core Blocks: The sequence of tokens is processed by a stack of Bidirectional Mamba (B-Mamba) Blocks. As shown in Image 3, a B-Mamba block processes the flattened sequence in both forward and backward directions simultaneously and combines the results. This allows each token to gather contextual information from all other tokens in the sequence.
该图像是示意图,展示了论文中提出的Mamba块的结构,包括(a) 一维序列的Mamba块和(b) 双向Mamba块。图中用绿色框表示序列变换模块(如Conv卷积和SSM状态空间模型),箭头表示数据流动,符号标明激活函数和乘法等操作,整体结构体现了线性投影、序列变换及残差连接的组合方式。 -
Classification: The output representation of the
[CLS]token from the final layer is fed into a linear layer for classification.
-
-
Spatiotemporal Scan:
该图像是示意图,展示了四种不同的视频帧扫描路径方法:(a) 空间优先双向扫描、(b) 时间优先双向扫描、(c) 时空双向扫描v1和(d) 时空双向扫描v2。箭头表示扫描顺序,图中省略了[CLS]标记以简化说明。横轴为空间维度S,纵轴为时间维度T。A crucial design choice is how to "scan" the 2D+time sequence of patches. The paper explores several strategies (Image 4):
- (a) Spatial-First: Scan all spatial patches within the first frame, then move to the second frame, and so on.
- (b) Temporal-First: For the first spatial location, scan through all frames, then move to the second spatial location.
- (c, d) Spatiotemporal: Hybrids that combine both scanning methods.
The experiments showed that the simple
Spatial-Firstbidirectional scan was the most effective, likely because it allows the model to better leverage pre-trained weights from 2D image models.
-
Self-Distillation for Scalability: The authors found that larger
VideoMambamodels (e.g., -Base, -Middle) tended to overfit and perform worse than smaller ones (e.g., -Small). To solve this, they introduced a self-distillation strategy.- Procedure: A smaller, well-trained "teacher" model (e.g.,
VideoMamba-S) is used to guide the training of a larger "student" model (e.g.,VideoMamba-M). The student is trained to minimize a loss function that includes both the standard classification loss and an L2 loss between the final feature maps of the student and the teacher. - Result: This simple technique effectively regularized the training of larger models, preventing overfitting and leading to the expected performance gains with increased model size.
- Procedure: A smaller, well-trained "teacher" model (e.g.,
-
Masked Modeling: To improve the model's sensitivity to fine-grained temporal details,
VideoMambais also adapted for masked modeling, inspired byVideoMAEandUMT.
该图像是示意图,展示了论文中多种视频帧掩码策略的对比:(a) 输入视频帧序列;(b) 随机掩码;(c) 管状掩码;(d) 以视频片段为单位的行掩码;(e) 以单帧为单位的行掩码;(f) 注意力掩码。每种掩码在时间和空间维度上覆盖的视频区域均不同,体现了其在视频建模中的差异和作用。- Masking Strategy: The paper proposes
row masking(Image 5), where entire rows of patches are masked. This is more suitable forMamba's internal 1D convolution than random masking.Attention masking, which preserves adjacent meaningful content, was found to be the most effective. - Alignment: During pre-training, the model learns to align the features of its unmasked tokens with corresponding features from a pre-trained teacher model like
CLIP-ViT. Due to architectural differences, only the final output features are aligned.
- Masking Strategy: The paper proposes
5. Experimental Setup
-
Datasets:
- Image Classification (for scaling study):
ImageNet-1K(1.28M images, 1000 classes). - Short-term Video Understanding:
Kinetics-400 (K400): A large-scale action recognition dataset with ~240k training videos, scene-related.Something-Something V2 (SthSthV2): A dataset focused on fine-grained, temporal actions (e.g., "opening" vs. "closing" something), with ~169k training videos.
- Long-term Video Understanding:
Breakfast: Contains videos of 10 cooking activities.COIN: A dataset of procedural tasks with an average video length of over 2 minutes.Long-form Video Understanding (LVU): A benchmark with ~30k movie clips (1-3 minutes long) covering nine different tasks.
- Multi-modality Pre-training & Evaluation:
- Pre-training:
WebVid-2M(video-text pairs) andCC3M(image-text pairs). - Evaluation (Zero-shot retrieval):
MSRVTT,DiDeMo,ActivityNet,LSMDC, andMSVD.
- Pre-training:
- Image Classification (for scaling study):
-
Evaluation Metrics:
- Top-k Accuracy:
- Conceptual Definition: Measures the percentage of samples for which the correct label is among the model's top k predictions.
Top-1accuracy requires the single highest-probability prediction to be correct.Top-5allows the correct label to be within the top five predictions. It is a standard metric for classification tasks. - Mathematical Formula: \text{Top-k Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(\text{true\_label}_i \in \{\text{top_k_predictions}_i\})
- Symbol Explanation: is the total number of samples, and is the indicator function, which is 1 if the condition is true and 0 otherwise.
- Conceptual Definition: Measures the percentage of samples for which the correct label is among the model's top k predictions.
- Mean Squared Error (MSE):
- Conceptual Definition: Used for regression tasks (like predicting a year in the LVU benchmark). It measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. A lower MSE is better.
- Mathematical Formula:
- Symbol Explanation: is the number of samples, is the true value for sample , and is the predicted value.
- Recall@k (R@k):
- Conceptual Definition: Used for retrieval tasks. For a given query (e.g., a text description), it measures the percentage of times the correct item (e.g., the corresponding video) is found within the top k retrieved results. The paper uses the notation
@k(e.g.,@1,@5,@10). - Mathematical Formula:
- Symbol Explanation: is the number of queries, and is the indicator function that checks if the correct item is in the top-k retrieved list for a query .
- Conceptual Definition: Used for retrieval tasks. For a given query (e.g., a text description), it measures the percentage of times the correct item (e.g., the corresponding video) is found within the top k retrieved results. The paper uses the notation
- Top-k Accuracy:
-
Baselines: The paper compares
VideoMambaagainst a comprehensive set of baselines, including:- CNNs:
SlowFast,X3D. - Transformers:
TimeSformer,ViViT,Swin Transformer,VideoMAE. - CNN+Transformer Hybrids:
UniFormer,MViT. - SSM-based:
ViS4mer. - Multi-modal Models:
UMT,CLIP4Clip,InternVideo.
- CNNs:
6. Results & Analysis
-
Core Results:
1. Efficiency:
该图像是图表,展示了VideoMamba-Ti与TimeSformer-Ti在视频帧数不同情况下的吞吐率(左图,单位img/s)和显存使用(右图,单位G)。测试采用分辨率224×224,批量大小128,在NVIDIA A100-80G GPU上进行。结果显示VideoMamba在吞吐率和显存方面均优于TimeSformer,且在长视频帧数下优势更加显著,分别提升了2至6倍的速度和20至40倍的显存效率。图中小插图对比了两者在不同任务上显存OOM的问题,进一步凸显VideoMamba的高效性。Image 2 graphically demonstrates
VideoMamba's superior efficiency. Compared toTimeSformer-Ti,VideoMamba-Tiachieves significantly higher throughput (images/sec) and requires drastically less GPU memory, especially as the number of frames increases. For a 64-frame video, it is 6x faster and uses 40x less memory.2. Scalability on ImageNet: The following is a transcribed version of Table 2 from the paper.
Arch. Model iso. InputSize #Param(M) FLOPs(G) IN-1KTop-1 CNN ConvNeXt-T [53] × 224² 29 4.5 82.1 ConvNeXt-S[53] × 224² 50 8.7 83.1 ConvNeXt-B [53] × 224² 89 15.4 83.8 Trans. SwinT-T [51] × 224² 28 4.5 81.3 Swin-S [51] × 224² 50 8.7 83.0 Swin-B [51] × 224² 88 15.4 83.5 CNN+SSM VMamba-T [50] × 224² 22 5.6 82.2 VMamba-S [50] × 224² 44 11.2 83.5 VMamba-B[50] × 224² 75 18.0 83.7 CNN ConvNeXt-S[53] ✓ 224² 22 4.3 79.7 ConvNeXt-B [53] ✓ 224² 87 16.9 82.0 Trans. DeiT-Ti[75] ✓ 224² 6 1.3 72.2 DeiT-S[75] ✓ 224² 22 4.6 79.8 DeiT-B[75] ✓ 224² 87 17.6 81.8 DeiT-B[75] ✓ 384² 87 55.5 83.1 SSM S4ND-ViT-B [58] ✓ 224² 89 - 80.4 Vim-Ti [91] — 224² 7 1.1 76.1 Vim-S [91] — 224² 26 4.3 80.5 VideoMamba-Ti √ 224² 7 1.1 76.9 VideoMamba-Ti 448² 7 4.3 79.3 VideoMamba-Ti 576² 7 7.1 79.6 VideoMamba-S √ 224² 26 4.3 81.2 VideoMamba-S √ 448² 26 16.9 83.2 VideoMamba-S 576² 26 28.0 83.5 VideoMamba-M 224² 74 12.7 82.8 VideoMamba-M 448² 75 50.4 83.8 VideoMamba-M 576² 75 83.1 84.0 The results show that
VideoMambascales effectively.VideoMamba-M, with self-distillation, outperforms other isotropic architectures likeDeiT-BandConvNeXt-Bwhile using fewer parameters. By fine-tuning at a higher resolution (576²), it achieves a very strong 84.0% Top-1 accuracy.3. Short-term Video Understanding: On the temporal-sensitive
SthSthV2dataset (Table 4, transcribed below),VideoMamba-Machieves 68.4% Top-1 accuracy, outperforming pure attention models likeViViT-L(65.4%) andTimeSformer-HR(62.5%), and is competitive with the SOTA hybrid modelUniFormer-B(70.4%). With masked pre-training, it reaches 71.4%, surpassingVideoMAE-B(70.8%). This highlights its strong ability to model fine-grained motion.This is a transcribed version of Table 4 from the paper.
Arch. Model iso. Extra Data Input Size #Param (M) FLOPs (G) SSV2 Top-1 SSV2 Top-5 Supervised: CNN SlowFastR101 [19] X K400 32×224² 53 106×3×1 63.1 87.6 TDNR50 [79] × IN-1K 16×224² 26 75×1×1 65.3 91.6 Trans. Swin-B [52] × K400 32×224² 89 88×3×1 69.6 92.7 CNN+Trans. MViTv2-B [45] × K400 32×224² 51 225×3×1 70.5 92.7 UniFormer-B [44] X IN-1K+K400 16×224² 50 97×3×1 70.4 92.8 Trans. TimeSformer-HR [4] ✓ IN-21K 16×224² 121 1703×3×1 62.5 - ViViT-L [2] ✓ IN-21K+K400 16×224² 311 3992×3×4 65.4 89.8 SSM VideoMamba-M √ IN-1K 16×288² 74 333×3×4 68.4 91.6 Self-supervised: Trans. VideoMAE-B2400e [74] 16×224² 87 180×3×2 70.8 92.4 UMT-B800e [43] ✓ CLIP-400M 8×224² 87 180×3×2 70.8 92.6 SSM VideoMamba-M800e √ CLIP-400M 16×288² 74 333×3×2 71.4 92.9 4. Long-term Video Understanding: The following is a transcribed version of Table 6 from the paper.
Method e2e Backbone Neck Type Pretraining Dataset BF Top-1 COIN Top-1 ViS4mer [35] × Swin-B SSM IN-21K+K600 88.2 88.4 Turbo†32 [29] ✓ VideoMAE-B K400 86.8 82.3 Turbo†32 [29] ✓ VideoMAE-B K400+HTM-AA 91.3 87.5 VideoMamba†64 ✓ VideoMamba-S K400 97.4 88.7 VideoMamba†64 ✓ VideoMamba-M K400 95.8 89.5 VideoMamba†64 ✓ VideoMamba-M† K400 96.9 90.4 VideoMambaexcels at long-term tasks. On theBreakfastdataset, it achieves a remarkable 97.4% Top-1 accuracy, significantly outperforming prior methods likeViS4mer(88.2%) which relied on extracting features from a pre-trained model. OnCOIN, it achieves 90.4%. This success is attributed to its ability to perform efficient end-to-end training on long video sequences, which captures temporal dynamics better than feature-based approaches.5. Multi-modal Understanding: On zero-shot video-text retrieval,
VideoMambaconsistently outperforms the ViT-basedUMTbaseline across five benchmarks when trained on the same data. The performance gap is notably larger on datasets with longer videos and more complex scenes likeActivityNetandDiDeMo, reinforcing the strength ofMambain modeling complex, long-range dependencies. -
Ablations / Parameter Sensitivity:
该图像是图表,展示了自蒸馏(Self-Distillation)与提前停止(Early Stopping)对模型性能的消融研究。左图(a)显示自蒸馏能有效避免过拟合,使Top-1准确率随训练轮数稳步提升;右图(b)表明提前停止对性能提升无显著帮助,两者均以Top-1准确率随训练轮数的变化曲线形式呈现,并配有局部放大视图突出细节对比。- Self-Distillation (Image 6): The left plot shows that without self-distillation (SD), the larger
VideoMamba-Band-Mmodels perform worse than-S. With SD,VideoMamba-M's performance improves significantly, demonstrating the technique's effectiveness in preventing overfitting. The right plot shows that early stopping (ES) did not provide any benefits. - Scan Type: The
Spatial-Firstbidirectional scan was found to be the most effective strategy. - Frames & Resolution: Increasing the number of frames consistently helped on
K400. ForSthSthV2, which has very short videos, performance saturated and did not benefit from very long inputs. - Masked Pretraining:
Attention maskingandclip-row maskingwere the most effective strategies. An 80% mask ratio combined with strong regularization (droppath) yielded the best results.
- Self-Distillation (Image 6): The left plot shows that without self-distillation (SD), the larger
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces
VideoMamba, a pure SSM-based model for video understanding. It demonstrates thatMamba's linear complexity and selective modeling capabilities make it an excellent choice for video, overcoming the limitations of both CNNs and Transformers. The authors establish its effectiveness across four core dimensions: scalability (via self-distillation), sensitivity to short-term actions, superiority in long-term understanding, and compatibility with multi-modal tasks.VideoMambarepresents a promising new direction for building efficient and powerful foundation models for long-video comprehension. -
Limitations & Future Work: The authors acknowledge several limitations due to resource constraints:
- They have not scaled the model to extremely large sizes (e.g., a "Giga" version).
- They have not integrated other modalities like audio.
- They have not explored integration with Large Language Models (LLMs) for understanding hour-long videos. These are all planned as future research directions.
-
Personal Insights & Critique:
- Significance: This paper is highly significant as it provides a compelling alternative to the dominant Transformer architecture for video. The efficiency gains are not just incremental; they are transformative, potentially unlocking new capabilities in long-form video analysis (e.g., movie analysis, extended surveillance footage) that were previously computationally infeasible.
- Simplicity and Elegance: The proposed
VideoMambaarchitecture is clean and closely follows the ViT design, making it easy to understand and adopt. The self-distillation solution to the overfitting problem is also remarkably simple and practical. - Future Impact:
VideoMambaand other Mamba-based architectures are likely to become a cornerstone of future video foundation models. The linear complexity is a critical feature as the demand for processing ever-longer and higher-resolution videos grows. The work paves the way for models that can understand not just short clips, but entire narratives, procedures, and events spanning minutes or even hours. - Open Questions: While the
Spatial-Firstscan worked best, the optimal way to linearize spatiotemporal data for a 1D SSM remains an interesting area for future research. Exploring more sophisticated scanning patterns or hierarchical SSMs could yield further improvements.
Similar papers
Recommended via semantic vector search.