LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
TL;DR Summary
LinVideo is a data-free post-training framework that selectively replaces self-attention with linear attention in video diffusion models, using anytime distribution matching to maintain performance and achieve up to 15.92× latency reduction and 1.25–2× speedup.
Abstract
Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
- Authors: Yushi Huang, Xingtong Ge, Ruihao Gong, Chengtao Lv, and Jun Zhang. Their affiliations include the Hong Kong University of Science and Technology (HKUST), Beihang University (BUAA), and Nanyang Technological University (NTU).
- Journal/Conference: This paper is a preprint available on arXiv. The listed publication date of October 9, 2025, suggests it is an early submission for a future conference or journal.
- Publication Year: 2025 (as listed on the preprint).
- Abstract: The paper addresses the high computational cost of video diffusion models (DMs), which is primarily due to the quadratic complexity () of self-attention. While linear attention offers a more efficient alternative, fully replacing quadratic attention compromises model performance and typically requires expensive pre-training from scratch. The authors introduce LinVideo, a data-free post-training framework that selectively replaces some self-attention modules with linear attention in a pre-trained model without degrading its quality. The framework features two main innovations: (1) selective transfer, an automatic method that treats layer selection as a classification problem to identify which layers to convert with minimal impact, and (2) anytime distribution matching (ADM), an efficient training objective that aligns the output distributions of the original and modified models at any point during the sampling process. Experiments show that LinVideo achieves a 1.25-2.00x speedup while maintaining quality, and a further distilled version reduces latency by 15.92x.
- Original Source Link: The paper is available as a preprint at https://arxiv.org/abs/2510.08318v1. The PDF can be accessed directly at https://arxiv.org/pdf/2510.08318v1.pdf.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: State-of-the-art video generation models, such as OpenAI's Sora, are built on diffusion models that use the Transformer architecture. A core component, self-attention, has a computational cost that scales quadratically with the input sequence length (). For high-resolution, long-duration videos, the sequence of tokens can be extremely long (e.g., >50,000), making the self-attention mechanism a prohibitive computational bottleneck.
- Gaps in Prior Work: Existing solutions are inadequate. Attention sparsification methods skip some computations but often fail to achieve significant speedups. Linear attention, which reduces complexity to , is a promising alternative but is less expressive than standard quadratic attention. Consequently, simply swapping all attention layers requires costly and time-consuming pre-training from scratch to recover the model's performance, making it impractical for existing large-scale models.
- Paper's Innovation: This paper asks if it's possible to accelerate a pre-trained video model by replacing some of its quadratic attention layers with linear ones through an efficient post-training process, all without needing the original (and often proprietary) training data.
-
Main Contributions / Findings (What): The paper introduces LinVideo, a framework designed to achieve this goal, with three primary contributions:
-
A Novel Post-Training Framework: LinVideo is the first data-free, post-training framework that intelligently replaces a portion of quadratic attention layers with linear ones in a pre-trained video diffusion model, significantly boosting inference efficiency without compromising output quality.
-
Selective Transfer: Instead of manual or heuristic choices, the paper proposes a principled method to automatically determine which layers to replace. It frames the selection as a binary classification problem for each layer and uses a learnable score to progressively and smoothly transition selected layers to linear attention, minimizing performance degradation.
-
Anytime Distribution Matching (ADM): To fine-tune the modified model effectively, the authors introduce a new training objective, ADM. Unlike previous methods that only match the final generated output, ADM aligns the data distributions of the student and teacher models at every timestep of the generation process. This approach is not only more effective at preserving quality but also more efficient, as it avoids the need for an auxiliary model.
The key finding is that LinVideo successfully accelerates a high-quality video model by up to 2x with no perceptible drop in visual quality. When combined with few-step distillation, the speedup is a remarkable 15.92x.
-
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Diffusion Models (DMs): These are a class of generative models that create data (like images or videos) by reversing a gradual noising process. The process starts with a clean data sample, adds Gaussian noise in a series of steps (the "forward process"), and trains a neural network to reverse this process. To generate new data, the model starts with pure noise and iteratively "denoises" it (the "reverse process") until a clean sample emerges.
- Video Diffusion Models (Video DMs): An extension of DMs to the temporal domain. In addition to generating spatially coherent pixels within a frame, the model must also learn to maintain temporal consistency across frames to produce a realistic video. Modern Video DMs, like the one used in this paper, are often based on the Diffusion Transformer (DiT) architecture.
- Rectified Flow: A generative modeling framework closely related to diffusion models. Instead of learning to reverse a noising process, rectified flow models learn a "velocity field" that transports samples from a simple noise distribution to a complex data distribution along straight paths. The paper's base model,
Wan 1.3B, is a rectified flow model.
- Self-Attention: The key mechanism in Transformer models. For each element in a sequence, self-attention calculates a weighted average of all other elements, where the weights signify "attention" or importance. This allows the model to capture long-range dependencies. However, calculating these weights requires comparing every element with every other element, leading to a computational complexity of , where is the sequence length. This becomes a major bottleneck for long sequences like videos.
- Linear Attention: A family of attention mechanisms that approximate standard self-attention but with a computational complexity of . They achieve this by using a kernel function to decompose the attention calculation, which allows reordering matrix multiplications to avoid explicitly computing the attention matrix. The trade-off is that linear attention is generally less expressive than its quadratic counterpart.
- Post-Training & Data-Free Fine-tuning: Post-training refers to modifying an already trained model, which is far more resource-efficient than training a new model from scratch. Data-free fine-tuning is a specific type of post-training where the original training dataset is not available. Instead, knowledge is transferred by using the original model to generate synthetic data (e.g., input-output pairs) for training the new, modified model.
-
Previous Works: The paper positions itself relative to two main categories of efficient attention methods for video generation:
- Attention Sparsification: Methods like
Sparse VideoGen (SVG)andXAttentionaim to reduce computation by skipping calculations for less important token pairs. However, these methods often struggle to achieve high sparsity and may not yield substantial speedups on moderately long sequences. - Full Linearization via Pre-training: Models like
SANA-Video,LinGen, andMattenreplace all quadratic attention layers with linear attention or similar mechanisms (like state-space models). While this achieves maximum efficiency, it requires massive computational resources for pre-training from scratch to compensate for the expressiveness gap between linear and quadratic attention. - Concurrent Work: The paper notes
SLA, a concurrent work that proposes mixing quadratic and linear attention within a single layer (intra-layer). In contrast,LinVideofocuses on replacing entire layers (inter-layer), making the two approaches potentially complementary.
- Attention Sparsification: Methods like
-
Differentiation:
LinVideocarves a unique niche by being an efficient post-training framework for partial linearization. It avoids the massive cost of pre-training while intelligently selecting which layers to replace to preserve performance. Its two core technical novelties, selective transfer (for what to replace) and anytime distribution matching (for how to train), are specifically designed to make this post-training process both effective and efficient.
4. Methodology (Core Technology & Implementation Details)
The core of LinVideo is a two-part strategy: first, deciding which attention layers to replace, and second, fine-tuning the modified model to recover performance.

1. Preparation: Data-Free Post-Training
Since the original video dataset is unavailable, LinVideo generates its own training data. It takes the pre-trained video DM (), feeds it random noise, and records the model's internal states—specifically, the input latent variable and the predicted velocity —at each step of the generation process. This creates a large dataset of pairs, which serves as the ground truth for fine-tuning the new, linearized model (). The paper notes that simply minimizing the Mean Squared Error (MSE) between the predictions of the two models leads to visual artifacts and poor generalization.
2. Selective Transfer for Effective Linearization
The authors observe that not all attention layers are equally important. As shown in Figure 2, replacing shallow layers is generally less harmful than replacing deep ones, but certain critical layers (like the very first one) must be preserved. This motivates a data-driven approach to layer selection.
该图像是一个包含四个子图的图表,展示了在不同层区间替换为线性注意力后,模型在主观一致性、成像质量、运动平滑度和动态程度四个维度上的表现得分变化,层区间以横坐标表示,符号号表示经过额外3000步微调的模型。*
- Core Idea: The choice between quadratic and linear attention for each layer is framed as a binary classification problem. A learnable score, , is introduced for each of the attention layers.
- Mixed-Attention Computation: During training, each layer computes a weighted sum of both attention types:
- : The output for the -th token.
- : The learnable score for this layer. An close to 1 favors quadratic attention, while an close to 0 favors linear attention.
- The first term is standard quadratic self-attention.
- The second term is linear attention, using the Hedgehog kernel function .
- Training and Inference: The scores for all layers are initialized to 1 (fully quadratic model) and optimized during training. For inference, each is rounded to the nearest integer (0 or 1) to make a hard selection, discarding the unused attention branch.
- Guiding the Selection with Loss Functions: Two additional loss terms guide the optimization of the scores :
-
Constraint Loss (): This loss ensures that the final number of replaced layers matches a predefined
target. Since the rounding function is not differentiable, a Straight-Through Estimator (STE) is used to approximate its gradient. -
Regularization Loss (): This loss encourages the scores to move towards the extremes (0 or 1) during training. This reduces the gap between the training-time mixed-attention and the inference-time hard selection, preventing performance drops from rounding. The hyperparameter is annealed from a large value to a small one, allowing more flexibility at the start of training and enforcing a hard decision towards the end. Figure 3 illustrates that without this loss, many values hover near 0.5, making the final rounding step destructive.

3. Anytime Distribution Matching (ADM)
-
The second key innovation is an improved training objective.
- Problem with Existing Objectives: Previous distillation methods for DMs typically focus on matching the distribution of the final generated samples (). The authors find this is insufficient and leads to quality degradation. Furthermore, these methods often require training a separate, costly auxiliary model to estimate the score function of the generator, making the process inefficient.
- ADM Idea: Instead of just matching the final distributions, ADM seeks to align the distributions of samples from the student model () and the teacher model () at any timestep along the entire sampling trajectory. This is done by minimizing the Kullback-Leibler (KL) divergence between them.
- ADM Loss and Gradient: The loss is defined as . The gradient of this loss with respect to the model parameters is: where and are the score functions (gradients of the log-probability) of the teacher and student distributions, respectively.
- Efficient Score Estimation: The crucial insight is that since the student model is a multi-step diffusion model itself, it can be used to estimate its own score function . This eliminates the need for an auxiliary model. The paper shows that for rectified flow models, the score difference simplifies to: This elegantly connects the abstract distribution matching objective to a concrete and efficiently computable difference between the velocity predictions of the teacher and student models.
4. Training Overview
The final training loss combines the ADM objective with the selective transfer losses:
where is a weighting hyperparameter. For further speedup, the resulting LinVideo model can be distilled into a few-step generator using existing techniques like DMD2.
5. Experimental Setup
- Datasets: The experiments are conducted in a data-free setting. The training data consists of 50,000 input-output pairs generated by the original
Wan 1.3Bmodel. - Base Model: The framework is applied to Wan 1.3B, an open-source, rectified flow-based text-to-video model that generates 5-second videos at 16 FPS ( resolution). The model has 30 attention layers.
- Evaluation Metrics:
- VBench: A standard benchmark for evaluating video generation quality. The authors report on 8 dimensions:
Imaging Quality: Perceptual quality and fidelity of individual frames.Aesthetic Quality: How visually pleasing the video is.Motion Smoothness: The absence of jitter or flickering between frames.Dynamic Degree: The amount of motion and activity in the video.Background Consistency: Whether the background remains stable and coherent.Subject Consistency: Whether the main subject maintains its identity and appearance.Scene Consistency: Overall coherence of the scene across the video.Overall Consistency: A composite measure of temporal consistency.
- VBench-2.0: An updated benchmark with more challenging prompts that test for adherence to physical laws, commonsense reasoning, and complex interactions.
- VBench: A standard benchmark for evaluating video generation quality. The authors report on 8 dimensions:
- Baselines:
LinVideois compared against several strong baselines:- Lossless Baseline:
FlashAttention2(FA2), a highly optimized implementation of standard quadratic attention. This represents the upper bound on quality but is the slowest. - Static Sparse Attention:
Sparse VideoGen (SVG),Sparse VideoGen 2 (SVG2), andDiT-FastAttn (DFA). - Dynamic Sparse Attention:
XAttention.
- Lossless Baseline:
6. Results & Analysis
-
Core Results:
The following table, transcribed from Table 1 in the paper, compares
LinVideowith baselines on VBench.LinVideoreplaces 16 out of 30 attention layers.Method Quality Imaging Aesthetic Quality Motion Smoothness Dynamic Degree Background Consistency Subject Consistency ↑ Scene Consistency ↑ Overall Consistency FlashAttention2 [7] 66.25 59.49 98.42 59.72 96.57 95.28 39.14 26.18 DFA [62] 65.41 58.35 98.11 58.47 95.82 94.31 38.43 26.08 XAttn [55] 65.32 58.51 97.42 59.02 95.43 93.65 38.14 26.22 SVG [52] 65.78 59.16 97.32 58.87 95.79 93.94 38.54 25.87 SVG2 [56] 66.03 59.31 98.07 59.44 96.61 94.95 39.14 26.48 Ours 66.07 59.41 98.19 59.67 96.72 95.12 39.18 26.52 Ours + DMD2 [58] 65.62 57.74 97.32 61.26 95.47 93.74 38.78 25.94 LinVideo("Ours") consistently outperforms all sparse-attention baselines and achieves performance nearly identical to the full-attentionFlashAttention2baseline, even surpassing it on consistency metrics. The 4-step distilled model (Ours + DMD2) maintains high quality with only a marginal drop.
该图像是论文中的图表,展示了在单个H100 80GB GPU上不同方法对99\mathrm{\Delta}\mathrm{Wan}1.3\mathrm{B}模型的端到端运行时延迟对比,X轴为方法名称,Y轴为延迟时间(秒),并标注了相对加速倍数。Figure 6 shows that
LinVideo(with ) provides a 1.43x speedup overFlashAttention2. As the number of replaced layers increases, the speedup grows. The distilled model (Ours + DMD2) achieves a massive 15.92x latency reduction. Figure 7 demonstrates that the speedup advantage ofLinVideogrows with the number of frames, reaching 2.00x for longer videos, which highlights the benefit of its complexity.
该图像是图7,展示了在不同帧数情况下,使用FA2与Ours模型的端到端延迟对比。图中用箭头标注了多组加速比,表明Ours方法在帧数增加时能有效降低延迟,最高可达2倍加速。 -
Ablation Studies:
-
Choice of
target: The table below (transcribed from Table 2) shows that performance degrades gracefully as more layers are replaced (targetincreases), with a significant drop only after . This confirms the trade-off between speed and quality.target Imaging Aesthetic Quality Quality Motion Smoothness Dynamic Degree Overall Consistency ↑ 10 66.32 59.18 98.68 60.06 26.35 12 66.36 59.14 98.57 59.73 26.65 14 66.17 58.88 98.34 59.67 26.29 16 66.07 59.41 98.19 59.67 26.52 18 65.84 58.32 97.78 58.63 26.08 20 64.38 57.02 95.49 57.12 23.30 -
Effect of Selective Transfer: The table below (transcribed from Table 3) demonstrates the effectiveness of the proposed components.
LinVideo(with the learnable score ) significantly outperformsManualselection (fixing the layers based onLinVideo's final choice) and aHeuristicbaseline. This proves that the progressive, smooth training process is crucial. Furthermore, removing the regularization loss (w/o Lreg) causes a catastrophic performance drop, confirming its role in stabilizing training.Method Imaging Aesthetic Quality Quality Motion Smoothness Dynamic Overall Degree Consistency LINVIDEO 66.07 59.41 98.19 59.67 Manual 62.97 57.21 92.25 52.87 Heuristic 60.74 54.13 90.36 50.61 w/o Lreg 18.62 17.83 12.59 7.48 -
Effect of ADM: The table below (transcribed from Table 4) shows that the
ADMobjective is superior to both naiveMSEloss and the standard distillation loss (L_DMD). Crucially, as shown in Figure 8,ADMis ~4.4x faster to train than objectives that require an auxiliary score model, demonstrating its high efficiency.Table 4: Ablation results of ADM (transcribed from the paper)
Method Imaging Aesthetic Quality ↑ Quality Motion Smoothness ↑ Dynamic Degree Overall Consistency ↑ LINVIDEO 66.07 59.41 98.19 59.67 26.52 w/ 61.34 56.12 91.34 51.38 19.87 w/ 63.15 57.86 93.11 55.62 22.51 w/ 64.38 58.11 94.73 56.34 23.06
该图像是一个表格与折线图的组合,展示了不同目标函数下LinVideo模型在成像质量、美学质量、运动流畅度、动态程度和整体一致性方面的性能比较,以及训练时间的差异。表格中包含指标如成像质量和运动流畅度最高分别为66.07和98.19。折线图显示使用不同目标函数训练时间差异,标注有4.4倍加速和训练小时数。
-
7. Conclusion & Personal Thoughts
-
Conclusion Summary: This paper introduces
LinVideo, a highly practical and effective framework for accelerating pre-trained video diffusion models. By selectively replacing computationally expensive quadratic attention layers with efficient linear ones, it achieves significant speedups (1.25-2.00x) without sacrificing generation quality. The framework's two core contributions—selective transferfor automatic layer selection andanytime distribution matching (ADM)for efficient, high-fidelity training—are key to its success. The approach is data-free, making it applicable to large, proprietary models, and can be combined with few-step distillation for even greater acceleration. -
Limitations & Future Work:
- The authors note that their implementation does not use custom, highly-optimized kernels for linear attention. Integrating such kernels could unlock further speedups.
- The paper suggests that
LinVideo's inter-layer replacement strategy is orthogonal to intra-layer sparse attention methods. Future work could explore combining these approaches for synergistic gains in efficiency and performance.
-
Personal Insights & Critique:
- Practicality and Impact: The data-free, post-training nature of
LinVideomakes it exceptionally valuable. The ability to optimize massive, existing models without needing their original training data or undertaking costly re-training from scratch is a huge practical advantage for deploying large-scale generative AI. - Methodological Strength: The
selective transfermechanism is a clever and generalizable idea. Framing the architectural choice as a learnable, regularized classification problem could be applied to other model compression tasks like pruning or quantization, where decisions must be made on a per-component basis. - Efficiency of ADM: The
anytime distribution matchingobjective is a standout contribution. By showing that the student model can estimate its own score function, the authors eliminate a major computational bottleneck in distillation-style training, making the process faster and simpler. - A Subtle Point on "Data-Free": While the method doesn't require the original training dataset, it does require a one-time, computationally intensive step of generating a synthetic dataset using the original model. This is a pragmatic trade-off, far cheaper than full pre-training but not entirely "free."
- Open Questions: The current framework finds a single, static set of layers to replace. An interesting avenue for future research would be a dynamic approach where the decision to use linear or quadratic attention could depend on the input prompt, the specific content being generated, or even the current timestep in the diffusion process.
- Practicality and Impact: The data-free, post-training nature of
Similar papers
Recommended via semantic vector search.