Training-Free Efficient Video Generation via Dynamic Token Carving
TL;DR Summary
The paper presents Jenga, a training-free method for efficient video generation that addresses the computational bottlenecks of Video Diffusion Transformers. Jenga achieves 8.83x speedup while maintaining generation quality, significantly enhancing practical application efficienc
Abstract
Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83 speedup with 0.01% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Training-Free Efficient Video Generation via Dynamic Token Carving
1.2. Authors
Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, and Jiaya Jia. The authors are primarily affiliated with the Chinese University of Hong Kong (CUHK), the Hong Kong University of Science and Technology (HKUST), Kuaishou Technology, and SmartMore. This group includes prominent researchers in computer vision and deep learning, such as Jiaya Jia, known for his work in image processing and computational photography.
1.3. Journal/Conference
This paper was published on arXiv (a pre-print server for rapid dissemination of research) in May 2025. Given its subject matter and the state-of-the-art results on benchmarks like VBench, it is targeted toward major computer vision or machine learning conferences such as CVPR, ICCV, or NeurIPS.
1.4. Publication Year
2025
1.5. Abstract
The paper addresses the high computational cost of Video Diffusion Transformer (DiT) models, which is caused by the quadratic complexity of self-attention and the multi-step nature of diffusion processes. The authors propose Jenga, a training-free inference pipeline that significantly speeds up video generation. Jenga uses two main strategies: (1) Progressive Resolution (ProRes), which generates low-resolution content in early stages, and (2) Block-Wise Attention Carving, which uses 3D space-filling curves to dynamically select and compute only the most important token interactions in later stages. Experimental results show up to an speedup on models like HunyuanVideo with almost no loss in quality, enabling high-quality video generation on modern hardware in seconds rather than minutes.
1.6. Original Source Link
https://arxiv.org/abs/2505.16864
2. Executive Summary
2.1. Background & Motivation
The recent success of Diffusion Transformers (DiT) has revolutionized high-resolution video generation. However, generating even a 5-second 720P video can take nearly 30 minutes on a high-end GPU. This inefficiency arises from two sources:
-
Self-Attention Bottleneck: The complexity of the
self-attentionmechanism is , meaning if you double the number of videotokens(data points), the computation increases fourfold. -
Iterative Denoising: Diffusion models work by gradually removing noise over many steps (e.g., 50 steps), multiplying the already high cost per step.
Existing solutions like distillation (training a faster model) require expensive retraining and often reduce video quality. The motivation behind Jenga is to create a plug-and-play solution that requires no retraining and tackles both the per-step cost and the total number of processed tokens.
2.2. Main Contributions / Findings
-
Attention Carving: A novel method that partitions video tokens into 3D blocks and dynamically ignores unimportant blocks during computation.
-
Progressive Resolution (ProRes): A strategy that starts generation at a lower resolution (where fewer tokens exist) and upscales to full resolution only for the final refinement steps.
-
Text-Attention Amplifier: A mechanism to prevent the "zoomed-in" effect (field of view degradation) often seen when generating at low resolutions.
-
Exceptional Performance: Achieving to speedups across multiple state-of-the-art models (
HunyuanVideo,Wan2.1,AccVideo) while maintaining competitive scores on theVBenchbenchmark.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Diffusion Models & Denoising
A Diffusion Model generates data (like images or videos) by learning to reverse a "forward" process where noise is added to data. During generation (inference), the model starts with pure Gaussian noise and iteratively removes it.
- Timestep (): Generation goes from (pure noise) to (clean video).
- Denoising Step: In each step, a neural network (the Transformer) predicts the noise present in the current frame to subtract it.
3.1.2. Transformers & Self-Attention
A Transformer processes data as a sequence of tokens. In video, a "patch" of pixels (e.g., ) is turned into one token. The Self-Attention mechanism allows every token to "look at" every other token to understand spatial and temporal relationships.
The standard calculation for Self-Attention is:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^{\intercal}}{\sqrt{d_k}}\right)V
$
- (Query): What a token is looking for.
- (Key): What a token contains.
- (Value): The information to be extracted.
- : The dimension of the keys, used for scaling.
- Complexity: Because every Query must be compared to every Key, if there are tokens, there are operations ().
3.1.3. Latent Diffusion Models (LDM)
Processing high-resolution video directly in pixel space is too slow. Latent Diffusion uses a Variational Autoencoder (VAE) to compress the video into a smaller, "latent" space. The diffusion process happens in this compressed space, and the VAE decoder converts the result back to pixels.
3.2. Previous Works
- Step Distillation: Methods like
Consistency ModelsorAccVideotry to reduce the number of steps () by training a student model to predict the result of multiple steps at once. - Feature Reuse: Methods like
TeaCacheskip the computation of certain layers if the input hasn't changed much from the previous step. - Sparse Attention: Previous works like
CLEARorSTAuse fixed "windows" (e.g., only look at nearby pixels). Jenga differentiates itself by being dynamic—it chooses which blocks to look at based on the actual content of the video at each step.
3.3. Technological Evolution
The field moved from Convolutional Neural Networks (CNNs) to Transformers for better scaling. As Transformers grew, the bottleneck became the primary constraint. Jenga represents the next step: content-aware sparsification, where we only calculate what is absolutely necessary for the human eye to perceive quality.
4. Methodology
4.1. Principles
The core intuition of Jenga rests on two observations:
- Semantic Layout vs. Detail: In the early steps of a diffusion process, the model is just deciding where objects go (the layout). This does not require high-resolution pixels.
- Redundancy: Once the layout is set, many parts of a video are redundant or only require local information. We don't need every pixel to attend to every other pixel to refine a small texture.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. 3D Space-Filling Curve Reordering
In a video, tokens are usually flattened in a simple row-by-row or frame-by-frame order. This loses the 3D "neighborhood" information. Jenga uses Generalized Hilbert Curves to reorder tokens. The following figure (Figure 3 from the original paper) shows how tokens are reordered and partitioned:
该图像是示意图,展示了 4 imes 4 imes 4 潜在空间的 3D 重排序与块分区。左侧部分展示了通过空间填充曲线 (SFC) 进行的重排序过程,右侧展示了计算重要性掩码、条件掩码以及邻接掩码的步骤,最终生成一热块级注意力掩码。
The reordering process is defined by: $ z_{\mathrm{blk}} = \mathcal{G}(z_{\mathrm{thw}}), \quad z_{\mathrm{thw}} = \mathcal{G}^{-1}(z_{\mathrm{blk}}) $
- : Tokens in the original temporal-height-width order.
- : Tokens reordered such that 1D proximity in the sequence corresponds to 3D proximity in the video.
- : The permutation function using the space-filling curve. This allows Jenga to divide the sequence into blocks of size where each block contains tokens that are physically close to each other in the 3D video.
4.2.2. Block-Wise Attention Carving (AttenCarve)
Instead of attention, Jenga computes a block-wise importance mask . For each block of queries, it only looks at a subset of key-value blocks. The mask is a union of three components:
-
Importance Mask (): Uses a coarse pooling to find relevant blocks. $ \mathbf{R} = \mathbf{softmax}\left(\frac{\hat{Q}\hat{K}^{\intercal}}{\sqrt{d_k}}\right) $ Where and are mean-pooled representations of the blocks. Jenga keeps the Top-K blocks and ensures the cumulative probability exceeds a threshold ().
-
Condition Mask (): Ensures that all vision tokens can always "see" the text prompt (conditioning) tokens.
-
Adjacency Mask (): Forces tokens to look at their immediate 3D neighbors to prevent artifacts at block boundaries.
The final mask is . By skipping blocks where , complexity drops from to , where is the average number of selected tokens.
4.2.3. Progressive Resolution (ProRes)
Jenga divides the timesteps into stages.
- Stage 1: Generate at low resolution (e.g., 540P).
- Transition: At a specific timestep, the model predicts a "clean" latent , upscales it using area interpolation, and adds noise back to continue at high resolution (e.g., 720P). The transition formula is: $ x_{t-1} = (1 - \sigma_t) \times \mathcal{U}(\hat{x}_0^s) + \sigma_t \tilde{\epsilon} $
- : Upsample function.
- : Standard deviation of the noise at time .
- : Random Gaussian noise.
4.2.4. Text-Attention Amplifier
When generating at low resolutions, the model often focuses too much on local pixels, making the "camera" appear zoomed-in. Jenga fixes this by adding a bias to the attention score between vision queries and text keys: $ \mathrm{Score} = q_v k_c^{\intercal} + \beta $ $ \beta = -\rho \log(\frac{\mathrm{numel}(R_s)}{\mathrm{numel}(R_S)}) $
- : A balancing factor (set to 0.5).
- : Number of tokens in the current stage. This bias "hypnotizes" the model to pay more attention to the global text description, preserving the intended Field of View (FOV).
The following figure (Figure 2 from the original paper) provides a high-level overview of these integrated components:
该图像是一个示意图,展示了Jenga的工作原理。左侧部分说明了注意力雕刻的方法,其中3D视频潜在空间被划分为局部块,并通过块级注意力生成稀疏选择掩码。右侧部分展示了渐进分辨率策略,通过调节采样时间步,逐步提高生成视频的分辨率。
5. Experimental Setup
5.1. Datasets
- VBench: A comprehensive benchmark for video generation that evaluates 16 dimensions, including object consistency, motion smoothness, and text-video alignment.
- VBench-I2V: A specialized version for Image-to-Video generation.
- Inter4K: A high-quality 4K video dataset used to calculate the
FVD(structural fidelity). - Sora Prompts: The authors used the publicly available prompts from OpenAI's Sora to test the model's ability to handle complex, descriptive text.
5.2. Evaluation Metrics
5.2.1. CLIPScore
- Conceptual Definition: Measures how well the generated video frames match the text prompt using a shared embedding space (CLIP).
- Mathematical Formula: $ \mathrm{CLIPScore} = \mathbb{E}[\cos(\mathbf{v}, \mathbf{t})] $
- Symbol Explanation: is the visual embedding of a frame, is the text embedding of the prompt, and is the cosine similarity.
5.2.2. VBench Score
- Conceptual Definition: A weighted average of multiple quality (visual) and semantic (meaning) sub-metrics.
- Formula: Standardized benchmark average (0-100%).
5.2.3. Fréchet Video Distance (FVD)
- Conceptual Definition: Measures the "distance" between the distribution of real videos and generated videos. Lower is better.
- Mathematical Formula: $ \mathrm{FVD} = |\mu_r - \mu_g|^2 + \mathrm{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2}) $
- Symbol Explanation: and are the mean and covariance of features extracted by a pre-trained video recognition model for real () and generated () data.
5.3. Baselines
The authors compared Jenga against:
-
Full Model: The original
HunyuanVideoorWan2.1. -
Attention Optimizers:
MInference(LLM-based sparse attention),CLEAR, andSVG. -
Pipeline Optimizers:
TeaCache(skipping steps based on feature similarity).
6. Results & Analysis
6.1. Core Results Analysis
Jenga demonstrates a "Pareto improvement"—it is significantly faster with negligible quality loss. In many cases, the Semantic Score in VBench actually increased (e.g., ), suggesting that Attention Carving helps the model focus on the prompt rather than being distracted by redundant spatial noise.
The following are the results from Table 1 of the original paper:
| Methods | Steps | Computation Loads | Quality Evaluation [35, 61] | Latency & Speed | |||||
|---|---|---|---|---|---|---|---|---|---|
| NFE PFLOPs↓ | PFLOPs / step↓ | VBench | VBench-Q↑ | VBench-S↑ | CLIP-score↑ | DiT time | Speedup↑ | ||
| HunyuanVideo [12] | 50 | 534.44 | 10.68 | 82.74% | 85.21% | 72.84% | 30.67 | 1625s | 1.00× |
| CLEAR (r=32) [21] | 50 | 479.97 | 9.60 | 82.68% | 86.06% | 69.17% | 30.43 | 1848s | 0.89× |
| MInference [36] | 50 | 187.79 | 3.76 | 83.36% | 85.41% | 75.16% | 30.73 | 815s | 1.99× |
| SVG [22] | 50 | 243.36 | 4.86 | 83.11% | 85.87% | 72.07% | 30.63 | 988s | 1.64× |
| AttenCarve (Ours) | 50 | 163.04 | 3.26 | 83.42% | 85.31% | 75.85% | 30.60 | 748s | 2.17× |
| TeaCache-fast [31] | 23 | 245.84 | 10.68 | 82.39% | 85.51% | 69.91% | 30.39 | 703s | 2.31× |
| ProRes-timeskip | 24 | 162.29 | 6.76 | 82.57% | 85.78% | 69.73% | 30.13 | 495s | 3.28× |
| Jenga-Base | 23 | 75.49 | 3.28 | 83.34% | 85.19% | 75.92% | 30.59 | 347s | 4.68× |
| Jenga-Turbo | 24 | 47.77 | 1.99 | 83.07% | 84.47% | 77.48% | 30.78 | 225s | 7.22× |
| Jenga-Flash | 24 | 32.97 | 1.37 | 82.73% | 84.01% | 77.58% | 30.77 | 184s | 8.83× |
6.2. Ablation Studies
-
Adjacency Mask: Removing the
Adjacency Maskcaused visible grid artifacts because tokens at the edge of blocks weren't communicating. -
SFC vs. Linear: Standard linear ordering (row-by-row) led to "shifting" artifacts. Using the Space-Filling Curve (Hilbert) solved this by preserving 3D locality.
-
Number of Stages: 2-stage (low-res to high-res) was the sweet spot. A 3-stage version was faster () but showed slight quality drops.
7. Conclusion & Reflections
7.1. Conclusion Summary
Jenga effectively breaks the computational barriers of high-resolution video generation. By combining Block-Wise Attention Carving (which reduces complexity per layer) and Progressive Resolution (which reduces complexity per step), it achieves nearly an order-of-magnitude speedup. Its training-free nature makes it a vital tool for immediate deployment across various DiT architectures.
7.2. Limitations & Future Work
- Latent Alignment: Resizing in the latent space (rather than pixel space) can occasionally cause boundary artifacts or flickering, especially in very long videos.
- Static Curves: The Hilbert curve is pre-calculated. Future versions could use "semantic" curves that adapt to where the action is in the video.
- VAE Bottleneck: While the DiT part is now very fast, the
VAE decoder(which converts latents to pixels) remains a constant cost. Future work could involve sparsifying the VAE process itself.
7.3. Personal Insights & Critique
Jenga is a brilliant example of "coarse-to-fine" engineering. The realization that attention density can be traded for resolution over time mirrors how the human visual system works—we focus on general shapes first and then fill in the gaps.
One potential issue is the sensitivity to the text prompt. The authors noted that "enhanced prompts" (more detailed descriptions) help eliminate artifacts in the 3-stage setting. This suggests the model relies more on text guidance when the visual information is sparse. A critical area for improvement would be making the Text-Attention Amplifier bias learnable or adaptive rather than a fixed logarithmic function, which could further stabilize the Field of View across diverse video styles.
Similar papers
Recommended via semantic vector search.