AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion
TL;DR Summary
AR-Diffusion merges auto-regressive and diffusion models for flexible asynchronous video generation. It mitigates train-inference discrepancies via diffusion and ensures temporal coherence with a non-decreasing timestep constraint and causal attention, enabling variable-length vi
Abstract
The task of video generation requires synthesizing visually realistic and temporally coherent video frames. Existing methods primarily use asynchronous auto-regressive models or synchronous diffusion models to address this challenge. However, asynchronous auto-regressive models often suffer from inconsistencies between training and inference, leading to issues such as error accumulation, while synchronous diffusion models are limited by their reliance on rigid sequence length. To address these issues, we introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible, asynchronous video generation. Specifically, our approach leverages diffusion to gradually corrupt video frames in both training and inference, reducing the discrepancy between these phases. Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames, ensuring that earlier frames remain clearer than subsequent ones. This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence. In addition, we design two specialized timestep schedulers: the FoPP scheduler for balanced timestep sampling during training, and the AD scheduler for flexible timestep differences during inference, supporting both synchronous and asynchronous generation. Extensive experiments demonstrate the superiority of our proposed method, which achieves competitive and state-of-the-art results across four challenging benchmarks.
English Analysis
1. Bibliographic Information
- Title: AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion
- Authors: Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, Siyu Zhou, Qian He, Jing Liu.
- Affiliations: The authors are affiliated with the Institute of Automation, Chinese Academy of Sciences (IA, CAS), University of Chinese Academy of Sciences (UCAS), and Bytedance Inc.
- Journal/Conference: The paper is available on arXiv, an open-access repository for electronic preprints. This indicates it has not yet undergone formal peer review for a conference or journal.
- Publication Year: The paper was submitted to arXiv in March 2025, as indicated by its identifier.
- Abstract: The paper addresses the challenge of generating visually realistic and temporally coherent videos. It identifies key weaknesses in existing methods: asynchronous auto-regressive models suffer from error accumulation due to train-inference discrepancies, while synchronous diffusion models are constrained by fixed sequence lengths. To resolve this, the authors propose AR-Diffusion, a novel model that merges the strengths of both approaches. AR-Diffusion uses a diffusion process to reduce the train-inference gap and incorporates an auto-regressive-inspired non-decreasing timestep constraint, ensuring earlier frames are less noisy than later ones. This, combined with temporal causal attention, allows for flexible-length video generation. The authors also introduce two specialized schedulers: the
FoPP
scheduler for balanced training and theAD
scheduler for flexible inference. Experiments show that AR-Diffusion achieves state-of-the-art results on four video generation benchmarks. - Original Source Link:
- arXiv Page: https://arxiv.org/abs/2503.07418v1
- PDF Link: http://arxiv.org/pdf/2503.07418v1
- Publication Status: Preprint.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: High-quality video generation is difficult, requiring both per-frame realism and smooth, logical transitions between frames (temporal coherence).
- Gaps in Prior Work: The field is dominated by two paradigms, each with significant drawbacks.
- Synchronous Diffusion Models: These models apply the same amount of noise to every frame in a video clip. While this ensures consistency, it forces the model to be trained and generate videos of a fixed, rigid length, limiting flexibility.
- Asynchronous Auto-Regressive Models: These models generate frames one by one, which allows for variable-length videos. However, they often suffer from a mismatch between how they are trained (predicting a frame from a real previous frame) and how they are used for inference (predicting from a previously generated frame). This discrepancy leads to an accumulation of errors, degrading quality over long sequences.
- Innovation: AR-Diffusion introduces a hybrid approach that aims to capture the "best of both worlds." It uses a diffusion framework to ensure high-quality, consistent generation while incorporating an auto-regressive structure to allow for flexible, variable-length outputs without the typical error accumulation problem.
-
Main Contributions / Findings (What):
- AR-Diffusion Model: A novel architecture that combines a diffusion process with an auto-regressive structure for asynchronous video generation.
- Non-Decreasing Timestep Constraint: The core innovation where the noise level applied to a frame must be greater than or equal to the noise level of the preceding frame (). This constraint provides a middle ground between the overly restrictive "equal timesteps" of synchronous models and the unstable "independent timesteps" of fully asynchronous models, significantly improving training stability.
- Specialized Timestep Schedulers:
FoPP
(Frame-oriented Probability Propagation) Scheduler: A training-time scheduler designed to handle the non-decreasing constraint, ensuring the model is trained on a balanced and diverse set of noise configurations.AD
(Adaptive-Difference) Scheduler: An inference-time scheduler that allows fine-grained control over the generation process, enabling a smooth transition between purely synchronous () and purely auto-regressive () generation styles.
- State-of-the-Art (SOTA) Performance: The proposed method achieves SOTA or competitive results across four challenging video generation benchmarks (FaceForensics, Sky-Timelapse, Taichi-HD, and UCF-101), demonstrating its superior quality and temporal consistency. For example, it improves the FVD score on UCF-101 by 60.1% over the previous SOTA asynchronous model.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Video Generation: The task of using a computational model to synthesize a sequence of images (frames) that are both individually realistic and collectively form a coherent, moving picture.
- Diffusion Models: A class of generative models that work in two stages. First, a forward process gradually adds noise to a real data sample (like an image) over a series of timesteps until it becomes pure noise. Second, a reverse process trains a neural network to denoise the sample step-by-step, starting from pure noise, to generate a new data sample. This process has proven highly effective for generating high-fidelity images and videos.
- Auto-Regressive Models: These models generate data sequentially. For example, to generate a sentence, they generate the first word, then the second word based on the first, then the third based on the first two, and so on. In video, this means generating frame based on frames
1
toF-1
. - Latent Space: A compressed, lower-dimensional representation of high-dimensional data (like images or videos). Models often work in this latent space to reduce computational cost and focus on the most important features. A Video Auto-Encoder (VAE) is a type of neural network used to learn this mapping from video to latent space and back.
-
Previous Works & Technological Evolution: The paper positions AR-Diffusion as the successor to two distinct lines of work in video generation.
-
Synchronous Video Generation:
- Methods:
VDM
,LVDM
,Latte
. - Technique: These models are typically diffusion-based and apply a single, shared noise timestep to all frames in a video clip ().
- Limitation: This "lock-step" approach ensures all frames have the same signal-to-noise ratio, promoting coherence. However, it fundamentally limits the model to generating videos of a fixed length, making it inflexible.
- Methods:
-
Asynchronous Video Generation:
- Auto-Regressive Methods:
CogVideo
,VideoGPT
,TATS
. These models generate frame conditioned on "clean" previous frames. This allows for variable-length videos but suffers from error accumulation, where small mistakes in early frames cascade and amplify in later frames. - Asynchronous Diffusion Methods:
Diffusion Forcing
,FVDM
. These models allow each frame to have an independent noise timestep ( is independent of ). This offers maximum flexibility. - Limitation: The search space of possible timestep combinations becomes enormous (), leading to training instability and inefficient learning, as many combinations are redundant or never seen during inference.
- Auto-Regressive Methods:
-
-
Differentiation: AR-Diffusion carves a unique path by blending these approaches. Its key innovation, the non-decreasing timestep constraint (), is a crucial compromise. As shown in Figure 1, this drastically reduces the search space compared to fully asynchronous models, stabilizing training. Yet, it retains enough flexibility to allow for variable-length generation, unlike synchronous models. This structured approach to asynchronicity is the paper's main distinguishing feature.
该图像是图1,一个对比示意图,展示了四种不同生成模型(同步扩散、异步扩散、自回归和AR-Diffusion)在时间步组成和各种特性上的差异。图表详细比较了它们的时间步组合模板、组合空间大小、训练-推理一致性、因果时间相关性、无分类器指导以及变长视频生成/续接能力。AR-Diffusion(本文方法)在所有这些关键特性上均表现出色,获得了绿色对勾,尤其在训练-推理一致性、因果时间相关性和支持变长视频生成方面,结合了自回归和扩散模型的优势,展现出全面的优越性。
4. Methodology (Core Technology & Implementation)
The AR-Diffusion framework consists of two main stages: first, an AR-VAE
compresses the video into a manageable latent space, and second, the AR-Diffusion
model generates new video latents within that space.
-
3.1. AR-VAE (Auto-Regressive Video Auto-Encoder) The
AR-VAE
is responsible for learning a compact and efficient representation of video frames.-
Architecture: It is based on
TiTok
, a Transformer-based image tokenizer. The paper adapts it for video. -
Encoder: The
Time-agnostic Video Encoder
processes each frame independently, converting patches of the frame into a set of "visual tokens" or latent features. -
Decoder: The
Temporal Causal Video Decoder
reconstructs the video frames. Its key feature is temporal causality: when reconstructing a frame, its patch tokens can attend to the visual tokens of previous frames but not future ones. This design choice enforces temporal consistency and is crucial for the auto-regressive nature of the model. -
Latent Representation: The model produces continuous latent features , where is the number of frames, is the number of tokens per frame, and is the token dimension.
-
The overall architecture is depicted in Figure 2(a).
该图像是AR-Diffusion模型(图b)的架构示意图,与AR-VAE(图a)进行对比。AR-Diffusion通过时间因果扩散处理带噪声的token,逐步去噪以生成视频帧,体现了异步视频生成的灵活性和时间一致性,其关键在于结合自回归和扩散模型的优点。
-
-
3.2. AR-Diffusion This is the core generative model that operates on the latent features produced by the
AR-VAE
.- Backbone: It uses a Transformer architecture with temporal causal attention, similar to the
AR-VAE
decoder. This ensures that the prediction for a frame only depends on itself and previous frames. - Diffusion Theory: The model follows the standard denoising diffusion process. A clean latent feature is corrupted with Gaussian noise over timesteps. The noisy latent at timestep is given by:
- : The original clean latent feature for frame .
- : The noisy latent feature for frame at its specific timestep .
- : Gaussian noise sampled from .
- : A parameter derived from the noise schedule that controls the signal-to-noise ratio at timestep .
- Training Objective: The model is trained with an x_0`prediction loss`, meaning it directly predicts the clean latent $z_i^0$ from the noisy input $z_i^{t_i}$. The authors argue in the appendix this is superior to predicting the noise (`epsilon`prediction) because it forces the model to learn and maintain strong temporal correlations across frames, which is vital in an asynchronous setting.
* **Non-decreasing Timestep Constraint:** This is the central idea. During training and inference, the timesteps assigned to the frames must be non-decreasing: $t_1 \le t_2 \le \dots \le t_F$. This ensures that earlier frames are always clearer (less noisy) than or as clear as subsequent frames. This structure naturally guides the model to generate frames in a temporally coherent order and stabilizes training by constraining the vast space of possible timestep combinations.
* **3.3. FoPP (Frame-oriented Probability Propagation) Timestep Scheduler**
This scheduler is used during training to select a valid timestep composition $\langle t_1, \dots, t_F \rangle$ that adheres to the non-decreasing constraint.
* **Problem:** A naive sampling approach (e.g., sampling $t_1$ uniformly, then $t_2$ from $[t_1, T]$, etc.) leads to a biased distribution. For example, the composition $\langle T, T, \dots, T \rangle$ would be sampled far too frequently. This scheduler is designed to ensure a more balanced and uniform sampling.
* **Procedure (Algorithm 1):**
1. First, uniformly sample a "pivot" frame index $f \in [1, F]$ and a corresponding timestep $t_f \in [1, T]$. This ensures every frame-timestep pair has an equal chance of being selected as an anchor.
2. Then, propagate probabilities outward from this pivot. Using dynamic programming, the scheduler pre-computes the number of valid non-decreasing sequences ending at or starting from any given frame-timestep pair.
3. It then samples the timesteps for subsequent frames ($t_{f+1}, \dots, t_F$) and previous frames ($t_{f-1}, \dots, t_1$) based on these pre-computed path counts, ensuring the final sampled composition is chosen more uniformly from the set of all valid compositions.
* **3.4. AD (Adaptive-Difference) Timestep Scheduler**
This scheduler is used during inference to control the generation process.
* **Purpose:** It provides a flexible way to denoise the video, bridging the gap between synchronous and auto-regressive generation.
* **Mechanism:** It introduces a hyperparameter $s$, the **timestep difference** between adjacent frames. The paper presents the following condition, though the first line appears to contain a typo ($t_i = t_i + 1$):
t _ { i } = { \left{ \begin{array} { l l } { t _ { i } + 1 , } & { { \mathrm { i f ~ } } i = 1 { \mathrm { ~ o r ~ } } t _ { i - 1 } = 0 , } \ { min ( t _ { i - 1 } + s , T ) , } & { { \mathrm { i f ~ } } t _ { i - 1 } > 0 } \end{array} \right. }
* **Interpretation:** The core idea is that the noise level of frame $i$, $t_i$, is kept at a difference of $s$ from the noise level of the previous frame, $t_{i-1}$.
* When $s=0$, all frames are denoised together at the same rate ($t_1 = t_2 = \dots$), mimicking **synchronous diffusion**.
* When $s$ is very large (e.g., $s \ge T$), frame $i$ remains fully noisy until frame `i-1` is almost fully denoised, mimicking **auto-regressive generation**.
* Intermediate values of $s$ allow for a hybrid approach, balancing quality and generation speed.
# 5. Experimental Setup
* **Datasets:** The model was evaluated on four standard video generation datasets:
* **FaceForensics:** Videos of human faces, often used for forgery detection.
* **Sky-Timelapse:** Time-lapse videos of skies, focusing on dynamic textures and motion.
* **Taichi-HD:** High-definition videos of people performing Tai Chi, featuring complex human motion.
* **UCF-101:** A challenging dataset of 101 human action categories, known for its diversity and complexity.
* **Evaluation Metrics:**
1. **FID-img (Fréchet Inception Distance - Image):**
* **Conceptual Definition:** Measures the quality and realism of individual generated frames. It calculates the distance between the feature distributions of a set of real images and a set of generated images. The features are extracted from a pre-trained InceptionV3 network. A lower score indicates the generated images are more similar to real images.
* **Mathematical Formula:**
\text{FID} = ||\mu_r - \mu_g||^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})
$$
- Symbol Explanation:
- : Mean of the feature vectors for real and generated images, respectively.
- : Covariance matrices of the feature vectors for real and generated images.
- : The trace of a matrix (sum of diagonal elements).
- Symbol Explanation:
-
FID-vid (Fréchet Inception Distance - Video):
- Conceptual Definition: Similar to
FID-img
but uses features extracted from a 3D (spatio-temporal) network to evaluate video clips. It assesses both image quality and short-term motion dynamics.
- Conceptual Definition: Similar to
-
FVD (Fréchet Video Distance):
- Conceptual Definition: A widely used metric for video generation that evaluates both the visual quality of frames and their temporal coherence. Like FID, it computes the distance between feature distributions of real and generated video clips, using features from a pre-trained I3D network (a video classification model). Lower FVD indicates better video quality. The paper reports
FVD_16
andFVD_128
for video clips of 16 and 128 frames, respectively. - Formula and Symbols: The formula is identical in form to FID, but and are computed from video features.
- Conceptual Definition: A widely used metric for video generation that evaluates both the visual quality of frames and their temporal coherence. Like FID, it computes the distance between feature distributions of real and generated video clips, using features from a pre-trained I3D network (a video classification model). Lower FVD indicates better video quality. The paper reports
- Backbone: It uses a Transformer architecture with temporal causal attention, similar to the
-
Baselines: The paper provides a comprehensive comparison against models from all major video generation categories:
- Generative Adversarial Models:
MoCoGAN
,DIGAN
,StyleGAN-V
,MoStGAN-V
. - Auto-regressive Generative Models:
VideoGPT
,TATS
. - Synchronous Diffusion Models:
Latte
,VDM
,LVDM
,VIDM
. - Asynchronous Diffusion Models:
FVDM
,Diffusion Forcing
.
- Generative Adversarial Models:
6. Results & Analysis
-
Core Results: The main quantitative results are presented in Table 1. AR-Diffusion consistently outperforms or is highly competitive with prior SOTA methods across all four datasets.
(This is a transcribed version of Table 1 from the paper.)
Taichi-HD [28] Sky-Timelapse [40] FaceForensics [25] UCF-101 [31] FVD16 FVD128 FVD16 FVD128 FVD16 FVD128 FVD16 FVD128 Generative Adversarial Models MoCoGAN [35] - - 206.6 575.9 124.7 257.3 2886.9 3679.0 + StyleGAN2 backbone - - 85.9 272.8 55.6 309.3 1821.4 2311.3 MoCoGAN-HD [34] - - 164.1 878.1 111.8 653.0 1729.6 2606.5 DIGAN [45] 128.1 748.0 83.11 196.7 62.5 1824.7 1630.2 2293.7 StyleGAN-V [30] 143.5 691.1 79.5 197.0 47.4 89.3 1431.0 1773.4 MoStGAN-V [27] - - 65.3 162.4 39.7 72.6 - - Auto-regressive Generative Models VideoGPT [41] - - 222.7 - 185.9 - 2880.6 - TATS [9] 94.6 132.6 - 420 Synchronous Diffusion Generative Models Latte [19] 159.6 59.8 - 34.0 478.0 VDM [46] 540.2 55.4 125.2 355.9 343.6 648.4 LVDM [11] 99.0 95.2 - 372.9 1531.9 VIDM [20] 121.9 563.6 57.4 140.9 294.7 Asynchronous Diffusion Generative Models FVDM [17] 194.6 106.1 55.0 555.2 468.2 Diffusion Forcing [4] 202.0 738.5 251.9 895.3 175.5 99.5 274.5 836.3 AR-Diffusion (ours) 66.3 376.3 40.8 175.5 71.9 265.7 186.6 572.3 - Key Findings: AR-Diffusion achieves the best
FVD
scores on Taichi-HD, Sky-Timelapse, and UCF-101. On UCF-101, itsFVD_16
of 186.6 is a significant improvement over all baselines, including the prior best asynchronous model, FVDM (468.2), demonstrating its effectiveness on complex action videos.
- Key Findings: AR-Diffusion achieves the best
-
Ablations / Parameter Sensitivity: Table 2 explores the impact of the inference timestep difference
s
.(This is a transcribed version of Table 2 from the paper.)
s FaceForensics [25] Sky-Timelapse [40] Taichi-HD [28] UCF-101 [31] Inference Time (s) FID-img FID-vid FVD FID-img FID-vid FVD FID-img FID-vid FVD FID-img FID-vid FVD 16-frame Video Generation 0 14.0 6.9 71.9 10.0 9.2 40.8 13.8 9.2 80.9 30.3 17.6 194.4 2.4 5 14.0 6.2 78.1 11.1 11.6 55.2 13.0 5.9 66.3 30.0 17.7 194.0 5.2 10 13.6 6.1 84.4 10.3 11.2 57.6 12.4 5.8 70.9 30.1 18.8 212.2 7.9 15 14.3 6.6 83.5 9.4 11.4 55.6 12.2 5.8 69.4 30.0 16.3 186.6 10.8 20 14.8 6.3 83.3 9.2 10.9 56.3 12.7 6.0 67.0 31.0 17.4 201.1 13.6 25 14.1 6.1 79.0 9.7 10.5 48.4 12.9 6.5 75.1 29.6 17.3 191.6 16.4 50 14.2 6.1 82.8 10.1 10.7 50.6 13.1 5.9 71.7 29.5 17.1 192.6 30.5 128-frame Video Generation 5 14.7 9.5 265.7 12.2 25.2 185.1 8.9 10.8 376.3 32.5 24.4 592.7 42.1 10 15.2 8.9 278.1 12.1 23.9 182.6 8.8 12.2 401.9 31.5 24.9 605.3 78.0 25 15.4 9.3 348.6 12.2 22.8 175.5 8.8 12.3 402.5 31.8 23.3 572.3 184.8 -
Analysis: The optimal value of
s
varies by dataset. For 16-frame generation,s=0
(synchronous) is best for FaceForensics and Sky-Timelapse, while a small non-zeros
(e.g., 5-15) is better for the more complex Taichi-HD and UCF-101 datasets. This shows that a degree of asynchronicity is beneficial for capturing complex motion. Inference time increases linearly withs
, highlighting the trade-off between performance and efficiency. For longer 128-frame videos, a smalls=5
provides a good balance.The paper also includes a crucial ablation study in Table 3. (This is a transcribed version of Table 3 from the paper.)
FID FVD-img FVD AR-Diffusion 12.2 13.4 62.8 - FoPP Timestep Scheduler 11.0 16.8 101.0 - Improved VAE 13.1 29.6 148.3 - Temporal Causal Attention 15.9 50.2 209.8 - x0 Prediction Loss 27.9 58.0 257.6 - Non-decreasing Constraint 32.2 87.9 272.5 - Analysis: Removing each component significantly degrades performance (higher FVD). The most critical components are the
Non-decreasing Constraint
,x0 Prediction Loss
, andTemporal Causal Attention
, the removal of which causes FVD to skyrocket. This confirms that every proposed component is essential to the model's success.
-
-
Qualitative Comparison: Figure 4 visually compares generated videos. The results show that AR-Diffusion produces videos with more dynamic and significant motion compared to baselines. For instance, on TaiChi-HD, other methods show minimal movement, while AR-Diffusion generates clear figures performing noticeable actions.
该图像是图4,展示了AR-Diffusion模型与其他现有视频生成方法的定性比较。图像分为三个部分,分别对应UCF-101、Sky-Timelapse和TaiChi-HD三个数据集。每个部分通过多行视频帧序列,直观地对比了不同模型在生成视觉真实性和时间连贯性视频方面的表现,其中AR-Diffusion的生成效果位于每个数据集的底部。
-
Training Stability: The paper compares the training loss curves of AR-Diffusion and Diffusion Forcing (a fully asynchronous model). AR-Diffusion's loss curve (Figure 6) is much smoother and more stable than that of Diffusion Forcing (Figure 5), which exhibits large spikes. This provides strong evidence that the non-decreasing constraint successfully stabilizes the training of an asynchronous diffusion model.
该图像是图5,展示了Diffusion Forcing [4]在UCF-101数据集上的损失曲线。图中蓝线表示训练损失随步数(Step)的变化,从初始较高的值快速下降,随后趋于稳定,表明模型在训练过程中趋于收敛。围绕损失曲线的阴影区域可能代表了损失的波动范围或标准差,显示了训练的稳定性。
(Note: The paper's text incorrectly labels the above image as Figure 6. It corresponds to AR-Diffusion's loss curve.)
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces AR-Diffusion, a novel video generation model that effectively synthesizes the strengths of auto-regressive and diffusion-based approaches. By introducing a
non-decreasing timestep constraint
, it enables flexible, asynchronous video generation while maintaining the training stability and high quality characteristic of synchronous diffusion models. The specializedFoPP
andAD
schedulers further enhance its training and inference capabilities. The model's state-of-the-art performance on four challenging benchmarks validates it as a significant advancement in the field of video generation. -
Limitations & Future Work: The authors identify one primary limitation: the model is trained exclusively on video data. They suggest that future work could explore incorporating the vast amount of available image data into the training process. This could improve the visual quality and diversity of the generated frames, especially in low-data regimes.
-
Personal Insights & Critique:
- Strengths: The
non-decreasing timestep constraint
is an elegant and powerful concept. It provides a principled way to manage the complexity of asynchronous generation, striking an excellent balance between flexibility and stability. TheAD
scheduler is also a highly practical contribution, giving users a direct "knob" to tune the trade-off between generation quality and speed. - Critique: The paper contains several minor typos, particularly in the formula for the
AD
scheduler (Equation 5) and in the labeling of tables and figures, which can cause confusion for the reader. For instance, the first case in Equation 5,`t_i = t_i + 1$, is logically incorrect and likely a transcription error. - Potential Impact: AR-Diffusion presents a compelling path forward for video generation. Its hybrid nature allows it to overcome the fundamental limitations of prior models. The core idea of structured asynchronicity could be influential and applicable to other sequential data generation tasks beyond video. The model's ability to generate variable-length videos makes it highly suitable for practical applications where content length is not predetermined.
- Strengths: The
Similar papers
Recommended via semantic vector search.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Self Forcing is a novel training paradigm for autoregressive video diffusion, addressing exposure bias by making models generate frames conditioned on their *own* prior outputs during training. Using a holistic loss and KV caching, it achieves real-time, sub-second latency video
Autoregressive Video Generation without Vector Quantization
NOVA reframes video generation as non-quantized autoregressive modeling combining temporal frame-wise and spatial set-wise prediction. It outperforms prior models in efficiency, speed, fidelity, and generalizes well to longer videos and diverse zero-shot tasks, with a smaller mod
Discussion
Leave a comment
No comments yet. Start the discussion!