Initializing viewer…

English Analysis

Details

1. Bibliographic Information

Title: Autoregressive Video Generation without Vector Quantization
Authors: Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang.
Affiliations: The authors are affiliated with several prominent Chinese institutions, including Beijing University of Posts and Telecommunications, the Chinese Academy of Sciences (CAS), the University of Chinese Academy of Sciences, Dalian University of Technology, and the Beijing Academy of Artificial Intelligence (BAAI).
Journal/Conference: The paper is available as a preprint on arXiv. arXiv is a widely respected open-access archive for scholarly articles in physics, mathematics, computer science, and related fields. It allows researchers to share their findings rapidly, often before or during the formal peer-review process.
Publication Year: 2024
Abstract: The paper introduces a novel and highly efficient method for autoregressive video generation. The core idea is to reframe video generation as a "non-quantized" process that combines temporal frame-by-frame prediction with spatial set-by-set prediction. This approach preserves the flexible "in-context" learning capabilities of autoregressive models (like GPT) while gaining efficiency by using bidirectional modeling within each frame. The resulting model, named NOVA, is shown to outperform previous autoregressive video models in data efficiency, inference speed, and visual quality, despite its small size (0.6B parameters). Notably, NOVA also competes with or surpasses state-of-the-art diffusion models in text-to-image generation at a much lower training cost and demonstrates strong zero-shot generalization to longer videos and other visual tasks.
Original Source Link:
- Official Source: https://arxiv.org/abs/2412.14169
- PDF Link: https://arxiv.org/pdf/2412.14169v2.pdf
- Status: Preprint.

2. Executive Summary

Background & Motivation (Why):
- Generative AI for visual content is dominated by two paradigms: autoregressive (AR) models and diffusion models.
- Problem with prior AR models: Traditional AR models for vision (e.g., DALL-E) rely on vector quantization (VQ), which converts images into a sequence of discrete tokens. This creates a difficult trade-off: achieving high visual fidelity requires many tokens, which drastically increases the computational cost and slows down inference, especially for high-resolution images and videos.
- Problem with diffusion models: While diffusion models (e.g., Stable Diffusion, Sora) produce high-quality results, they typically model the joint distribution of a fixed-length sequence of frames. This makes them less flexible for generating videos of varying lengths and robs them of the powerful "in-context" learning abilities seen in large language models (LLMs) like GPT, where a single model can perform diverse tasks (e.g., completion, editing) without retraining.
- The Gap: There is a need for a model that combines the flexibility and in-context learning of AR models with the efficiency and high-fidelity generation of modern approaches.
Main Contributions / Findings (What):
- A Novel Autoregressive Formulation: The paper proposes a new way to structure video generation by completely avoiding vector quantization. Instead of predicting discrete tokens, it predicts continuous-valued vectors.
- Hybrid Prediction Scheme: A key innovation is the dual-mode prediction:
  1. Temporal Frame-by-Frame Prediction: The model generates video frames one after another in a strictly causal sequence, allowing it to model time and motion logically.
  2. Spatial Set-by-Set Prediction: Within each frame, the model predicts sets of visual tokens in parallel using a bidirectional (non-causal) mechanism, which is much more efficient than traditional pixel-by-pixel or token-by-token raster scanning.
- NOVA (NOn-quantized Video Autoregression): The authors introduce NOVA, a 0.6B parameter model built on this principle. It achieves state-of-the-art results for its size, demonstrating remarkable efficiency. For example, it reaches top performance on image generation benchmarks with only 127 GPU days of training.
- Superior Performance and Efficiency: NOVA is shown to be superior to previous AR video models in quality, speed, and data efficiency. It also competes favorably with much larger, state-of-the-art diffusion models in both text-to-image and text-to-video tasks.
- Unified, Zero-Shot Capabilities: Thanks to its autoregressive nature, a single trained NOVA model can perform various tasks without modification, including text-to-image, text-to-video, image-to-video, and video extrapolation (generating videos longer than seen during training).
  
  Image 1: A diagram illustrating the versatile capabilities of the NOVA model. It can accept various combinations of text, images, and video as input to perform multiple generation tasks, such as text-to-image, text-to-video, and video-to-video, all within a single unified framework.

Foundational Concepts:
- Autoregressive (AR) Models: These models generate data sequentially, where each new piece of data is predicted based on all the previously generated pieces. In language, GPT predicts the next word based on the preceding text. In vision, this traditionally meant predicting the next pixel or token based on the ones already generated. This causal structure makes them excellent for tasks requiring contextual understanding and completion.
- Vector Quantization (VQ): A technique used to compress continuous data (like image features) into a discrete set of representations. An encoder maps patches of an image to the closest vector in a pre-defined "codebook." A transformer then learns to predict the sequence of these codebook indices. The main drawback is the fidelity-compression trade-off: a small codebook leads to loss of detail, while a large codebook results in long, computationally expensive sequences. NOVA avoids this entirely.
- Diffusion Models: These models learn to generate data by reversing a process of gradually adding noise. They start with a random noise image and, guided by a prompt, iteratively "denoise" it to produce a clean, coherent image or video. They are known for high-quality synthesis but are less inherently flexible for variable-length generation or in-context tasks.
- Raster-Scan vs. Masked Prediction:
  - Raster-Scan: A sequential prediction order, like reading a book from left-to-right, top-to-bottom. It preserves causality but is slow because generation is not parallelizable.
  - Masked Prediction: A non-sequential approach where the model predicts randomly masked (hidden) parts of the data based on the visible parts. This allows for bidirectional context and parallel prediction, making it much faster. This is used by models like BERT and MaskGIT.
Previous Works:
- Raster-Scan AR Models: Early vision AR models like DALL-E, CogView, and VAR used VQ to turn images into discrete token sequences and then predicted them one-by-one in a raster-scan order. Their main limitation was the high computational cost associated with the long token sequences required for high-resolution output.
- Masked AR Models: To improve efficiency, models like MaskGIT, Muse, and CogVideo adopted a masked prediction strategy. They predict multiple tokens in parallel, significantly speeding up inference. A key related work is MAR, which introduced a non-quantized approach for class-conditional image generation by using a diffusion process to predict continuous vectors.
- Video Diffusion Models: Models like Stable Video Diffusion, Sora, and Kling are the current state-of-the-art, generating highly realistic videos. However, they are trained on fixed-length clips and require special techniques or fine-tuning to handle different tasks, lacking the unified, in-context learning ability of AR models.
Differentiation:
- NOVA vs. VQ-based AR Models: NOVA is non-quantized. It predicts continuous vectors directly, guided by a diffusion loss, bypassing the fidelity-compression bottleneck of VQ.
- NOVA vs. MAR: While both are non-quantized AR models, MAR was demonstrated on simpler class-to-image tasks. NOVA extends this concept to the much more complex domains of text-to-image and text-to-video, solving key challenges in scalability and training stability.
- NOVA vs. Other Video Models: NOVA's unique contribution is its hybrid autoregressive structure: strictly causal between frames (for temporal coherence) but parallel and bidirectional within each frame (for spatial efficiency). This design provides a "best of both worlds" solution that is both flexible and efficient.

4. Methodology (Core Technology & Implementation)

NOVA's architecture is designed to generate a video frame by frame. For each frame, it generates the visual content in parallel sets of tokens.

Figure 9: Ablation studies on NOVA's architecture components. We carefully examine the two key stability factors in large-scale video generation training, as illustrated in (a) and (b). Image 10: This figure shows the overall framework of NOVA during inference. A text prompt is encoded and fed into the temporal layers. These layers autoregressively predict frame-level representations. The spatial layers then take these representations and autoregressively generate sets of visual tokens for each frame. Finally, a diffusion-based MLP denoises these tokens into continuous latent vectors, which a VAE decoder converts into the final video frames.

Principles: The core idea is to treat a video as a sequence of "meta-units," where each unit is a full frame. The generation is autoregressive at the frame level but parallel at the token level within each frame. This maintains temporal causality while leveraging the efficiency of bidirectional modeling.
Steps & Procedures:

1. Temporal Autoregressive Modeling (Frame-by-Frame Prediction)
- Inputs: The model takes a text prompt, which is encoded by a pre-trained language model. It also computes a motion score from the video's optical flow to help control dynamics.
- Video Encoding: A pre-trained 3D Variational Autoencoder (VAE) compresses the video into a lower-dimensional latent space.
- Causal Prediction: The model predicts the latent representation for the current frame $f$ based on the text prompt, motion score, and all previously generated frames (1 to f-1). This is enforced using a block-wise causal attention mask, as shown in Figure 3(a). Within frame $f$ , all tokens can attend to each other, but they cannot see any tokens from future frames.
- This process is formalized by the following probability distribution, where the model predicts all tokens of the $f$ $f$ -th frame, $S_f$ $S_{f}$ , given the context: $p \left( P , m , B , S _ { 1 } , . . . , S _ { F } \right) = \prod _ { f } ^ { F } p \left( S _ { f } \mid P , m , B , S _ { 1 } , . . . , S _ { f - 1 } \right)$
  - $P$ : Text prompt features.
  - $m$ : Motion score.
  - $B$ : Learnable Begin-of-Video (BOV) embeddings that kickstart the generation of the first frame.
  - $S_f$ : The set of all visual tokens for the $f$ -th frame.
  - $F$ : Total number of frames.
2. Spatial Autoregressive Modeling (Set-by-Set Prediction)
- The Challenge: Directly using the output from the temporal layers to generate the current frame's content proved unstable. Since features from adjacent frames are very similar, the model struggled to learn subtle motions and was sensitive to error accumulation during inference.
- The Solution: Scaling and Shift Layer: To solve this, the authors introduce a clever mechanism. Instead of modeling the absolute content of the current frame, it models the relative change from a stable reference point.
  1. The output features corresponding to the BOV tokens are used as a stable "anchor feature set."
  2. For the current frame $f$ , the temporal layer's output is passed through a Multi-Layer Perceptron (MLP) to predict scaling ( $\gamma$ ) and shifting ( $\beta$ ) parameters.
  3. The anchor features are then affine-transformed with these parameters to create "indicator features" $S_f'$ , which guide the generation of the current frame. This helps the model focus on motion and change.
- Parallel Decoding: With these indicator features, a bidirectional transformer generates the visual tokens for the frame. It follows a masked autoregressive schedule: in each step, it predicts a set of randomly masked tokens based on the unmasked ones. This is done in parallel, making it highly efficient.
- This spatial process is formulated as: $p \left( S _ { f } ^ { \prime } , S _ { \left( f , 1 \right) } , . . . , S _ { \left( f , K \right) } \right) = \prod _ { k } ^ { K } p \left( S _ { \left( f , k \right) } \vert S _ { f } ^ { \prime } , S _ { \left( f , 1 \right) } , . . . , S _ { \left( f , k - 1 \right) } \right)$
  - $S_f'$ : The indicator features for frame $f$ .
  - $S_{(f, k)}$ : The $k$ -th set of tokens being predicted for frame $f$ .
  - $K$ : The total number of sets needed to generate the full frame.
    
    Image 11: This figure contrasts NOVA's approach with traditional methods. On the temporal scale (left), instead of predicting token-by-token, NOVA predicts an entire frame at a time. On the spatial scale (right), instead of a slow token-by-token raster scan, NOVA predicts random sets of tokens in parallel, enabling bidirectional context and faster generation.
    
    3. Diffusion Procedure for Per-Token Prediction
- Non-Quantized Prediction: The spatial AR model does not output a discrete token ID. Instead, for each visual patch, it outputs a continuous vector $z_n$ .
- Denoising Loss: The model is trained to use $z_n$ to predict the original VAE latent vector $x_n$ . This is framed as a denoising task. The loss function aims to predict the noise $\epsilon$ that was added to a "noisy" version of the ground truth latent $x_n^t$ .
- The loss is given by: $\mathcal { L } ( x _ { n } \mid z _ { n } ) = \mathbb { E } _ { \varepsilon , t } \left[ \left\| \epsilon - \epsilon _ { \theta } \left( x _ { n } ^ { t } \mid t , z _ { n } \right) \right\| ^ { 2 } \right]$
  - $x_n$ : The ground-truth latent vector for the $n$ -th patch.
  - $z_n$ : The continuous vector predicted by the transformer for that patch.
  - $\epsilon$ : A random Gaussian noise vector.
  - $t$ : A timestep from the noise schedule.
  - $x_n^t$ : A noisy version of $x_n$ .
  - $\epsilon_\theta$ : A neural network (an MLP in this case) that predicts the noise $\epsilon$ given the noisy latent $x_n^t$ , the timestep $t$ , and the transformer's prediction $z_n$ .
- Inference: During generation, the model starts with pure random noise and iteratively denoises it using the predictions from $\epsilon_\theta$ , guided by the transformer's output $z_n$ , to recover the clean latent vector $x_n$ .

5. Experimental Setup

Datasets:
- Text-to-Image: An initial set of 16M image-text pairs from DataComp, COYO, Unsplash, and JourneyDB. This was later expanded to ~600M pairs by filtering LAION and other sources for high aesthetic scores.
- Text-to-Video: 19M video-text pairs from a subset of Panda-70M and internal datasets, plus 1M high-resolution pairs from Pexels for fine-tuning.
- Data Quality: The authors used a powerful captioning model (Emu2-17B) to generate high-quality descriptions for the training data, a crucial step for improving text-alignment.
Evaluation Metrics:
- For Text-to-Image:
  - T2I-CompBench: Conceptual Definition: A benchmark designed to evaluate a model's ability to handle compositional prompts, such as correctly rendering object attributes, spatial relationships, and complex scenes.
  - GenEval: Conceptual Definition: An object-focused framework that measures how well the generated image aligns with the text prompt, particularly in aspects like counting objects correctly, assigning the right colors, and placing objects in specified positions.
  - DPG-Bench: Conceptual Definition: A benchmark that assesses a model's performance on prompts that are difficult for current models, focusing on fine-grained semantic alignment.
- For Text-to-Video:
  - VBench: Conceptual Definition: A comprehensive benchmark for evaluating text-to-video models across 16 different dimensions. It assesses not just technical quality (e.g., clarity, motion smoothness) but also semantic correctness (e.g., following the prompt), temporal consistency, and aesthetic appeal.
Baselines:
- Text-to-Image: PixArt-α, Stable Diffusion (SD) v1/v2/XL/3, DALL-E2/3, LlamaGen, and Emu3. This list includes both leading diffusion models and other autoregressive models.
- Text-to-Video: Gen-2, Kling, LaVie, Show-1, OpenSora, CogVideo, and Emu3. This includes closed-source commercial models, open-source diffusion models, and prior autoregressive models.

6. Results & Analysis

Core Results:

Text-to-Image Performance (Table 2) The following table is a transcription of the original Table 2 from the paper.

Model	ModelSpec		T2I-CompBench			GenEval						DPG-Bench
Model	#params	#images	Color	Shape	Texture	Overall	Single	Two	Counting	Colors	Position	ColorAttr	Overall	A100 days
Diffusion models
PixArt-α	0.6B	25M	68.86	55.82	70.44	0.48	0.98	0.50	0.44	0.80	0.08	0.07	71.11	753
SD v1.5	1B	2B	37.50	37.24	42.19	0.43	0.97	0.38	0.35	0.76	0.04	0.06	63.18	-
SD v2.1	1B	2B	56.94	44.95	49.82	0.50	0.98	0.37	0.44	0.85	0.07	0.17	-	-
SDXL	2.6B	-	63.69	54.08	56.37	0.55	0.98	0.44	0.39	0.85	0.15	0.23	74.65	-
DALL-E2	6.5B	650M	57.50	54.64	63.74	0.52	0.94	0.66	0.49	0.70	0.10	0.19	-	-
DALL-E3	-	-	81.10	67.50	80.70	0.67	0.96	0.87	0.47	0.83	0.43	0.45	83.50	-
SD3	2B	-	-	-	-	0.62	0.98	0.74	0.63	0.67	0.34	0.36	84.10
Autoregressive models
LamaGen	0.8B	60M	-	-	-	0.32	0.71	0.34	0.21	0.58	0.07	0.04	-	-
Emu3 (+ Rewriter)	8B	-	79.13	58.46	74.22	0.66	0.99	0.81	0.42	0.80	0.49	0.45	81.60	-
NOVA (512×512)	0.6B	16M	70.75	55.98	69.79	0.66	0.98	0.85	0.58	0.83	0.20	0.48	81.76	127
+Rewriter	0.6B	16M	83.02	61.47	75.80	0.75	0.98	0.88	0.62	0.82	0.62	0.58	-	127

Analysis: The most striking result is the efficiency. The 0.6B NOVA model, trained on only 16M images (127 A100 days), achieves a GenEval score of 0.75 (with a rewriter), surpassing all other models listed except DALL-E3. This performance comes at a fraction of the training cost of competitors like PixArt-α (753 A100 days).

Text-to-Video Performance (Table 3) The following table is a transcription of the original Table 3 from the paper.

Model	#params	#videos	latency	Total Score	Quality Score	Semantic Score	Aesthetic Quality	Object Class	Multiple Objects	Human Action	Spatial Relationship	Scene
Closed-source models
Gen-2		-	-	80.58	82.47	73.03	66.96	90.92	55.47	89.20	66.91	48.91
Kling (2024-07)		-	-	81.85	83.39	75.68	61.21	87.24	68.05	93.40	73.03	50.86
Gen-3		-	-	82.32	84.11	75.17	63.34	87.81	53.64	96.4	65.09	54.57
Diffusion models
OpenSora-v1.2	1B	32M	55s	79.76	81.35	73.39	56.85	82.22	51.83	91.20	68.56	42.44
CogVideoX	2B	35M	90s	80.91	82.18	75.83	60.82	83.37	62.63	98.00	69.90	51.14
Autoregressive models
CogVideo	9B	5.4M	-	67.01	72.06	46.83	38.18	73.4	18.11	78.20	18.24	28.24
Emu3	8B	-	-	80.96	84.09	68.43	59.64	86.17	44.64	77.71	68.73	37.11
NOVA	0.6B	20M	12s	78.48	78.96	76.57	54.52	91.36	73.46	91.20	66.37	50.16
+ Rewriter	0.6B	20M	12s	80.12	80.39	79.05	59.42	92.00	77.52	95.20	77.52	54.06

Analysis: NOVA (0.6B params) dramatically outperforms the previous AR model CogVideo (9B params), with a VBench score of 80.12 vs. 67.01. It is also highly competitive with the much larger Emu3 (8B params), achieving a nearly identical score (80.12 vs. 80.96). Its inference speed (12s) is also significantly faster than diffusion models like OpenSora (55s) and CogVideoX (90s).

Qualitative Results and Zero-Shot Generalization:
- Image & Video Quality: Figures 4 and 5 show that NOVA produces high-fidelity images and fluent videos that accurately capture motion and follow text prompts.
- Video Extrapolation (Figure 6): By feeding generated frames back as input, NOVA can create videos longer than its training length, demonstrating the power of its causal temporal structure.
- Multi-Context Generalization (Figure 7): NOVA can perform image-to-video generation (with or without a text prompt) in a zero-shot manner by simply using a real image as the first frame context. This highlights the flexibility of its unified AR framework.
  
  Image 12: A qualitative comparison of text-to-image generation. NOVA's outputs are shown to be visually competitive with other leading models like SDXL and PixArt-α across a variety of prompts.
  
  Image 13: This figure showcases NOVA's ability to generate fluent and diverse video content from text prompts, including a 3D model, a cat lifeguard, and fireworks, demonstrating strong motion capture and subject fidelity.
Ablations / Parameter Sensitivity:
- Effectiveness of Temporal Autoregressive Modeling (Figure 8): An ablation where video generation was attempted using only the spatial AR mechanism resulted in static-looking videos with poor motion and temporal inconsistencies. This confirms the necessity of the frame-by-frame causal structure for capturing video dynamics.
- Effectiveness of Scaling and Shift Layer (Figure 9a): This layer was critical for stable training. Without it, the model struggled to reconcile the distributions of adjacent frames. The ablation shows it effectively bridges the gap between text-to-image and image-to-video loss, leading to more robust learning.
- Effectiveness of Post-Norm Layer (Figure 9b): The authors found that standard Pre-Normalization led to training instability and numerical overflows. Switching to Post-Normalization (placing the LayerNorm after the residual connection) stabilized the training process by controlling the variance of the output embeddings.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully introduces NOVA, a novel autoregressive model that brings the efficiency and quality of modern generative methods to the flexible, in-context learning paradigm of AR models. By eliminating vector quantization and introducing a hybrid temporal-spatial prediction scheme, NOVA sets a new standard for efficient and high-quality autoregressive video generation. It demonstrates that a smaller, well-designed model can outperform or match much larger competitors, paving the way for more accessible and real-time video generation.
Limitations & Future Work: The authors frame this work as a "first step." The primary direction for future work is scaling: exploring the limits of the NOVA architecture with larger model sizes and more extensive datasets to potentially rival or surpass the top commercial models like Sora.
Personal Insights & Critique:
- Significance: The departure from vector quantization is a major conceptual leap for autoregressive vision models. It elegantly solves the long-standing dilemma between compression and fidelity. The hybrid prediction scheme is a clever architectural solution that respects the distinct natures of temporal and spatial data.
- Efficiency as a Key Selling Point: In an era of ever-growing model sizes and computational demands, NOVA's exceptional performance-per-parameter is a significant contribution. It suggests that architectural innovation can be as impactful as raw scaling.
- Potential for Real-Time "World Models": The autoregressive nature and fast inference make NOVA a promising candidate for developing "world models" that can simulate environments and react to inputs in real-time, going beyond the pre-rendered clip generation of many diffusion models.
- Open Questions: While "set-by-set" spatial prediction is efficient, it would be interesting to analyze its impact on fine-grained spatial control. Does the random order of sets affect the model's ability to precisely place small details compared to a deterministic raster-scan? Furthermore, the quality of the VAE is a crucial component; the overall generation quality is capped by the VAE's reconstruction ability. Future improvements in VAEs could further boost NOVA's performance.