Taming Transformers for High-Resolution Image Synthesis
TL;DR Summary
This study combines CNN's inductive bias with Transformer expressivity to synthesize high-resolution images. It first learns a context-rich vocabulary of image constituents with CNNs, then models their composition using Transformers, achieving state-of-the-art results in semantic
Abstract
Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers and obtain the state of the art among autoregressive models on class-conditional ImageNet. Code and pretrained models can be found at https://github.com/CompVis/taming-transformers .
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Taming Transformers for High-Resolution Image Synthesis
The title clearly states the paper's primary objective: to make Transformer architectures, which are notoriously computationally expensive, viable for generating high-resolution images. "Taming" suggests controlling or adapting the powerful but unwieldy nature of Transformers for this specific task.
1.2. Authors
-
Patrick Esser: At the time of publication, a researcher at the Heidelberg Collaboratory for Image Processing, Heidelberg University. He is one of the key authors of VQGAN and later, Stable Diffusion.
-
Robin Rombach: Also a researcher at the Heidelberg Collaboratory for Image Processing, Heidelberg University. He is a lead author on VQGAN and Stable Diffusion, indicating a focused research trajectory on efficient, high-quality generative models.
-
Björn Ommer: Professor and head of the Computer Vision & Learning Group at Ludwig Maximilian University of Munich (at the time, at Heidelberg University). He is a leading figure in computer vision and generative modeling, and his group has produced several influential works, including Stable Diffusion.
The authors are from the same research group, which is known for its significant contributions to generative modeling.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server for academic papers. This means it was shared with the research community before or during a formal peer-review process for a conference or journal. Although arXiv itself is not a peer-reviewed venue, this paper was presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021, which is one of the top-tier, most prestigious conferences in the field of computer vision. Publication at CVPR signifies a high level of novelty, technical soundness, and impact.
1.4. Publication Year
The initial version was submitted to arXiv in December 2020. It was officially published in the CVPR 2021 proceedings.
1.5. Abstract
The abstract summarizes the core problem and solution. Transformers are highly expressive for learning long-range interactions but are computationally infeasible for high-resolution images due to their quadratic complexity. The authors propose a two-stage approach to solve this:
-
Use a Convolutional Neural Network (CNN) to learn a "context-rich vocabulary" (a discrete codebook) of image parts.
-
Use a Transformer to efficiently model the composition of these parts to form a high-resolution image.
This method is applicable to various conditional synthesis tasks (using class labels or segmentation maps) and achieves state-of-the-art results among autoregressive models on class-conditional ImageNet, including the first transformer-based synthesis of megapixel images.
1.6. Original Source Link
-
Original Source Link: https://arxiv.org/abs/2012.09841
-
PDF Link: https://arxiv.org/pdf/2012.09841v3.pdf
-
Publication Status: The paper is a preprint that was later officially published at CVPR 2021.
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is the scalability of Transformers for high-resolution image synthesis.
-
Transformers' Strength and Weakness: Transformers, with their
self-attentionmechanism, are exceptionally good at modeling long-range dependencies. This is crucial for creating globally coherent images. However, theself-attentionmechanism's computational and memory complexity is quadratic with respect to the input sequence length (). For an image, the sequence length is the number of pixels (), making Transformers prohibitively expensive for resolutions beyond small thumbnails (e.g., ). -
CNNs' Strength and Weakness: In contrast, Convolutional Neural Networks (CNNs) have a strong
inductive biastowards local interactions. This means they are designed with the prior knowledge that pixels are strongly correlated with their neighbors. This makes them highly efficient for processing images. However, this focus on locality makes it challenging for CNNs to capture the long-range, global relationships necessary for synthesizing complex, realistic scenes. -
The Gap: Prior work either used Transformers on raw pixels for low-resolution images or used CNN-based autoregressive models which struggled with global coherence. There was a clear need for a method that could combine the global expressivity of Transformers with the local efficiency of CNNs.
The paper's innovative idea is to not force the Transformer to work on the pixel level. Instead, it proposes a two-stage process where a CNN first compresses a high-resolution image into a much shorter sequence of discrete "visual words," and then the Transformer's task is simplified to arranging these powerful, context-rich words into a coherent final image.
2.2. Main Contributions / Findings
The paper makes several key contributions:
-
A Hybrid CNN-Transformer Architecture for Image Synthesis: The primary contribution is a novel two-stage framework that effectively combines a CNN and a Transformer.
- Stage 1 (VQGAN): A CNN-based Vector Quantized Generative Adversarial Network (VQGAN) is used to learn a discrete codebook of perceptual image "constituents" or "parts." This model acts as a powerful image tokenizer, converting a high-resolution image into a much smaller grid of codebook indices.
- Stage 2 (Transformer): An autoregressive Transformer is then trained on these sequences of indices to learn the high-level compositional structure of images. By operating on this compressed latent space, the Transformer can model global relationships efficiently.
-
Enabling Megapixel Image Synthesis with Transformers: By drastically reducing the sequence length the Transformer needs to process, this approach makes it possible to generate images of up to megapixel resolutions for the first time using a Transformer-based architecture. This is achieved via a sliding-window mechanism during sampling.
-
A Unified Framework for Conditional Synthesis: The approach is shown to be highly versatile. It can handle various conditional image synthesis tasks (e.g., semantic-to-image, pose-to-image, class-conditional) within the same framework without requiring task-specific architectural changes.
-
State-of-the-Art Performance: The paper demonstrates state-of-the-art results on several benchmarks. Notably, it outperforms previous autoregressive models like VQVAE-2 on class-conditional ImageNet synthesis, achieving superior FID and Inception Scores.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Transformers: A neural network architecture introduced in "Attention Is All You Need" by Vaswani et al. (2017). Unlike CNNs or RNNs, it relies almost entirely on a mechanism called
self-attention.Self-attentionallows every element in an input sequence to interact with every other element, calculating a weighted score of importance. This makes Transformers incredibly powerful for capturing long-range dependencies, but also leads to its complexity, where is the sequence length. -
Convolutional Neural Networks (CNNs): A class of neural networks particularly effective for image processing. Their core building block is the convolution operation, which applies a small filter (kernel) across the image. This gives CNNs two key properties or
inductive biases:- Locality: Neurons are only connected to a small, local region of the input, assuming that nearby pixels are most relevant to each other.
- Translation Invariance/Equivariance: The same filter is used across the entire image, meaning a feature (e.g., an edge) can be detected regardless of its position. These biases make CNNs very efficient for image tasks.
-
Generative Models: Algorithms that learn the underlying distribution of a dataset and can generate new, synthetic data samples that resemble the original data. Key types include:
- Generative Adversarial Networks (GANs): Consist of two competing networks: a
Generatorthat creates fake data and aDiscriminatorthat tries to distinguish fake data from real data. They are known for producing sharp, realistic images but can be unstable to train. - Variational Autoencoders (VAEs): Learn a compressed, continuous latent representation of the data. They are stable to train and allow for encoding/decoding, but often produce blurrier images than GANs.
- Autoregressive Models: Generate data sequentially, one element at a time, conditioning each new element on all previously generated ones. For an image, this means generating it pixel by pixel. They produce high-likelihood results but are very slow to sample from.
- Generative Adversarial Networks (GANs): Consist of two competing networks: a
-
Vector Quantization (VQ): A process of mapping vectors from a large, continuous space to a finite number of "code" vectors in a smaller, discrete space (a
codebook). In deep learning, this is used to create discrete latent representations, which can be easier to model than continuous ones.
3.2. Previous Works
-
The Transformer (Vaswani et al., 2017): The original paper that introduced the Transformer architecture and the
self-attentionmechanism. While focused on language, its principles are foundational to this work. The coreAttentionformula is: $ \operatorname { A t t n } ( Q , K , V ) = \mathrm { s o f t m a x } \Big ( \frac { Q K ^ { t } } { \sqrt { d _ { k } } } \Big ) V $- Conceptual Definition: For each element in a sequence, attention computes how much it should "pay attention to" all other elements (including itself).
- Symbol Explanation:
- (Query): A matrix representing the current element looking for information.
- (Key): A matrix representing all elements in the sequence that can provide information.
- (Value): A matrix representing the actual content of the elements.
- : The dimension of the keys and queries, used for scaling to stabilize gradients.
The dot product computes a similarity score between every query and every key. The
softmaxturns these scores into weights, which are then used to create a weighted sum of the values .
-
Vector Quantised Variational Autoencoder (VQ-VAE) (van den Oord et al., 2017): This work introduced a method for learning discrete representations with an autoencoder. It consists of an encoder that maps an image to a grid of continuous vectors, a vector quantization step that replaces each vector with its nearest neighbor from a learned codebook, and a decoder that reconstructs the image from the quantized grid. This paper's VQGAN is a direct evolution of the VQ-VAE.
-
PixelCNN / PixelSNAIL (van den Oord et al., 2016; Chen et al., 2018): These are powerful autoregressive models for images that use CNNs. They generate images pixel by pixel, with each pixel's value predicted based on a local context of previously generated pixels. The paper compares its Transformer-based approach to PixelSNAIL to show that Transformers are more expressive even in the latent space.
-
ImageGPT (Chen et al., 2020): A key predecessor that applied a Transformer directly to sequences of pixels. To make this feasible, they either used very low-resolution images ( or ) or used a shallow VQ-VAE for slight compression. The current paper argues that using a powerful, context-rich VQ-VAE (their VQGAN) is critical for success at high resolutions, distinguishing it from ImageGPT's approach.
3.3. Technological Evolution
The field of generative image modeling has seen a progression:
- Early VAEs/GANs: Showed promise but struggled with quality and stability.
- Advanced GANs (e.g., StyleGAN): Achieved photorealism but are often task-specific and lack a direct way to encode an existing image.
- Autoregressive CNNs (e.g., PixelCNN): Provided strong likelihood-based models but were limited by the local receptive field of convolutions, struggling with global coherence.
- Autoregressive Transformers (e.g., ImageGPT): Introduced the expressivity of attention to image generation but were crippled by quadratic complexity, limiting them to low resolutions.
- This Paper (VQGAN + Transformer): Represents a pivotal step. It keeps the Transformer's expressivity but "tames" its complexity by making it operate on a highly compressed, perceptually rich latent space created by a powerful CNN. This hybrid approach unlocked high-resolution synthesis for Transformers and heavily influenced subsequent models like DALL-E and Stable Diffusion.
3.4. Differentiation Analysis
The core innovation of this paper compared to previous work is the synergy of the two stages:
- vs. ImageGPT: ImageGPT either worked on pixels (infeasible for high-res) or used a shallow VQ-VAE with a small receptive field to maintain spatial invariance. This paper argues the opposite: the first stage should be a powerful compression model that captures as much context as possible into each code. This is what makes the subsequent Transformer's job manageable.
- vs. VQ-VAE-2: VQ-VAE-2 also used a two-stage approach with a VQ-VAE and a PixelCNN. The key differences are:
-
Stage 1 Model: This paper uses a VQGAN, incorporating an adversarial loss and a perceptual loss. This results in a codebook that preserves high-fidelity perceptual detail even at very high compression rates, something a standard VQ-VAE with an L2 loss struggles with.
-
Stage 2 Model: This paper uses a Transformer instead of a PixelCNN to model the latent codes. As their experiments show, the Transformer is more effective at capturing the long-range relationships between the codes than the convolutional PixelSNAIL.
The overall philosophy is to delegate tasks based on architectural strengths: let the efficient CNN handle the low-level, local statistics and texture, and let the expressive Transformer handle the high-level, global composition and structure.
-
4. Methodology
The paper's method is a two-stage process. First, a VQGAN is trained to convert images into a compressed, discrete representation. Second, a Transformer is trained to model the distribution of these representations.
The overall approach is illustrated in Figure 2 from the paper.
该图像是示意图,展示了基于VQGAN和Transformer的高分辨率图像合成方法。图中包括了一个代码本,用于存储上下文丰富的视觉部分,以及通过卷积神经网络(CNN)编码器和解码器处理的过程。主要结构展示了如何使用CNN学习视觉特征,并利用Transformer建模特征组合。图中还标注了生成器和鉴别器的工作流程,包括生成图像的过程及其真实与否的判别。
4.1. Stage 1: Learning an Effective Codebook of Image Constituents
The goal of this stage is to learn a mapping from a high-resolution image to a much smaller grid of discrete codes , where and . This is achieved with a model they call a VQGAN.
4.1.1. Core VQ-VAE Framework
The VQGAN builds upon the VQ-VAE framework, which consists of three main components:
-
Encoder (): A standard CNN that takes an image and downsamples it to produce a continuous latent representation .
-
Codebook (): A learned set of embedding vectors, , where each is a "visual word."
-
Decoder (): A CNN that takes the quantized latent representation and upsamples it to reconstruct the image .
The crucial step is the quantization process. For each spatial vector in the encoder's output , the model finds the closest codebook entry and replaces with it. This is done via an element-wise
argminoperation: $ z_{\mathbf{q}} = \mathbf{q}(\hat{z}) := \left( \underset{z_k \in \mathcal{Z}}{\arg\min} \lVert \hat{z}_{ij} - z_k \rVert \right) \in \mathbb{R}^{h \times w \times n_z} $ The final reconstruction is then .
4.1.2. VQGAN Training Objective
A standard VQ-VAE is trained with a reconstruction loss and terms to align the encoder output with the codebook. However, this often leads to blurry reconstructions, especially at high compression rates. To create a "perceptually rich" codebook, the authors introduce two key modifications: a perceptual loss and an adversarial loss.
The base VQ loss function is: $ \mathcal{L}{\mathrm{VQ}}(E, G, \mathcal{Z}) = |x - \hat{x}|^2 + |\mathbf{sg}[E(x)] - z{\mathbf{q}}|2^2 + |\mathbf{sg}[z{\mathbf{q}}] - E(x)|_2^2 $
- Symbol Explanation:
-
: The reconstruction loss, encouraging the decoded image to be close to the original .
-
: The codebook loss. The
sg(stop-gradient) operator means gradients don't flow back into the encoder. This term pushes the selected codebook vectors to be closer to the encoder outputs, effectively learning the codebook. -
: The commitment loss. This term encourages the encoder's output to stay "committed" to the chosen codebook vector, preventing it from fluctuating too much.
To improve this, the authors make the following changes:
-
- The simple pixel-wise reconstruction loss is replaced with a perceptual loss, which measures distance in a deep feature space (e.g., VGGNet), better aligning with human perception.
- A patch-based adversarial loss is added. A discriminator network is trained to distinguish between real images and reconstructed images . The VQGAN's generator (encoder and decoder) is then trained to fool this discriminator. The GAN loss for the generator and discriminator is: $ \mathcal{L}_{\mathrm{GAN}}({E, G, \mathcal{Z}}, D) = [\log D(x) + \log(1 - D(\hat{x}))] $
The complete objective for finding the optimal VQGAN model is a minimax game: $ \mathcal{Q}^* = \underset{E, G, \mathcal{Z}}{\arg\min} \underset{D}{\arg\max} \mathbb{E}{x \sim p(x)} \left[ \mathcal{L}{\mathrm{VQ}}(E, G, \mathcal{Z}) + \lambda \mathcal{L}_{\mathrm{GAN}}({E, G, \mathcal{Z}}, D) \right] $
- Symbol Explanation:
-
: An adaptive weight that balances the VQ (reconstruction) and GAN losses.
The adaptive weight is calculated dynamically to prevent one loss from overpowering the other: $ \lambda = \frac{\nabla_{G_L}[\mathcal{L}{\mathrm{rec}}]}{\nabla{G_L}[\mathcal{L}_{\mathrm{GAN}}] + \delta} $
-
- Symbol Explanation:
-
: The perceptual reconstruction loss.
-
: The gradient of the loss with respect to the last layer of the decoder .
-
: A small constant for numerical stability (e.g., ).
-
Intuition: This formula balances the two losses by looking at how strongly each one wants to update the last layer of the decoder. If the GAN loss is producing much larger gradients, will decrease to scale it down, and vice-versa.
This adversarial training forces the decoder to produce sharp, realistic details, which in turn forces the codebook to capture perceptually meaningful information.
-
4.2. Stage 2: Learning the Composition of Images with Transformers
Once the VQGAN is trained, it can be used to convert any image into a sequence of discrete integer indices . This is done by taking the quantized latent grid and replacing each code vector with its corresponding index in the codebook : $ s_{ij} = k \text{ such that } (z_{\mathbf{q}})_{ij} = z_k $ This 2D grid of indices is then flattened into a 1D sequence (e.g., in raster-scan order). The Transformer's job is to learn the probability distribution over these sequences.
4.2.1. Autoregressive Modeling
The Transformer models the joint probability of the sequence as a product of conditional probabilities, in an autoregressive fashion: $ p(s) = \prod_i p(s_i | s_{
- Symbol Explanation:
-
: The index at the -th position in the sequence.
-
: All indices that came before position .
The model is trained by maximizing the log-likelihood of the data, which is equivalent to minimizing the negative log-likelihood (cross-entropy loss): $ \mathcal{L}{\mathrm{Transformer}} = \mathbb{E}{x \sim p(x)} [-\log p(s)] $ During training, the Transformer is given the ground-truth sequence and learns to predict the next token at each position. During inference, it generates a new sequence one index at a time, sampling from the distribution it predicts at each step and feeding the sampled index back as input for the next step.
-
4.2.2. Conditioned Synthesis
The framework is easily extended to conditional image synthesis, where an input (e.g., a class label, segmentation map) guides the generation. The goal is to learn the conditional distribution: $ p(s|c) = \prod_i p(s_i | s_{
- Implementation:
- Non-spatial conditioning (e.g., class label): The class label is converted to an embedding and prepended to the sequence as a starting token.
- Spatial conditioning (e.g., segmentation map): The segmentation map is also tokenized using another VQGAN, creating an index sequence . This sequence is simply prepended to the image sequence . The Transformer then learns to predict given . The loss is only calculated on the part of the sequence.
4.2.3. Generating High-Resolution Images
A standard Transformer can only handle a fixed sequence length (e.g., tokens). To generate larger images, the paper uses a sliding window approach during inference, as shown in Figure 3.
该图像是一个示意图,展示了滑动注意力窗口的机制。各个步骤中,红色方框内的内容被逐步引入计算,箭头指示了注意力的移动方式,从而体现了卷积神经网络与变换器结合的思想。
- Process:
-
The Transformer generates an initial patch of tokens (e.g., ).
-
To generate the next part of the image, it slides the window over, using the last few tokens from the previously generated patch as context to generate the next set of tokens.
-
This process is repeated, sliding across the image until the full latent grid is generated.
-
Finally, the complete grid of generated indices is passed to the VQGAN's decoder to produce the final high-resolution image.
This works because the VQGAN codes are "context-rich," meaning each code already contains significant information about a larger patch of the original image, providing sufficient context for coherent generation.
-
5. Experimental Setup
5.1. Datasets
The authors use a wide range of datasets to demonstrate the versatility of their approach:
-
ImageNet (IN): A large-scale dataset with 1.2 million training images across 1000 object categories. Used for class-conditional synthesis, unconditional synthesis, and other conditional tasks.
-
Restricted ImageNet (RIN): A subset of ImageNet containing only animal classes.
-
LSUN Churches and Towers (LSUN-CT): A dataset focused on specific scene categories, used for unconditional synthesis.
-
FacesHQ: A combination of two high-quality face datasets, CelebA-HQ and FFHQ, used for unconditional face generation.
-
ADE20K & COCO-Stuff: Large-scale scene parsing datasets with semantic segmentation masks. Used for semantically-guided synthesis.
-
S-FLCKR: A custom dataset of landscape images collected from Flickr, also used with semantic layouts.
-
DeepFashion: A dataset of clothing images with corresponding human pose keypoints. Used for pose-guided synthesis.
-
CIFAR-10: A small-scale dataset ( images) used for quantitative comparison.
The diversity of these datasets (objects, scenes, faces, people) effectively validates the claim that the proposed model is a general-purpose mechanism for image synthesis.
5.2. Evaluation Metrics
-
Negative Log-Likelihood (NLL):
- Conceptual Definition: In the context of autoregressive models, NLL measures how well the model predicts the ground-truth sequence of tokens. A lower NLL indicates that the model assigns a higher probability to the true data, suggesting it has learned the data distribution better.
- Mathematical Formula: For a sequence , $ \text{NLL}(s) = -\log p(s) = -\sum_{i=1}^{L} \log p(s_i | s_{
- Symbol Explanation:
- : The probability assigned by the model to the true token at position , given all previous true tokens.
-
Fréchet Inception Distance (FID):
- Conceptual Definition: FID measures the similarity between two sets of images, typically real images and generated images. It is designed to capture both the quality (fidelity) and diversity of the generated samples. A lower FID score indicates that the distribution of generated images is closer to the distribution of real images.
- Mathematical Formula: $ \text{FID}(x, g) = \lVert \mu_x - \mu_g \rVert_2^2 + \text{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}) $
- Symbol Explanation:
- : The mean of the feature vectors from an Inception-v3 model for the real () and generated () images, respectively.
- : The covariance matrices of the feature vectors for the real and generated images.
- : The trace of a matrix (sum of diagonal elements).
-
Inception Score (IS):
- Conceptual Definition: IS is another metric for evaluating generative models, primarily focusing on two aspects: the quality of individual images (they should be clearly identifiable as a specific object) and their diversity (the model should generate a wide variety of classes). A higher IS is better.
- Mathematical Formula: $ \text{IS}(G) = \exp(\mathbb{E}{x \sim p_g} [D{\text{KL}}(p(y|x) \parallel p(y))]) $
- Symbol Explanation:
- : An image sampled from the generator.
- : The conditional class distribution predicted by a pre-trained Inception model for image . For a high-quality image, this distribution should have low entropy (be sharp).
p(y): The marginal class distribution, averaged over all generated images. For high diversity, this distribution should have high entropy (be uniform).- : The Kullback-Leibler (KL) divergence, which measures the distance between the two distributions.
5.3. Baselines
The paper compares its method against a variety of strong baselines from different classes of generative models:
-
Autoregressive Models:
VQVAE-2,PixelSNAIL,DCTransformer. These represent the state-of-the-art in autoregressive image generation at the time. -
GANs:
BigGAN,StyleGAN2,SPADE,Pix2PixHD. These are top-performing GANs known for their high-fidelity image synthesis. -
VAEs and Flow-based models:
GLOW,NVAE,VDVAE. These are prominent likelihood-based models. -
Diffusion Models:
IDDPM,ADM. These represent the emerging class of diffusion models, which were becoming competitive around the same time.This comprehensive set of baselines allows for a thorough evaluation of the proposed method's position within the broader landscape of generative models.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Attention vs. Convolution in Latent Space (Sec 4.1)
The authors first validate their choice of a Transformer over a convolutional model (PixelSNAIL) for modeling the latent codes.
The following are the results from Table 1 of the original paper:
| Data / # params | Transformer P-SNAIL steps | Transformer P-SNAIL time | PixelSNAIL fixed time |
|---|---|---|---|
| RIN /85M | 4.78 | 4.84 | 4.96 |
| LSUN-CT /310M | 4.63 | 4.69 | 4.89 |
| IN / 310M | 4.78 | 4.83 | 4.96 |
| D-RIN / 180 M | 4.70 | 4.78 | 4.88 |
| S-FLCKR / 310 M | 4.49 | 4.57 | 4.64 |
Analysis:
- Across all datasets and model sizes, the Transformer consistently achieves a lower Negative Log-Likelihood (NLL) than PixelSNAIL.
- This holds true even when comparing models trained for the same amount of time ("Transformer P-SNAIL time" vs. "PixelSNAIL fixed time"), despite PixelSNAIL training about twice as fast. When trained for the same number of steps, the Transformer's advantage is even larger.
- This result is significant because it demonstrates that the expressivity advantage of Transformers carries over to the compressed latent space. The global receptive field of the attention mechanism is better at modeling the compositional relationships between the VQGAN codes than the local receptive field of a CNN.
6.1.2. Versatility and High-Resolution Synthesis (Sec 4.2)
The paper showcases the model's ability to handle numerous tasks and generate megapixel images.
-
Qualitative Results: Figures 1, 4, 5, and 6 show compelling results across a wide array of tasks: unconditional generation, depth-to-image, semantic synthesis, pose-guided synthesis, super-resolution, and class-conditional synthesis. The samples are globally coherent and contain fine-grained, realistic textures.
-
Megapixel Synthesis: Figures 1 and 5 demonstrate successful synthesis of images with resolutions like , a feat previously out of reach for Transformer-based models. This is a direct result of the two-stage approach and the sliding-window generation.
The following are the results from Table 2 of the original paper:
Dataset ours SPADE [53] Pix2PixHD (+aug) [75] CRN [9] COCO-Stuff 22.4 22.6/23.9(*) 111.5 (54.2) 70.4 ADE20K 35.5 33.9/35.7(*) 81.8 (41.5) 73.3
Analysis:
- For semantic image synthesis, the proposed model achieves FID scores that are competitive with or better than state-of-the-art specialized GANs like SPADE.
- This is impressive because the proposed model is a general, likelihood-based framework, not a task-specific adversarial one. It demonstrates that the approach does not sacrifice quality for generality.
6.1.3. Class-Conditional Synthesis on ImageNet (Sec 4.4)
This is one of the paper's flagship results, comparing their model against the SOTA on the challenging ImageNet benchmark.
The following are the results from Table 4 of the original paper:
| Model | acceptance rate | FID | IS |
|---|---|---|---|
| mixed k, p = 1.0 | 1.0 | 17.04 | 70.6 ± 1.8 |
| k = 250, p = 1.0 | 1.0 | 15.98 | 78.6 ± 1.1 |
| mixed k, p = 1.0 | 0.05 | 5.88 | 304.8 ± 3.6 |
| mixed k, p = 1.0 | 0.005 | 6.59 | 402.7 ± 2.9 |
| --- Baselines --- | |||
| VQVAE-2 [61] | 1.0 | ∼31 | ~45 |
| VQVAE-2 | n/a | ∼10 | ∼330 |
| BigGAN [4] | 1.0 | 7.53 | 168.6 ± 2.5 |
| ADM-G, no guid. [15] | 1.0 | 10.94 | 100.98 |
| ADM-G, 1.0 guid. | 1.0 | 4.59 | 186.7 |
Analysis:
- Outperforms Previous Autoregressive SOTA: Without any sampling tricks (acceptance rate 1.0), the model achieves an FID of ~16-17, which is significantly better than VQVAE-2's FID of ~31. This confirms the superiority of the VQGAN + Transformer combination.
- Effect of Rejection Sampling: Following VQVAE-2, the authors apply classifier-based rejection sampling, where only the top-scoring samples (according to a pre-trained classifier) are kept. With a low acceptance rate (e.g., 5%), the FID drops dramatically to 5.88, surpassing the strong BigGAN baseline (FID 7.53). The Inception Score also skyrockets, indicating higher quality and diversity.
- Comparison to Diffusion Models: The model is competitive with the emerging diffusion models (ADM), especially when comparing the unguided versions. While guided diffusion models like ADM ultimately achieve better scores, this paper's results were SOTA among autoregressive models and highly competitive overall at the time of publication.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Importance of Context-Rich Vocabularies (Sec 4.3)
This is the most critical ablation study in the paper. The authors investigate how the compression factor of the VQGAN affects the final synthesis quality. A larger means the latent grid is smaller, so each code must represent a larger patch of the image, thus being more "context-rich."
The results are shown qualitatively in Figure 7 for the FacesHQ dataset.

Analysis:
- (Pixel-level): When the Transformer operates on a representation equivalent to pixels (a k-means of RGB values), it fails completely to capture any coherent global structure. The image is a mess of textures.
- and (Intermediate Compression): As the compression factor increases, the model starts to capture the overall structure of a face. However, at , there are clear inconsistencies, such as a "half-bearded face" or conflicting viewpoints in different parts of the image. The Transformer struggles to stitch together these less-contextual parts coherently.
- (High Compression): At the highest compression factor, the model produces high-fidelity, globally consistent faces. This is the key finding: giving the Transformer a shorter sequence of more powerful, context-rich tokens is far more effective than giving it a longer sequence of low-level tokens. This validates the core hypothesis of the paper.
6.2.2. VQGAN Quality Analysis (Sec 4.4)
The authors quantify the quality of their VQGAN's reconstructions, as this sets the upper bound on the quality the Transformer can achieve.
The following are the results from Table 5 of the original paper:
| Model | Codebook Size | dim Z | FID/val | FID/train |
|---|---|---|---|---|
| VQVAE-2 | 64×64 & 32×32 | 512 | n/a | ∼ 10 |
| DALL-E [59] | 32×32 | 8192 | 32.01 | 33.88 |
| VQGAN | 16×16 | 1024 | 7.94 | 10.54 |
| VQGAN | 16×16 | 16384 | 4.98 | 7.41 |
| VQGAN | 64×64 & 32×32 | 512 | 1.45 | 2.78 |
Analysis:
-
The standard VQGAN (16x16 grid) achieves a reconstruction FID of 7.94 on the validation set, which is significantly better than the DALL-E VAE's reconstruction FID of 32.01, despite using a much higher compression rate (sequence length 256 vs. 1024).
-
This demonstrates the effectiveness of the adversarial and perceptual losses in the VQGAN, which allow it to create high-fidelity reconstructions even when compressing the image significantly. A better reconstruction FID from the first stage provides a stronger foundation for the second-stage Transformer.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully "tames" Transformers for high-resolution image synthesis by proposing a two-stage framework that leverages the complementary strengths of CNNs and Transformers. By using a VQGAN to learn a perceptually rich, compressed codebook of visual constituents, the method transforms the problem of image generation from a complex pixel-level task into a more manageable high-level composition task. The Transformer can then efficiently model the long-range relationships within this compact latent space. This approach enables the generation of megapixel images, provides a unified framework for a multitude of conditional synthesis tasks, and achieves state-of-the-art results, particularly on the challenging class-conditional ImageNet benchmark.
7.2. Limitations & Future Work
While the paper is groundbreaking, it has some inherent limitations based on its architecture:
-
Slow Sampling Speed: The second stage is an autoregressive Transformer. Generating an image requires sequential sampling, one token at a time, which is orders of magnitude slower than single-pass models like GANs. This makes the model less practical for real-time applications.
-
Two-Stage Training Complexity: The model requires training two separate, complex deep learning models (a VQGAN and a Transformer). This can be cumbersome and error-prone, and the final quality is highly dependent on the quality of the first stage. An end-to-end trained model might be more elegant.
-
Error Compounding: Any artifacts or errors introduced by the VQGAN in the first stage are permanent and cannot be corrected by the Transformer. The final output quality is fundamentally capped by the VQGAN's reconstruction fidelity.
The authors suggest that their general mechanism for conditional synthesis opens up opportunities for novel neural rendering approaches, which has since been a very active area of research.
7.3. Personal Insights & Critique
This paper is a landmark in the history of generative models. Its core insight—decoupling perception from synthesis—has been profoundly influential.
-
A Paradigm Shift: The idea of using a powerful tokenizer (VQGAN) to create a "visual vocabulary" and then using a separate model (Transformer) to learn the "grammar" of how these words combine has become a dominant paradigm. This same principle underpins many subsequent large-scale models, including OpenAI's DALL-E and, most notably, Stable Diffusion, which evolved directly from this work by replacing the autoregressive Transformer with a more efficient Diffusion Model in the latent space.
-
The Power of Inductive Bias: The work is a powerful demonstration of the importance of combining learned, expressive architectures (Transformers) with well-designed inductive biases (from CNNs). Instead of forcing a general-purpose model to re-learn fundamental properties of the natural world from scratch (like local correlations in images), this approach injects that knowledge efficiently through a specialized component, freeing up the more powerful model to focus on the higher-level reasoning tasks it excels at.
-
Critique: While revolutionary, the autoregressive nature of the second stage was a significant bottleneck. The field quickly moved towards parallel decoding or alternative synthesis models like diffusion models, which offer a better trade-off between quality and sampling speed. Therefore, while the specific Transformer-based synthesis part of this paper has been superseded, the foundational concept of a two-stage, latent-space generative model pioneered by this work remains incredibly relevant and forms the backbone of many current state-of-the-art systems.
Similar papers
Recommended via semantic vector search.