Abstract

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

1. Bibliographic Information

Title: CogView: Mastering Text-to-Image Generation via Transformers
Authors: Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang.
Affiliations: The authors are affiliated with Tsinghua University, DAMO Academy (Alibaba Group), and the Beijing Academy of Artificial Intelligence (BAAI).
Journal/Conference: This paper was published as a preprint on arXiv. arXiv is a repository for electronic preprints of scientific papers, not a peer-reviewed journal. It allows researchers to share their findings quickly. While not formally peer-reviewed at the time of its initial release, the work has been highly influential.
Publication Year: 2021
Abstract: The paper tackles the challenging problem of general-domain text-to-image generation. The authors propose CogView, a 4-billion-parameter Transformer model that uses a Vector Quantized Variational Autoencoder (VQ-VAE) as an image tokenizer. The paper details strategies for finetuning the model for various tasks like style learning and super-resolution, as well as crucial techniques for stabilizing the pretraining process (e.g., eliminating NaN losses). CogView is shown to achieve state-of-the-art Fréchet Inception Distance (FID) on the blurred MS COCO dataset, outperforming both previous GAN-based models and the contemporary work DALL-E.
Original Source Link: https://arxiv.org/abs/2105.13290

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Generating high-quality, diverse images from arbitrary text descriptions (text-to-image synthesis) in a general domain remains a difficult open problem. It requires a model that not only understands language but also can compose complex visual scenes.
- Prior Work Gaps: Previous dominant methods, based on Generative Adversarial Networks (GANs), performed well on domain-specific datasets (like birds or flowers) but struggled to produce satisfactory results for complex, general-domain scenes like those in the MS COCO dataset.
- Fresh Angle: Inspired by the success of large-scale Transformers (like GPT) in natural language processing and the VQ-VAE framework for turning images into discrete tokens, the authors propose combining these two ideas. They unify text and image generation under a single auto-regressive framework, treating it as a sequence-to-sequence prediction problem. A key challenge they address is the numerical instability encountered when training such a large model on heterogeneous data (text and images).
Main Contributions / Findings (What):
1. CogView Model: The development and release of CogView, a 4-billion-parameter Transformer for text-to-image generation, which was the first open-source model of its kind and scale.
2. State-of-the-Art Performance: CogView achieved a new state-of-the-art FID score on the MS COCO benchmark, surpassing previous GAN-based models and the influential DALL-E model.
3. Training Stabilization Techniques: The paper introduces novel and simple techniques—Precision Bottleneck Relaxation (PB-Relax) and Sandwich LayerNorm (Sandwich-LN)—to overcome the overflow (NaN loss) issues common in large-scale, mixed-precision Transformer training, making it stable.
4. Versatile Finetuning Strategies: The authors demonstrate that the pretrained CogView model can be effectively finetuned for a variety of downstream tasks, including style learning, super-resolution, image captioning, and even industrial applications like fashion design.
5. Self-Reranking and a New Metric: Instead of relying on an external model like CLIP for selecting the best generated images, the paper proposes a self-reranking method based on Caption Loss (CapLoss), a metric derived from a finetuned captioning version of CogView itself.

Foundational Concepts:
- Transformers: A neural network architecture that relies on a mechanism called self-attention to process sequences of data (like words in a sentence). GPT (Generative Pre-trained Transformer) models are a type of Transformer trained to predict the next item in a sequence, making them excellent for generation tasks.
- Auto-regressive Models: These models generate complex data (like text or images) one piece at a time, where each new piece is conditioned on all the previously generated pieces. For example, to generate a sentence, the model predicts the first word, then the second word based on the first, the third based on the first two, and so on.
- Variational Autoencoder (VAE): A generative model consisting of an encoder and a decoder. The encoder compresses an input (like an image) into a low-dimensional representation in a latent space, and the decoder reconstructs the input from this latent representation. It's trained to optimize the Evidence Lower Bound (ELBO), balancing reconstruction quality and the regularity of the latent space.
- Vector Quantized VAE (VQ-VAE): A special type of VAE where the latent space is not continuous but discrete. The encoder's output is mapped to the closest vector in a finite codebook. This "quantization" turns a continuous input like an image into a sequence of discrete integer tokens, making it suitable for processing by a Transformer, which is designed for discrete sequences.
Previous Works:
- GAN-based Models: Early and dominant approaches to text-to-image generation used Generative Adversarial Networks (GANs). Models like AttnGAN, StackGAN, and DM-GAN improved generation by incorporating attention mechanisms or breaking the generation into stages (e.g., sketch-then-refine). However, they often struggled with complex scenes and producing high-fidelity images in general domains.
- Pixel-level Auto-regressive Models: Models like PixelCNN and ImageGPT generated images pixel by pixel. While powerful, this approach is computationally very expensive for high-resolution images due to the extremely long sequences of pixels.
- VQ-VAE + Transformer: The idea of using a VQ-VAE to tokenize images and then a Transformer to model the token sequence was a major breakthrough. It significantly reduced the sequence length compared to pixel-level models, making it feasible to train large Transformers on images.
Differentiation:
- vs. GANs: CogView moves away from the adversarial training of GANs to a more stable, likelihood-based auto-regressive framework. This allows it to model the complex data distribution of general-domain images more effectively.
- vs. DALL-E: DALL-E, a concurrent work from OpenAI, proposed the same core idea. However, CogView differentiates itself in several ways:
  1. Training Stability: It introduces specific, generalizable techniques (PB-Relax, Sandwich-LN) to solve FP16 training instability, whereas DALL-E used a more complex, custom mixed-precision training framework.
  2. Reranking: CogView uses a self-reranking method (CapLoss), avoiding the need for a separate large model like CLIP, which DALL-E used for post-selection.
  3. Finetuning: CogView systematically explores and demonstrates finetuning for various downstream tasks, showcasing its adaptability.
  4. Open Source: CogView was the first of these large-scale models to be open-sourced, facilitating further research.

4. Methodology (Core Technology & Implementation)

The core of CogView is a two-stage process rooted in the theory of Variational Autoencoders (VAEs).

Principles: The paper frames text-to-image generation as maximizing the joint log-likelihood of text-image pairs $({\bf T}, {\bf X})$ . The authors derive their objective from the Evidence Lower BOund (ELBO) of this likelihood. The key idea is to use a VQ-VAE to first learn a discrete representation (tokens) for an image and then use a single powerful Transformer (GPT) to model the joint distribution of text tokens and image tokens.

The learning objective from Equation (2) is simplified to minimizing three loss terms: $- \sum_{i=1}^{N} \bigg( \underbrace{\mathbb{E}_{z_i \sim q(\mathbf{z}|x_i; \phi)}}[-\log p(x_i|z_i; \psi)]}_{\text{reconstruction loss}} + \underbrace{-\log p(t_i; \theta)}_{\text{NLL loss for text}} + \underbrace{-\log p(z_i|t_i; \theta)}_{\text{NLL loss for image tokens}} \bigg)$
- The reconstruction loss is minimized in Stage 1 to train the image tokenizer (VQ-VAE).
- The two negative log-likelihood (NLL) losses for text and image tokens are minimized in Stage 2 by training a single auto-regressive Transformer.
Steps & Procedures:

Stage 1: Image Tokenization
1. An image tokenizer, which is a discrete autoencoder similar to a VQ-VAE, is trained.
2. An Encoder maps a $256 \times 256$ pixel image into a $32 \times 32$ grid of vectors.
3. Each vector is quantized by finding its nearest neighbor in a learnable codebook of 8192 vectors. This turns the image into a sequence of $32 \times 32 = 1024$ integer tokens.
4. A Decoder is trained to reconstruct the original (blurry) image from these quantized vectors. The paper compares four methods for training this tokenizer (Figure 2) and finds they perform similarly, suggesting the exact training dynamics of the codebook are not critical if initialized properly.
  
  Stage 2: Auto-regressive Pretraining
5. Text is tokenized using SentencePiece into a sequence of text tokens.
6. The text tokens and the 1024 image tokens are concatenated into a single sequence, separated by special tokens: [ROI1] text [BASE] [BOI1] image [E0I1]. ROI stands for "Reference of Image" and BOI/EOI for "Beginning/End of Image".
7. A 4-billion-parameter GPT-style Transformer is trained on this combined sequence to perform language modeling: predict the next token given all previous tokens. Crucially, both text and image tokens contribute to the loss, which the authors found essential for learning the cross-modal alignment.
  
  The overall framework is illustrated in Figure 3.
  
  该图像是一张彩色插图，展示了一个男孩坐在桌前，面前有一个用水果摆成的卡通形象盘子，表情开心，体现了儿童饮食的趣味性和健康概念。
Stabilization of Training: Training a 4-billion-parameter model with 16-bit precision (FP16) is highly unstable and prone to producing NaN (Not a Number) losses. CogView introduces two key techniques to solve this:
1. Precision Bottleneck Relaxation (PB-Relax): This addresses overflow in two specific places:
  - LayerNorm: In deep layers, input values can become very large, causing overflow when calculating variance. The fix is to divide the input by its maximum absolute value before the LayerNorm operation, as LayerNorm is scale-invariant.
  - Attention: The attention scores, calculated as $Q^T K / \sqrt{d}$ , can also explode. The paper proposes a numerically stable calculation: $\operatorname{softmax}\bigg(\big(\frac{Q^T}{\alpha \sqrt{d}}K - \operatorname{max}(\frac{Q^T}{\alpha \sqrt{d}}K)\big) \times \alpha\bigg)$ Here, $\alpha$ is a large constant (e.g., 32). This re-scales the intermediate values and subtracts the maximum before the exp() in softmax, preventing overflow.
2. Sandwich LayerNorm (Sandwich-LN): Standard Pre-LN Transformers can still suffer from value explosion where outputs from the residual connection are repeatedly added back, amplifying their magnitude layer by layer. Sandwich-LN adds a second LayerNorm to the end of each residual branch before it is added back to the main path. This normalizes the residual output, keeping the values within a stable range throughout the network. Figure 4 illustrates the difference.
  
  该图像是一张插图，展示了一位站立在芦苇丛中的女性，背景为绿色草地和蓝天，整体氛围自然宁静。
These techniques, along with a minor one called shrink embedding gradient, proved effective at stabilizing training, even in toy experiments designed to provoke instability.

5. Finetuning

A major part of the paper is dedicated to showing how the pretrained CogView can be adapted for various downstream tasks.

Super-resolution: To overcome the blurriness from the VQ-VAE's compression, they finetune CogView to generate higher-resolution details. First, a model is finetuned to predict a $32 \times 32$ token image from a $16 \times 16$ token version. Then, to generate a very high-resolution ( $512 \times 512$ pixel, or $64 \times 64$ token) image, they use a center-continuous sliding-window strategy, generating the image patch-by-patch in an order that prioritizes the coherence of the central region. Figure 5 shows this strategy and its effect.

该图像是一张厨房的照片，展示了带有木质橱柜、嵌入式烤箱和开放式白色储物架的现代厨房空间。图像细节清晰，体现了明亮而温馨的室内环境。
Image Captioning and Self-reranking: By simply reversing the input sequence order during finetuning (i.e., $image tokens -> text tokens$ ), CogView can be turned into a captioning model. This captioner is then used for self-reranking. When generating multiple images for a single text prompt, each candidate image is fed back into the captioning model to calculate the probability of generating the original prompt text. This probability is expressed as Caption Loss (CapLoss). The image with the lowest CapLoss is selected as the best one. Figure 6 shows that images with low CapLoss are generally of high quality and relevance.

该图像是一张模糊的图片，展示了一只狗和两个人，可能与论文中提到的模糊MS COCO数据集相关。图像体现了文本到图像生成的应用背景，但未包含详细技术信息。
Style Learning: The model is finetuned on small, style-specific datasets (e.g., 1000 images of oil paintings). By using generic captions during finetuning (e.g., "An image of oil painting style") and specific prompts during generation (e.g., "A cat of oil painting style"), the model learns to apply the learned style to new objects. Figure 7 demonstrates this for Shanghai's Oriental Pearl Tower.

该图像是人物肖像照片，展示了两位男子坐在桌前，面带微笑，正在使用笔记本电脑。图像风格真实，可能用于说明人机交互或合作场景。
Industrial Fashion Design: For domain-specific tasks like fashion, the authors train a specialized model. They use a VQGAN (a variant of VQ-VAE with a GAN loss for more realistic textures) and train a 3B-parameter model on 10 million fashion-caption pairs to generate high-resolution ( $800 \times 800$ ) clothing designs.

该图像是一幅生物风格的黑白插图，展示了一位白发老人的正面肖像，肌理较为模糊且带有艺术化处理效果，具有强烈的视觉冲击力。

6. Experimental Setup

Datasets:
- Pretraining: A custom dataset of 30 million Chinese text-image pairs collected from various sources, including professional image websites, news articles, and search engines. English captions from datasets like Conceptual Captions were machine-translated to Chinese.
- Evaluation: The primary benchmark is MS COCO, a standard dataset for image understanding tasks. Importantly, MS COCO was not part of the training set. To compare with DALL-E, they evaluate on a subset of 30,000 captions and apply Gaussian blur filters of varying radii to both generated and ground-truth images.
Evaluation Metrics:
1. Fréchet Inception Distance (FID):
  - Conceptual Definition: FID measures the similarity between two sets of images (e.g., real vs. generated). It compares the statistics (mean and covariance) of features extracted from an intermediate layer of the InceptionV3 network. A lower FID score indicates that the distribution of generated images is closer to the distribution of real images, implying better quality and diversity.
  - Mathematical Formula: $\mathrm{FID}(x, g) = \left\| \mu_x - \mu_g \right\|_2^2 + \mathrm{Tr}\left( \Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2} \right)$
  - Symbol Explanation:
    - $\mu_x, \mu_g$ : The mean of the feature vectors for the real ( $x$ ) and generated ( $g$ ) images.
    - $\Sigma_x, \Sigma_g$ : The covariance matrices of the feature vectors for the real and generated images.
    - $\mathrm{Tr}(\cdot)$ : The trace of a matrix (sum of diagonal elements).
2. Inception Score (IS):
  - Conceptual Definition: IS measures both the quality (clarity and recognizability of objects) and diversity of generated images. It uses an Inception network to classify images. A high score is achieved if (1) the model generates images that are confidently classified into a single class (high quality) and (2) the generated images span a wide variety of classes (high diversity).
  - Mathematical Formula: $\mathrm{IS}(G) = \exp\left( \mathbb{E}_{x \sim G} D_{\mathrm{KL}}(p(y|x) \| p(y)) \right)$
  - Symbol Explanation:
    - $x \sim G$ : An image $x$ sampled from the generator $G$ .
    - $p(y|x)$ : The conditional class distribution (classifier output) for image $x$ .
    - p(y): The marginal class distribution, averaged over all generated images.
    - $D_{\mathrm{KL}}(\cdot \| \cdot)$ : The Kullback-Leibler (KL) divergence, which measures how different two probability distributions are.
3. Caption Loss (CapLoss):
  - Conceptual Definition: Proposed by the authors, this metric measures the text-image correspondence. It is the negative log-likelihood of the ground-truth text caption given the generated image, calculated using the finetuned image-to-text model. A lower CapLoss means the image is a better match for the caption.
  - Mathematical Formula: $\mathrm{CapLoss}(x, t) = \frac{1}{|t|} \sum_{i=0}^{|t|} -\log p(t_i | x, t_{0:i-1})$
  - Symbol Explanation:
    - $x$ : The generated image.
    - $t$ : The sequence of text tokens in the original caption.
    - $p(t_i | x, t_{0:i-1})$ : The probability of the $i$ -th text token, given the image $x$ and all preceding text tokens, as predicted by the captioning model.
Baselines:
- GAN-based models: AttnGAN, DM-GAN, DF-GAN.
- Transformer-based model: DALL-E.

7. Results & Analysis

Core Results: The main quantitative results are presented in Table 1. CogView is evaluated against baselines on the MS COCO dataset.

Manual transcription of Table 1 from the paper.

Model	FID-0	FID-1	FID-2	FID-4	FID-8	IS	CapLoss
AttnGAN	35.2	44.0	72.0	108.0	100.0	23.3	3.01
DM-GAN	26.5	39.0	73.0	119.0	112.3	32.2	2.87
DF-GAN	26.5	33.8	55.9	91.0	97.0	18.7	3.09
DALL-E	27.5	28.0	45.5	83.5	85.0	17.9	—
CogView	27.1	19.4	13.9	19.4	23.6	18.2	2.43

FID Scores: CogView achieves a significantly lower (better) FID score than all baselines, including DALL-E, especially on the blurred images (FID-1, FID-2). A lower FID on blurred images suggests better high-level structural correctness.
IS Scores: CogView's Inception Score is competitive but not the highest. The authors argue that IS is not a suitable metric for this complex task, as it favors models that generate simple, single-object images, which is supported by DM-GAN having the highest IS but poor human evaluation scores.
CapLoss: CogView achieves the lowest (best) CapLoss, indicating the best text-image alignment among the models evaluated with this metric.

Ablations / Parameter Sensitivity:
- Reranking Comparison: Figure 9 compares self-reranking (using CapLoss) with CLIP-based reranking. Self-reranking consistently achieves a better FID score as the number of candidates increases, while CLIP is more effective at improving the IS. This reinforces the idea that CapLoss is better aligned with the FID metric for this task.
  
  该图像是插图，展示了一人在雪地上滑雪的场景，背景为蓝天。图像反映了运动和寒冷环境下的户外活动氛围，未涉及具体论文内容。
- Human Evaluation: The human evaluation results in Figure 10 are highly persuasive. CogView is chosen as the "best" image nearly 40% of the time, far outperforming the GAN baselines and approaching the performance of the "recovered ground truth" (the theoretical best result for the VQ-VAE tokenizer). The super-resolution finetuned model even surpasses this upper bound in clarity, demonstrating its ability to add plausible details.
  
  $Figure 2: `L _ { 2 }` loss curves during training image tokenizers. All the above methods finally converge to a similar loss level.$ 该图像是一个图表，展示了训练图像编码器时不同方法的 $L_2$ 损失随迭代次数变化的曲线。所有方法最终收敛到相似的损失水平，说明优化效果接近。

8. Conclusion & Reflections

Conclusion Summary: The paper successfully presents CogView, a large-scale Transformer model that significantly advances the state of the art in text-to-image generation. The core contributions are the model itself, the crucial training stabilization techniques (PB-Relax and Sandwich-LN) that enable stable FP16 pretraining, and the demonstration of versatile finetuning capabilities. The proposed CapLoss metric and self-reranking method provide an efficient and effective alternative to using external models for image selection.
Limitations & Future Work: The authors acknowledge two primary limitations inherent to the auto-regressive framework:
1. Slow Generation Speed: Generating images token-by-token is inherently slow compared to one-shot methods like GANs.
2. Blurriness: The use of VQ-VAE for tokenization introduces a lossy compression step, resulting in generated images that are blurrier than real-world photos. While super-resolution finetuning helps, it doesn't completely solve the problem.
Personal Insights & Critique:
- The most significant and lasting contribution of this paper might be the training stabilization techniques. PB-Relax and Sandwich-LN are simple, elegant, and generalizable solutions to a critical engineering problem in training extremely large Transformer models. These methods have implications far beyond text-to-image generation and can be applied to any very deep Transformer architecture.
- The concept of self-reranking via CapLoss is clever and resource-efficient. It demonstrates a powerful principle: a sufficiently capable generative model can also be used for evaluation and discrimination, unifying different aspects of a task within a single model. This is a more integrated approach than relying on a separate, large contrastive model like CLIP.
- The paper is a strong example of how progress in AI often comes from combining existing powerful ideas (Transformers, VQ-VAE) and then rigorously solving the engineering challenges that arise at scale. The authors' focus on identifying and fixing the root causes of training instability was key to their success.
- While CogView represented a major step forward, the field has since moved towards diffusion models, which have largely overcome the blurriness and slow generation issues of auto-regressive models. Nonetheless, CogView and DALL-E were foundational in demonstrating that a unified, large-scale generative pretraining approach could master the complex task of text-to-image synthesis.

CogView: Mastering Text-to-Image Generation via Transformers

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~16 min read · 18,081 chars

1. Bibliographic Information

2. Executive Summary

4. Methodology (Core Technology & Implementation)

5. Finetuning

6. Experimental Setup

7. Results & Analysis

8. Conclusion & Reflections

Similar papers

CogView: Mastering Text-to-Image Generation via Transformers

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~16 min read · 18,081 chars

1. Bibliographic Information

2. Executive Summary

3. Prerequisite Knowledge & Related Work

4. Methodology (Core Technology & Implementation)

5. Finetuning

6. Experimental Setup

7. Results & Analysis

8. Conclusion & Reflections

Similar papers