Paper status: completed

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Published:05/24/2022

Diffusion Models (10)Text-to-Image Generation (19)Large Language Model Understanding (1)High-Fidelity Image Synthesis (2)DrawBench Benchmark (1)

Original Link PDF

Price: 0.100000

10 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Imagen combines large transformer LMs with diffusion models to achieve unprecedented photorealism and deep language understanding, outperforming prior methods with state-of-the-art FID on COCO and superior text-image alignment.

Abstract

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.

Mind Map

In-depth Reading

English Analysis~14 min read · 19,183 chars

1. Bibliographic Information

Title: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Authors: Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, Mohammad Norouzi. The authors are affiliated with Google Research, Brain Team.
Journal/Conference: The paper was submitted to arXiv, a preprint server. It is not formally peer-reviewed in a conference or journal but is considered a seminal work in the field of generative AI.
Publication Year: 2022
Abstract: The authors introduce Imagen, a text-to-image diffusion model that achieves remarkable photorealism and a profound understanding of language. The model's architecture leverages large transformer language models (LMs) for text comprehension and diffusion models for high-quality image synthesis. The central discovery is that large language models pre-trained solely on text (like T5) are highly effective at encoding text for image generation. Scaling the size of the language model was found to improve sample quality and image-text alignment more than scaling the image diffusion model. Imagen set a new state-of-the-art Fréchet Inception Distance (FID) score of 7.27 on the COCO dataset without being trained on it. To facilitate more in-depth model evaluation, the authors introduce DrawBench, a challenging new benchmark. Human evaluators preferred Imagen over other leading models like DALL-E 2 in side-by-side comparisons on DrawBench for both sample quality and alignment.
Original Source Link: The paper is available as a preprint at https://arxiv.org/abs/2205.11487.

2. Executive Summary

Background & Motivation (Why):
- At the time of publication, text-to-image synthesis was rapidly advancing, but models still struggled to simultaneously achieve high-fidelity photorealism and a deep, compositional understanding of complex text prompts. Existing models often used text encoders trained on image-text datasets (e.g., CLIP), which, while effective, were limited by the scale and richness of available multimodal data.
- The core problem was how to better translate intricate, nuanced natural language descriptions into corresponding high-quality images. Prior work had not fully explored the potential of massive language models that were pre-trained exclusively on text corpora, which are vastly larger and more diverse than image-text datasets.
- The paper introduces a fresh angle by hypothesizing that a deep understanding of language itself is the most critical bottleneck. Instead of using a visually-grounded text encoder, Imagen leverages a frozen, pre-trained, text-only large language model (T5), separating the task of language understanding from image generation.
Main Contributions / Findings (What):
- Large Language Models for Text Encoding: The paper's key discovery is that large, pre-trained, text-only language models are surprisingly effective text encoders for text-to-image generation. Scaling the size of this text encoder provides more significant gains in image quality and alignment than scaling the image diffusion model.
- Dynamic Thresholding: A novel sampling technique called dynamic thresholding is introduced. It allows for the use of high guidance weights, which improve text alignment, without causing the image quality degradation (oversaturation) seen in previous methods.
- Efficient U-Net Architecture: The paper proposes Efficient U-Net, a new variant of the U-Net architecture that is simpler, more memory-efficient, and converges faster than prior designs used in diffusion models.
- State-of-the-Art Performance: Imagen achieved a new state-of-the-art zero-shot FID score of 7.27 on the MS-COCO dataset, outperforming all previous models, including those trained directly on COCO.
- DrawBench Benchmark: A new, comprehensive benchmark named DrawBench was introduced to evaluate text-to-image models on more challenging aspects like compositionality, cardinality, spatial relations, and rare words. On this benchmark, human evaluators significantly preferred Imagen over other leading models, including the concurrent DALL-E 2.

Foundational Concepts:
- Diffusion Models: These are a class of generative models that learn to create data by reversing a gradual noising process. The model is trained to predict the original data (or the noise that was added) from a noisy version. During generation, it starts with pure random noise and iteratively applies this denoising function to produce a clean, realistic sample.
- Transformer Language Models (LMs): These are neural network architectures, like BERT and T5, that have become the standard for natural language processing. They are pre-trained on enormous amounts of text data (e.g., the entire web) to learn deep patterns, syntax, and semantics of language. A key feature is the self-attention mechanism, which allows the model to weigh the importance of different words in the input text when processing it.
- Classifier-Free Guidance: A technique used during the sampling (generation) phase of conditional diffusion models to improve adherence to the conditioning signal (e.g., a text prompt). It works by training a single model that can operate both conditionally and unconditionally. During sampling, the model's output is pushed away from the unconditional prediction and towards the conditional one, with a "guidance weight" controlling the strength of this effect.
- Cascaded Models: An architectural strategy where a pipeline of models is used to generate high-resolution data. In this paper, a base model generates a low-resolution image (64x64), which is then passed to a series of super-resolution models that progressively increase its size (to 256x256, then 1024x1024), adding finer details at each stage.
- CLIP (Contrastive Language-Image Pre-training): A model trained on hundreds of millions of image-text pairs from the internet. It learns a shared embedding space where the embeddings of a matching image and text caption are close together. This makes its text encoder a natural choice for text-to-image tasks, as it is inherently "visual." Imagen compares its T5-based approach against a CLIP-based one.
Previous Works:
- GANs (Generative Adversarial Networks): Models like AttnGAN and XMC-GAN were early successes in text-to-image but were often difficult to train and struggled with generating diverse, high-fidelity images.
- VQ-VAE + Transformers: Methods like DALL-E and Make-A-Scene first tokenize an image into discrete visual words and then use an autoregressive transformer to generate these tokens from a text prompt. While powerful, they could sometimes lack photorealism.
- Other Diffusion Models:
  - GLIDE: A text-guided diffusion model that also used cascaded generation but relied on a smaller, custom-trained text encoder.
  - Latent Diffusion Models (LDM): These perform the diffusion process in a compressed "latent" space to improve computational efficiency.
  - DALL-E 2: A concurrent work that also achieved impressive results. However, its architecture is more complex, using a diffusion "prior" to map a CLIP text embedding to a CLIP image embedding, which then guides a cascaded diffusion process. Imagen is simpler as it directly conditions the diffusion model on text embeddings from T5.
Technological Evolution: Imagen represents a pivotal moment where the advancements in two separate fields—large-scale language modeling and high-fidelity diffusion-based image generation—were combined. It shifted the focus from building specialized, visually-aware text encoders to leveraging the immense power of general-purpose, text-only language models.
Differentiation: The most significant innovation of Imagen is its core hypothesis: a better language model leads to a better image model. By using a large, frozen, text-only LM (T5-XXL), Imagen demonstrated that superior language understanding was more critical for high-quality, aligned image generation than using a multimodal encoder like CLIP. This insight, combined with technical improvements like dynamic thresholding, allowed it to surpass even more complex architectures like DALL-E 2.

4. Methodology (Core Technology & Implementation)

Imagen's architecture is a cascade of three text-conditional diffusion models, guided by embeddings from a large, frozen language model.

$Figure A.4: Visualization of Imagen. Imagen uses a frozen text encoder to encode the input text into text embeddings. A conditional diffusion model maps the text embedding into a $6 4 \\times 6 4$ ima…$ Figure A.4 (Image 29): A high-level overview of the Imagen pipeline. Text is first processed by a frozen text encoder. The resulting embeddings guide a base diffusion model to generate a 64x64 image. This image is then upscaled twice by two separate text-conditional super-resolution diffusion models to produce the final 1024x1024 output.

Principles: The core idea is to decouple deep language understanding from image synthesis. A massive, pre-trained language model provides a rich, contextual representation of the input text. This representation then guides a series of diffusion models, which are experts at generating photorealistic images, to produce an output that matches the text description.
Steps & Procedures:
1. Text Encoding: An input text prompt is fed into a frozen T5-XXL language model. This produces a sequence of high-dimensional vector embeddings, one for each input token. Being "frozen" means the language model's weights are not updated during Imagen's training.
2. Base Image Generation: A 64x64 text-to-image diffusion model takes the text embeddings as conditioning information. Starting from random noise, it iteratively denoises it over many steps to generate a 64x64 image that aligns with the prompt.
3. First Super-Resolution: A 64x64 → 256x256 super-resolution diffusion model takes the 64x64 image and the same text embeddings as input. It upsamples the image, adding details and refining the structure to create a 256x256 image.
4. Second Super-Resolution: A final 256x256 → 1024x1024 super-resolution diffusion model takes the 256x256 image and the text embeddings to generate the final high-resolution 1024x1024 image.
Mathematical Formulas & Key Details:
- Diffusion Model Objective: The model $\hat{\mathbf{x}}_{\theta}$ is trained to predict the original clean image $\mathbf{x}$ from a noised version $\mathbf{z}_t = \alpha_t \mathbf{x} + \sigma_t \mathbf{\epsilon}$ , where $\mathbf{\epsilon}$ is random Gaussian noise and $t$ is the timestep. The objective is to minimize the weighted mean squared error: $\mathbb { E } _ { \mathbf { x } , \mathbf { c } , \epsilon , t } \Big [ w _ { t } \big \| \hat { \mathbf { x } } _ { \boldsymbol { \theta } } \big ( \alpha _ { t } \mathbf { x } + \sigma _ { t } \mathbf { \epsilon } , \mathbf { c } \big ) - \mathbf { x } \big \| _ { 2 } ^ { 2 } \Big ]$
  - $\mathbf{x}$ : The clean, original image.
  - $\mathbf{c}$ : The conditioning information (text embeddings).
  - $\mathbf{\epsilon}$ : A random noise sample from a standard normal distribution $\mathcal{N}(\mathbf{0}, \mathbf{I})$ .
  - $t$ : A timestep, typically sampled uniformly from [0, 1].
  - $\alpha_t, \sigma_t$ : Functions of $t$ that control the signal-to-noise ratio at each timestep.
  - $w_t$ : A weighting function to prioritize certain timesteps.
- Classifier-Free Guidance: During sampling, the model's prediction is modified to enhance the effect of the text prompt. Instead of predicting the noise $\epsilon$ directly, an adjusted noise $\widetilde \epsilon_{\theta}$ is used: $\widetilde \epsilon _ { \theta } ( \mathbf { z } _ { t } , \mathbf { c } ) = w \epsilon _ { \theta } ( \mathbf { z } _ { t } , \mathbf { c } ) + ( 1 - w ) \epsilon _ { \theta } ( \mathbf { z } _ { t } )$
  - $\epsilon_{\theta}(\mathbf{z}_t, \mathbf{c})$ : The noise prediction from the model conditioned on the text $\mathbf{c}$ .
  - $\epsilon_{\theta}(\mathbf{z}_t)$ : The noise prediction from the model with no conditioning (unconditional).
  - $w$ : The guidance weight. $w=1$ means no guidance. $w > 1$ strengthens the text conditioning, improving alignment.
- Dynamic Thresholding: A key innovation to enable large guidance weights ( $w$ ). High values of $w$ can push predicted pixel values outside the standard $[-1, 1]$ range, leading to oversaturated, unrealistic images. Dynamic thresholding addresses this:
  1. At each sampling step, predict the clean image $\hat{\mathbf{x}}_0^t$ .
  2. Calculate a dynamic threshold value $s$ , which is a high percentile (e.g., 99.5%) of the absolute pixel values in $\hat{\mathbf{x}}_0^t$ .
  3. If $s > 1$ , clip all pixel values in $\hat{\mathbf{x}}_0^t$ to the range $[-s, s]$ .
  4. Rescale the clipped image by dividing all pixel values by $s$ . This brings the values back into the $[-1, 1]$ range while preserving relative color information and preventing saturation.
- Noise Conditioning Augmentation: The super-resolution models are made more robust by corrupting their low-resolution input images with varying levels of Gaussian noise during training. The model is conditioned on the aug_level (noise level), learning to denoise a wide range of input qualities. This helps it handle artifacts from the previous stage and allows for more creative variations at higher resolutions.
- Neural Network Details: The diffusion models use a U-Net architecture. The text conditioning is applied via cross-attention layers at multiple resolutions within the U-Net, allowing the model to focus on relevant parts of the text embedding sequence when generating different parts of the image. The paper also introduces an Efficient U-Net for the 64x64 → 256x256 model, which optimizes memory and speed by rearranging network blocks and using grouped normalization.

5. Experimental Setup

Datasets:
- Training: A large internal dataset of ~460 million image-text pairs, combined with the publicly available LAION-400M dataset (~400 million pairs). The authors note that these datasets are largely uncurated and contain undesirable content.
- Evaluation:
  - MS-COCO: The standard benchmark for text-to-image evaluation. The authors evaluate in a "zero-shot" setting, meaning the model was not trained on any COCO data. They use 30,000 captions from the validation set.
  - DrawBench: A new benchmark created by the authors, consisting of 200 challenging prompts organized into 11 categories. These prompts are designed to test specific model capabilities like compositionality (e.g., "a red cube on top of a blue sphere"), cardinality ("a photo of two dogs"), spatial relations, and handling of rare words.
Evaluation Metrics:
1. Fréchet Inception Distance (FID):
  - Conceptual Definition: FID measures the quality and diversity of generated images by comparing the feature distributions of generated images and real images. A lower FID score indicates that the distribution of generated images is more similar to the distribution of real images, suggesting better quality. It is computed using features from a pre-trained Inception-v3 network.
  - Mathematical Formula: $\mathrm{FID}(x, g) = \left\| \mu_x - \mu_g \right\|_2^2 + \mathrm{Tr}\left(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}\right)$
  - Symbol Explanation:
    - $x$ and $g$ are the sets of real and generated image features, respectively.
    - $\mu_x$ and $\mu_g$ are the mean vectors of the features for real and generated images.
    - $\Sigma_x$ and $\Sigma_g$ are the covariance matrices of the features.
    - $\mathrm{Tr}(\cdot)$ denotes the trace of a matrix.
2. CLIP Score:
  - Conceptual Definition: CLIP Score measures how well an image aligns with a given text prompt. It calculates the cosine similarity between the CLIP embeddings of the generated image and the text prompt. A higher CLIP score indicates better image-text alignment.
  - Mathematical Formula: $\text{CLIP Score} = 100 \times \cos(E_I, E_T)$
  - Symbol Explanation:
    - $E_I$ is the CLIP embedding of the generated image.
    - $E_T$ is the CLIP embedding of the text prompt.
    - $\cos(\cdot, \cdot)$ is the cosine similarity function.
3. Human Evaluation:
  - On COCO: Raters were asked two questions: 1) To assess photorealism, they chose between an Imagen-generated image and the ground-truth COCO image ("Which is more photorealistic?"). 2) To assess alignment, they rated how well a caption described an image (either generated or real) on a scale of "yes" (100), "somewhat" (50), or "no" (0).
  - On DrawBench: For direct model comparison, raters were shown images from two models (e.g., Imagen vs. DALL-E 2) for the same prompt and asked to choose which model they preferred in terms of sample fidelity and image-text alignment.
Baselines: The paper compares Imagen against several contemporary state-of-the-art models:
- DALL-E 2 (concurrent work)
- GLIDE
- Latent Diffusion Models (LDM)
- VQ-GAN+CLIP

6. Results & Analysis

Core Results:

COCO Benchmark: Imagen established a new state-of-the-art for zero-shot text-to-image synthesis.

The following is a transcription of Table 1 from the paper.

Model	FID-30K	Zero-shot FID-30K
AttnGAN [76]	35.49
DM-GAN [83]	32.64
DF-GAN [69]	21.42
DM-GAN + CL [78]	20.79
XMC-GAN [81]	9.33
LAFITE [82]	8.12
Make-A-Scene [22]	7.55
DALL-E [53]		17.89
LAFITE [82]		26.94
GLIDE [41]		12.24
DALL-E 2 [54]		10.39
Imagen (Our Work)		7.27

Imagen's zero-shot FID score of 7.27 was significantly better than all previous models, including DALL-E 2 (10.39) and even models trained directly on COCO like Make-A-Scene (7.55).
Human evaluations on COCO (transcribed from Table 2) revealed that Imagen's alignment score (91.4) was on par with the ground-truth COCO images (91.9). For photorealism, it was preferred over real images 39.5% of the time, with this rate increasing to 43.9% when images with people were excluded, indicating a weakness in generating realistic humans.

Model Photorealism ↑ Alignment ↑

Original 50.0% 91.9 ± 0.42

Imagen 39.5 ± 0.75% 91.4 ± 0.44

No people

Original 50.0% 92.2 ± 0.54

Imagen 43.9 ± 1.01% 92.1 ± 0.55

DrawBench Benchmark: Human evaluations showed a strong preference for Imagen over competitors.

$Figure 3: Comparison between Imagen and DALL-E 2 \[54\], GLIDE \[41\], VQ-GAN $^ +$ CLIP \[12\] and Latent Diffusion \[57\] on DrawBench: User preference rates (with $9 5 \\%$ confidence intervals) for image-…$ Figure 3 (Image 12): Human preference rates on DrawBench. In pairwise comparisons against DALL-E 2, GLIDE, VQ-GAN+CLIP, and Latent Diffusion, human raters overwhelmingly preferred Imagen's results for both image-text alignment and image fidelity.

Ablations / Parameter Sensitivity: The paper's strength lies in its thorough ablation studies, which support its key claims.

$Figure 3: Comparison between Imagen and DALL-E 2 \[54\], GLIDE \[41\], VQ-GAN $^ +$ CLIP \[12\] and Latent Diffusion \[57\] on DrawBench: User preference rates (with $9 5 \\%$ confidence intervals) for image-…$ Figure 4 (Image 12): A summary of key ablation results, plotting FID vs. CLIP scores.
- (a) Scaling Text Encoder vs. U-Net: The plots clearly show that increasing the text encoder size (from T5-Base to T5-XXL) results in a much larger improvement (lower FID, higher CLIP score) than increasing the U-Net size. This is the paper's central finding.
- (b) Importance of U-Net Size: While less impactful than the text encoder, scaling the U-Net of the diffusion model also leads to consistent, albeit smaller, improvements in sample quality.
- (c) Dynamic Thresholding: This plot demonstrates that dynamic thresholding achieves a significantly better Pareto frontier (better FID for any given CLIP score) compared to static thresholding or no thresholding, especially at high guidance weights (which produce high CLIP scores).
- T5-XXL vs. CLIP: While automated metrics on COCO were similar, human evaluators on the more challenging DrawBench benchmark preferred T5-XXL over CLIP for both alignment and fidelity across all categories. This suggests T5's richer language understanding is crucial for complex prompts.
- Other Findings: The paper also confirms that noise conditioning augmentation for super-resolution models and using cross-attention for text conditioning are critical for achieving the best results.

Model	Photorealism ↑	Alignment ↑
Original	50.0%	91.9 ± 0.42
Imagen	39.5 ± 0.75%	91.4 ± 0.44
No people
Original	50.0%	92.2 ± 0.54
Imagen	43.9 ± 1.01%	92.1 ± 0.55

7. Conclusion & Reflections

Conclusion Summary: The paper presents Imagen, a text-to-image model that set a new standard for photorealism and language understanding. Its main contribution is the discovery that large, general-purpose, text-only language models serve as extremely powerful text encoders for this task. The authors show that scaling the language model is more effective than scaling the image generation model. This finding, combined with architectural innovations like dynamic thresholding and an Efficient U-Net, enabled Imagen to achieve state-of-the-art results on COCO and outperform competitors in human evaluations on the new DrawBench benchmark.
Limitations & Future Work: The authors provide a remarkably candid and extensive discussion of the model's limitations and the broader societal impact.
- Data Bias: Imagen is trained on large, uncurated web datasets, which contain harmful stereotypes, oppressive viewpoints, and pornographic content. The model learns and can reproduce these biases, for instance showing a tendency to generate people with lighter skin tones and aligning professions with Western gender stereotypes.
- Representational Harm: The model's biases risk causing representational harm to marginalized communities.
- Malicious Use: Like any powerful generative tool, it could be misused for creating misinformation or harassment.
- Performance Gaps: The model's performance degrades when generating images of people, as confirmed by human evaluations.
- Decision Not to Release: Due to these significant ethical concerns and limitations, the authors made the decision not to release the model code or a public demo. They call for future work on responsible externalization frameworks and better methods for auditing and mitigating bias in text-to-image models.
Personal Insights & Critique:
- Impact and Novelty: This paper was a landmark in generative AI. The insight that a text-only LM could outperform a visually-trained LM like CLIP for text encoding was counter-intuitive but transformative. It simplified the conceptual model for text-to-image generation and highlighted that language understanding was a key bottleneck that could be solved by leveraging progress from the NLP field.
- Technical Rigor: The paper is an exemplar of rigorous empirical research. The ablation studies are comprehensive and convincingly support each of the key claims. The introduction of dynamic thresholding was a simple yet highly effective technical contribution that has been widely adopted.
- Contribution to Evaluation: The creation of DrawBench was a crucial contribution. It moved the field beyond the limitations of COCO and provided a more nuanced tool for measuring a model's compositional and creative abilities, which has influenced how subsequent models are evaluated.
- Responsible AI: The extensive and thoughtful section on societal impact set a high bar for responsible disclosure in AI research. By openly discussing the model's flaws and the risks of deployment, the authors fostered a critical conversation about the ethics of generative models at a time when the field was exploding in popularity. This responsible stance was as important as the technical contributions.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.