Paper status: completed

High-Resolution Image Synthesis with Latent Diffusion Models

Published:12/21/2021

Latent Diffusion Models (2)High-Resolution Image Synthesis (3)Image Inpainting and Reconstruction (1)Cross-Attention Mechanism (2)Conditional Image Generation (1)

Original Link PDF

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work introduces Latent Diffusion Models that reduce computational costs by operating in a pretrained autoencoder’s latent space while enhancing image detail. Leveraging cross-attention, they enable flexible conditional high-resolution image synthesis and set new state-of-the

Abstract

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .

Mind Map

In-depth Reading

English Analysis~18 min read · 22,783 chars

1. Bibliographic Information

Title: High-Resolution Image Synthesis with Latent Diffusion Models
Authors: Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.
Affiliations: The authors are affiliated with Ludwig Maximilian University of Munich, IWR at Heidelberg University, and Runway ML. Their collective background is in computer vision and deep generative models.
Journal/Conference: The paper was submitted to arXiv, a preprint server. It was later presented at the Conference on Computer Vision and Pattern Recognition (CVPR) in 2022, which is a top-tier, highly competitive conference in the computer vision field.
Publication Year: 2021 (arXiv preprint), 2022 (CVPR conference version).
Abstract: The paper addresses the immense computational cost of training and running state-of-the-art Diffusion Models (DMs), which operate in the high-dimensional pixel space. The authors propose Latent Diffusion Models (LDMs), which perform the diffusion process in the more compact latent space of a powerful pretrained autoencoder. This approach drastically reduces computational requirements while preserving, and in some cases improving, visual quality. By incorporating a cross-attention mechanism, LDMs can be conditioned on various inputs like text or bounding boxes, enabling flexible and powerful synthesis. LDMs achieve new state-of-the-art results for image inpainting and highly competitive performance on tasks like text-to-image synthesis, unconditional generation, and super-resolution, all while being significantly more efficient than their pixel-based counterparts.
Original Source Link: The paper is available as a preprint at https://arxiv.org/abs/2112.10752.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Diffusion Models (DMs) have become the state-of-the-art in image synthesis, producing high-fidelity and diverse images. However, their main drawback is their massive computational demand. They are trained and run directly on images in pixel space, which is extremely high-dimensional. This leads to:
  1. Exorbitant Training Costs: Training powerful DMs can take hundreds of GPU days, making it inaccessible to most researchers and practitioners.
  2. Slow Inference: Generating a single image requires hundreds of sequential evaluation steps, making inference time-consuming.
- Identified Gap: The paper observes that DMs spend a significant portion of their capacity modeling imperceptible, high-frequency details in images. This is computationally wasteful. Previous attempts to use a latent space (e.g., VQGAN) required very aggressive compression to make subsequent modeling (with autoregressive transformers) feasible, which in turn sacrificed image detail and reconstruction quality.
- Innovative Angle: The authors propose to separate the image generation process into two distinct stages:
  1. Perceptual Compression: Use a powerful autoencoder to map images into a lower-dimensional latent space that is perceptually equivalent to the original but computationally cheaper.
  2. Semantic Compression & Generation: Train a Diffusion Model in this compact latent space to learn the semantic and conceptual composition of the data. Because the DM's UNet architecture is well-suited for spatial data, it does not require the aggressive compression of prior methods, striking a better balance between efficiency and quality.
Main Contributions / Findings (What):
- Latent Diffusion Models (LDMs): The paper introduces a new class of generative models that apply the diffusion process in the latent space of a pretrained autoencoder, significantly reducing the computational cost of training and inference.
- Improved Efficiency-Quality Tradeoff: LDMs find a "sweet spot" for compression, preserving high-frequency details in the autoencoder stage and allowing the diffusion model to focus on the semantic content. This leads to faster training and sampling with little to no loss in visual fidelity.
- Flexible Conditioning Mechanism: A novel cross-attention based conditioning mechanism is introduced, which allows LDMs to be guided by various modalities like text, semantic maps, or bounding boxes, turning them into powerful and flexible multi-modal generators.
- State-of-the-Art Performance: LDMs achieve state-of-the-art (SOTA) or highly competitive results across a range of tasks, including class-conditional image synthesis (ImageNet), image inpainting (Places), and unconditional generation (CelebA-HQ), while using a fraction of the computational resources of previous SOTA models.
- High-Resolution Convolutional Synthesis: The paper demonstrates that LDMs can be trained on smaller image patches and then used to generate much larger, consistent images (e.g., 1024x1024 pixels) in a convolutional manner for spatially conditioned tasks.
- Public Release of Models: The authors released their pretrained autoencoders and LDM models, which proved to be a foundational contribution to the open-source community, most notably forming the basis for the widely popular Stable Diffusion model.

Foundational Concepts:
- Generative Models: Algorithms that learn the underlying distribution of a dataset (e.g., of images) to generate new, synthetic data samples that resemble the original data.
- Diffusion Models (DMs): A class of generative models inspired by thermodynamics. They work in two steps:
  1. Forward Process: Gradually add Gaussian noise to a real image over a series of timesteps until it becomes pure noise. This process is fixed and does not involve learning.
  2. Reverse Process: Train a neural network (typically a UNet) to reverse this process. Starting from random noise, the network iteratively removes noise to generate a clean image. DMs are known for their high-quality, diverse outputs but are slow to sample from.
- Autoencoder: A neural network architecture consisting of an encoder and a decoder. The encoder compresses an input (e.g., an image) into a low-dimensional representation called the latent space. The decoder then reconstructs the original input from this latent representation. The goal is to learn a meaningful, compressed representation.
- Variational Autoencoder (VAE): A type of generative autoencoder where the encoder produces parameters (mean and variance) for a probability distribution in the latent space. This regularizes the latent space, allowing one to sample from it to generate new data.
- Generative Adversarial Network (GAN): A framework where two networks, a Generator and a Discriminator, are trained in competition. The Generator creates fake images, and the Discriminator tries to distinguish them from real ones. GANs can produce realistic images quickly but can be unstable to train and may suffer from "mode collapse" (lack of diversity).
- UNet: A convolutional neural network architecture with an encoder-decoder structure and "skip connections" that link layers in the encoder to corresponding layers in the decoder. These connections help preserve fine-grained details, making it very effective for image-to-image tasks like denoising.
- Cross-Attention Mechanism: An extension of the standard self-attention mechanism. It allows a model to selectively focus on information from a different source. In this paper, it lets the UNet (processing the image latent) "pay attention" to relevant parts of a conditioning input (like a text sentence) at each generation step.
Previous Works and Differentiation:
- Pixel-Based Diffusion Models: Models like DDPM and ADM perform the diffusion process directly in pixel space. The paper highlights their success but frames them as computationally prohibitive, serving as the primary motivation for LDMs.
- Two-Stage Models (e.g., VQ-VAE, VQGAN): These models also use a two-stage approach. First, an autoencoder (like VQ-VAE or VQGAN) compresses an image into a discrete latent representation. Second, an autoregressive model (like a Transformer) is trained to predict the sequence of latent codes.
- Differentiation from Two-Stage Models: The key difference is in the second stage.
  - Previous models used autoregressive Transformers in the latent space. Transformers are not inherently biased towards spatial data, so to make training tractable, the latent space had to be aggressively downsampled (e.g., by a factor of $f=16$ or $f=32$ ), leading to a significant loss of detail.
  - LDMs use a UNet-based Diffusion Model in the latent space. The UNet's convolutional structure is an excellent inductive bias for image-like data. This allows LDMs to work with much milder compression factors (e.g., $f=4$ or $f=8$ ), preserving fine details and achieving higher-fidelity reconstructions, as illustrated in Figure 1.
    
    Figure 1: This figure illustrates the core advantage of LDMs. It compares reconstructions from different models with varying spatial downsampling factors ( $f$ ). The LDM approach (right) with a mild downsampling factor of $f=4$ achieves significantly better reconstruction quality (higher PSNR, lower R-FID) compared to models like DALL-E and VQGAN, which require more aggressive compression. This shows that LDMs strike a better balance between complexity reduction and detail preservation.

4. Methodology (Core Technology & Implementation)

The paper proposes a two-stage approach to separate perceptual learning from generative/semantic learning. This is visualized in Figure 2, which conceptualizes model training as moving from perceptual compression (removing redundant, imperceptible details) to semantic compression (learning the high-level concepts in the data). LDMs offload the perceptual compression to a separate, efficient autoencoder.

$Figure .Convolutional samples from the semantic landscapes model as n Sec. 4.3.2, finetuned on $5 1 2 ^ { 2 }$ images.$ Figure 2: This diagram illustrates the authors' motivation. Digital images contain a lot of perceptually meaningless information (high-frequency noise/details). While DMs can learn to ignore this, they still have to process all pixels during training and inference, which is wasteful. The proposed LDM approach separates the task: a dedicated compression stage first removes these imperceptible details, and the generative DM then operates on this simplified, lower-dimensional representation.

The core method of Latent Diffusion Models (LDMs) can be broken down into three key components.

4.1. Perceptual Image Compression

This is the first stage. Its goal is to create a lower-dimensional but perceptually equivalent representation of an image.

Architecture: An autoencoder is used, consisting of an encoder $\mathcal{E}$ $E$ and a decoder $\mathcal{D}$ $D$ .
- The encoder $\mathcal{E}$ takes an image $x \in \mathbb{R}^{H \times W \times 3}$ and maps it to a latent representation $z = \mathcal{E}(x)$ , where $z \in \mathbb{R}^{h \times w \times c}$ .
- The spatial resolution is reduced by a downsampling factor $f = H/h = W/w$ .
- The decoder $\mathcal{D}$ reconstructs the image $\tilde{x} = \mathcal{D}(z)$ from the latent code.
Training: To ensure the reconstructions are realistic and detailed (not blurry), the autoencoder is trained with a combination of a perceptual loss (which measures similarity in a deep feature space) and a patch-based adversarial (GAN) loss. This combination forces the reconstructions to lie on the natural image manifold.
Latent Space Regularization: To prevent the latent space from having arbitrarily high variance, two regularization schemes are explored:
1. KL-reg: A small Kullback-Leibler (KL) penalty is applied to the latent codes, encouraging them to follow a standard normal distribution, similar to a VAE.
2. VQ-reg: A Vector Quantization layer is used, which makes this model similar to a VQGAN.
  
  This autoencoder is trained only once and then frozen. It can be reused for training many different LDMs.

4.2. Latent Diffusion Models

This is the second stage, where a standard Diffusion Model is trained on the latent representations learned in the first stage.

Diffusion Process in Latent Space: Instead of applying the diffusion process to pixel-images $x$ , it is applied to the latent codes $z = \mathcal{E}(x)$ .
Objective Function: The training objective for the DM is to predict the noise added to a noisy latent code $z_t$ . This is adapted from the standard DM objective: $L_{LDM} := \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0, 1), t} \Big [ \| \epsilon - \epsilon_{\theta}(z_t, t) \|_2^2 \Big ]$
- $\mathcal{E}(x) = z$ : The clean latent code from the encoder.
- $t$ : A uniformly sampled timestep from $1, \dots, T$ .
- $\epsilon$ : The noise sampled from a standard normal distribution.
- $z_t$ : The latent code $z$ with noise $\epsilon$ added, corresponding to timestep $t$ .
- $\epsilon_{\theta}(z_t, t)$ : The neural network (a time-conditional UNet) which takes the noisy latent $z_t$ and the timestep $t$ as input and tries to predict the noise $\epsilon$ .
Generation: To generate a new image, we first sample a random latent code from noise, $z_T \sim \mathcal{N}(0,1)$ , then iteratively denoise it using the trained network $\epsilon_{\theta}$ to get a clean latent code $z_0$ . Finally, this latent code is passed through the decoder $\mathcal{D}$ once to get the final, full-resolution image $x = \mathcal{D}(z_0)$ .

4.3. Conditioning Mechanisms

To control the image generation process, the paper introduces a flexible conditioning mechanism based on cross-attention, as depicted in Figure 3.

Figure 22. More qualitative results on object removal as in Fig. 11. 该图像是论文中图22的插图，展示了对象移除的更多定性结果。图中每组包含两个子图，左侧为带红色轮廓标注的输入图像，右侧为移除对象后生成的结果图，体现了模型在不同场景下的去除效果。 Figure 3: This diagram shows how conditioning is incorporated into the LDM. A conditioning input $y$ (e.g., text, a semantic map) is first projected into an embedding space by an encoder $\tau_\theta$ . This embedding is then injected into the UNet backbone of the diffusion model via cross-attention layers, allowing the model to guide generation based on the conditioning information. Simpler conditioning can also be done via concatenation.

General Framework: The model is augmented to accept a conditioning input $y$ (e.g., a text prompt, a class label, a semantic map). The denoising network becomes $\epsilon_{\theta}(z_t, t, y)$ .
Cross-Attention:
- A domain-specific encoder, $\tau_{\theta}$ , first converts the conditioning input $y$ into an intermediate representation: $\tau_{\theta}(y) \in \mathbb{R}^{M \times d_{\tau}}$ . For text, $\tau_{\theta}$ can be a Transformer.
- This representation is then injected into the UNet of the LDM via cross-attention layers. The attention mechanism is defined as: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \cdot V$
- Here, the Query ( $Q$ $Q$ ) comes from the UNet's internal representation of the image latent, while the Key ( $K$ $K$ ) and Value ( $V$ $V$ ) come from the conditioning representation:
  - $Q = W_Q^{(i)} \cdot \varphi_i(z_t)$
  - $K = W_K^{(i)} \cdot \tau_{\theta}(y)$
  - $V = W_V^{(i)} \cdot \tau_{\theta}(y)$
- $\varphi_i(z_t)$ is an intermediate flattened feature map from the UNet, and $W_Q, W_K, W_V$ are learnable projection matrices. This allows the model to correlate spatial regions of the image with parts of the conditioning input.
Conditional Objective: The final loss function is optimized jointly over the UNet ( $\epsilon_{\theta}$ ) and the conditioning encoder ( $\tau_{\theta}$ ): $L_{LDM} := \mathbb{E}_{\mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0, 1), t} \left[ \| \epsilon - \epsilon_{\theta}\bigl(z_t, t, \tau_{\theta}(y)\bigr) \|_2^2 \right]$

5. Experimental Setup

Datasets: A wide variety of standard datasets were used to demonstrate the versatility of LDMs:
- Unconditional Generation: CelebA-HQ (faces), FFHQ (faces), LSUN-Churches & LSUN-Bedrooms (scenes).
- Class-Conditional Generation: ImageNet (1000 object classes).
- Text-to-Image Generation: LAION-400M (large-scale image-text pairs) for training, MS-COCO for evaluation.
- Layout-to-Image Generation: OpenImages and COCO-Stuff.
- Super-Resolution: ImageNet.
- Inpainting: Places.
Evaluation Metrics:
1. Fréchet Inception Distance (FID): Measures the similarity between the distribution of generated images and real images. It computes the Fréchet distance between two multivariate Gaussians fitted to the feature representations of images from the InceptionV3 network. Lower is better.
  - Formula: $d^2 = ||\mu_x - \mu_y||^2_2 + \text{Tr}(\Sigma_x + \Sigma_y - 2(\Sigma_x \Sigma_y)^{1/2})$
  - Symbols: $\mu_x, \mu_y$ are the means and $\Sigma_x, \Sigma_y$ are the covariance matrices of the Inception features for real and generated images, respectively.
2. Inception Score (IS): Measures both image quality and diversity. Higher is better. It is known to be less reliable than FID.
  - Formula: $\text{IS}(G) = \exp(\mathbb{E}_{x \sim p_g} D_{KL}(p(y|x) || p(y)))$
  - Symbols: $p(y|x)$ is the conditional class distribution for a generated image $x$ from the Inception network, and p(y) is the marginal class distribution.
3. Precision and Recall: Used to separately measure the fidelity (realism) and diversity (coverage) of generated samples. Higher is better for both.
4. Peak Signal-to-Noise Ratio (PSNR): Measures the ratio between the maximum possible power of a signal and the power of corrupting noise. Used for reconstruction tasks. Higher is better but doesn't always align with human perception.
  - Formula: $\text{PSNR} = 20 \cdot \log_{10}(\text{MAX}_I) - 10 \cdot \log_{10}(\text{MSE})$
  - Symbols: $\text{MAX}_I$ is the maximum pixel value (e.g., 255), and MSE is the Mean Squared Error between the original and reconstructed images.
5. Structural Similarity Index (SSIM): Measures the similarity between two images based on luminance, contrast, and structure. Higher is better.
6. Learned Perceptual Image Patch Similarity (LPIPS): A metric that compares images using features from a deep neural network, designed to align better with human perceptual judgment. Lower is better.
Baselines: The paper compares LDMs against a comprehensive set of contemporary SOTA models, including pixel-based DMs (ADM), GANs (StyleGAN2, ProjectedGAN), autoregressive models (ImageBART, CogView), and specialized task-specific models (SR3 for super-resolution, LaMa for inpainting).

6. Results & Analysis

6.1. Perceptual Compression Tradeoffs

The authors first analyze how the choice of downsampling factor $f$ affects model performance.

Figure 6 shows the training progress on ImageNet for LDMs with different factors ( $f \in \{1, 2, 4, 8, 16, 32\}$ ), where LDM-1 is a pixel-based DM.
- Models with small factors (LDM-1, LDM-2) train slowly, as the diffusion model still has a heavy computational load.
- Models with very large factors (LDM-32) train quickly but plateau at a lower quality, as the initial compression was too lossy.
- LDM-4 and LDM-8 strike the best balance, achieving strong performance with efficient training.
Figure 7 compares sample throughput (speed) versus FID score (quality).
- LDM-4 and LDM-8 consistently achieve better FID scores than the pixel-based LDM-1 while being significantly faster at generating samples. This empirically validates the core hypothesis of the paper.
  
  Figure 6: This graph shows that LDMs with moderate downsampling factors (LDM-4 to LDM-16) achieve better sample quality (lower FID) much faster than a pixel-based model (LDM-1). An overly aggressive factor (LDM-32) limits the final quality.
  
  Figure 7: This plot clearly shows the superiority of LDMs. LDM-4 and LDM-8 (green and red lines) offer a much better tradeoff, achieving lower FID scores (better quality) at a much higher sampling throughput (faster generation) compared to the pixel-based LDM-1 (blue line).

6.2. Image Generation with Latent Diffusion

Unconditional Generation: The results in Table 1 show that LDMs are highly competitive.
- On CelebA-HQ, LDM-4 achieves a new SOTA FID of 5.11, outperforming both GANs and other likelihood-based models.
- On other datasets, LDMs consistently achieve better Precision and Recall than GANs, confirming they capture the data distribution more fully.
- Qualitative samples are shown in Figure 4.

(Manual Transcription of Table 1)

CelebA-HQ 256 × 256				FFHQ 256 × 256
Method	FID ↓	Prec. ↑	Recall ↑	Method	FID ↓	Prec. ↑	Recall ↑
DC-VAE [63]	15.8	-	-	ImageBART [21]	9.57		-
VQGAN+T. [23] (k=400)	10.2	-	-	U-Net GAN (+aug) [77]	10.9 (7.6)	-	-
PGGAN [39]	8.0			UDM [43]	5.54	-	-
LSGM [93]	7.22		-	StyleGAN [41]	4.16	0.71	0.46
UUDM [43]	7.16	-	-	ProjectedGAN [76]	3.08	0.65	0.46
LDM-4 (ours, 500-s†)	5.11	0.72	0.49	LDM-4 (ours, 200-s)	4.98	0.73	0.50
LSUN-Churches 256 × 256				LSUN-Bedrooms 256 × 256
Method	FID ↓	Prec. ↑	Recall ↑	Method	FID ↓	Prec. ↑	Recall ↑
DDPM [30]	7.89	-	-	ImageBART [21]	5.51	-	-
ImageBART [21]	7.32	-	-	DDPM [30]	4.9	-	-
PGGAN [39]	6.42			UDM [43]	4.57	-	-
StyleGAN [41]	4.21	-	-	StyleGAN [41]	2.35	0.59	0.48
StyleGAN2 [42]	3.86		-	ADM [15]	1.90	0.66	0.51
ProjectedGAN [76]	1.59	0.61	0.44	ProjectedGAN [76]	1.52	0.61	0.34
*LDM-8 (ours, 200-s)**	4.02	0.64	0.52	LDM-4 (ours, 200-s)	2.95	0.66	0.48

该图像是由多张不同年龄和性别的面部人像组成的图集，展示了使用潜在扩散模型生成的高质量人脸样本，反映了模型在细节和多样性上的表现。 Figure 4: This figure shows high-quality samples generated by unconditional LDMs on various datasets, demonstrating their ability to synthesize diverse and detailed images of faces, churches, bedrooms, and general objects.

6.3. Conditional Latent Diffusion

This section showcases the flexibility of the cross-attention conditioning mechanism.

Text-to-Image Synthesis:
- An LDM is trained on the LAION-400M dataset. Table 2 shows that when evaluated on MS-COCO, the guided LDM (LDM-KL-8-G) is on par with much larger models like GLIDE and Make-A-Scene (FID of 12.63 vs 12.24) despite having significantly fewer parameters (1.45B vs 6B).
- Figure 5 shows impressive qualitative results for complex, user-defined text prompts.

(Manual Transcription of Table 2)

tr>

	Text-Conditional Image Synthesis
Method	FID ↓	IS↑	Nparams
CogView† [17]	27.10	18.20	4B	self-ranking, rejection rate 0.017
LAFITE† [109]	26.94	26.02	75M	self-ranking, rejection rate 0.017
GLIDE* [59]	12.24	-	6B	277 DDIM steps, c.f.g. [32] s = 3
Make-A-Scene* [26]	11.84	-	4B	c.f.g for AR models [98] s = 5
LDM-KL-8	23.31	20.03±0.33	1.45B	250 DDIM steps
LDM-KL-8-G*	12.63	30.29±0.42	1.45B	250 DDIM steps, c.f.g. [32] s = 1.5

该图像是由多个教堂建筑照片组成的拼贴图，展示了不同风格和结构的教堂外观，反映了图像合成模型在建筑图像生成中的应用效果。 Figure 5: Samples generated from creative text prompts by the text-to-image LDM, showcasing its strong generalization capabilities.

Class-Conditional Synthesis: Table 3 shows results on ImageNet. The guided LDM-4-G outperforms the previous SOTA ADM-G on both FID (3.60 vs. 4.59) and IS (247.67 vs. 186.7), while using fewer parameters (400M vs. 608M) and far less compute.

(Manual Transcription of Table 3)

Method	FID↓	IS↑	Precision↑	Recall↑	Nparams
BigGan-deep [3]	6.95	203.6±2.6	0.87	0.28	340M
ADM [15]	10.94	100.98	0.69	0.63	554M	250 DDIM steps
ADM-G [15]	4.59	186.7	0.82	0.52	608M	250 DDIM steps
LDM-4 (ours)	10.56	103.49±1.24	0.71	0.62	400M	250 DDIM steps
LDM-4-G (ours)	3.60	247.67±5.59	0.87	0.48	400M	250 steps, c.f.g [32], s = 1.5

Convolutional Sampling: Figure 9 demonstrates that an LDM trained for semantic synthesis on $256^2$ images can generate large, coherent $512 \times 1024$ images at inference time by applying the model convolutionally.

Figure 9: This image shows a large landscape synthesized by an LDM. The model was trained on smaller images but can generalize to generate much higher-resolution images for spatially conditioned tasks, a key benefit of its convolutional architecture.

6.4. Super-Resolution with Latent Diffusion

LDMs are adapted for super-resolution by conditioning on low-resolution images. Table 5 shows LDM-SR is competitive with the specialized SR3 model, achieving a better FID score (4.8 vs. 5.2) with fewer parameters.
A user study (Table 4) confirms that LDM-4 generated images are preferred over a pixel-based DM baseline 70.6% of the time, validating its perceptual quality.

(Manual Transcription of Table 5)

Method	FID ↓	IS ↑	PSNR ↑	SSIM ↑	Nparams	[ sampls [(*)
Image Regression [72]	15.2	121.1	27.9	0.801	625M	N/A
SR3 [72]	5.2	180.1	26.4	0.762	625M	N/A
LDM-4 (ours, 100 steps)	2.8†/4.8‡	166.3	24.4±3.8	0.69±0.14	169M	4.62
emphLDM-4 (ours, big, 100 steps)	2.4†/4.3‡	174.9	24.7±4.1	0.71±0.15	552M	4.5
LDM-4 (ours, 50 steps, guiding)	4.4+/6.4‡	153.7	25.8±3.7	0.74±0.12	184M	0.38

6.5. Inpainting with Latent Diffusion

LDMs are highly efficient for inpainting, showing a 2.7x speedup in training throughput over a pixel-based DM (Table 6).
The main results in Table 7 show that a large, fine-tuned LDM sets a new SOTA on the Places dataset with an FID of 1.50, surpassing the specialized LaMa inpainting model.
The user study (Table 4) again shows a strong preference for LDM's results over LaMa (68.1% vs. 31.9%).

(Manual Transcription of Table 7)

	40-50% masked		All samples
Method	FID ↓	LPIPS ↓	FID ↓	LPIPS ↓
LDM-4 (ours, big, w/ ft)	9.39	0.246± 0.042	1.50	0.137± 0.080
LDM-4 (ours, big, w/o ft)	12.89	0.257 ± 0.047	2.40	0.142± 0.085
LDM-4 (ours, w/ attn)	11.87	0.257 ± 0.042	2.15	0.144± 0.084
LDM-4 (ours, w/o attn)	12.60	0.259± 0.041	2.37	0.145± 0.084
LaMa [88]†	12.31	0.243± 0.038	2.23	0.134± 0.080
LaMa [88]	12.0	0.24	2.21	0.14
CoModGAN [107]	10.4	0.26	1.82	0.15
RegionWise [52]	21.3	0.27	4.75	0.15
DeepFill v2 [104]	22.1	0.28	5.20	0.16
EdgeConnect [58]	30.5	0.28	8.37	0.16

Figure 3. We condition LDMs either via concatenation or by a more general cross-attention mechanism. See Sec. 3.3 Figure 11: Qualitative results for object removal using the best-performing inpainting LDM. The model successfully fills in the masked regions with plausible and coherent content.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully introduces Latent Diffusion Models (LDMs), a simple yet powerful method for making high-resolution image synthesis with diffusion models computationally tractable. By performing the expensive generative process in a compressed latent space, LDMs drastically reduce training and inference costs. The addition of a flexible cross-attention conditioning mechanism transforms them into versatile tools for a wide range of tasks. The empirical results convincingly demonstrate that LDMs not only match but often exceed the performance of previous state-of-the-art models across multiple benchmarks, establishing a new and highly efficient paradigm for generative modeling.
Limitations & Future Work:
- Inference Speed: While much faster than pixel-based DMs, LDM sampling is still a sequential process and slower than single-pass models like GANs.
- Reconstruction Bottleneck: The final image quality is fundamentally limited by the reconstruction capability of the pretrained autoencoder. For tasks requiring extreme pixel-level precision, this could be a bottleneck.
Personal Insights & Critique:
- Groundbreaking Impact: This paper is one of the most impactful generative AI papers of its time. The LDM architecture is the direct foundation for Stable Diffusion, the model that democratized high-resolution text-to-image generation for the public. The core idea of separating perceptual and semantic compression is elegant and has been profoundly validated by the subsequent explosion in LDM-based applications.
- Design Elegance: The combination of a well-trained perceptual autoencoder, a UNet-based diffusion model, and a cross-attention conditioner is a masterclass in system design. It leverages the strengths of each component (perceptual realism from the GAN-trained autoencoder, powerful generative priors from DMs, and flexibility from attention) to create a whole that is greater than the sum of its parts.
- Open Questions: The work relies on a fixed, pretrained autoencoder. Future research could explore jointly finetuning the autoencoder and LDM, or developing even more powerful perceptual compression models. The quality of the first stage directly gate-keeps the quality of the second, making it a critical component for further improvement. Overall, this work represents a pivotal moment in generative modeling, shifting the focus towards more efficient and accessible high-performance models.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.