AiPaper
Paper status: completed

Denoising Diffusion Probabilistic Models

Published:06/20/2020
Original LinkPDF
Price: 0.10
Price: 0.10
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents a novel denoising diffusion probabilistic model inspired by nonequilibrium thermodynamics, achieving high-quality image synthesis. By training on a weighted variational bound, it establishes a new connection with denoising score matching, attaining competitive

Abstract

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at https://github.com/hojonathanho/diffusion

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Denoising Diffusion Probabilistic Models

1.2. Authors

The paper lists Jonathan Ho (UC Berkeley), Ajay Jain (UC Berkeley), and Pieter Abbeel (UC Berkeley) as authors. Their affiliations suggest a strong background in machine learning and artificial intelligence research, particularly at a leading academic institution.

1.3. Journal/Conference

This paper was published at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. NeurIPS is one of the most prestigious and influential conferences in the field of machine learning and computational neuroscience, indicating the high impact and quality of research presented at this venue.

1.4. Publication Year

2020

1.5. Abstract

The paper introduces high-quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by nonequilibrium thermodynamics. The core methodology involves training these models on a weighted variational bound, which is specifically designed based on a novel connection between diffusion models and denoising score matching with Langevin dynamics. A significant outcome is that these models naturally support a progressive lossy decompression scheme, interpretable as a generalization of autoregressive decoding. Experimentally, the models achieved an Inception score of 9.46 and a state-of-the-art FID score of 3.17 on the unconditional CIFAR10 dataset. For 256x256 LSUN datasets, the sample quality was comparable to ProgressiveGAN. The authors also made their implementation publicly available.

https://arxiv.org/abs/2006.11239 This is a preprint from arXiv, indicating it was publicly available before or alongside the NeurIPS 2020 publication.

https://arxiv.org/pdf/2006.11239v2.pdf This is the link to the PDF version 2 of the paper.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the generation of high-quality synthetic images using a class of models known as diffusion probabilistic models (or diffusion models). While deep generative models like Generative Adversarial Networks (GANs), autoregressive models, flows, and variational autoencoders (VAEs) had already shown impressive results in various data modalities (images, audio), diffusion models, despite their theoretical elegance, had not yet demonstrated competitive sample quality. This presented a significant gap in the research landscape, as diffusion models offer advantages like straightforward definition and efficient training.

The importance of this problem lies in advancing the capabilities of generative AI. High-quality synthetic data has numerous applications, including data augmentation, content creation, conditional generation, and understanding data distributions. Overcoming the limitations of diffusion models in terms of sample quality would unlock a new, powerful tool for these tasks.

The paper's entry point or innovative idea is the demonstration that diffusion models are indeed capable of generating high-quality samples, often surpassing existing state-of-the-art methods in certain metrics. This is achieved through a novel parameterization of the model's reverse process and a carefully designed simplified training objective that emphasizes different aspects of reconstruction, leading to an equivalence with denoising score matching and annealed Langevin dynamics.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  1. High-Quality Sample Generation with Diffusion Models: It provides the first demonstration that diffusion models can generate images of exceptionally high quality, achieving state-of-the-art results on unconditional CIFAR10 (FID of 3.17) and comparable quality to ProgressiveGAN on 256x256 LSUN datasets. This fundamentally changes the perception of diffusion models from theoretically interesting but practically limited, to highly performant generative models.

  2. Novel Parameterization and Connection to Denoising Score Matching: The paper introduces a specific parameterization for the reverse process (specifically, predicting the noise epsilon) that reveals a novel equivalence between diffusion models and denoising score matching across multiple noise levels, coupled with annealed Langevin dynamics during sampling. This connection is key to achieving the improved sample quality.

  3. Simplified, Weighted Variational Bound Objective: A simplified training objective (LsimpleL_{simple}) is proposed. This objective, effectively a weighted variational bound, discards certain weighting factors of the standard Evidence Lower Bound (ELBO). This reweighting strategy is shown to significantly improve sample quality by focusing the model on more challenging denoising tasks at larger noise levels, despite potentially leading to worse log likelihoods.

  4. Progressive Lossy Decompression Scheme: The models naturally admit a progressive lossy decompression scheme. This can be interpreted as a generalization of autoregressive decoding, where images are generated or reconstructed progressively from coarse to fine details over time steps. This provides insights into the model's rate-distortion characteristics and its ability to learn conceptual compressions.

  5. Extensive Empirical Validation: The paper provides comprehensive experimental results on standard image synthesis benchmarks (CIFAR10, CelebA-HQ, LSUN) with detailed ablations on model parameterization and training objectives, solidifying the claims of improved sample quality and contributing to the understanding of diffusion model mechanisms.

    These findings solve the problem of achieving competitive generation quality with diffusion models, opening up a new avenue for research and application in generative AI, while also providing theoretical insights into their operation and connections to other established generative paradigms.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand Denoising Diffusion Probabilistic Models (DDPMs), a reader needs familiarity with several core machine learning concepts:

  • Generative Models: These are models that learn the underlying distribution of a dataset and can then generate new samples that resemble the training data. Examples include GANs, VAEs, autoregressive models, and flows. DDPMs fall into this category.
  • Latent Variable Models: A class of generative models where the observed data x\mathbf{x} is assumed to be generated from a set of unobserved (latent) variables z\mathbf{z}. The model learns to map latent variables to data and vice-versa. DDPMs use a sequence of latent variables x1,,xT\mathbf{x}_1, \dots, \mathbf{x}_T to represent the noisy versions of the original data x0\mathbf{x}_0.
  • Markov Chain: A sequence of random variables where the probability of each variable depends only on the state of the previous variable, not on the entire history. In DDPMs, both the forward process (adding noise) and the reverse process (denoising) are modeled as Markov chains.
  • Variational Inference (VI): A technique used to approximate intractable posterior distributions in Bayesian statistics. In generative models, it's often used to train latent variable models by optimizing a lower bound on the log-likelihood of the data, known as the Evidence Lower Bound (ELBO). DDPMs leverage VI to train their generative reverse process. The ELBO for a latent variable model pθ(x)p_\theta(\mathbf{x}) and an approximate posterior q(zx)q(\mathbf{z}|\mathbf{x}) is: $ \log p_\theta(\mathbf{x}) \ge \mathbb{E}{q(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p\theta(\mathbf{x}, \mathbf{z})}{q(\mathbf{z}|\mathbf{x})} \right] = \mathbb{E}{q(\mathbf{z}|\mathbf{x})} [\log p\theta(\mathbf{x}|\mathbf{z})] - D_{\mathrm{KL}}(q(\mathbf{z}|\mathbf{x}) \parallel p_\theta(\mathbf{z})) $ where:
    • pθ(x)p_\theta(\mathbf{x}) is the true data distribution (unknown).
    • pθ(x,z)p_\theta(\mathbf{x}, \mathbf{z}) is the joint distribution of data and latent variables.
    • q(zx)q(\mathbf{z}|\mathbf{x}) is the approximate posterior distribution of latent variables given data.
    • DKL(qp)D_{\mathrm{KL}}(q \parallel p) is the Kullback-Leibler (KL) divergence, a measure of how one probability distribution qq diverges from a second, expected probability distribution pp. It is defined as: $ D_{\mathrm{KL}}(q(x) \parallel p(x)) = \sum_{x} q(x) \log \frac{q(x)}{p(x)} $ for discrete distributions, or $ D_{\mathrm{KL}}(q(x) \parallel p(x)) = \int_{-\infty}^{\infty} q(x) \log \frac{q(x)}{p(x)} dx $ for continuous distributions. A lower KL divergence means qq is a better approximation of pp.
  • Gaussian Noise: Random noise sampled from a Gaussian (normal) distribution. In DDPMs, Gaussian noise is gradually added to data in the forward process and removed in the reverse process. A Gaussian distribution is characterized by its mean (μ\mu) and variance (σ2\sigma^2), denoted as N(μ,σ2)\mathcal{N}(\mu, \sigma^2).
  • U-Net Architecture: A type of convolutional neural network initially developed for biomedical image segmentation. It features a symmetric encoder-decoder structure with "skip connections" that pass information from earlier encoder layers to corresponding decoder layers. This helps preserve fine-grained details during downsampling and upsampling operations. DDPMs commonly use a U-Net backbone for their noise prediction network.
  • Denoising Autoencoders: Neural networks trained to reconstruct an original input from a corrupted version of it (e.g., by adding noise). The objective is to learn robust representations by forcing the network to capture the essential features of the input. DDPMs draw a strong connection to denoising autoencoders, as their reverse process essentially denoises an image.
  • Score Matching: A technique for estimating the gradient of the log-probability density function of a data distribution, known as the score function xlogp(x)\nabla_{\mathbf{x}} \log p(\mathbf{x}). It avoids estimating the density explicitly. Denoising Score Matching trains a model to predict the score function by minimizing the difference between the model's score and the score of a noisy version of the data.
  • Langevin Dynamics: A continuous-time stochastic process often used in physics and statistics to sample from a probability distribution. It involves iteratively taking steps in the direction of the score function (gradient of the log-probability) plus some random noise. Annealed Langevin Dynamics applies this process over a sequence of noise scales, gradually reducing the noise to sample from complex distributions.
  • Inception Score (IS): A metric used to evaluate the quality of images generated by generative adversarial networks (GANs) and other generative models. It measures both the fidelity (quality) and diversity of generated samples. A higher Inception Score indicates better quality and diversity. It is calculated by passing generated images through a pre-trained Inception v3 network, obtaining class probabilities. The IS is the exponentiated Kullback-Leibler (KL) divergence between the conditional class distribution of generated images p(yx)p(y|\mathbf{x}) and the marginal class distribution p(y). $ \mathrm{IS}(G) = \exp \left( \mathbb{E}{\mathbf{x} \sim p_g} [D{\mathrm{KL}}(p(y|\mathbf{x}) \parallel p(y))] \right) $ where:
    • pgp_g is the distribution of generated images.
    • p(yx)p(y|\mathbf{x}) is the conditional class distribution given a generated image x\mathbf{x}.
    • p(y) is the marginal class distribution. A higher score is better.
  • Fréchet Inception Distance (FID): Another popular metric for evaluating the quality of generated images, often considered more robust than IS. FID measures the Fréchet distance between two Gaussian distributions fitted to the feature representations of real and generated images. These features are typically extracted from an intermediate layer of a pre-trained Inception v3 network. A lower FID score indicates better quality and similarity between real and generated image distributions. $ \mathrm{FID}(\mathbf{x}, \mathbf{g}) = ||\mu_{\mathbf{x}} - \mu_{\mathbf{g}}||^2 + \mathrm{Tr}(\Sigma_{\mathbf{x}} + \Sigma_{\mathbf{g}} - 2(\Sigma_{\mathbf{x}}\Sigma_{\mathbf{g}})^{1/2}) $ where:
    • μx\mu_{\mathbf{x}} and Σx\Sigma_{\mathbf{x}} are the mean and covariance of the feature vectors for real images.
    • μg\mu_{\mathbf{g}} and Σg\Sigma_{\mathbf{g}} are the mean and covariance of the feature vectors for generated images.
    • 2||\cdot||^2 is the squared Euclidean distance.
    • Tr()\mathrm{Tr}(\cdot) is the trace of a matrix. A lower score is better.
  • Negative Log Likelihood (NLL): A common metric for evaluating probabilistic models. It measures how well a model fits the observed data. For a given data point x\mathbf{x}, the log-likelihood is logp(x)\log p(\mathbf{x}). Minimizing the NLL (which is equivalent to maximizing the log-likelihood) means the model assigns higher probabilities to the observed data points. It is often reported in bits/dim (bits per dimension), which normalizes the NLL by the dimensionality of the data, making it comparable across different datasets or models.
  • Root Mean Squared Error (RMSE): A frequently used measure of the differences between values predicted by a model or an estimator and the values observed. It is the square root of the mean of the squared errors. For images, it quantifies the difference between generated and true pixel values. $ \mathrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2} $ where:
    • NN is the number of observations (e.g., pixels).
    • yiy_i is the actual value.
    • y^i\hat{y}_i is the predicted value. A lower RMSE indicates better reconstruction quality.
  • Bits per dimension (bits/dim): A unit used to measure the codelength or compression efficiency of a probabilistic model, often for Negative Log Likelihood (NLL). It represents the average number of bits required to encode each dimension (e.g., each pixel channel in an image) of the data using the model's learned probability distribution. Lower bits/dim indicate more efficient encoding and a better model fit to the data distribution.

3.2. Previous Works

The paper contextualizes its contributions by referencing various established deep generative models:

  • Generative Adversarial Networks (GANs) [14, 27, 3]: GANs consist of a generator and a discriminator network that are trained in an adversarial manner. The generator tries to produce realistic samples, while the discriminator tries to distinguish between real and fake samples. Early GANs faced challenges with training stability and mode collapse, but ProgressiveGAN [27] and BigGAN [3] significantly improved sample quality and training stability for high-resolution images. StyleGAN2 [29] further advanced this, achieving state-of-the-art results.
  • Autoregressive Models [58, 38, 25]: These models generate data elements sequentially, where each element's generation is conditioned on the previously generated elements. Examples include PixelCNN [58] and PixelCNN++ [52], which are highly effective for image generation and density estimation. They are known for competitive log-likelihoods but can be slow for sampling. The paper draws a connection between its progressive decoding and a generalization of autoregressive decoding.
  • Flow-based Models [9, 10, 32]: Also known as Normalizing Flows, these models explicitly learn an invertible mapping from a simple base distribution (e.g., Gaussian) to the complex data distribution. This allows for exact log-likelihood computation and efficient sampling/inference. Real NVP [10] and Glow [32] are prominent examples.
  • Variational Autoencoders (VAEs) [33, 37]: VAEs learn a probabilistic mapping from data to a lower-dimensional latent space and back. They optimize the ELBO to learn both an encoder and a decoder. VAEs provide good latent space representations and are easier to train than GANs, but often produce blurrier samples.
  • Energy-Based Models (EBMs) and Score Matching [11, 55]: EBMs define a probability distribution using an energy function, where lower energy corresponds to higher probability. Score matching is a method to train EBMs or directly estimate the gradient of the log-density (the score function). NCSN [55] and NCSNv2 [56] are notable score-based generative models that achieve high sample quality comparable to GANs by estimating the score functions of data distributions perturbed by various noise levels and using annealed Langevin dynamics for sampling. The paper explicitly states a novel connection to denoising score matching.
  • Diffusion Probabilistic Models (Original) [53]: The foundational work by Sohl-Dickstein et al. (2015) introduced the concept of diffusion models, drawing inspiration from nonequilibrium thermodynamics. While theoretically sound, this early work did not demonstrate the high sample quality achieved by the current paper. This paper builds directly on that foundation, proving the practical viability of the approach.

3.3. Technological Evolution

The field of deep generative models has seen rapid evolution:

  1. Early 2010s: VAEs [33] emerged, providing a principled framework for latent variable modeling with variational inference. While theoretically elegant, initial VAEs often produced blurry samples.

  2. Mid-2010s: GANs [14] revolutionized image generation with their adversarial training paradigm, capable of producing remarkably realistic images. However, they were notoriously difficult to train, prone to mode collapse, and lacked direct likelihood estimation.

  3. Late 2010s: Improvements in GANs (e.g., ProgressiveGAN [27], BigGAN [3], StyleGAN [28], StyleGAN2 [30]) dramatically improved sample quality, stability, and control. Simultaneously, autoregressive models (PixelCNN++ [52]) achieved impressive likelihoods but suffered from slow sampling, and flow-based models (Real NVP [10], Glow [32]) offered exact likelihoods and invertible mappings.

  4. Very Late 2010s/Early 2020s: Score-based generative models (NCSN [55], NCSNv2 [56]) demonstrated sample quality competitive with GANs by modeling the score function and using Langevin dynamics.

    This paper, Denoising Diffusion Probabilistic Models (DDPMs), fits into this timeline by taking the theoretically grounded diffusion models from 2015 and, through significant algorithmic and architectural innovations (especially the epsilon-prediction parameterization and simplified objective), elevates them to a state-of-the-art position in image generation. It shows that diffusion models, previously overlooked for sample quality, can now rival or exceed the performance of leading GANs and score-based models, while retaining advantages like stable training and principled likelihood estimation.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of DDPMs are:

  • Compared to Original Diffusion Models [53]: This paper is a direct advancement, demonstrating for the first time that diffusion models can achieve high sample quality, which was not shown in the original work. The key innovations are the epsilon-prediction parameterization of the reverse process and the simplified training objective.
  • Compared to GANs (e.g., ProgressiveGAN, StyleGAN2):
    • Training Stability: DDPMs are generally more stable to train than GANs, avoiding issues like mode collapse or tricky adversarial dynamics.
    • Likelihood Estimation: Unlike GANs, DDPMs are likelihood-based models, allowing for explicit calculation of the variational bound on the log-likelihood (though the simplified objective often yields better samples at the cost of less competitive likelihoods).
    • Sample Quality: This paper shows DDPMs achieving competitive and even state-of-the-art FID scores, rivaling the best GANs.
  • Compared to Autoregressive Models (e.g., PixelCNN++):
    • Sampling Speed: DDPMs, while iterative, can potentially be faster for sampling high-resolution images than strictly autoregressive models that generate pixel-by-pixel.
    • Inductive Bias: The Gaussian diffusion process introduces a specific inductive bias that the authors argue is highly suitable for image data, potentially more natural than the masking noise often used in autoregressive models. The progressive decoding mechanism is a generalization of autoregressive decoding.
    • Log-Likelihood: While DDPMs provide a likelihood, autoregressive models typically achieve much better (lower) lossless codelengths. This paper notes that DDPMs excel as lossy compressors.
  • Compared to Flow-based Models:
    • Explicit Likelihood: Both offer explicit likelihoods.
    • Flexibility: Flows require carefully designed invertible architectures, which can be restrictive. Diffusion models allow for more flexible neural network architectures.
  • Compared to Score-based Generative Models (NCSN, NCSNv2):
    • Rigorous Sampler Derivation: A key distinction highlighted by the authors is that their Langevin-like sampler coefficients are rigorously derived from the forward process's βt\beta_t schedule and optimized directly via variational inference. In contrast, NCSN's sampler coefficients are often set heuristically post-hoc, and their training does not directly optimize a quality metric of the sampler in the same way. This means DDPMs train the sampler as an integral part of a latent variable model.

    • Forward Process Design: DDPMs use a forward process that scales down data to maintain consistent input scales for the network and ensures the final latent xTx_T has near-zero mutual information with x0x_0, unlike NCSN which omits this scaling.

    • Reversibility: The small βt\beta_t values in DDPMs ensure the forward process is approximately reversible by a Markov chain with conditional Gaussians, preventing distribution shift during sampling, which is a key theoretical underpinning for their approach.

      In essence, DDPMs combine the principled variational inference framework of latent variable models with the powerful denoising capabilities found in score-based models, all while demonstrating GAN-level sample quality and introducing a novel progressive decoding interpretation.

4. Methodology

4.1. Principles

The core idea of Denoising Diffusion Probabilistic Models (DDPMs) is inspired by statistical physics, specifically nonequilibrium thermodynamics. The method involves two main processes:

  1. Forward Diffusion Process (or Diffusion Process): This is a fixed, predefined Markov chain that gradually adds Gaussian noise to the data over TT timesteps. Starting from an original data sample x0\mathbf{x}_0, it generates a sequence of noisy samples x1,x2,,xT\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T, where xT\mathbf{x}_T is almost pure noise (a standard normal distribution). This process is denoted by q(x1:Tx0)q(\mathbf{x}_{1:T}|\mathbf{x}_0).

  2. Reverse Process: This is a learned Markov chain that aims to reverse the diffusion process. Starting from pure Gaussian noise xTN(0,I)\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), it gradually denoises the samples over TT timesteps to generate an clean data sample x0\mathbf{x}_0. This process is denoted by pθ(x0:T)p_\theta(\mathbf{x}_{0:T}), where θ\theta represents the parameters of the neural network that learns the denoising steps.

    The model is trained to make the learned reverse process pθp_\theta approximate the true (but intractable) reverse of the forward process q(xt1xt,x0)q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0). This training is performed by optimizing a variational bound on the negative log-likelihood of the data. The key insight is that if the noise added in each forward step is small, the reverse process transitions can also be approximated as Gaussian, simplifying the learning problem to predicting the mean and variance of these Gaussian transitions. The paper further refines this by showing a strong connection to denoising score matching and Langevin dynamics.

4.2. Core Methodology In-depth (Layer by Layer)

DDPMs are latent variable models defined by a joint distribution pθ(x0:T)p_\theta(\mathbf{x}_{0:T}) that factorizes into a Markov chain for the reverse process, starting from a standard normal prior p(xT)p(\mathbf{x}_T). The forward process q(x1:Tx0)q(\mathbf{x}_{1:T}|\mathbf{x}_0) is a fixed Markov chain that gradually adds Gaussian noise.

4.2.1. Overall Model Structure

The generative process, pθ(x0)p_\theta(\mathbf{x}_0), is defined by integrating out the latent variables x1:T\mathbf{x}_{1:T}: $ p _ { \theta } ( \mathbf { x } _ { 0 } ) : = \int p _ { \theta } ( \mathbf { x } _ { 0 : T } ) d \mathbf { x } _ { 1 : T } $ where x1,,xT\mathbf{x}_1, \ldots, \mathbf{x}_T are latents (unobserved variables) of the same dimensionality as the data x0\mathbf{x}_0.

4.2.2. Reverse Process

The joint distribution pθ(x0:T)p_\theta(\mathbf{x}_{0:T}), also known as the reverse process, is a Markov chain with learned Gaussian transitions. It starts from a simple prior distribution p(xT)=N(xT;0,I)p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I}) (a standard normal distribution), meaning that at the final timestep TT, the latent variable xT\mathbf{x}_T is pure Gaussian noise. The transitions from xt\mathbf{x}_t to xt1\mathbf{x}_{t-1} are learned:

$ p _ { \theta } ( \mathbf { x } _ { 0 : T } ) : = p ( \mathbf { x } _ { T } ) \prod _ { t = 1 } ^ { T } p _ { \theta } ( \mathbf { x } _ { t - 1 } | \mathbf { x } _ { t } ) $ Here, the conditional probability pθ(xt1xt)p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) is a Gaussian distribution: $ p _ { \theta } ( \mathbf { x } _ { t - 1 } | \mathbf { x } _ { t } ) : = \mathcal { N } ( \mathbf { x } _ { t - 1 } ; \mu _ { \theta } ( \mathbf { x } _ { t } , t ) , \pmb { \Sigma } _ { \theta } ( \mathbf { x } _ { t } , t ) ) $ where:

  • xt1\mathbf{x}_{t-1} is the denoised sample at timestep t-1.
  • xt\mathbf{x}_t is the noisy sample at timestep tt.
  • μθ(xt,t)\mu_\theta(\mathbf{x}_t, t) is the mean of the Gaussian distribution, predicted by a neural network parameterized by θ\theta. This network takes the noisy sample xt\mathbf{x}_t and the current timestep tt as input.
  • Σθ(xt,t)\pmb{\Sigma}_\theta(\mathbf{x}_t, t) is the covariance matrix of the Gaussian distribution, also potentially predicted by a neural network. In this paper, it's often fixed or simplified.
  • The notation N(variable;mean,covariance)\mathcal{N}(\text{variable}; \text{mean}, \text{covariance}) indicates a Gaussian distribution for variable with the given mean and covariance.

4.2.3. Forward Process

What distinguishes diffusion models is that the approximate posterior q(x1:Tx0)q(\mathbf{x}_{1:T}|\mathbf{x}_0), also called the forward process or diffusion process, is fixed (not learned) to a Markov chain that gradually adds Gaussian noise to the data. This process is defined by a variance schedule β1,,βT\beta_1, \ldots, \beta_T, where βt\beta_t dictates the amount of noise added at each step tt.

$ q ( \mathbf { x } _ { 1 : T } | \mathbf { x } _ { 0 } ) : = \prod _ { t = 1 } ^ { T } q ( \mathbf { x } _ { t } | \mathbf { x } _ { t - 1 } ) $ Each step of the forward process adds noise: $ q ( \mathbf { x } _ { t } | \mathbf { x } _ { t - 1 } ) : = \mathcal { N } ( \mathbf { x } _ { t } ; \sqrt { 1 - \beta _ { t } } \mathbf { x } _ { t - 1 } , \beta _ { t } \mathbf { I } ) $ where:

  • xt\mathbf{x}_t is the noisy sample at timestep tt.

  • xt1\mathbf{x}_{t-1} is the sample from the previous timestep.

  • 1βt\sqrt{1 - \beta_t} is a scaling factor that reduces the signal from xt1\mathbf{x}_{t-1} as noise is added.

  • βtI\beta_t \mathbf{I} is the covariance matrix of the Gaussian noise added. βt\beta_t is a predefined variance, and I\mathbf{I} is the identity matrix, meaning the noise is isotropic (equal variance in all dimensions).

    A notable property of the forward process is that it allows sampling xt\mathbf{x}_t at an arbitrary timestep tt in a closed form, given x0\mathbf{x}_0. This is crucial for efficient training. By defining αt:=1βt\alpha_t := 1 - \beta_t and αˉt:=s=1tαs\bar{\alpha}_t := \prod_{s=1}^t \alpha_s, we can directly sample xt\mathbf{x}_t from x0\mathbf{x}_0: $ q ( \mathbf { x } _ { t } | \mathbf { x } _ { 0 } ) = \mathcal { N } ( \mathbf { x } _ { t } ; \sqrt { \bar { \alpha } _ { t } } \mathbf { x } _ { 0 } , ( 1 - \bar { \alpha } _ { t } ) \mathbf { I } ) $ This means xt=αˉtx0+1αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, where ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). This reparameterization is fundamental for training.

4.2.4. Training Objective: Variational Bound

Training is performed by optimizing the usual variational bound on the negative log-likelihood. The general form of the Evidence Lower Bound (ELBO) for this model is: $ \mathbb { E } \left[ - \log p _ { \theta } ( \mathbf { x } _ { 0 } ) \right] \leq \mathbb { E } _ { q } \left[ - \log \frac { p _ { \theta } ( \mathbf { x } _ { 0 : T } ) } { q ( \mathbf { x } _ { 1 : T } | \mathbf { x } _ { 0 } ) } \right] = \mathbb { E } _ { q } \left[ - \log p ( \mathbf { x } _ { T } ) - \sum _ { t \geq 1 } \log \frac { p _ { \theta } ( \mathbf { x } _ { t - 1 } | \mathbf { x } _ { t } ) } { q ( \mathbf { x } _ { t } | \mathbf { x } _ { t - 1 } ) } \right] = : L $ where:

  • Eq[]\mathbb{E}_q[\cdot] denotes the expectation over the forward process distribution q(x0:T)q(\mathbf{x}_{0:T}).

  • The goal is to minimize this variational upper bound LL on the negative log-likelihood.

    For variance reduction, this bound can be rewritten using Kullback-Leibler (KL) divergence terms, which directly compare the learned reverse transitions with the (tractable) forward process posterior distributions. $ L = \mathbb { E } _ { q } \left[ \underbrace { D _ { \mathrm { K L } } ( q ( \mathbf { x } _ { T } | \mathbf { x } _ { 0 } ) \parallel p ( \mathbf { x } _ { T } ) ) } _ { L _ { T } } + \sum _ { t > 1 } \underbrace { D _ { \mathrm { K L } } ( q ( \mathbf { x } _ { t - 1 } | \mathbf { x } _ { t } , \mathbf { x } _ { 0 } ) \parallel p _ { \theta } ( \mathbf { x } _ { t - 1 } | \mathbf { x } _ { t } ) ) } _ { L _ { t - 1 } } \underbrace { - \log p _ { \theta } ( \mathbf { x } _ { 0 } | \mathbf { x } _ { 1 } ) } _ { L _ { 0 } } \right] $ Here:

  • LTL_T: This term measures how close the final latent q(xTx0)q(\mathbf{x}_T|\mathbf{x}_0) is to the prior p(xT)p(\mathbf{x}_T) (a standard normal). It encourages the forward process to completely destroy the signal from x0\mathbf{x}_0 by timestep TT.

  • Lt1L_{t-1}: These terms (for t>1t > 1) measure the difference between the true posterior of the forward process q(xt1xt,x0)q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) (which is tractable) and the learned reverse process transition pθ(xt1xt)p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t). This is where the neural network primarily learns to reverse the noise process.

  • L0L_0: This term represents the reconstruction likelihood of the data x0\mathbf{x}_0 given the first denoised latent x1\mathbf{x}_1. It is a negative log-likelihood term for the first step of the reverse process.

    The forward process posterior q(xt1xt,x0)q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) is tractable and also a Gaussian: $ q ( \mathbf { x } _ { t - 1 } | \mathbf { x } _ { t } , \mathbf { x } _ { 0 } ) = \mathcal { N } ( \mathbf { x } _ { t - 1 } ; \widetilde { \mu } _ { t } ( \mathbf { x } _ { t } , \mathbf { x } _ { 0 } ) , \widetilde { \beta } _ { t } \mathbf { I } ) $ where its mean μ~t\widetilde{\mu}_t and variance β~t\widetilde{\beta}_t are given by: $ \mathrm { w h e r e } \quad \widetilde { \mu } _ { t } ( \mathbf { x } _ { t } , \mathbf { x } _ { 0 } ) : = \displaystyle \frac { \sqrt { \bar { \alpha } _ { t - 1 } } \beta _ { t } } { 1 - \bar { \alpha } _ { t } } \mathbf { x } _ { 0 } + \frac { \sqrt { \alpha _ { t } } \left( 1 - \bar { \alpha } _ { t - 1 } \right) } { 1 - \bar { \alpha } _ { t } } \mathbf { x } _ { t } \quad \mathrm { a n d } \quad \widetilde { \beta } _ { t } : = \frac { 1 - \bar { \alpha } _ { t - 1 } } { 1 - \bar { \alpha } _ { t } } \beta _ { t } $ Since all KL divergences are between Gaussians, they can be computed in closed form, making training efficient.

4.2.5. Design Choices for Forward Process and LTL_T

The paper fixes the forward process variances βt\beta_t to constants, meaning they are not learned parameters. This simplifies the model as the approximate posterior qq has no learnable parameters. Consequently, the term LTL_T becomes a constant during training and can be ignored, as it does not depend on θ\theta. The chosen schedule for βt\beta_t is linear, increasing from β1=104\beta_1 = 10^{-4} to βT=0.02\beta_T = 0.02 over T=1000T=1000 steps. This choice ensures that the noise addition is small enough for the Gaussian approximation to hold and that the signal is effectively destroyed by TT.

4.2.6. Design Choices for Reverse Process and L1:T1L_{1:T-1}

For the reverse process pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \pmb{\Sigma}_\theta(\mathbf{x}_t, t)), two main choices are made:

  1. Fixed Covariance Matrix: The covariance matrix Σθ(xt,t)\pmb{\Sigma}_\theta(\mathbf{x}_t, t) is set to untrained, time-dependent constants. Two options were explored: σt2=βtI\sigma_t^2 = \beta_t \mathbf{I} (the variance of the forward process) and σt2=β~tI\sigma_t^2 = \tilde{\beta}_t \mathbf{I} (the variance of the forward process posterior). The paper found better sample quality with σt2=β~tI\sigma_t^2 = \tilde{\beta}_t \mathbf{I}.

  2. Mean Parameterization (Epsilon-Prediction): The most straightforward way to parameterize the mean μθ(xt,t)\mu_\theta(\mathbf{x}_t, t) would be to directly predict μ~t\widetilde{\mu}_t (the mean of the forward process posterior). The Lt1L_{t-1} term can be simplified to: $ L _ { t - 1 } = \mathbb { E } _ { q } \bigg [ \frac { 1 } { 2 \sigma _ { t } ^ { 2 } } | \tilde { \pmb { \mu } } _ { t } ( \mathbf { x } _ { t } , \mathbf { x } _ { 0 } ) - \pmb { \mu } _ { \theta } ( \mathbf { x } _ { t } , t ) | ^ { 2 } \bigg ] + C $ where CC is a constant not dependent on θ\theta. This shows that minimizing Lt1L_{t-1} is equivalent to making μθ\mu_\theta predict μ~t\tilde{\mu}_t.

    However, the paper proposes a novel parameterization. By reparameterizing xt\mathbf{x}_t from q(xtx0)q(\mathbf{x}_t|\mathbf{x}_0) as xt=αˉtx0+1αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon} (where ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})), and substituting this into the expression for μ~t\widetilde{\mu}_t, it can be shown that μ~t\widetilde{\mu}_t can be expressed in terms of xt\mathbf{x}_t and ϵ\boldsymbol{\epsilon}. This leads to the insight that the model μθ\mu_\theta can be designed to predict ϵ\boldsymbol{\epsilon}.

    The proposed epsilon-prediction parameterization for the mean μθ(xt,t)\mu_\theta(\mathbf{x}_t, t) is: $ \mu _ { \theta } ( \mathbf { x } _ { t } , t ) = \tilde { \mu } _ { t } \bigg ( \mathbf { x } _ { t } , \frac { 1 } { \sqrt { \bar { \alpha } _ { t } } } ( \mathbf { x } _ { t } - \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon _ { \theta } ( \mathbf { x } _ { t } ) ) \bigg ) = \frac { 1 } { \sqrt { \alpha _ { t } } } \left( \mathbf { x } _ { t } - \frac { \beta _ { t } } { \sqrt { 1 - \bar { \alpha } _ { t } } } \epsilon _ { \theta } ( \mathbf { x } _ { t } , t ) \right) $ where ϵθ(xt,t)\epsilon_\theta(\mathbf{x}_t, t) is a neural network (function approximator) trained to predict the noise ϵ\boldsymbol{\epsilon} that was added to x0\mathbf{x}_0 to get xt\mathbf{x}_t. The network ϵθ\epsilon_\theta takes xt\mathbf{x}_t and the timestep tt as input.

    This parameterization has two significant implications:

    • Connection to Langevin Dynamics: To sample xt1\mathbf{x}_{t-1} from pθ(xt1xt)p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t), one uses the mean μθ(xt,t)\mu_\theta(\mathbf{x}_t, t) and adds noise: xt1=μθ(xt,t)+σtz\mathbf{x}_{t-1} = \mu_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z}, where zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). Substituting the epsilon-prediction formula for μθ\mu_\theta yields: $ \mathbf { x } _ { t - 1 } = \frac { 1 } { \sqrt { \alpha _ { t } } } \left( \mathbf { x } _ { t } - \frac { 1 - \alpha _ { t } } { \sqrt { 1 - \bar { \alpha } _ { t } } } \mathbf { \epsilon } _ { \theta } ( \mathbf { x } _ { t } , t ) \right) + \sigma _ { t } \mathbf { z } $ This equation strongly resembles Langevin dynamics, where ϵθ\epsilon_\theta acts as a learned gradient of the data density (or score function).

    • Connection to Denoising Score Matching: With this parameterization, the term Lt1L_{t-1} simplifies to: $ \mathbb { E } _ { \mathbf { x } _ { 0 } , \epsilon } \left[ \frac { \beta _ { t } ^ { 2 } } { 2 \sigma _ { t } ^ { 2 } \alpha _ { t } ( 1 - \bar { \alpha } _ { t } ) } \left. \epsilon - \epsilon _ { \theta } ( \sqrt { \bar { \alpha } _ { t } } \mathbf { x } _ { 0 } + \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon , t ) \right. ^ { 2 } \right] $ This resembles a denoising score matching objective, where the model ϵθ\epsilon_\theta is trained to predict the noise ϵ\boldsymbol{\epsilon} given a noisy version of x0\mathbf{x}_0. The term αˉtx0+1αˉtϵ\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon} is precisely xt\mathbf{x}_t. This confirms that optimizing the variational bound for this specific epsilon-prediction parameterization is equivalent to training a denoising score matching model.

The training algorithm (Algorithm 1) focuses on this simplified objective:

Algorithm 1 Training 1: repeat 2: x0q(x0)\mathbf { x } _ { 0 } \sim q ( \mathbf { x } _ { 0 } ) (Sample a data point) 3: tUniform({1,,T})t \sim \mathrm { U n i f o r m } ( \{ 1 , \dots , T \} ) (Sample a random timestep) 4: ϵN(0,I)\mathbf { \epsilon } \in \sim \mathcal { N } ( \mathbf { 0 } , \mathbf { I } ) (Sample Gaussian noise) 5: Take gradient descent step on $ \nabla _ { \theta } \left| \epsilon - \epsilon _ { \theta } ( \sqrt { \bar { \alpha } _ { t } } \mathbf { x } _ { 0 } + \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon , t ) \right| ^ { 2 } $ (Optimize the network ϵθ\epsilon_\theta to predict the noise ϵ\epsilon from xt=αˉtx0+1αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon) 6: until converged

The sampling algorithm (Algorithm 2) leverages the learned ϵθ\epsilon_\theta to iteratively denoise.

Algorithm 2 Sampling 1: xTN(0,I)\mathbf { x } _ { T } \sim \mathcal { N } ( \mathbf { 0 } , \mathbf { I } ) (Start with pure Gaussian noise) 2: for t=T,,1t = T , \dots , 1 do (Iterate backward through timesteps) 3: zN(0,I)\mathbf { z } \sim \mathcal { N } ( \mathbf { 0 } , \mathbf { I } ) if t>1t > 1 , else z=0\mathbf { z } = \mathbf { 0 } (Sample random noise for all but the last step) 4: xt1=1αt(xt1αt1αˉtϵθ(xt,t))+σtz\mathbf { x } _ { t - 1 } = \frac { 1 } { \sqrt { \alpha _ { t } } } \left( \mathbf { x } _ { t } - \frac { 1 - \alpha _ { t } } { \sqrt { 1 - \bar { \alpha } _ { t } } } \mathbf { \epsilon } _ { \theta } ( \mathbf { x } _ { t } , t ) \right) + \sigma _ { t } \mathbf { z } (Denoise xt\mathbf{x}_t to get xt1\mathbf{x}_{t-1} using the predicted noise ϵθ\epsilon_\theta) 5: end for 6: return x0\mathbf{x}_0 (The final generated image)

4.2.7. Data Scaling, Reverse Process Decoder, and L0L_0

  • Data Scaling: Image data, originally integers in {0,1,,255}\{0, 1, \ldots, 255\}, are scaled linearly to [1,1][-1, 1]. This normalization helps the neural network operate on consistently scaled inputs, starting from the standard normal prior p(xT)p(\mathbf{x}_T).
  • Reverse Process Decoder (L0L_0): To obtain discrete log-likelihoods for the final step (reconstructing x0\mathbf{x}_0 from x1\mathbf{x}_1), the last term of the reverse process pθ(x0x1)p_\theta(\mathbf{x}_0|\mathbf{x}_1) is set to an independent discrete decoder derived from the Gaussian N(x0;μθ(x1,1),σ12I)\mathcal{N}(\mathbf{x}_0; \mu_\theta(\mathbf{x}_1, 1), \sigma_1^2 \mathbf{I}). This means for each coordinate ii of the data, the probability is calculated by integrating the Gaussian PDF over the bin representing the discrete pixel value: $ p _ { \theta } ( \mathbf { x } _ { 0 } | \mathbf { x } _ { 1 } ) = \prod _ { i = 1 } ^ { D } \int _ { \delta _ { - } ( x _ { 0 } ^ { i } ) } ^ { \delta _ { + } ( x _ { 0 } ^ { i } ) } { \mathcal { N } } ( x ; \mu _ { \theta } ^ { i } ( \mathbf { x } _ { 1 } , 1 ) , \sigma _ { 1 } ^ { 2 } ) d x $ where:
    • DD is the data dimensionality (e.g., number of pixels ×\times channels).
    • x0ix_0^i is the ii-th coordinate of the original data x0\mathbf{x}_0.
    • μθi(x1,1)\mu_\theta^i(\mathbf{x}_1, 1) is the ii-th coordinate of the mean predicted by the network at timestep 1.
    • σ12\sigma_1^2 is the variance at timestep 1.
    • δ(x)\delta_-(x) and δ+(x)\delta_+(x) define the integration bounds for the discrete pixel value x0ix_0^i: $ \delta _ { + } ( x ) = { \left{ \begin{array} { l l } { \infty } & { { \mathrm { i f ~ } } x = 1 } \ { x + { \frac { 1 } { 2 5 5 } } } & { { \mathrm { i f ~ } } x < 1 } \end{array} \right. } \qquad \delta _ { - } ( x ) = { \left{ \begin{array} { l l } { - \infty } & { { \mathrm { i f ~ } } x = - 1 } \ { x - { \frac { 1 } { 2 5 5 } } } & { { \mathrm { i f ~ } } x > - 1 } \end{array} \right. } $ These bounds are for data scaled to [1,1][-1, 1], where x=1x=1 corresponds to a pixel value of 255 and x=1x=-1 to 0. The interval width 1/2551/255 corresponds to the discrete step size. This ensures the variational bound is a lossless codelength of discrete data. At the end of sampling, μθ(x1,1)\mu_\theta(\mathbf{x}_1, 1) is directly used as the generated pixel value without added noise.

4.2.8. Simplified Training Objective

While the full variational bound LL is differentiable, the authors found it beneficial for sample quality (and simpler to implement) to train on a simplified objective, LsimpleL_{simple}: $ L _ { \mathrm { s i m p l e } } ( \theta ) : = \mathbb { E } _ { t , \mathbf { x } _ { 0 } , \epsilon } \Big [ \big | \epsilon - \epsilon _ { \theta } \big ( \sqrt { \bar { \alpha } _ { t } } \mathbf { x } _ { 0 } + \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon , t \big ) \big | ^ { 2 } \Big ] $ where tt is uniformly sampled between 1 and TT. This objective:

  • Corresponds to an unweighted version of Eq. (12) (the Lt1L_{t-1} terms), akin to the loss weighting used by NCSN models. It removes the complex weighting factors present in the full variational bound.
  • For t=1t=1, it corresponds to L0L_0 with an approximation of the integral in the discrete decoder, ignoring σ12\sigma_1^2 and edge effects.
  • By discarding the weighting from Eq. (12), LsimpleL_{simple} becomes a weighted variational bound that implicitly re-weights the importance of different timesteps. Specifically, the authors note that the diffusion process setup causes this simplified objective to down-weight loss terms corresponding to small tt. These small-tt terms correspond to denoising data with very small amounts of noise. Down-weighting them allows the network to focus more on difficult denoising tasks at larger tt terms (where noise levels are higher), which empirically leads to better sample quality.

4.2.9. Neural Network Architecture

The reverse process network ϵθ\epsilon_\theta uses a U-Net backbone similar to an unmasked PixelCNN++ [52, 48], employing group normalization [66] throughout. Key architectural details include:

  • U-Net structure: symmetric encoder-decoder with skip connections.
  • Wide ResNet [72] blocks are used within the U-Net.
  • Self-attention blocks [63, 60] are used at the 16×1616 \times 16 feature map resolution.
  • Parameters are shared across time, and the timestep tt is specified to the network by adding the Transformer sinusoidal position embedding [60] into each residual block. This allows the single network to learn denoising at all noise levels.

4.2.10. Connection to Autoregressive Decoding

The variational bound (Eq. 5) can also be rewritten as: $ L = D _ { \mathrm { K L } } ( q ( \mathbf { x } _ { T } ) \parallel p ( \mathbf { x } _ { T } ) ) + \mathbb { E } _ { q } \left[ \sum _ { t \geq 1 } D _ { \mathrm { K L } } ( q ( \mathbf { x } _ { t - 1 } | \mathbf { x } _ { t } ) \parallel p _ { \theta } ( \mathbf { x } _ { t - 1 } | \mathbf { x } _ { t } ) ) \right] + H ( \mathbf { x } _ { 0 } ) $ where H(x0)H(\mathbf{x}_0) is the entropy of the data distribution q(x0)q(\mathbf{x}_0). This form highlights that minimizing LL involves making the reverse transitions pθ(xt1xt)p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) match the forward process posteriors q(xt1xt)q(\mathbf{x}_{t-1}|\mathbf{x}_t).

The paper draws an interesting parallel: if one were to set the diffusion process length TT to the dimensionality of the data, define the forward process to mask out coordinates sequentially (e.g., q(xtxt1)q(\mathbf{x}_t|\mathbf{x}_{t-1}) masks out the tt-th coordinate), and make pθ(xt1xt)p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) a fully expressive conditional distribution, then minimizing the KL divergence terms would essentially train pθp_\theta to predict the tt-th coordinate given the (t+1)(t+1)-th to TT-th coordinates. This setup directly corresponds to training an autoregressive model.

Therefore, the Gaussian diffusion model can be interpreted as a form of autoregressive model but with a generalized bit ordering determined by the Gaussian noise levels, rather than a fixed spatial ordering. This Gaussian diffusion is hypothesized to serve a similar purpose to reorderings in autoregressive models, introducing an inductive bias beneficial for image data, potentially more effectively than masking noise. Moreover, the diffusion length TT is not tied to data dimensionality (e.g., T=1000T=1000 for 32×32×332 \times 32 \times 3 images), allowing flexibility.

4.3. Progressive Lossy Compression and Decoding

The model's iterative nature naturally supports a progressive lossy compression scheme. This is conceptualized through Algorithm 3: Sending x0 and Algorithm 4: Receiving. These algorithms iteratively transmit the noisy versions of the data xT,,x0\mathbf{x}_T, \ldots, \mathbf{x}_0.

Algorithm 3 Sending x0 1: Send xTq(xTx0)\mathbf { x } _ { T } \sim q ( \mathbf { x } _ { T } | \mathbf { x } _ { 0 } ) using p(xT)p ( \mathbf { x } _ { T } ) 2: for t=T1,,2,1t = T - 1 , \dotsc , 2 , 1 do 3: Send xtq(xtxt+1,x0)\mathbf { x } _ { t } \sim q ( \mathbf { x } _ { t } | \mathbf { x } _ { t + 1 } , \mathbf { x } _ { 0 } ) using pθ(xtxt+1)p _ { \theta } \big ( \mathbf { x } _ { t } \big | \mathbf { x } _ { t + 1 } \big ) 4: end for 5: Send x0\mathbf { x } _ { \mathrm { 0 } } using pθ(x0x1)p _ { \theta } ( \mathbf { x } _ { 0 } | \mathbf { x } _ { 1 } )

Algorithm 4 Receiving 1: Receive xT\mathbf{x}_T using p(xT)p ( \mathbf { x } _ { T } ) 2: for t=T1,,1,0t = T - 1, \dots, 1, 0 do 3: Receive xt\mathbf{x}_t using pθ(xtxt+1)p _ { \theta } \big ( \mathbf { x } _ { t } \big | \mathbf { x } _ { t + 1 } \big ) 4: end for 5: return x0\mathbf{x}_0

This process transmits intermediate representations. At any time tt, the receiver has xt\mathbf{x}_t and can estimate the original x0\mathbf{x}_0 using: $ \mathbf { x } _ { 0 } \approx \hat { \mathbf { x } } _ { 0 } = \left( \mathbf { x } _ { t } - \sqrt { 1 - \bar { \alpha } _ { t } } \pmb { \epsilon } _ { \theta } ( \mathbf { x } _ { t } ) \right) / \sqrt { \bar { \alpha } _ { t } } $ This progressive decoding allows for rate-distortion analysis, where rate is the cumulative bits received and distortion is the RMSE between x0\mathbf{x}_0 and x^0\hat{\mathbf{x}}_0. The paper finds that initial bits remove significant distortion (large-scale features), while later bits refine imperceptible details.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on several widely used image datasets:

  • CIFAR10: This dataset consists of 60,000 32×3232 \times 32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. It's a common benchmark for image classification and generative modeling, representing small, diverse natural images.

    An example data sample from CIFAR10 could be a 32×3232 \times 32 color image of a deer or an airplane.

  • CelebA-HQ (CelebFaces Attributes High-Quality): A high-resolution version of the CelebA dataset, containing 30,000 high-resolution images (1024×10241024 \times 1024) of celebrity faces. Experiments in the paper were performed on 256×256256 \times 256 resolution versions of these images. This dataset is excellent for evaluating generative models on human faces, which exhibit complex textures and structures.

    An example data sample would be a 256×256256 \times 256 color image of a person's face.

  • LSUN (Large-scale Scene Understanding) [71]: A large-scale image dataset with millions of labeled images for various scene categories. The paper specifically uses 256×256256 \times 256 resolution images from the LSUN Bedroom, LSUN Church, and LSUN Cat categories. These datasets are challenging due to their diverse appearances and complex scene compositions, making them good benchmarks for high-resolution image synthesis.

    An example data sample from LSUN Bedroom would be a 256×256256 \times 256 color image of a bedroom interior.

These datasets were chosen because they are standard benchmarks in generative modeling, allowing for direct comparison with previous state-of-the-art methods. They cover a range of resolutions and content complexities, validating the method's performance across different image generation tasks.

5.2. Evaluation Metrics

The performance of the models was evaluated using several standard metrics for generative image quality:

  • Inception Score (IS):

    1. Conceptual Definition: The Inception Score measures both the quality (how realistic/clear images are) and diversity (how many distinct object categories are represented) of generated images. It leverages a pre-trained Inception v3 classification model to assess these aspects. A higher IS implies better quality and diversity.
    2. Mathematical Formula: $ \mathrm{IS}(G) = \exp \left( \mathbb{E}{\mathbf{x} \sim p_g} [D{\mathrm{KL}}(p(y|\mathbf{x}) \parallel p(y))] \right) $
    3. Symbol Explanation:
      • GG: The generative model being evaluated.
      • pgp_g: The probability distribution of images generated by model GG.
      • xpg\mathbf{x} \sim p_g: A generated image sampled from GG.
      • p(yx)p(y|\mathbf{x}): The conditional class probability distribution output by a pre-trained Inception v3 model when classifying image x\mathbf{x}. This indicates how confident the classifier is that x\mathbf{x} belongs to a specific class. High entropy in p(yx)p(y|\mathbf{x}) (i.e., uniform distribution over classes) suggests a low-quality or ambiguous image. Low entropy suggests a clear, high-quality image.
      • p(y): The marginal class distribution over all generated images. This measures the diversity. A uniform p(y) across classes indicates good diversity; if p(y) is concentrated on a few classes, it indicates mode collapse.
      • DKL(p(yx)p(y))D_{\mathrm{KL}}(p(y|\mathbf{x}) \parallel p(y)): The Kullback-Leibler divergence between the conditional class distribution and the marginal class distribution. This term is high when images are clearly classifiable (low entropy p(yx)p(y|\mathbf{x})) and when generated classes are diverse (high entropy p(y)), indicating good quality and diversity.
      • Expg[]\mathbb{E}_{\mathbf{x} \sim p_g}[\cdot]: The expectation taken over all generated images.
      • exp()\exp(\cdot): The exponential function, used to scale the result.
  • Fréchet Inception Distance (FID):

    1. Conceptual Definition: FID is a metric that quantifies the similarity between the distribution of real images and the distribution of generated images. It's often considered superior to IS because it compares the entire distributions of feature representations, not just class labels. A lower FID score indicates generated images are more similar to real images in terms of their perceptual quality and diversity.
    2. Mathematical Formula: $ \mathrm{FID}(\mathbf{x}, \mathbf{g}) = ||\mu_{\mathbf{x}} - \mu_{\mathbf{g}}||^2 + \mathrm{Tr}(\Sigma_{\mathbf{x}} + \Sigma_{\mathbf{g}} - 2(\Sigma_{\mathbf{x}}\Sigma_{\mathbf{g}})^{1/2}) $
    3. Symbol Explanation:
      • x\mathbf{x}: Real images.
      • g\mathbf{g}: Generated images.
      • μx\mu_{\mathbf{x}}: The mean feature vector of real images, obtained from an intermediate layer of a pre-trained Inception v3 network.
      • μg\mu_{\mathbf{g}}: The mean feature vector of generated images.
      • Σx\Sigma_{\mathbf{x}}: The covariance matrix of the feature vectors for real images.
      • Σg\Sigma_{\mathbf{g}}: The covariance matrix of the feature vectors for generated images.
      • 2||\cdot||^2: The squared Euclidean distance (or L2L_2 norm). This term measures the distance between the means of the feature distributions.
      • Tr()\mathrm{Tr}(\cdot): The trace of a matrix (sum of its diagonal elements).
      • (ΣxΣg)1/2(\Sigma_{\mathbf{x}}\Sigma_{\mathbf{g}})^{1/2}: The matrix square root of the product of the covariance matrices. This term, along with the sum of covariances, measures the difference in the shape and orientation of the feature distributions.
  • Negative Log Likelihood (NLL) / bits/dim:

    1. Conceptual Definition: Negative Log Likelihood (NLL) measures how well a probabilistic model assigns probability to observed data. A lower NLL means the model better explains the data. When normalized by the dimensionality of the data, it's expressed in bits/dim (bits per dimension), which can be interpreted as the average number of bits needed to encode each dimension of a data point using the model. Lower bits/dim imply a more efficient and accurate model of the data distribution.
    2. Mathematical Formula: For a model pθ(x)p_\theta(\mathbf{x}) and a dataset {x(1),,x(N)}\{\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(N)}\}, the average NLL is: $ \mathrm{NLL} = - \frac{1}{N} \sum_{i=1}^{N} \log p_\theta(\mathbf{x}^{(i)}) $ When reported in bits/dim, it is divided by the data dimensionality DD: $ \text{bits/dim} = \frac{\mathrm{NLL}}{D} $ For variational autoencoders and diffusion models, the ELBO is typically reported as an upper bound on the NLL, so a lower ELBO is better. The paper states their NLL values are upper bounds (\leq).
    3. Symbol Explanation:
      • NN: Number of data points in the dataset.
      • pθ(x(i))p_\theta(\mathbf{x}^{(i)}): The probability density (or mass) assigned to data point x(i)\mathbf{x}^{(i)} by the model with parameters θ\theta.
      • log\log: The natural logarithm.
      • DD: The dimensionality of each data point x\mathbf{x} (e.g., for a 32×32×332 \times 32 \times 3 image, D=32×32×3=3072D = 32 \times 32 \times 3 = 3072).
  • Root Mean Squared Error (RMSE):

    1. Conceptual Definition: RMSE is a measure of the difference between values predicted by a model and the actual values. It's frequently used to quantify the magnitude of the error. In the context of image reconstruction or progressive decoding, it measures how closely the reconstructed image matches the original image pixel-wise. Lower RMSE indicates higher reconstruction accuracy.
    2. Mathematical Formula: $ \mathrm{RMSE} = \sqrt{\frac{1}{D} \sum_{i=1}^{D} (x_{0,i} - \hat{x}_{0,i})^2} $
    3. Symbol Explanation:
      • DD: The total number of dimensions (pixels * channels) in the image.
      • x0,ix_{0,i}: The ii-th pixel value (or channel value) of the original image x0\mathbf{x}_0.
      • x^0,i\hat{x}_{0,i}: The ii-th pixel value (or channel value) of the reconstructed image x^0\hat{\mathbf{x}}_0.

5.3. Baselines

The paper compares its Denoising Diffusion Probabilistic Model against a wide array of state-of-the-art generative models across different paradigms, primarily on the CIFAR10 dataset for unconditional generation, and LSUN for larger image generation.

For CIFAR10 (Unconditional):

  • Diffusion (original) [53]: The original diffusion probabilistic model, representing the foundational work that this paper significantly improves upon.

  • Gated PixelCNN [59]: An early autoregressive model known for good likelihood.

  • Sparse Transformer [7]: A more recent and powerful autoregressive model.

  • PixelIQN [43]: An autoregressive quantile network.

  • Energy-Based Models (EBM) [11]: Represents a class of models that learn an energy function.

  • Score-based Generative Models:

    • NCSN [55]: Noise Conditional Score Networks, a pioneering score-based model.
    • NCSNv2 [56]: Improved techniques for training score-based models.
  • Generative Adversarial Networks (GANs):

    • SNGAN [39]: Spectral Normalization GAN, which improved GAN stability.

    • SNGAN-DDLS [4]: SNGAN with Discriminator Driven Latent Sampling.

    • StyleGAN2 + ADA (v1) [29]: A highly advanced GAN architecture with adaptive discriminator augmentation, representing the state-of-the-art in GANs at the time.

      For CIFAR10 (Conditional): (Though the paper focuses on unconditional generation, it lists some conditional baselines for context, showing its unconditional model can sometimes outperform conditional ones.)

  • EBM [11]: Conditional Energy-Based Model.

  • JEM [17]: Joint Energy-based Model.

  • BigGAN [3]: A large-scale GAN known for high-quality conditional image generation.

  • StyleGAN2 + ADA (v1) [29]: State-of-the-art conditional GAN.

For LSUN 256x256 (Bedroom, Church, Cat):

  • ProgressiveGAN [27]: A GAN that generates images progressively from low to high resolution.

  • StyleGAN [28]: An advanced GAN known for disentangled latent space.

  • StyleGAN2 [30]: Further improvements over StyleGAN.

    These baselines are representative because they cover the dominant paradigms in generative modeling at the time (GANs, autoregressive models, VAEs, score-based models, and EBMs) and include methods that were considered state-of-the-art in terms of sample quality and/or likelihood performance on these specific datasets. Comparing against such a diverse and strong set of baselines robustly validates the performance and innovation of DDPMs.

6. Results & Analysis

6.1. Core Results Analysis

The main experimental results strongly validate the effectiveness of the proposed DDPM method, particularly in achieving state-of-the-art sample quality as measured by FID score.

The following are the results from Table 1 of the original paper:

Model IS FID NLL Test (Train)
Conditional
EBM [11] 8.30 37.9
JEM [17] 8.76 38.4
BigGAN [3] 9.22 14.73
StyleGAN2 + ADA (v1) [29] 10.06 2.67
Unconditional
Diffusion (original) [53] ≤ 5.40
Gated PixelCNN [59] 4.60 65.93 3.03 (2.90)
Sparse Transformer [7] 2.80
PixelIQN [43] 5.29 49.46
EBM [11] 6.78 38.2
NCSNv2 [56] 31.75
NCSN [55] 8.87±0.12 25.32
SNGAN [39] 8.22±0.05 21.7
SNGAN-DDLS [4] 9.09±0.10 15.42
StyleGAN2 + ADA (v1) [29] 9.74 ± 0.05 3.26
Ours (L, fixed isotropic ∑) 7.67±0.13 13.51 ≤ 3.70 (3.69)
Ours (Lsimple) 9.46±0.11 3.17 ≤ 3.75 (3.72)

CIFAR10 Results Analysis:

  • State-of-the-Art FID: The most striking result is the FID score of 3.17 achieved by "Ours (Lsimple)" on unconditional CIFAR10. This is a new state-of-the-art at the time of publication, outperforming even the highly sophisticated StyleGAN2+ADA(v1)StyleGAN2 + ADA (v1) (3.26 FID) and significantly beating all other unconditional baselines by a large margin (e.g., NCSN at 25.32 FID). This demonstrates unprecedented sample quality for diffusion models.

  • Inception Score: The Inception Score of 9.46 is also highly competitive, approaching the conditional BigGAN (9.22) and StyleGAN2 (10.06).

  • Negative Log Likelihood (NLL): While achieving superior sample quality, the NLL for DDPMs (3.75 bits/dim for Lsimple) is not as competitive as autoregressive models like Sparse Transformer (2.80 bits/dim) or Gated PixelCNN (3.03 bits/dim). This suggests that diffusion models excel at producing perceptually high-quality samples (good lossy compression) but might not capture every fine detail of the data distribution perfectly for lossless compression.

  • Comparison to Original Diffusion: The improvement over the original diffusion model (which had an NLL 5.40\leq 5.40 and no reported IS/FID) is substantial, underscoring the impact of the proposed architectural and objective changes.

    The following are the results from Table 3 of the original paper:

    Model LSUN Bedroom LSUN Church LSUN Cat
    ProgressiveGAN [27] 8.34 6.42 37.52
    StyleGAN [28] 2.65 4.21* 8.53*
    StyleGAN2 [30] - 3.86 6.93
    Ours (Lsimple) 6.36 7.89 19.75
    Ours (Lsimple, large) 4.90 - -

LSUN Results Analysis:

  • Competitive Sample Quality: On 256x256 LSUN datasets, the DDPMs achieve sample quality similar to ProgressiveGAN. For LSUN Bedroom, the "Ours (Lsimple, large)" model achieves an FID of 4.90, which is better than ProgressiveGAN (8.34) but not as good as StyleGAN (2.65) or StyleGAN2. For LSUN Church (FID 7.89) and LSUN Cat (FID 19.75), the FID scores are also competitive, especially against ProgressiveGAN. This demonstrates that the method scales to higher resolutions and more complex datasets.

    The high-quality samples generated by the model are visually compelling, as illustrated in the paper's figures. For instance, Figure 1 shows realistic faces from CelebA-HQ and diverse images from CIFAR10.

    Figure 1: Generated samples on CelebA-HQ \(2 5 6 \\times 2 5 6\) (left) and unconditional CIFAR10 (right) 该图像是图像合成的结果,左侧展示了 CelebA-HQ 数据集中的人脸样本,右侧则显示了无条件 CIFAR10 数据集中的图像样本。这些样本展示了高质量的生成效果。

Figure 1: Generated samples on CelebA-HQ 256×2562 5 6 \times 2 5 6 (left) and unconditional CIFAR10 (right)

Further examples on LSUN datasets (Church and Bedroom) in Figures 3 and 4 also support the claim of high sample quality.

Figure 3: LSUN Church samples. \(\\mathrm { F I D } { = } 7 . 8 9\) 该图像是多幅教堂建筑的样例,展示了不同风格的建筑设计和环境。图中的建筑各具特色,适合观察建筑美学和风格演变。

Figure 3: LSUN Church samples. FID=7.89\mathrm { F I D } { = } 7 . 8 9

Figure 4: LSUN Bedroom samples. \(\\mathrm { F I D = 4 . 9 0 }\) 该图像是一个展示不同风格卧室的插图,包含多种床铺和室内布置样式,整体呈现了舒适宜人的居住环境。

Figure 4: LSUN Bedroom samples. FID=4.90\mathrm { F I D = 4 . 9 0 }

The overall results indicate that DDPMs, with the proposed innovations, are a highly effective and competitive generative modeling approach, particularly for image synthesis.

6.2. Ablation Studies / Parameter Analysis

The paper conducts an ablation study on CIFAR10 to evaluate the effects of different reverse process parameterizations and training objectives.

The following are the results from Table 2 of the original paper:

Objective IS FID
μ~\tilde{\mu} prediction (baseline)
L, learned diagonal Σ 7.28±0.10 23.69
L, fixed isotropic ∑ 8.06±0.09 13.22
μμθ2\|\|\mu - \mu_\theta\|^2
ϵ\epsilon prediction (ours)
L, learned diagonal Σ
L, fixed isotropic ∑ 7.67±0.13 13.51
 ⁣kξθ ⁣2\|\!k\xi - \theta\!\|^2 (Lsimple) 9.46±0.11 3.17

Ablation Study Analysis:

  • Impact of Reverse Process Variance (Σθ\Sigma_\theta):
    • Learning the diagonal covariance Σθ\Sigma_\theta (by incorporating it into the variational bound) consistently leads to worse sample quality and unstable training, regardless of whether μ~\tilde{\mu} predictionor\epsilonprediction is used. For example, L,learneddiagonalΣL, learned diagonal Σ with \tilde{\mu}`prediction` yields an FID of 23.69. This suggests that `fixed variances` are crucial for stable and high-quality generation. * Using $fixed isotropic Σ$ significantly improves performance. For \tilde{\mu}prediction, it drops FID to 13.22 (from 23.69). For \epsilon`prediction`, it gives an FID of 13.51. This confirms the robustness of fixing the variance. * **Impact of Mean Parameterization ($\mu_\theta$):** * \tilde{\mu}prediction (directly predicting the mean of the posterior) works reasonably well when trained on the full variational bound L with fixedisotropicΣfixed isotropic Σ (FID 13.22). However, training it on a simplified unweighted mean squared error objective (represented by μμθ2\|\|\mu - \mu_\theta\|^2) was unstable and produced poor samples.
    • ϵprediction(predictingthenoiseterm),themethodproposedbytheauthors,performscomparablyto\epsilon`prediction` (predicting the noise term), the method proposed by the authors, performs comparably to \tilde{\mu}prediction when both are trained on the full variational bound L with fixedisotropicΣfixed isotropic Σ (FID 13.51 vs 13.22).
  • Impact of Training Objective:
    • The most significant finding is the performance of ϵpredictionwhentrainedwiththesimplifiedobjective(Lsimple,denotedas\epsilon`prediction` when trained with the `simplified objective` (`Lsimple`, denoted as |!k\xi - \theta!|^2). This combination yields the best `FID score` of **3.17**, a dramatic improvement over all other configurations. This validates the paper's key claim that the `simplified, re-weighted variational bound` is essential for achieving state-of-the-art sample quality. The simplified objective (which effectively down-weights loss for small $t$) allows the model to focus on more challenging denoising tasks. In summary, the ablation studies reveal that fixing the `reverse process variances` and, most importantly, using the `epsilon-prediction parameterization` combined with the `simplified training objective` ($L_{simple}$) are critical for DDPMs to achieve their high sample quality. ## 6.3. Progressive Coding and Decoding The paper discusses a `progressive lossy compression` scheme inherent to diffusion models, which also manifests as `progressive generation`. The following are the results from Table 4 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Reverse process time (T − t + 1)</td> <td>Rate (bits/dim)</td> <td>Distortion (RMSE [0, 255])</td> </tr> </thead> <tbody> <tr> <td>1000</td> <td>1.77581</td> <td>0.95136</td> </tr> <tr> <td>900</td> <td>0.11994</td> <td>12.02277</td> </tr> <tr> <td>800</td> <td>0.05415</td> <td>18.47482</td> </tr> <tr> <td>700</td> <td>0.02866</td> <td>24.43656</td> </tr> <tr> <td>600</td> <td>0.01507</td> <td>30.80948</td> </tr> <tr> <td>500</td> <td>0.00716</td> <td>38.03236</td> </tr> <tr> <td>400</td> <td>0.00282</td> <td>46.12765</td> </tr> <tr> <td>300</td> <td>0.00081</td> <td>54.18826</td> </tr> <tr> <td>200</td> <td>0.00013</td> <td>60.97170</td> </tr> <tr> <td>100</td> <td>0.00000</td> <td>67.60125</td> </tr> <tr> <td>1</td> <td>0.00000</td> <td>67.60125</td> </tr> </tbody> </table></div> **Rate-Distortion Analysis:** * Table 4 and Figure 5 illustrate the `rate-distortion` behavior on the CIFAR10 test set. ![Figure 5:Unconditional CIFAR10 test set rate-distortion vs. time. Distortion is measured in root mean squared error on a \[0, 255\] scale. See Table 4 for details.](/files/papers/69200f289c937f638e825c5c/images/5.jpg) *该图像是一个图表,展示了无条件CIFAR10测试集的失真(RMSE)与反向过程步骤($T - t$)和率(bits/dim)之间的关系。左侧图表显示失真随反向过程步骤的变化,中间图表展示率,右侧图表则表示失真与率的关系。* Figure 5:Unconditional CIFAR10 test set rate-distortion vs. time. Distortion is measured in root mean squared error on a [0, 255] scale. See Table 4 for details. * The `distortion` (measured in `RMSE`) decreases steeply in the low-rate region (i.e., at earlier reverse process timesteps, corresponding to higher `T-t+1`). This indicates that a small number of `bits` (transmitted from higher $t$) are sufficient to capture the most significant, coarse-grained information of the image, drastically reducing the `RMSE`. * The majority of the `lossless codelength` (total `rate`) describes `imperceptible image details`. For example, at `Reverse process time` 1000 (meaning all bits transmitted), the total `rate` is 1.77581 `bits/dim` and `distortion` is 0.95136. However, by `Reverse process time` 900 (meaning very few bits transmitted, rate is 0.11994), the distortion has already dropped significantly to 12.02277. This suggests the model prioritizes transmitting structural information early. This supports the idea that diffusion models are excellent `lossy compressors`. **Progressive Generation:** * The iterative `sampling procedure` itself can be viewed as `progressive generation`. Figures 6 and 14 show how images gradually form from noise during the reverse process. `Large-scale image features` appear first, and `fine details` emerge in the later stages (smaller $t$). ![Figure 6: Unconditional CIFAR10 progressive generation $\\hat { \\mathbf { x } } _ { 0 }$ over time, from left to right). Extended samples and sample quality metrics over time in the appendix (Figs. 10 and 14).](/files/papers/69200f289c937f638e825c5c/images/6.jpg) *该图像是图表,展示了无条件CIFAR10数据集的渐进生成过程 `ilde{ extbf{x}}_0`,从左到右依次展示图像生成的不同阶段。每一行代表不同的生成步长,整体展示了样本质量随时间的演变。* Figure 6: Unconditional CIFAR10 progressive generation $\\hat { \\mathbf { x } } _ { 0 }$ over time, from left to right). Extended samples and sample quality metrics over time in the appendix (Figs. 10 and 14). ![Figure 14: Unconditional CIFAR10 progressive generation](/files/papers/69200f289c937f638e825c5c/images/14.jpg) *该图像是图表,展示了无条件CIFAR10数据集的逐步生成过程。可以看到,从噪声到清晰图像的逐步改善。图像的各个部分展示了不同阶段生成结果的质量。* Figure 14: Unconditional CIFAR10 progressive generation * Figure 10 provides a quantitative view of this, showing how `IS` (`Inception Score`) increases and `FID` (`Fréchet Inception Distance`) decreases as more reverse process steps are taken (as `T-t` increases, meaning approaching $t=0$). ![Figure 2: The directed graphical model considered in this work.](/files/papers/69200f289c937f638e825c5c/images/2.jpg) *该图像是示意图,展示了论文中讨论的有向图形模型。图中用圆圈表示随机变量 $X_T, X_t, X_{t-1}, X_0$,箭头指示了条件概率关系 $p_\theta(X_{t-1}|X_t)$ 和 $q(X_t|X_{t-1})$。图中还包含了每个变量对应的图像示例,直观展示了模型进行信息传递的过程。* Figure 10: Unconditional CIFAR10 progressive sampling quality over time The left chart displays the increase of the Inception score with the reverse process steps, while the right chart shows the decrease of the FID score as the reverse process steps increase. This clearly demonstrates that image quality improves progressively. ## 6.4. Interpolation The paper demonstrates `latent space interpolation` between source images $\mathbf{x}_0$ and $\mathbf{x}_0'$ (from the CelebA-HQ dataset) by first encoding them into a common noisy latent space $\mathbf{x}_t, \mathbf{x}_t'$ using the forward process, then linearly interpolating between these latents $\bar{\mathbf{x}}_t = (1-\lambda)\mathbf{x}_t + \lambda\mathbf{x}_t'$, and finally decoding $\bar{\mathbf{x}}_t$ back into image space using the reverse process. * Figure 8 illustrates this process. The left side shows the diffused source images and the interpolated latent. The right side displays the reconstructions and interpolations for various $\lambda$ values at $t=500$. ![Figure 8: Interpolations of CelebA-HQ 256x256 images with 500 timesteps of diffusio](/files/papers/69200f289c937f638e825c5c/images/8.jpg) *该图像是插图,展示了在CelebA-HQ 256x256图像中使用500个时间步骤进行插值的结果。图中左侧是扩散源和插值示意,右侧则呈现不同参数 `oldsymbol{eta}` (从0.1到0.9) 对图像重建的影响。* Figure 8: Interpolations of CelebA-HQ 256x256 images with 500 timesteps of diffusio * The results show plausible interpolations that smoothly vary attributes like pose, skin tone, hairstyle, and expression. However, some attributes like eyewear are not smoothly interpolated, suggesting the latent space might not disentangle all features equally. * The choice of timestep $t$ for interpolation is crucial. A larger $t$ means more noise is added, effectively destroying more original structure, leading to coarser and more varied interpolations, and potentially novel samples (as seen in Appendix Figure 9 where $t=1000$ yields novel samples). Conversely, smaller $t$ values preserve more fine details. ![该图像是示意图,展示了不同步长下的图像恢复结果。行表示步长,从1000步到0步;列表示不同的重建权重 `oldsymbol{eta}` 值,从0.1到0.9。图中每个样本展示了网络在逐渐减少步长过程中恢复的图像质量变化。](/files/papers/69200f289c937f638e825c5c/images/9.jpg) *该图像是示意图,展示了不同步长下的图像恢复结果。行表示步长,从1000步到0步;列表示不同的重建权重 `oldsymbol{eta}` 值,从0.1到0.9。图中每个样本展示了网络在逐渐减少步长过程中恢复的图像质量变化。* The image is a schematic diagram showing the image restoration results at different step sizes. The rows represent the step sizes, ranging from 1000 steps to 0 steps; the columns represent different reconstruction weight $\eta$ values, from 0.1 to 0.9. Each sample illustrates the change in image quality as the step size decreases. This demonstrates the model's ability to complete structure and generate diverse plausible images from partial or noisy information. ## 6.5. Latent Structure and Reverse Process Stochasticity Figure 7 explores the `latent structure` of the model by conditioning multiple generated samples on the same intermediate latent $\mathbf{x}_t$. ![Figure 7: When conditioned on the same latent, CelebA-HQ $2 5 6 \\times 2 5 6$ samples share high-level attributes. Bottom-right quadrants are $\\mathbf { x } _ { t }$ , and other quadrants are samples from $p _ { \\theta } ( \\mathbf { x } _ { 0 } | \\mathbf { x } _ { t } )$ .](/files/papers/69200f289c937f638e825c5c/images/7.jpg) *该图像是插图,展示了在相同潜在条件下,CelebA-HQ $256 \times 256$ 样本共享的高层次属性。图中的下右方象限为 $\mathbf{x}_{t}$,而其他象限则为从 $p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{t})$ 生成的样本,分别标注为 "Share $x_{1000}$"、"Share $x_{750}$"、"Share $x_{500}$" 和 "Share $x_{0}$"。* Figure 7: When conditioned on the same latent, CelebA-HQ $2 5 6 \times 2 5 6$ samples share high-level attributes. Bottom-right quadrants are $\\mathbf { x } _ { t }$ , and other quadrants are samples from $p _ { \\theta } ( \\mathbf { x } _ { 0 } | \\mathbf { x } _ { t } )$ . * When samples are branched from a very early latent (`Share`\mathbf{x}_{1000}), they differ significantly, as x1000\mathbf{x}_{1000} is mostly noise.
  • However, when branched from an intermediate latent like x750\mathbf{x}_{750} or x500\mathbf{x}_{500}, the generated images share high-level attributes such as gender, hair color, eyewear, saturation, pose, and facial expression. This suggests that these intermediate latent variables encode such semantic information, even though the latent variables themselves are imperceptible noisy images. This hints at the conceptual compression capabilities of the model.

6.6. Nearest Neighbors

The paper includes nearest neighbor visualizations to show that the generated samples are not mere copies of the training data. Figures 12 and 15 show generated samples alongside their closest counterparts from the training set, in both pixel space and Inception feature space.

Figure 5:Unconditional CIFAR10 test set rate-distortion vs. time. Distortion is measured in root mean squared error on a \[0, 255\] scale. See Table 4 for details. 该图像是一个显示像素空间最近邻的图表,包含多个面孔的图像。每一列展示了一组相邻面孔,以展示在图像合成中的相似性。这些图像与本研究中的去噪扩散概率模型应用相关。

Figure 12: CelebA-HQ 256×2562 5 6 \times 2 5 6 nearest neighbors, computed on a 100×1001 0 0 \times 1 0 0 crop surrounding the faces. Generated samples are in the leftmost column, and training set nearest neighbors are in the remaining columns.

Figure 6: Unconditional CIFAR10 progressive generation \(\\hat { \\mathbf { x } } _ { 0 }\) over time, from left to right). Extended samples and sample quality metrics over time in the appendix (Figs. 10 and 14). 该图像是示意图,展示了无条件CIFAR10数据集的最近邻样本。左侧列为生成的样本,其他列为训练集中的最近邻图像,展示了在像素空间和Inception特征空间中的比较。

Figure 15: Unconditional CIFAR10 nearest neighbors. Generated samples are in the leftmost column, and training set nearest neighbors are in the remaining columns.

  • In all cases, the generated samples are distinct from the training examples, indicating that the model is indeed generating new data rather than simply memorizing and reconstructing existing ones. The nearest neighbors often share broad characteristics but differ in details, supporting the model's ability to generalize. This addresses a common concern with generative models (avoiding overfitting or simply copying).

6.7. Additional Samples

The paper includes numerous additional uncurated samples in the appendix for various datasets (CelebA-HQ, CIFAR10, LSUN Church, Bedroom, Cat), visually confirming the high quality and diversity of generated images.

Figure 14: Unconditional CIFAR10 progressive generation 该图像是一个由 CelebA-HQ 数据集生成的 256 imes 256 的人脸样本集合,展示了多个人物的特征和细节。

Figure 11: CelebA-HQ 256×2562 5 6 \times 2 5 6 generated samples

Figure 2: The directed graphical model considered in this work. 该图像是一个展示从无条件CIFAR10生成的样本大图,包含多个图像块,展示了多种类别的图像,如动物和交通工具。这些生成样本展示了模型在图像合成方面的高质量表现。

Figure 13: Unconditional CIFAR10 generated samples

Figure 8: Interpolations of CelebA-HQ 256x256 images with 500 timesteps of diffusio 该图像是多张教堂建筑的生成样本拼贴,展示了不同风格的教堂。图中有多个建筑,突出其各异的设计和色彩,同时表现出良好的图像合成效果。

Figure 16: LSUN Church generated samples. FID=7.89\mathrm { F I D } { = } 7 . 8 9

该图像是示意图,展示了不同步长下的图像恢复结果。行表示步长,从1000步到0步;列表示不同的重建权重 `oldsymbol{eta}` 值,从0.1到0.9。图中每个样本展示了网络在逐渐减少步长过程中恢复的图像质量变化。 该图像是一个展示生成的LSUN卧室样本的图表,包含多个房间场景。图中展示了不同风格和布局的卧室,以评估图像合成质量。图表中标注了FID分数为4.90。

Figure 17: LSUN Bedroom generated samples, large model. FID=4.90\mathrm { F I D = 4 . 9 0 }

Figure 7: When conditioned on the same latent, CelebA-HQ \(2 5 6 \\times 2 5 6\) samples share high-level attributes. Bottom-right quadrants are \(\\mathbf { x } _ { t }\) , and other quadrants are samples from \(p _ { \\theta } ( \\mathbf { x } _ { 0 } | \\mathbf { x } _ { t } )\) . 该图像是LSUN卧室生成样本的小模型。图中展示了多个风格各异的卧室场景,表现出生成模型的样本质量,FID值为6.36。

Figure 18: LSUN Bedroom generated samples, small model. FID=6.36\mathrm { F I D { = } 6 . 3 6 }

该图像是一个显示像素空间最近邻的图表,包含多个面孔的图像。每一列展示了一组相邻面孔,以展示在图像合成中的相似性。这些图像与本研究中的去噪扩散概率模型应用相关。 该图像是LSUN数据集中生成的猫咪样本,展示了不同品种和表情的猫咪,FID为19.75。

Figure 19: LSUN Cat generated samples. FID =19.75\mathord { = } 1 9 . 7 5

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully demonstrates that Denoising Diffusion Probabilistic Models (DDPMs) are capable of generating high-quality images, achieving state-of-the-art FID scores on unconditional CIFAR10 (3.17) and sample quality comparable to ProgressiveGAN on 256x256 LSUN datasets. The key innovations contributing to this performance are a novel epsilon-prediction parameterization of the reverse process's mean function, which establishes a theoretical connection to denoising score matching and annealed Langevin dynamics, and a simplified, weighted variational bound objective that prioritizes learning challenging denoising tasks. The models also inherently support a progressive lossy decompression scheme, interpretable as a generalized autoregressive decoding, and exhibit structured latent spaces that encode high-level attributes.

7.2. Limitations & Future Work

Limitations pointed out by the authors:

  • Log-Likelihood: Despite excellent sample quality, the lossless codelengths (negative log-likelihoods) of their models are not competitive with other likelihood-based models (e.g., autoregressive models). The authors suggest that the majority of this codelength is consumed to describe imperceptible image details, implying the models are highly effective as lossy compressors rather than lossless ones.
  • Practical Compression System: The progressive lossy compression argument (Algorithms 3 and 4) is presented as a proof of concept. It currently relies on theoretical procedures like minimal random coding, which are not tractable for high-dimensional data, thus not yet forming a practical compression system.
  • Decoder Complexity: The reverse process decoder for x0\mathbf{x}_0 (Eq. 13) is a simple independent discrete decoder. They note that incorporating a more powerful decoder (e.g., a conditional autoregressive model) is left for future work.

Potential future research directions suggested by the authors:

  • Other Data Modalities: Investigate the utility of diffusion models in other data modalities beyond images (ee.g., audio, text, video).
  • Components in Other Systems: Explore their use as components in other types of generative models or machine learning systems.
  • Subscale Orderings/Sampling Strategies: Leverage the progressive decoding concept to develop more general designs for subscale orderings or sampling strategies for autoregressive models.
  • Theoretical Implications: Further explore the connections between diffusion models, score matching, energy-based models, and Langevin dynamics.

7.3. Personal Insights & Critique

This paper represents a landmark contribution to the field of generative AI. Before this work, diffusion models were largely considered theoretically elegant but practically inferior to GANs for image generation. This paper decisively changed that perception, establishing diffusion models as a formidable class of generative models.

Inspirations drawn:

  • Power of Reparameterization: The epsilon-prediction parameterization is a brilliant insight. By re-framing the task of predicting the mean of a Gaussian transition to predicting the noise itself, the problem becomes directly analogous to denoising, leading to a simpler objective that directly aligns with score matching principles. This highlights how a clever reparameterization can unlock significant performance gains and theoretical connections.
  • Objective Function Design: The simplified training objective (LsimpleL_{simple}) is a powerful lesson in objective function design. By strategically re-weighting the variational bound (down-weighting early denoising steps), the authors implicitly guide the model to focus on more perceptually important features, leading to superior sample quality even if the NLL is less competitive. This underscores that maximizing likelihood isn't always perfectly correlated with perceptual quality, especially for images.
  • Inductive Bias of Diffusion: The idea that Gaussian diffusion introduces a beneficial inductive bias for image data, akin to a generalized autoregressive ordering, is thought-provoking. This suggests that the nature of noise and its removal can encode structural information that aids generation.
  • Progressive Generation as a First-Class Citizen: The inherent progressive generation capability, where images gradually materialize from coarse to fine details, is not just a side effect but a feature that provides insight into the model's learning process and its potential for applications like progressive rendering or efficient content streaming.

Potential issues, unverified assumptions, or areas for improvement:

  • Sampling Speed: While the paper mentions that DDPMs can be "made shorter for fast sampling," the default T=1000T=1000 steps still implies a relatively slow sampling process compared to single-pass GANs or flows. This was a known limitation of iterative generative models at the time. Further research (e.g., Denoising Diffusion Implicit Models, DDIMs, which came later) addresses this by allowing fewer sampling steps.

  • Computational Cost: Training large DDPMs, especially with T=1000T=1000 steps and a U-Net architecture, can be computationally intensive, as evidenced by the TPU v3-8 usage and long training times.

  • Log-Likelihood vs. Perceptual Quality Trade-off: While the Lsimple objective yields excellent samples, its sub-optimal log-likelihood suggests it might not capture the true data distribution as accurately as some other likelihood-based models. For applications requiring precise density estimation (e.g., anomaly detection based on likelihood), this could be a drawback. The claim that imperceptible details consume most of the codelength is plausible but also highlights a potential area for future work: how to design objectives that balance perceptual quality with accurate density modeling.

  • Generalizability of Inductive Bias: While the Gaussian diffusion appears beneficial for image data, its effectiveness for other modalities (e.g., structured tabular data, graphs) might vary. The choice of noise schedule and type of noise is critical and might require careful tuning for different data types.

    Overall, this paper laid crucial groundwork for the widespread adoption and subsequent advancements of diffusion models, which have since become dominant in generative AI. It's a testament to rigorous theoretical connections combined with practical engineering insights.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.