Paper status: completed

Consistency Models

Published:03/03/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces consistency models to address the slow generation speed of diffusion models, enabling fast one-step generation and multi-step sampling. They also support zero-shot data editing, outperforming existing techniques on benchmarks like CIFAR-10.

Abstract

Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Consistency Models."

1.2. Authors

The authors are:

  • Yang Song

  • Prafulla Dhariwal

  • Mark Chen

  • Ilya Sutskever

    All authors are affiliated with OpenAI, as indicated by the numeral "1" next to their names. Their research background is in generative modeling, particularly diffusion models, with significant contributions in this area.

1.3. Journal/Conference

The paper was published as a preprint on arXiv. While not explicitly stated as published in a journal or conference yet, arXiv is a reputable platform for disseminating cutting-edge research in machine learning and related fields, indicating the novelty and relevance of the work. The authors' affiliation with OpenAI suggests the work is from a leading AI research institution.

1.4. Publication Year

The paper was published at UTC: 2023-03-02T18:30:16.000Z, which corresponds to the year 2023.

1.5. Abstract

This paper introduces consistency models, a novel family of generative models designed to address the slow generation speed of existing diffusion models, which rely on iterative sampling. Consistency models enable high-quality sample generation by directly mapping noise to data, supporting fast one-step generation by design. They also offer the flexibility of multi-step sampling to balance computational cost and sample quality. A significant advantage is their zero-shot data editing capabilities, including image inpainting, colorization, and super-resolution, without requiring specific training for these tasks.

The models can be trained in two ways: by distilling pre-trained diffusion models (consistency distillation) or as standalone generative models (consistency training). Experimental results demonstrate that consistency models outperform current diffusion model distillation techniques in one- and few-step sampling, achieving state-of-the-art Fréchet Inception Distance (FID) scores of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, they surpass existing one-step, non-adversarial generative models on standard benchmarks like CIFAR-10, ImageNet 64x64, and LSUN 256x256.

The official source link is: https://arxiv.org/abs/2303.01469 The PDF link is: https://arxiv.org/pdf/2303.01469v2.pdf This paper is published as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

The core problem the paper aims to solve is the slow generation speed of diffusion models. While diffusion models have achieved remarkable success in generating high-quality images, audio, and video, they inherently rely on an iterative sampling process. This process, which typically involves 10 to 2000 steps, makes generation computationally intensive and slow, limiting their application in real-time scenarios.

2.1.2. Importance and Gaps

Diffusion models, also known as score-based generative models, have demonstrated unprecedented success across various fields. Their iterative sampling provides a flexible trade-off between compute and sample quality and enables powerful zero-shot data editing capabilities (e.g., inpainting, colorization). However, their slow inference speed stands as a significant bottleneck compared to single-step generative models like Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), or Normalizing Flows. Existing methods for faster sampling, such as optimized ODE solvers or distillation techniques, often still require numerous steps or involve computationally expensive data generation prior to distillation. This leaves a gap for a generative model that can achieve efficient, single-step generation without sacrificing the advantages of iterative sampling (such as quality-compute trade-offs and zero-shot editing).

2.1.3. Paper's Entry Point / Innovative Idea

The paper's innovative idea is to learn a mapping that directly transforms any point on a Probability Flow (PF) Ordinary Differential Equation (ODE) trajectory to its origin (the clean data sample). This mapping function possesses a property called self-consistency: all points on the same ODE trajectory map to the same initial data point. By learning this consistency function, the models, termed consistency models, can generate data samples in a single step from random noise, effectively bypassing the iterative denoising process of traditional diffusion models. Crucially, they retain the ability to perform multi-step generation for quality improvements and zero-shot data editing by chaining their outputs.

2.2. Main Contributions / Findings

2.2.1. Primary Contributions

The paper's primary contributions are:

  • Introduction of Consistency Models: A new family of generative models designed for fast one-step generation while retaining the benefits of multi-step sampling and zero-shot data editing capabilities.
  • Two Training Paradigms:
    • Consistency Distillation (CD): A method to distill knowledge from pre-trained diffusion models into a consistency model, significantly outperforming existing diffusion distillation techniques.
    • Consistency Training (CT): A method to train consistency models from scratch as standalone generative models, independent of pre-trained diffusion models.
  • Zero-Shot Data Editing Capabilities: Demonstration that consistency models inherit and support various zero-shot data editing tasks (e.g., inpainting, colorization, super-resolution, denoising, interpolation, stroke-guided editing) without explicit training for these tasks.
  • Theoretical Foundations: Provides theoretical justifications for the training objectives, including an asymptotic analysis for consistency distillation and consistency training.
  • Architectural Flexibility: The proposed methods place minor architectural constraints, allowing the use of flexible neural networks common in diffusion models.

2.2.2. Key Conclusions / Findings

The paper reached several key conclusions:

  • State-of-the-Art One-Step Generation: Consistency models achieve new state-of-the-art FID scores for one-step generation (3.55 on CIFAR-10 and 6.20 on ImageNet 64x64), significantly outperforming previous diffusion distillation methods.
  • Superior to Other Distillation Techniques: In one- and few-step sampling, consistency distillation consistently outperforms existing diffusion distillation methods like progressive distillation.
  • Competitive Standalone Generative Models: When trained from scratch, consistency models match or surpass the quality of one-step samples from progressive distillation and outperform existing one-step, non-adversarial generative models (e.g., VAEs, normalizing flows) on standard benchmarks.
  • Retained Diffusion Model Advantages: Despite their single-step generation capability, consistency models effectively preserve the flexibility to trade compute for sample quality via multi-step sampling and enable diverse zero-shot data editing applications.
  • Connection to Other Fields: The underlying principles of consistency models show similarities to techniques in deep reinforcement learning (e.g., Deep Q-learning) and momentum-based contrastive learning, suggesting potential for cross-pollination of ideas.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following foundational concepts:

3.1.1. Generative Models

Generative models are a class of machine learning models that learn to generate new data instances that resemble the training data.

  • Goal: To learn the underlying probability distribution of a dataset so that new samples can be drawn from that learned distribution.
  • Examples: Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Normalizing Flows, and Diffusion Models.

3.1.2. Diffusion Models

Diffusion models (also known as score-based generative models) are a class of generative models that work by systematically destroying training data through an iterative forward diffusion process (adding noise) and then learning to reverse this process to generate new data from noise.

  • Forward Process (Diffusion): Gradually adds Gaussian noise to data samples over several time steps, transforming them into pure noise. This can be described by a stochastic differential equation (SDE).
  • Reverse Process (Denoising): Learns to gradually denoise the pure noise back into data samples. This involves learning the score function, which is the gradient of the log probability density of the noisy data at each time step.
  • Sampling: To generate a sample, one starts with a pure noise vector and iteratively applies the learned denoising steps.
  • Slow Generation: The iterative nature of the denoising process is the primary reason for their slow sampling speed.
  • Probability Flow (PF) ODE: For every SDE used in the forward process, there exists a corresponding ordinary differential equation (ODE) whose trajectories have the same marginal distributions as the SDE. This PF ODE offers a deterministic way to move from noise to data (or vice versa), which is often used for sampling in continuous-time diffusion models.

3.1.3. Ordinary Differential Equations (ODEs) and Numerical Solvers

  • ODE: An equation relating a function to its derivatives. In continuous-time diffusion models, the process of diffusing or denoising data is often modeled as an ODE. For instance, the PF ODE describes how a data point changes continuously over time.
  • Numerical ODE Solvers: Since analytical solutions to ODEs are often complex or impossible to find, numerical methods (e.g., Euler's method, Heun's method) are used to approximate the solution by taking small, discrete steps.
    • Euler's Method: A first-order numerical procedure for solving ODEs with a given initial value. It approximates the next point using the current point and the derivative at the current point.
    • Heun's Method: A second-order numerical method that improves upon Euler's method by considering both the derivative at the current point and an estimate of the derivative at the next point (predictor-corrector approach). Higher-order solvers generally provide more accurate approximations for a given step size.

3.1.4. Distillation Techniques

Distillation in machine learning refers to transferring knowledge from a larger, more complex model (teacher) to a smaller, simpler model (student) without significant performance loss. In the context of diffusion models, it aims to create faster generative models.

  • Goal: To reduce the number of sampling steps required to generate high-quality samples, thereby speeding up inference.
  • Progressive Distillation: A specific diffusion model distillation technique where a teacher diffusion model (which requires many steps) trains a student model that requires half the number of steps. This process can be repeated to progressively reduce the number of steps to one or two.

3.1.5. Score Function

The score function of a probability density p(x)p(\mathbf{x}) is defined as xlogp(x)\nabla_{\mathbf{x}} \log p(\mathbf{x}). It points in the direction of increasing probability density. In score-based generative models (diffusion models), the model learns to estimate this score function at various noise levels to guide the reverse denoising process.

3.1.6. FID (Fréchet Inception Distance)

FID is a metric used to assess the quality of images generated by generative models.

  • Conceptual Definition: FID measures the "distance" between the distribution of generated images and the distribution of real images. It computes the Fréchet distance between two Gaussian distributions fitted to feature representations of the real and generated images, typically extracted from an Inception-v3 model's final pooling layer. A lower FID score indicates better quality and diversity of generated images, implying that the generated distribution is closer to the real data distribution.
  • Mathematical Formula: $ \mathrm{FID} = ||\mu_1 - \mu_2||_2^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1\Sigma_2)^{1/2}) $
  • Symbol Explanation:
    • μ1\mu_1: The mean feature vector of real images.
    • μ2\mu_2: The mean feature vector of generated images.
    • Σ1\Sigma_1: The covariance matrix of feature vectors of real images.
    • Σ2\Sigma_2: The covariance matrix of feature vectors of generated images.
    • 22||\cdot||_2^2: The squared Euclidean distance.
    • Tr()\mathrm{Tr}(\cdot): The trace of a matrix.

3.1.7. LPIPS (Learned Perceptual Image Patch Similarity)

LPIPS is a metric for measuring the perceptual similarity between two images.

  • Conceptual Definition: Unlike simple pixel-wise metrics (like L1L_1 or L2L_2 distance), LPIPS uses features extracted from pre-trained deep neural networks (e.g., VGG, AlexNet) to assess how similar two images are in terms of human perception. It is designed to correlate well with human judgment of image similarity. A lower LPIPS score indicates higher perceptual similarity.
  • Mathematical Formula: $ \mathrm{LPIPS}(\mathbf{x}, \mathbf{x}0) = \sum_l \frac{1}{H_l W_l} \sum{h,w} ||w_l \odot (\phi_l(\mathbf{x})_{h,w} - \phi_l(\mathbf{x}0){h,w})||_2^2 $
  • Symbol Explanation:
    • x\mathbf{x}, x0\mathbf{x}_0: The two images being compared.
    • ϕl\phi_l: The feature stack from layer ll of a pre-trained network (e.g., AlexNet).
    • wlw_l: A learned scalar weight for each channel.
    • Hl,WlH_l, W_l: The height and width of the feature map at layer ll.
    • \odot: Element-wise product.
    • 22||\cdot||_2^2: Squared Euclidean distance.

3.2. Previous Works

The paper frames its work against the backdrop of diffusion models and score-based generative models, highlighting their success but also their computational bottleneck. Key previous works mentioned include:

  • Diffusion Models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; 2020; Ho et al., 2020; Song et al., 2021): These foundational works established the paradigm of diffusion models, showing their ability to generate high-quality samples.
    • Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) demonstrated a practical framework for training and sampling.
    • Score-Based Generative Modeling through Stochastic Differential Equations (SDEs) (Song et al., 2021) generalized diffusion models to continuous time, introducing Probability Flow ODEs for deterministic sampling. This paper directly builds upon the PF ODE concept.
  • Fast Sampling of Diffusion Models:
    • Faster Numerical ODE Solvers: (Song et al., 2020; Zhang & Chen, 2022; Lu et al., 2022; Dockhorn et al., 2022) aim to reduce the number of steps required for sampling by using more efficient numerical integration schemes for the PF ODE. Even with these, typically more than 10 steps are needed for competitive samples.
      • DDIM (Song et al., 2020) proposed a deterministic sampling process for diffusion models, which can accelerate sampling but still requires multiple steps.
      • DPM-solver (Lu et al., 2022) are a family of fast ODE solvers that significantly reduce sampling steps for diffusion models.
    • Distillation Techniques: (Luhman & Luhman, 2021; Salimans & Ho, 2022; Meng et al., 2022; Zheng et al., 2022) aim to compress the knowledge of a multi-step diffusion model into a few-step or one-step model.
      • Progressive Distillation (PD) (Salimans & Ho, 2022) is highlighted as the most directly comparable approach. It iteratively distills a diffusion model into a new model that requires half the sampling steps. A key aspect is that it does not require pre-generating a large synthetic dataset, unlike some other methods.
      • Other distillation methods like Knowledge Distillation (Luhman & Luhman, 2021) and DFNO (Zheng et al., 2022) typically require extensive pre-generated datasets from the teacher model, which is computationally expensive.
  • Other Single-Step Generative Models:
    • GANs (Goodfellow et al., 2014): Generate samples in one pass but are known for training instability and mode collapse.
    • VAEs (Kingma & Welling, 2014; Rezende et al., 2014): Provide a probabilistic framework for generation but often produce blurrier samples than GANs or diffusion models.
    • Normalizing Flows (Dinh et al., 2015; 2017; Kingma & Dhariwal, 2018): Allow exact likelihood computation and efficient sampling but can be architecturally constrained.
    • Rectified Flow (Liu et al., 2022): A recent method for learning continuous-time generative models that can also perform one-step generation.

3.3. Technological Evolution

The evolution of generative models has progressed from early models like GANs and VAEs that offered either high-quality but unstable generation (GANs) or stable but lower-quality generation (VAEs), to Normalizing Flows which provided invertible mappings and exact likelihoods. Diffusion models emerged as a powerful paradigm, achieving state-of-the-art sample quality, surpassing even GANs in many benchmarks. However, this quality came at the cost of slow, iterative inference.

The field then focused on accelerating diffusion models, leading to advancements in faster ODE solvers and distillation techniques. These efforts aimed to reduce the number of inference steps from hundreds/thousands to tens or even single digits, but often faced trade-offs between speed, quality, and training complexity (e.g., needing large pre-generated datasets for distillation).

This paper's work (Consistency Models) fits into this timeline as a significant leap in diffusion model acceleration. It directly addresses the "slow sampling" problem by proposing a method for inherent one-step generation, moving beyond distillation as a secondary process and establishing consistency models as a new, independent family of generative models capable of both fast generation and retaining the critical advantages of diffusion models (multi-step quality trade-off, zero-shot editing) without their native slowness.

3.4. Differentiation Analysis

Compared to main methods in related work, the core differences and innovations of Consistency Models are:

  • Diffusion Models (Standard):

    • Difference: Standard diffusion models generate samples through hundreds or thousands of iterative denoising steps. Consistency Models are designed for one-step generation by directly mapping noise to data.
    • Innovation: Achieves orders of magnitude faster generation while maintaining high quality and retaining diffusion model benefits like zero-shot editing and multi-step quality trade-off.
  • Diffusion Model Distillation (e.g., Progressive Distillation):

    • Difference: Existing distillation methods typically aim to reduce the number of steps of an already trained diffusion model. While Consistency Distillation also distills, Consistency Models introduce a fundamentally new objective function (the consistency loss) that explicitly enforces the self-consistency property crucial for one-step generation. Consistency Training further allows training consistency models from scratch, making them independent.
    • Innovation: Consistency Models consistently outperform prior diffusion distillation techniques in one- and few-step sampling quality. They offer a more robust and effective way to accelerate diffusion-like generation.
  • Other Single-Step Generative Models (GANs, VAEs, Normalizing Flows):

    • Difference: These models are inherently single-step. GANs often suffer from training instability and mode collapse; VAEs generate blurrier images; Normalizing Flows can be architecturally restrictive. Consistency Models offer a new non-adversarial, single-step generative paradigm.
    • Innovation: Consistency Models achieve superior sample quality compared to existing non-adversarial, single-step generative models (VAEs, Normalizing Flows) and competitive quality against GANs on several benchmarks, without the adversarial training complexities. They also uniquely offer multi-step generation for quality improvements and zero-shot data editing, capabilities not typically found in these other single-step paradigms.
  • Rectified Flow:

    • Difference: Rectified Flow also learns continuous-time generative models that can enable one-step generation. However, the consistency training objective directly enforces the self-consistency property on PF ODE trajectories to map noisy points to x_epsilon, which is a distinct formulation from Rectified Flow's approach of learning straight paths.

    • Innovation: Consistency Models demonstrate competitive or superior performance to Rectified Flow when trained for one-step generation, particularly for distillation from diffusion models.

      In essence, Consistency Models represent an advancement that marries the quality and flexibility of diffusion models with the speed of single-step generation, surpassing previous efforts in both diffusion acceleration and standalone single-step generation without adversarial training.

4. Methodology

The core methodology of consistency models revolves around learning a consistency function that maps any point on a Probability Flow (PF) ODE trajectory to its origin (the clean data sample). This function is designed to be self-consistent, meaning all points on the same trajectory map to the same origin. The paper introduces two methods for training these models: Consistency Distillation (CD) and Consistency Training (CT).

4.1. Principles

The central idea is rooted in continuous-time diffusion models and their Probability Flow (PF) ODEs. A PF ODE describes a continuous path that transforms data (x0\mathbf{x}_0) into pure noise (xT\mathbf{x}_T) or vice-versa. The key insight is that all points along a single trajectory of this PF ODE originate from the same initial data point x0\mathbf{x}_0 (or, more precisely, xϵ\mathbf{x}_{\epsilon} where ϵ\epsilon is a small positive number to avoid numerical instability).

The consistency function, denoted f:(xt,t)xϵf: (\mathbf{x}_t, t) \mapsto \mathbf{x}_{\epsilon}, formalizes this mapping. The self-consistency property states that for any two points (xt,t)(\mathbf{x}_t, t) and (xt,t)(\mathbf{x}_{t'}, t') on the same PF ODE trajectory, their mappings to the origin must be identical: f(xt,t)=f(xt,t)f(\mathbf{x}_t, t) = f(\mathbf{x}_{t'}, t').

The goal of a consistency model (fθf_{\theta}) is to learn to estimate this consistency function from data. By learning this mapping, the model can generate a sample in a single step: start with a random noise vector xT\mathbf{x}_T, and simply evaluate fθ(xT,T)f_{\theta}(\mathbf{x}_T, T) to get the data sample xϵ\mathbf{x}_{\epsilon}. This bypasses the iterative process of diffusion models.

Furthermore, consistency models introduce a boundary condition: f(xϵ,ϵ)=xϵf(\mathbf{x}_{\epsilon}, \epsilon) = \mathbf{x}_{\epsilon}, meaning that at the smallest time step ϵ\epsilon, the function acts as an identity mapping, ensuring the model's output is the actual data point. This boundary condition is crucial for successful training and prevents trivial solutions.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Diffusion Models and the Empirical PF ODE

The paper builds on continuous-time diffusion models which perturb data x\mathbf{x} to noise xt\mathbf{x}_t using an SDE. The corresponding PF ODE is central to sampling. Using the settings from Karras et al. (2022), where the drift coefficient μ(x,t)=0\mu(\mathbf{x}, t) = \mathbf{0} and diffusion coefficient σ(t)=2t\sigma(t) = \sqrt{2t}, the PF ODE for generation is: dxtdt=tsϕ(xt,t) \frac{\mathrm{d}\mathbf{x}_t}{\mathrm{d}t} = -t \mathbf{s}_{\phi}(\mathbf{x}_t, t) Here, sϕ(x,t)\mathbf{s}_{\phi}(\mathbf{x}, t) is a score model that approximates the true score function logpt(x)\nabla \log p_t(\mathbf{x}). To generate samples, one typically initializes with x^TN(0,T2I)\hat{\mathbf{x}}_T \sim \mathcal{N}(\mathbf{0}, T^2 I) and solves this empirical PF ODE backwards in time using a numerical ODE solver (e.g., Euler, Heun) to obtain x^0\hat{\mathbf{x}}_0. The resulting x^0\hat{\mathbf{x}}_0 (or x^ϵ\hat{\mathbf{x}}_{\epsilon}) is the generated sample. The iterative nature of ODE solvers is what makes traditional diffusion model sampling slow.

4.2.2. Consistency Model Definition and Parameterization

A consistency function f:(xt,t)xϵf: (\mathbf{x}_t, t) \mapsto \mathbf{x}_{\epsilon} maps any point on a PF ODE trajectory to its origin. The property of self-consistency is f(xt,t)=f(xt,t)f(\mathbf{x}_t, t) = f(\mathbf{x}_{t'}, t') for all t,t[ϵ,T]t, t' \in [\epsilon, T] on the same trajectory. The goal of a consistency model fθf_{\theta} is to learn this function.

To enforce the boundary condition f(xϵ,ϵ)=xϵf(\mathbf{x}_{\epsilon}, \epsilon) = \mathbf{x}_{\epsilon}, two parameterization methods are proposed:

  1. Conditional Definition: fθ(x,t)={xt=ϵFθ(x,t)t(ϵ,T] f_{\theta}(\mathbf{x}, t) = \left\{ \begin{array}{ll} \mathbf{x} & t = \epsilon \\ F_{\theta}(\mathbf{x}, t) & t \in (\epsilon, T] \end{array} \right. Here, Fθ(x,t)F_{\theta}(\mathbf{x}, t) is a free-form deep neural network. This method explicitly sets the output to x\mathbf{x} at t=ϵt=\epsilon.

  2. Skip Connections (Used in Experiments): fθ(x,t)=cskip(t)x+cout(t)Fθ(x,t) f_{\theta}(\mathbf{x}, t) = c_{\mathrm{skip}}(t) \mathbf{x} + c_{\mathrm{out}}(t) F_{\theta}(\mathbf{x}, t) Here, cskip(t)c_{\mathrm{skip}}(t) and cout(t)c_{\mathrm{out}}(t) are differentiable functions designed such that cskip(ϵ)=1c_{\mathrm{skip}}(\epsilon) = 1 and cout(ϵ)=0c_{\mathrm{out}}(\epsilon) = 0. This ensures fθ(x,ϵ)=1x+0Fθ(x,ϵ)=xf_{\theta}(\mathbf{x}, \epsilon) = 1 \cdot \mathbf{x} + 0 \cdot F_{\theta}(\mathbf{x}, \epsilon) = \mathbf{x}. This parameterization is similar to successful diffusion models and allows for architectural reuse. The paper uses this second method.

4.2.3. Sampling with Consistency Models

  • One-Step Generation: Once fθ(,)f_{\theta}(\cdot, \cdot) is trained, samples are generated by:

    1. Sampling initial noise x^TN(0,T2I)\hat{\mathbf{x}}_T \sim \mathcal{N}(\mathbf{0}, T^2 I).
    2. Evaluating the consistency model: x^ϵ=fθ(x^T,T)\hat{\mathbf{x}}_{\epsilon} = f_{\theta}(\hat{\mathbf{x}}_T, T). This involves only one forward pass, making it very fast.
  • Multistep Consistency Sampling (Algorithm 1): For improved sample quality or zero-shot data editing, the model can be evaluated multiple times. This process involves alternating denoising and noise injection steps. Algorithm 1: Multistep Consistency Sampling Input: Consistency model fθ(,)f_{\theta}(\cdot, \cdot), sequence of time points τ1>τ2>>τN1\tau_1 > \tau_2 > \cdots > \tau_{N-1}, initial noise xT\mathbf{x}_T

    1. xfθ(xT,T)\mathbf{x} \gets f_{\theta}(\mathbf{x}_T, T)
    2. for n=1n = 1 to N-1 do
    3.  Sample zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I)
      
    4.  xτnx+τn2ϵ2z\mathbf{x}_{\tau_n} \gets \mathbf{x} + \sqrt{\tau_n^2 - \epsilon^2} \mathbf{z}
      
    5.  xfθ(xτn,τn)\mathbf{x} \gets f_{\theta}(\mathbf{x}_{\tau_n}, \tau_n)
      
    6. end for Output: x\mathbf{x}

    Note: The τn2ϵ2\sqrt{\tau_n^2 - \epsilon^2} term in step 4 implies adding noise such that xτn\mathbf{x}_{\tau_n} is a noisy version of the current x\mathbf{x} corresponding to time τn\tau_n if x\mathbf{x} was xϵ\mathbf{x}_{\epsilon}. This formulation (though slightly ambiguous in the pseudocode) aims to perturb the current estimate x\mathbf{x} back onto a trajectory corresponding to time τn\tau_n before applying the consistency model again. The formulation in Appendix D Algorithm 4 suggests adding noise to x\mathbf{x}, then applying fθf_{\theta} again, similar to a reverse diffusion step.

4.2.4. Training Consistency Models via Distillation (Consistency Distillation - CD)

This method distills knowledge from a pre-trained score model sϕ(x,t)\mathbf{s}_{\phi}(\mathbf{x}, t) (from a diffusion model).

  1. Generating Trajectory Pairs: The time horizon [ϵ,T][\epsilon, T] is discretized into N-1 sub-intervals with boundaries ϵ=t1<t2<<tN=T\epsilon = t_1 < t_2 < \cdots < t_N = T. The specific time points are determined by ti=(ϵ1/ρ+(i1)/(N1)(T1/ρϵ1/ρ))ρt_i = (\epsilon^{1/\rho} + (i-1)/(N-1)(T^{1/\rho} - \epsilon^{1/\rho}))^{\rho} with ρ=7\rho=7. Given a data point x\mathbf{x} from pdatap_{\mathrm{data}}, a pair of adjacent points on a PF ODE trajectory, (x^tnϕ,xtn+1)(\hat{\mathbf{x}}_{t_n}^{\phi}, \mathbf{x}_{t_{n+1}}), is generated.
    • xtn+1\mathbf{x}_{t_{n+1}} is sampled from the SDE transition density: xtn+1N(x;tn+12I)\mathbf{x}_{t_{n+1}} \sim \mathcal{N}(\mathbf{x}; t_{n+1}^2 I). This means xtn+1=x+tn+1z\mathbf{x}_{t_{n+1}} = \mathbf{x} + t_{n+1}\mathbf{z} where zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I).
    • x^tnϕ\hat{\mathbf{x}}_{t_n}^{\phi} is obtained by applying one step of a numerical ODE solver (e.g., Euler) backwards from xtn+1\mathbf{x}_{t_{n+1}} using the pre-trained score model sϕ\mathbf{s}_{\phi}. For the Euler solver: x^tnϕ=xtn+1(tntn+1)tn+1sϕ(xtn+1,tn+1) \hat{\mathbf{x}}_{t_n}^{\phi} = \mathbf{x}_{t_{n+1}} - (t_n - t_{n+1}) t_{n+1} \mathbf{s}_{\phi}(\mathbf{x}_{t_{n+1}}, t_{n+1}) This equation is based on the empirical PF ODE dxtdt=tsϕ(xt,t)\frac{\mathrm{d}\mathbf{x}_t}{\mathrm{d}t} = -t \mathbf{s}_{\phi}(\mathbf{x}_t, t). The term Φ(xtn+1,tn+1;ϕ)=tn+1sϕ(xtn+1,tn+1)\Phi(\mathbf{x}_{t_{n+1}}, t_{n+1}; \phi) = -t_{n+1} \mathbf{s}_{\phi}(\mathbf{x}_{t_{n+1}}, t_{n+1}) is the estimated derivative at tn+1t_{n+1}.
  2. Consistency Distillation Loss (Definition 1): The consistency model fθf_{\theta} is trained to minimize the difference between its output for xtn+1\mathbf{x}_{t_{n+1}} and its output for x^tnϕ\hat{\mathbf{x}}_{t_n}^{\phi}. An Exponential Moving Average (EMA) version of the model, fθf_{\theta^-}, is used as a target. LCDN(θ,θ;ϕ):=E[λ(tn)d(fθ(xtn+1,tn+1),fθ(x^tnϕ,tn))] \mathcal{L}_{\mathrm{CD}}^N(\pmb{\theta}, \pmb{\theta}^-; \phi) := \mathbb{E} \left[ \lambda(t_n) d \left( f_{\theta}(\mathbf{x}_{t_{n+1}}, t_{n+1}), f_{\theta^-}(\hat{\mathbf{x}}_{t_n}^{\phi}, t_n) \right) \right]
    • E[]\mathbb{E}[\cdot]: Expectation over xpdata\mathbf{x} \sim p_{\mathrm{data}}, nU[[1,N1]]n \sim \mathcal{U}[[1, N-1]], and xtn+1N(x;tn+12I)\mathbf{x}_{t_{n+1}} \sim \mathcal{N}(\mathbf{x}; t_{n+1}^2 I).
    • λ(tn)\lambda(t_n): A positive weighting function (often λ(tn)1\lambda(t_n) \equiv 1).
    • d(,)d(\cdot, \cdot): A metric function (e.g., squared 2\ell_2 distance, 1\ell_1 distance, LPIPS).
    • θ\pmb{\theta}: Parameters of the online network fθf_{\theta}.
    • θ\pmb{\theta}^-: Parameters of the target network fθf_{\theta^-}, updated by EMA of θ\pmb{\theta} (e.g., θstopgrad(μθ+(1μ)θ)\pmb{\theta}^- \gets \mathrm{stopgrad}(\mu \pmb{\theta}^- + (1-\mu)\pmb{\theta})).
  3. Theorem 1 (Asymptotic Accuracy): If the loss LCDN(θ,θ;ϕ)=0\mathcal{L}_{\mathrm{CD}}^N(\pmb{\theta}, \pmb{\theta}; \phi) = 0 (meaning fθf_{\theta} perfectly matches fθf_{\theta^-} and achieves zero loss), then the estimated consistency model fθf_{\theta} can be arbitrarily accurate to the true consistency function f(,;ϕ)f(\cdot, \cdot; \phi) of the empirical PF ODE. The error is O((Δt)p)O((\Delta t)^p), where Δt\Delta t is the maximum step size and pp is the order of the ODE solver. This justifies the distillation process: by minimizing the loss, fθf_{\theta} learns to map adjacent points on an ODE trajectory to the same origin, thereby enforcing self-consistency.

4.2.5. Training Consistency Models in Isolation (Consistency Training - CT)

This method trains consistency models without a pre-trained diffusion model. It leverages a property of the score function: logpt(xt)=E[xtxt2xt] \nabla \log p_t(\mathbf{x}_t) = - \mathbb{E} \left[ \frac{\mathbf{x}_t - \mathbf{x}}{t^2} \mid \mathbf{x}_t \right] This means that given a data point x\mathbf{x} and its noisy version xtN(x;t2I)\mathbf{x}_t \sim \mathcal{N}(\mathbf{x}; t^2 I), we can estimate the score function logpt(xt)\nabla \log p_t(\mathbf{x}_t) with (xtx)/t2-(\mathbf{x}_t - \mathbf{x})/t^2. Using this estimate, the consistency training (CT) loss is defined as: LCTN(θ,θ):=E[λ(tn)d(fθ(x+tn+1z,tn+1),fθ(x+tnz,tn))] \mathcal{L}_{\mathrm{CT}}^N(\pmb{\theta}, \pmb{\theta}^-) := \mathbb{E} \left[ \lambda(t_n) d \left( f_{\theta}(\mathbf{x} + t_{n+1}\mathbf{z}, t_{n+1}), f_{\theta^-}(\mathbf{x} + t_n\mathbf{z}, t_n) \right) \right]

  • Here, zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I). The terms x+tn+1z\mathbf{x} + t_{n+1}\mathbf{z} and x+tnz\mathbf{x} + t_n\mathbf{z} represent noisy versions of the original data point x\mathbf{x} at different noise levels tn+1t_{n+1} and tnt_n. These implicitly approximate adjacent points on an ODE trajectory stemming from x\mathbf{x}.
  • Theorem 2: For the Euler ODE solver and in the limit of NN \to \infty (i.e., Δt0\Delta t \to 0), the consistency distillation loss LCDN\mathcal{L}_{\mathrm{CD}}^N (with a ground truth score model) approaches the consistency training loss LCTN\mathcal{L}_{\mathrm{CT}}^N up to an o(Δt)o(\Delta t) term. This justifies CT as a standalone training objective.
  • Progressive NN and μ\mu: For practical performance, CT uses progressively increasing NN (number of discretization steps) and an adaptive μ\mu (EMA decay rate) during training. Small NN facilitates faster convergence early on (less variance, more bias), while larger NN is preferred later for better sample quality (more variance, less bias).

4.2.6. Continuous-Time Extensions (Appendix B)

The discrete consistency distillation and consistency training objectives can be extended to continuous-time limits (NN \to \infty). These continuous-time objectives do not require specifying NN or discrete time steps and are analyzed for different scenarios (θ=θ\pmb{\theta}^- = \pmb{\theta} or θ=stopgrad(θ)\pmb{\theta}^- = \mathrm{stopgrad}(\pmb{\theta})). They involve Jacobian-vector products and require forward-mode automatic differentiation.

  • Consistency Distillation in Continuous Time:
    • When θ=θ\pmb{\theta}^- = \pmb{\theta} (no stopgrad): For d(x,y)=xy22d(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||_2^2, the continuous loss LCD(θ,θ;ϕ)\mathcal{L}_{\mathrm{CD}}^{\infty}(\pmb{\theta}, \pmb{\theta}; \phi) (derived in Theorem 3) involves the derivative of fθf_{\theta} with respect to time and its Jacobian with respect to x\mathbf{x}, ensuring fθf_{\theta} satisfies the PF ODE. LCD(θ,θ;ϕ)=E[λ(t)[(τ1)(t)]2fθ(xt,t)ttfθ(xt,t)xtsϕ(xt,t)22] \mathcal{L}_{\mathrm{CD}}^{\infty}(\pmb{\theta}, \pmb{\theta}; \phi) = \mathbb{E} \left[ \frac{\lambda(t)}{[(\tau^{-1})'(t)]^2} \left\| \frac{\partial f_{\theta}(\mathbf{x}_t, t)}{\partial t} - t \frac{\partial f_{\theta}(\mathbf{x}_t, t)}{\partial \mathbf{x}_t} \mathbf{s}_{\phi}(\mathbf{x}_t, t) \right\|_2^2 \right] Here, τ(u)\tau(u) is a strictly monotonic function mapping [0,1] to [ϵ,T][\epsilon, T], used to define t = \tau(u). This loss is minimized if and only if fθf_{\theta} perfectly matches the consistency function of the empirical PF ODE.
    • When θ=stopgrad(θ)\pmb{\theta}^- = \mathrm{stopgrad}(\pmb{\theta}): A "pseudo-objective" LCD(θ,θ;ϕ)\mathcal{L}_{\mathrm{CD}}^{\infty}(\pmb{\theta}, \pmb{\theta}^-; \phi) is derived (Theorem 5) whose gradient in the limit matches the gradient of the discrete loss. This objective is important for practical training with EMA.
  • Consistency Training in Continuous Time:
    • When θ=stopgrad(θ)\pmb{\theta}^- = \mathrm{stopgrad}(\pmb{\theta}): A continuous pseudo-objective LCT(θ,θ)\mathcal{L}_{\mathrm{CT}}^{\infty}(\pmb{\theta}, \pmb{\theta}^-) is derived (Theorem 6) that does not depend on ϕ\phi (the diffusion model parameters) and can be optimized directly. For d(x,y)=xy22d(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||_2^2: LCT(θ,θ)=2E[λ(t)(τ1)(t)fθ(xt,t)(fθ(xt,t)t+fθ(xt,t)xtxtxt)] \mathcal{L}_{\mathrm{CT}}^{\infty}(\pmb{\theta}, \pmb{\theta}^-) = 2 \mathbb{E} \left[ \frac{\lambda(t)}{(\tau^{-1})'(t)} f_{\theta}(\mathbf{x}_t, t)^\top \left( \frac{\partial f_{\theta^-}(\mathbf{x}_t, t)}{\partial t} + \frac{\partial f_{\theta^-}(\mathbf{x}_t, t)}{\partial \mathbf{x}_t} \cdot \frac{\mathbf{x}_t - \mathbf{x}}{t} \right) \right] This objective allows consistency models to be trained from scratch.

4.2.7. Zero-Shot Data Editing (Algorithm 4)

Consistency Models leverage the multistep sampling procedure (Algorithm 1) to perform various zero-shot data editing tasks, similar to diffusion models. The core idea is an iterative replacement procedure where parts of the sample are updated based on the task constraints.

Algorithm 4: Zero-Shot Image Editing Input: Consistency model fθ(,)f_{\theta}(\cdot, \cdot), sequence of time points t1>t2>>tNt_1 > t_2 > \cdots > t_N, reference image y\mathbf{y}, invertible linear transformation A\mathbf{A}, and binary image mask Ω\mathbf{\Omega}

  1. yA1[(Ay)(1Ω)+0Ω]\mathbf{y} \gets \mathbf{A}^{-1}[ (\mathbf{A}\mathbf{y}) \odot (1 - \mathbf{\Omega}) + \mathbf{0} \odot \mathbf{\Omega} ]
    • This step transforms the reference image y\mathbf{y} and potentially masks out parts according to Ω\mathbf{\Omega} in the transformed space.
  2. Sample xN(y,t12I)\mathbf{x} \sim \mathcal{N}(\mathbf{y}, t_1^2 I)
    • Initial noisy input x\mathbf{x} is sampled centered at the (potentially masked) reference image y\mathbf{y}.
  3. xfθ(x,t1)\mathbf{x} \gets f_{\theta}(\mathbf{x}, t_1)
    • The consistency model generates an initial denoised estimate.
  4. xA1[(Ay)(1Ω)+(Ax)Ω]\mathbf{x} \gets \mathbf{A}^{-1}[ (\mathbf{A}\mathbf{y}) \odot (1 - \mathbf{\Omega}) + (\mathbf{A}\mathbf{x}) \odot \mathbf{\Omega} ]
    • This is the "replacement" step: The parts of x\mathbf{x} corresponding to (1Ω)(1-\mathbf{\Omega}) (known/unmasked parts) are replaced by the reference y\mathbf{y}, while the masked parts Ω\mathbf{\Omega} are kept from the model's generation. This enforces consistency with the reference image.
  5. for n=2n = 2 to NN do
  6.  Sample xN(x,(tn2ϵ2)I)\mathbf{x} \sim \mathcal{N}(\mathbf{x}, (t_n^2 - \epsilon^2) I)
    
    • Noise is added to the current estimate x\mathbf{x}. The term (tn2ϵ2)(t_n^2 - \epsilon^2) determines the variance of this noise, effectively "pushing" x\mathbf{x} back onto a noisy trajectory.
  7.  xfθ(x,tn)\mathbf{x} \gets f_{\theta}(\mathbf{x}, t_n)
    
    • The consistency model is applied again to denoise the perturbed x\mathbf{x} back to an xϵ\mathbf{x}_{\epsilon} estimate.
  8.  xA1[(Ay)(1Ω)+(Ax)Ω]\mathbf{x} \gets \mathbf{A}^{-1}[ (\mathbf{A}\mathbf{y}) \odot (1 - \mathbf{\Omega}) + (\mathbf{A}\mathbf{x}) \odot \mathbf{\Omega} ]
    
    • Another replacement step to enforce consistency with the reference y\mathbf{y}.
  9. end for
  10. Output: x\mathbf{x}

Specific task implementations using Algorithm 4:

  • Inpainting: A\mathbf{A} is the identity transformation. y\mathbf{y} is the image with missing pixels masked, and Ω\mathbf{\Omega} marks the missing pixels. The algorithm iteratively fills in the masked regions while keeping the known parts fixed.
  • Colorization: y\mathbf{y} is the grayscale image. Ω\mathbf{\Omega} is a mask that keeps the luminance channel fixed but allows changes in chrominance channels. A\mathbf{A} is a transformation from RGB to a color space like YCbCr, where Y is luminance and CbCr are chrominance. The algorithm generates color information while preserving the original luminance.
  • Super-resolution: y\mathbf{y} is the low-resolution image. Ω\mathbf{\Omega} is a mask that preserves the information from the downsampled patches. A\mathbf{A} is a transformation that encodes the averaging operation used for downsampling. The algorithm reconstructs high-frequency details.
  • Stroke-guided image generation (SDEdit): y\mathbf{y} is a stroke painting, A\mathbf{A} is the identity, and Ω\mathbf{\Omega} is a matrix of ones (meaning all pixels can be modified, but the generation is guided by the starting noise perturbed by the stroke). This allows the model to generate an image based on simple user strokes.
  • Denoising: If an image x\mathbf{x} is perturbed by noise N(0;σ2I)\mathcal{N}(\mathbf{0}; \sigma^2 I), and σ[ϵ,T]\sigma \in [\epsilon, T], then simply evaluating fθ(x,σ)f_{\theta}(\mathbf{x}, \sigma) can produce the denoised image.
  • Interpolation: If x1=fθ(z1,T)\mathbf{x}_1 = f_{\theta}(\mathbf{z}_1, T) and x2=fθ(z2,T)\mathbf{x}_2 = f_{\theta}(\mathbf{z}_2, T), then spherical linear interpolation between z1\mathbf{z}_1 and z2\mathbf{z}_2 to get z\mathbf{z} (i.e., z=sin[(1α)ψ]sin(ψ)z1+sin(αψ)sin(ψ)z2\mathbf{z} = \frac{\sin[(1-\alpha)\psi]}{\sin(\psi)}\mathbf{z}_1 + \frac{\sin(\alpha\psi)}{\sin(\psi)}\mathbf{z}_2) and then evaluating fθ(z,T)f_{\theta}(\mathbf{z}, T) yields an interpolation between x1\mathbf{x}_1 and x2\mathbf{x}_2.

Example of cskip(t)c_{\mathrm{skip}}(t) and cout(t)c_{\mathrm{out}}(t) used in experiments: The paper modifies the terms from EDM (Karras et al., 2022) to satisfy the boundary condition: cskip(t)=σdata2(tϵ)2+σdata2,cout(t)=σdata(tϵ)σdata2+t2 c_{\mathrm{skip}}(t) = \frac{\sigma_{\mathrm{data}}^2}{(t-\epsilon)^2 + \sigma_{\mathrm{data}}^2}, \quad c_{\mathrm{out}}(t) = \frac{\sigma_{\mathrm{data}}(t-\epsilon)}{\sqrt{\sigma_{\mathrm{data}}^2 + t^2}} where σdata=0.5\sigma_{\mathrm{data}} = 0.5. With these, cskip(ϵ)=1c_{\mathrm{skip}}(\epsilon) = 1 and cout(ϵ)=0c_{\mathrm{out}}(\epsilon) = 0, ensuring fθ(x,ϵ)=xf_{\theta}(\mathbf{x}, \epsilon) = \mathbf{x}.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on several image datasets to evaluate the performance of consistency models.

  • CIFAR-10 (Krizhevsky et al., 2009):

    • Description: A widely used dataset for image classification, consisting of 60,000 32x32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images.

    • Characteristics: Small image size, diverse object categories (e.g., airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck).

    • Why chosen: A common benchmark for generative models, allowing for direct comparison with many existing methods due to its moderate complexity.

    • Example data sample: Figure 14 from the paper shows uncurated samples from CIFAR-10 32x32.

      该图像是一个展示生成模型效果的图表,包含两个部分:上半部分展示了EDM模型生成的图像,FID为2.04。图中包含多种类别的图像,主要以动物和车辆为主,展示了该模型的生成能力与多样性。 该图像是一个展示生成模型效果的图表,包含两个部分:上半部分展示了EDM模型生成的图像,FID为2.04。图中包含多种类别的图像,主要以动物和车辆为主,展示了该模型的生成能力与多样性。

      Figure 14 (a) EDM samples (FID=2.04) shows examples of various animals and vehicles.

  • ImageNet 64x64 (Deng et al., 2009):

    • Description: A subset of the larger ImageNet dataset, downsampled to 64x64 pixels. It contains millions of images across 1000 categories.

    • Characteristics: Higher resolution and more diverse content than CIFAR-10, representing a more challenging image generation task.

    • Why chosen: A standard benchmark for evaluating high-resolution image generation capabilities of generative models.

    • Example data sample: Figure 15 from the paper shows uncurated samples from ImageNet 64x64.

      该图像是图表(c)展示了使用两步生成策略的样本,FID值为2.93,包含多个动物及车辆的图像,显示了生成模型在样本多样性和质量方面的能力。 该图像是图表(c)展示了使用两步生成策略的样本,FID值为2.93,包含多个动物及车辆的图像,显示了生成模型在样本多样性和质量方面的能力。

      Figure 15 (a) EDM samples (FID=2.44) depicts examples of various animals and objects.

  • LSUN Bedroom 256x256 (Yu et al., 2015):

    • Description: Part of the Large-scale Scene Understanding (LSUN) dataset, focusing specifically on bedroom scenes. Images are 256x256 pixels.

    • Characteristics: High resolution, complex indoor scenes, exhibiting structural regularity but also significant stylistic variation.

    • Why chosen: Tests the model's ability to generate coherent and realistic complex scenes at higher resolutions.

    • Example data sample: Figure 16 from the paper shows uncurated samples from LSUN Bedroom 256x256.

      该图像是一个图表,展示了基于EDM生成的多个房间图像样例,FID值为3.57。图中包含了不同风格的卧室,显示了模型在生成高质量样本方面的能力。 该图像是一个图表,展示了基于EDM生成的多个房间图像样例,FID值为3.57。图中包含了不同风格的卧室,显示了模型在生成高质量样本方面的能力。

      Figure 18 (a) EDM samples (FID=3.57) illustrates different bedroom layouts and decor.

  • LSUN Cat 256x256 (Yu et al., 2015):

    • Description: Another subset of the LSUN dataset, featuring images of cats at 256x256 pixels.

    • Characteristics: High resolution, focusing on a specific object category, which often requires capturing fine details and variations in pose and appearance.

    • Why chosen: Provides a challenge in generating detailed and realistic images of a single, complex object class.

    • Example data sample: Figure 20 from the paper shows uncurated samples from LSUN Cat 256x256.

      该图像是一个展示猫咪图像生成效果的图表,显示了EDM(FID=6.69)生成的猫咪图像。多个图像在同一张图中并排展示,以比较图像生成技术的效果。 该图像是一个展示猫咪图像生成效果的图表,显示了EDM(FID=6.69)生成的猫咪图像。多个图像在同一张图中并排展示,以比较图像生成技术的效果。

      Figure 20 (a) EDM samples (FID=6.69) displays various cat images.

5.2. Evaluation Metrics

The evaluation metrics used in the paper primarily focus on image quality and diversity, common in generative modeling.

5.2.1. Fréchet Inception Distance (FID)

  • Conceptual Definition: FID measures the similarity between the distribution of generated images and the distribution of real images. It calculates the Fréchet distance between two Gaussian distributions, which are fitted to the feature representations of real and generated images (typically extracted from the Inception-v3 model). A lower FID score indicates higher quality and more diverse generated images, meaning the generated distribution is closer to the real data distribution.
  • Mathematical Formula: $ \mathrm{FID} = ||\mu_1 - \mu_2||_2^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1\Sigma_2)^{1/2}) $
  • Symbol Explanation:
    • μ1\mu_1: The mean feature vector of real images.
    • μ2\mu_2: The mean feature vector of generated images.
    • Σ1\Sigma_1: The covariance matrix of feature vectors of real images.
    • Σ2\Sigma_2: The covariance matrix of feature vectors of generated images.
    • 22||\cdot||_2^2: The squared Euclidean distance between the mean vectors.
    • Tr()\mathrm{Tr}(\cdot): The trace of a matrix, which is the sum of its diagonal elements.

5.2.2. Inception Score (IS)

  • Conceptual Definition: IS evaluates two aspects of generated images: image quality (how recognizable objects are in the generated images, measured by the confidence of an Inception-v3 classifier) and image diversity (whether the generative model produces a wide variety of images, measured by the entropy of the marginal class distribution). A higher Inception Score indicates better image quality and diversity.
  • Mathematical Formula: $ \mathrm{IS}(\mathcal{G}) = e^{\mathbb{E}{\mathbf{x} \sim \mathcal{G}} [D{\mathrm{KL}}(p(y|\mathbf{x}) || p(y))]} $
  • Symbol Explanation:
    • G\mathcal{G}: The generative model.
    • xG\mathbf{x} \sim \mathcal{G}: An image generated by the model.
    • p(yx)p(y|\mathbf{x}): The conditional class distribution given a generated image x\mathbf{x}, obtained from a pre-trained Inception-v3 network. This reflects the quality (recognizability) of the image.
    • p(y): The marginal class distribution, averaged over all generated images. This reflects the diversity of the images.
    • DKL()D_{\mathrm{KL}}(\cdot || \cdot): The Kullback-Leibler divergence.
    • E[]\mathbb{E}[\cdot]: Expectation.

5.2.3. Precision (Prec.)

  • Conceptual Definition: Precision measures how well the generated samples align with the real data distribution. High precision means that most generated samples are similar to real samples, indicating good fidelity. It quantifies the proportion of generated samples that fall within the support of the real data distribution.
  • Mathematical Formula: $ \mathrm{Precision} = \frac{|\mathcal{S}_g \cap \mathcal{S}_r|}{|\mathcal{S}_g|} $
  • Symbol Explanation:
    • Sg\mathcal{S}_g: The set of generated samples in the feature space.
    • Sr\mathcal{S}_r: The set of real samples in the feature space.
    • |\cdot|: Cardinality of the set.
    • The intersection SgSr\mathcal{S}_g \cap \mathcal{S}_r typically implies that samples from Sg\mathcal{S}_g are "close enough" to samples in Sr\mathcal{S}_r (e.g., within a certain distance in feature space).

5.2.4. Recall (Rec.)

  • Conceptual Definition: Recall measures how well the generated samples cover the entire real data distribution. High recall means that the generated samples cover most of the modes (variations) present in the real data, indicating good diversity. It quantifies the proportion of real samples that fall within the support of the generated data distribution.
  • Mathematical Formula: $ \mathrm{Recall} = \frac{|\mathcal{S}_r \cap \mathcal{S}_g|}{|\mathcal{S}_r|} $
  • Symbol Explanation:
    • Sr\mathcal{S}_r: The set of real samples in the feature space.
    • Sg\mathcal{S}_g: The set of generated samples in the feature space.
    • |\cdot|: Cardinality of the set.
    • The intersection SrSg\mathcal{S}_r \cap \mathcal{S}_g typically implies that samples from Sr\mathcal{S}_r are "close enough" to samples in Sg\mathcal{S}_g.

5.3. Baselines

The paper compares consistency models against various categories of generative models and acceleration techniques.

5.3.1. Diffusion Models with Fast Samplers

These are standard diffusion models that use optimized ODE/SDE solvers to reduce the number of sampling steps, but still require multiple steps.

  • DDIM (Song et al., 2020)
  • DPM-solver-2 (Lu et al., 2022)
  • DPM-solver-fast (Lu et al., 2022)
  • 3-DEIS (Zhang & Chen, 2022)
  • Score SDE (Song et al., 2021)
  • DDPM (Ho et al., 2020)
  • LSGM (Vahdat et al., 2021)
  • PFGM (Xu et al., 2022)
  • EDM (Karras et al., 2022): The diffusion model that consistency models are distilled from in consistency distillation experiments, and also used as a baseline for its few-step sampling performance.

5.3.2. Diffusion Model Distillation Techniques

These methods aim to accelerate diffusion models by compressing them into fewer steps.

  • Knowledge Distillation* (Luhman & Luhman, 2021): Requires synthetic data generation.
  • DFNODFNO* (Zheng et al., 2022): Requires synthetic data generation.
  • 1/2/3-Rectified Flow (+distill)* (Liu et al., 2022): Also requires synthetic data generation.
  • Progressive Distillation (PD) (Salimans & Ho, 2022): Directly comparable, as it does not require synthetic data generation prior to distillation. This is a key comparison point for consistency distillation.

5.3.3. Direct Generation (One-Step) Models

These are models designed to generate samples in a single forward pass.

  • GANs:
    • BigGAN (Brock et al., 2019)
    • Diffusion GAN (Xiao et al., 2022)
    • AutoGAN (Gong et al., 2019)
    • E2GAN (Tian et al., 2020)
    • ViTGAN (Lee et al., 2021)
    • TransGAN (Jiang et al., 2021)
    • StyleGAN2-ADA (Karras et al., 2020)
    • StyleGAN-XL (Sauer et al., 2022)
    • PGGAN (Karras et al., 2018)
    • PG-SWGAN (Wu et al., 2019)
    • TDPM (GAN) (Zheng et al., 2023)
  • VAEs and Normalizing Flows:
    • 1-Rectified Flow (Liu et al., 2022)

    • Glow (Kingma & Dhariwal, 2018)

    • Residual Flow (Chen et al., 2019)

    • GLFlow (Xiao et al., 2019)

    • DenseFlow (Grci et al., 2021)

    • DC-VAE (Parmar et al., 2021)

      The choice of baselines covers a broad spectrum of generative modeling, from state-of-the-art diffusion models with optimized samplers, to various diffusion distillation techniques, and leading one-step generative models (both adversarial and non-adversarial). This allows for a comprehensive evaluation of consistency models in terms of speed, quality, and training methodology.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that consistency models significantly advance the state-of-the-art in few-step image generation, both as a distillation method and as a standalone generative model.

6.1.1. Few-Step Image Generation (Distillation)

The paper extensively compares Consistency Distillation (CD) with Progressive Distillation (PD), which is the most directly comparable method because neither requires synthetic data generation prior to distillation.

The following are the results from Tables 1 and 2 of the original paper:

Table 1: Sample quality on CIFAR-10.

METHOD NFE (↓) FID (↓) IS (↑)
Diffusion + Samplers
DDIM (Song et al., 2020) 50 4.67
DDIM (Song et al., 2020) 20 6.84
DDIM (Song et al., 2020) 10 8.23
DPM-solver-2 (Lu et al., 2022) 10 5.94
DPM-solver-fast (Lu et al., 2022) 10 4.70
3-DEIS (Zhang & Chen, 2022) 10 4.17
Diffusion + Distillation
Knowledge Distillation* (Luhman & Luhman, 2021) 1 9.36
DFNO* (Zheng et al., 2022) 1 4.12
1-Rectified Flow (+distill)* (Liu et al., 2022) 1 6.18 9.08
2-Rectified Flow (+distill)* (Liu et al., 2022) 1 4.85 9.01
3-Rectified Flow (+distill)* (Liu et al., 2022) 1 5.21 8.79
PD (Salimans & Ho, 2022) 1 8.34 8.69
CD 1 3.55 9.48
PD (Salimans & Ho, 2022) 2 5.58 9.05
CD 2 2.93 9.75
Direct Generation
BigGAN (Brock et al., 2019) 1 14.7 9.22
Diffusion GAN (Xiao et al., 2022) 1 14.6 8.93
AutoGAN (Gong et al., 2019) 1 12.4 8.55
E2GAN (Tian et al., 2020) 1 11.3 8.51
ViTGAN (Lee et al., 2021) 1 6.66 9.30
TransGAN (Jiang et al., 2021) 1 9.26 9.05
StyleGAN2-ADA (Karras et al., 2020) 1 2.92 9.83
StyleGAN-XL (Sauer et al., 2022) 1 1.85
Score SDE (Song et al., 2021) 2000 2.20 9.89
DDPM (Ho et al., 2020) 1000 3.17 9.46
LSGM (Vahdat et al., 2021) 147 2.10
PFGM (Xu et al., 2022) 110 2.35 9.68
EDM (Karras et al., 2022) 35 2.04 9.84
1-Rectified Flow (Liu et al., 2022) 1 378 1.13
Glow (Kingma & Dhariwal, 2018) 1 48.9 3.92
Residual Flow (Chen et al., 2019) 1 46.4
GLFlow (Xiao et al., 2019) 1 44.6
DenseFlow (Grci et al., 2021) 1 34.9
DC-VAE (Parmar et al., 2021) 1 17.9 8.20
CT 1 8.70 8.49
CT 2 5.83 8.85

Table 2: Sample quality on ImageNet 64x64, and LSUN Bedroom & Cat 256x256 Distillation techniques.

METHOD NFE (↓) FID (↓) Prec. (↑) Rec. (↑)
ImageNet 64 × 64
PD† (Salimans & Ho, 2022) 1 15.39 0.59 0.62
DFNO† (Zheng et al., 2022) 1 8.35
CD† 1 6.20 0.68 0.63
PD† (Salimans & Ho, 2022) 2 8.95 0.63 0.65
CD 2 4.70 0.69 0.64
ADM (Dhariwal & Nichol, 2021) 250 2.07 0.74 0.63
EDM (Karras et al., 2022) 79 2.44 0.71 0.67
BigGAN-deep (Brock et al., 2019) 1 4.06 0.79 0.48
CT 1 13.0 0.71 0.47
CT 2 11.1 0.69 0.56
LSUN Bedroom 256 × 256
PD† (Salimans & Ho, 2022) 1 16.92 0.47 0.27
PD† (Salimans & Ho, 2022) 2 8.47 0.56 0.39
CD† 1 7.80 0.66 0.34
CD† 2 5.22 0.68 0.39
DDPM (Ho et al., 2020) 1000 4.89 0.60 0.45
ADM (Dhariwal & Nichol, 2021) 1000 1.90 0.66 0.51
EDM (Karras et al., 2022) 79 3.57 0.66 0.45
PGGAN (Karras et al., 2018) 1 8.34
PG-SWGAN (Wu et al., 2019) 1 8.0
TDPM (GAN) (Zheng et al., 2023) 1 5.24
StyleGAN2 (Karras et al., 2020) 1 2.35 0.59 0.48
CT 1 16.0 0.60 0.17
CT 2 7.85 0.68 0.33
LSUN Cat 256 × 256
PD† (Salimans & Ho, 2022) 1 29.6 0.51 0.25
PD† (Salimans & Ho, 2022) 2 15.5 0.59 0.36
CD† 1 11.0 0.65 0.36
CD† 2 8.84 0.66 0.40
DDPM (Ho et al., 2020) 1000 17.1 0.53 0.48
ADM (Dhariwal & Nichol, 2021) 1000 5.57 0.63 0.52
EDM (Karras et al., 2022) 79 6.69 0.70 0.43
PGGAN (Karras et al., 2018) 1 37.5
StyleGAN2 (Karras et al., 2020) 1 7.25 0.58 0.43
CT 1 20.7 0.56 0.23
CT 2 11.7 0.63 0.36

Analysis:

  • Superiority over Progressive Distillation (PD): CD consistently outperforms PD across all datasets (CIFAR-10, ImageNet 64x64, LSUN Bedroom, LSUN Cat) for both one-step and two-step generation.
    • On CIFAR-10, CD achieves an FID of 3.55 (1-step) vs. PD's 8.34, and 2.93 (2-step) vs. PD's 5.58.
    • On ImageNet 64x64, CD achieves an FID of 6.20 (1-step) vs. PD's 15.39, and 4.70 (2-step) vs. PD's 8.95.
    • Similar significant improvements are observed for LSUN Bedroom and Cat datasets, with the only exception being single-step generation on LSUN Bedroom where CD with 2\ell_2 metric slightly underperforms PD with 2\ell_2.
  • Outperforming other distillation methods: CD also outperforms other distillation approaches that require prior synthetic dataset construction (e.g., Knowledge Distillation, DFNO, Rectified Flow (+distill)) in one-step generation on CIFAR-10.
  • Impact of LPIPS Metric: The paper notes that using the LPIPS metric universally improves PD's performance compared to the squared 2\ell_2 distance used in its original paper, suggesting LPIPS is a better metric for training generative models. CD also leverages LPIPS for optimal results.
  • Compute-Quality Trade-off: Both PD and CD show improved FID scores as the number of sampling steps (NFE) increases, demonstrating the ability to trade compute for sample quality. CD maintains its lead at higher NFE.

6.1.2. Few-Step Image Generation (Direct Generation - CT)

Consistency Training (CT) refers to training consistency models from scratch, without distilling from a pre-trained diffusion model.

Analysis:

  • Superiority over non-adversarial single-step models: CT significantly outperforms existing single-step, non-adversarial generative models (VAEs and normalizing flows) on CIFAR-10. For example, CT (1-step) achieves an FID of 8.70, which is vastly better than 1-Rectified Flow (378), Glow (48.9), etc.
  • Competitive with distilled models: CT achieves sample quality comparable to one-step samples from PD, despite not having access to pre-trained diffusion models. For instance, on CIFAR-10, CT (1-step) FID of 8.70 is close to PD (1-step) FID of 8.34.
  • Comparison to GANs: CT (1-step) FID of 8.70 on CIFAR-10 is not yet competitive with the best GANs like StyleGAN2-ADA (2.92) or StyleGAN-XL (1.85), but it outperforms several other GANs (BigGAN, Diffusion GAN, AutoGAN, E2GAN, TransGAN).
  • Multi-step improvement for CT: Similar to CD, CT also benefits from multi-step generation. On CIFAR-10, CT improves from FID 8.70 (1-step) to 5.83 (2-step).

6.1.3. Zero-Shot Image Editing

Consistency Models demonstrate strong zero-shot data editing capabilities. The models were not explicitly trained on these tasks but can perform them by modifying the multistep sampling process (Algorithm 4). Examples shown in the paper include:

  • Colorization (Figure 8): A consistency model trained on LSUN Bedroom can colorize grayscale images at test time.

  • Super-resolution (Figure 9): The same model can generate high-resolution images from low-resolution inputs.

  • Stroke-guided image editing (Figure 6c): The model can generate images based on user-provided strokes, similar to SDEdit.

  • Inpainting (Figure 10), Interpolation (Figure 11), Denoising (Figure 12): Additional results in the appendix further confirm these capabilities.

    These results highlight a key advantage of consistency models: they inherit the powerful zero-shot editing capabilities of diffusion models while achieving significantly faster generation.

6.2. Data Presentation (Tables)

The data tables for sample quality on CIFAR-10, ImageNet 64x64, and LSUN Bedroom & Cat 256x256 have been presented in the Core Results Analysis section (6.1.1).

6.3. Ablation Studies / Parameter Analysis

The paper includes an experimental section (6.1. Training Consistency Models) dedicated to understanding the effect of various hyperparameters on the performance of consistency models trained by CD and CT.

The following are the results from Figure 3 of the original paper:

该图像是一个图表,展示了不同训练迭代中CD(Consistency Distillation)和CT(Consistency Training)方法下的FID指标变化。图中包含多条曲线,分别代表了不同的方法和参数设置,直观呈现了不同模型训练效果的对比。

Figure 3: V DTFART configuration for CD is LPIPS, Heun ODE solver, and N=18N=18. Our adaptive schedule functions for N and μ\mu make CT converge significantly faster than fixing them to be constants during the course of optimization.

Analysis of Consistency Distillation (CD):

  • Metric Function d(,)d(\cdot, \cdot) (Figure 3a):
    • Compares squared\ell_2distance, \ell_1`distance`, and `LPIPS`. * **Finding:** `LPIPS` consistently outperforms both $\ell_1$ and $\ell_2$ by a large margin across all training iterations (lower FID). This is expected because `LPIPS` is designed for measuring perceptual similarity between natural images, aligning better with the goal of high-quality image generation than pixel-wise distances. * **ODE Solver and Number of Discretization Steps $N$ (Figure 3b and 3c):** * Compares `Euler` and `Heun` ODE solvers. * Compares different values for $N \in \{9, 12, 18, 36, 50, 60, 80, 120\}$. * **Finding (Figure 3b):** The `Heun ODE solver` uniformly outperforms `Euler's first-order solver` for the same $N$. This corroborates `Theorem 1`, which implies that higher-order ODE solvers (like Heun) should yield smaller estimation errors for the same $N$. * **Finding (Figure 3c):** $N=18$ emerges as the best choice for this specific setup. The performance of `CD` becomes less sensitive to $N$ once it's sufficiently large. These findings align with recommendations from Karras et al. (2022) for `diffusion models`, suggesting a transferability of insights. * **Optimal CD Configuration:** Based on these findings, the optimal configuration for `CD` is using `LPIPS` as the metric, `Heun ODE solver`, and $N=18$ (for CIFAR-10). **Analysis of `Consistency Training (CT)`:** * **Schedule Functions for $N$ and $\mu$ (Figure 3d):** * Compares `CT` with fixed $N$ and $\mu$ values versus adaptive schedule functions for `N(k)` and $\mu(k)$. * **Finding:** The convergence of `CT` is highly sensitive to $N$. Smaller $N$ leads to faster initial convergence but ultimately worse sample quality, while larger $N$ results in slower convergence but better final sample quality. This aligns with `Theorem 2`'s analysis about bias (`O(`\Delta t`)`) and variance. * **Innovation:** Adaptive schedules for $N$ and $\mu$ (`N(k)` and \mu(k) that progressively increase $N$ and adjust $\mu$ over training iterations $k$) significantly improve both the convergence speed and final sample quality of `CT`. This strategy balances the trade-off between fast initial convergence and high final sample quality. * **Metric Function:** Similar to `CD`, `LPIPS` is adopted for `CT` due to its effectiveness. * **ODE Solver:** `CT` does not rely on a particular numerical ODE solver in its loss function, so the choice of solver from the `diffusion model` context (like Heun) is not directly applicable to `CT`'s loss calculation itself. **Overall Hyperparameter Insights:** * The choice of `perceptual metric` (`LPIPS`) is critical for training high-quality image generative models. * Higher-order `ODE solvers` are beneficial for `consistency distillation` when approximating `PF ODE` trajectories. * `Adaptive scheduling` of training parameters ($N$, \mu)isessentialforoptimizingconsistencytrainingperformance,balancingconvergencespeedandfinalsamplequality.ExperimentalVerificationsforContinuousTimeObjectives(Figure7):ThefollowingaretheresultsfromFigure7oftheoriginalpaper:![Figure7:Comparingdiscreteconsistencydistillation/trainingalgorithmswithcontinuouscounterparts.](/files/papers/693e91768423067332d514b1/images/7.jpg)Figure7:Comparingdiscreteconsistencydistillation/trainingalgorithmswithcontinuouscounterparts.Analysis:Figure7showstheFIDcurvesovertrainingiterationsfordifferentdiscreteandcontinuoustimelossfunctions.ForConsistencyDistillation(CD)(Figure7a):CD(LPIPS)performsthebestamongdiscreteCDobjectives,reinforcingLPIPSastheoptimalmetric.CD) is essential for optimizing `consistency training` performance, balancing convergence speed and final sample quality. **Experimental Verifications for Continuous-Time Objectives (Figure 7):** The following are the results from Figure 7 of the original paper: ![Figure 7: Comparing discrete consistency distillation/training algorithms with continuous counterparts.](/files/papers/693e91768423067332d514b1/images/7.jpg) **Figure 7: Comparing discrete consistency distillation/training algorithms with continuous counterparts.** **Analysis:** * Figure 7 shows the `FID` curves over training iterations for different discrete and continuous-time loss functions. * For `Consistency Distillation (CD)` (Figure 7a): * `CD (LPIPS)` performs the best among discrete `CD` objectives, reinforcing `LPIPS` as the optimal metric. * `CD`^\infty (stopgrad, LPIPS) performs best among continuous CD objectives, even outperforming discrete CD (LPIPS). This suggests that the continuous-time pseudo-objective with stopgrad is very effective.
    • The continuous-time objectives generally achieve lower FID values, especially with LPIPS and stopgrad.
  • For Consistency Training (CT) (Figure 7b):
    • CT with adaptive NN and μ\mu (labeled CT (LPIPS)) performs best, reinforcing the importance of scheduling.
    • The continuous-time CT objectives, CT^\infty(\ell_2) and CT^\infty(LPIPS), do not perform as well as the discrete CT with adaptive schedules when models are randomly initialized. This suggests that while theoretically sound, practical challenges (e.g., variance reduction for continuous-time CT) need to be addressed for standalone CT to fully leverage its continuous formulation.

6.4. Additional Samples

The paper provides extensive visual samples generated by EDM, CD, and CT across different datasets and NFE values in Appendix E. Figure 5 (in the main paper) shows a direct comparison:

Figure 7: Comparing discrete consistency distillation/training algorithms with continuous counterparts.

Figure 5: Samples generated by EDM (top), CT+\mathrm { CT + } single-step generation (middle), and CT+2\mathrm { CT } + 2 -step generation (Bottom). All corresponding images are generated from the same initial noise.

Analysis:

  • The samples generated by CT (both 1-step and 2-step) maintain significant structural similarity to the EDM samples when initialized with the same noise vector. This implies that consistency models effectively capture the underlying data manifold learned by diffusion models.
  • The visual quality of CT samples, especially with 2-step generation, appears high and plausible, further supporting the quantitative FID results.
  • The fact that CT (trained in isolation) can produce samples structurally similar to EDM (a powerful diffusion model) suggests that consistency models are less prone to issues like mode collapse, a common problem in GANs.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces Consistency Models as a novel and highly effective family of generative models. They tackle the long-standing challenge of slow generation speed in diffusion models by enabling fast one-step sampling, while crucially preserving the ability to trade computation for quality via multi-step sampling and supporting versatile zero-shot data editing capabilities. The authors demonstrate two powerful training paradigms: Consistency Distillation (CD), which significantly outperforms existing diffusion distillation techniques, and Consistency Training (CT), which allows for consistency models to be trained as competitive standalone generative models from scratch. Empirically, consistency models achieve new state-of-the-art FID scores for one-step generation on standard benchmarks and surpass other non-adversarial, single-step generative models. The work establishes consistency models as a promising new direction that harmonizes the high quality and flexibility of diffusion models with the efficiency of single-step generation.

7.2. Limitations & Future Work

The authors implicitly or explicitly point out several limitations and suggest future work:

  • Multistep ODE Solvers for Distillation: The current consistency distillation framework primarily considers one-step ODE solvers (e.g., Euler, Heun). Generalizing the framework to multistep ODE solvers is left as future work, which could potentially improve distillation accuracy.
  • Time Point Selection for Multistep Sampling: The selection of time points {τ1,τ2,,τN1}\{\tau_1, \tau_2, \dots, \tau_{N-1}\} in Multistep Consistency Sampling (Algorithm 1) currently uses a greedy algorithm with ternary search to optimize FID. Exploring better, more robust strategies for selecting these time points is suggested for future work.
  • Variance Reduction for Continuous-Time CT: While continuous-time CT objectives are theoretically appealing (reducing bias), experimental results show that discrete-time CT with adaptive schedules performs better when models are randomly initialized. Addressing variance reduction techniques for continuous-time CT is an area for future research.
  • Comparison to Top GANs in CT: While CT outperforms other non-adversarial, single-step models, its one-step generation quality (e.g., FID on CIFAR-10) is still not as good as the absolute best GANs (e.g., StyleGAN-XL). Further improvements to CT could aim to close this gap.
  • Generalization of Theorem 3: A more general version of Theorem 3 (for consistency distillation in continuous time) that applies to more general ODE solvers is left for future work.

7.3. Personal Insights & Critique

7.3.1. Inspirations and Applications

This paper offers several profound inspirations:

  • Bridging the Gap: Consistency models elegantly bridge the gap between the high-quality outputs of diffusion models and the high-speed requirements of real-time applications. This is a crucial step towards making advanced generative AI more practical.
  • Elegance of Self-Consistency: The core idea of self-consistency is remarkably elegant. It transforms the problem of iterative denoising into a direct mapping, simplifying the inference process without sacrificing key features. The boundary condition is a clever architectural constraint that grounds the model.
  • Zero-Shot Capabilities as a Benchmark: The emphasis on zero-shot data editing as a key feature highlights the versatility of latent representations learned by these models. This capability is highly valuable and suggests that the learned consistency function effectively encodes the semantic meaning of data.
  • Cross-Pollination Potential: The observed similarities to deep Q-learning (target networks, EMA) and momentum-based contrastive learning (EMA) open exciting avenues for research. Insights from these fields could further stabilize and improve consistency model training, and vice-versa. For instance, techniques for exploration-exploitation balance from RL might be relevant to how consistency models handle noise injection in multi-step sampling.

7.3.2. Potential Issues, Unverified Assumptions, or Areas for Improvement

  • Complexity of Training Schedules: While effective, the adaptive scheduling of NN and μ\mu for CT suggests a degree of hyperparameter tuning complexity. Developing more robust or self-adapting schedules, or exploring architectural inductive biases that reduce this sensitivity, could be beneficial. The fact that continuous-time CT didn't immediately outperform discrete CT with adaptive schedules indicates that practical implementation details, beyond theoretical formulation, are critical.
  • Theoretical vs. Practical Performance Discrepancy (Continuous-Time CT): The continuous-time CT loss is theoretically appealing for its lack of bias. However, its lower empirical performance compared to discrete CT with schedules points to practical challenges, possibly related to optimization stability, gradient estimation, or the effectiveness of forward-mode automatic differentiation in current frameworks. More research into efficient and stable optimization of these continuous objectives is warranted.
  • Dependence on "Good" Diffusion Models for CD: Consistency Distillation heavily relies on the quality of the pre-trained diffusion model. If the teacher model itself has limitations (e.g., specific biases or failure modes), these might be inherited by the consistency model. However, this is a general limitation of distillation techniques.
  • Interpretation of xτnx+τn2ϵ2z\mathbf{x}_{\tau_n} \gets \mathbf{x} + \sqrt{\tau_n^2 - \epsilon^2} \mathbf{z}: In Algorithm 1, the noise injection step is slightly ambiguous in its exact interpretation. Does x\mathbf{x} represent xϵ\mathbf{x}_{\epsilon} before noise is added? If so, the formulation is sound for creating a noisy sample at time τn\tau_n. Clarifying this and its implications for trajectory adherence could be valuable for newcomers.
  • Scalability for Very High Resolutions: While demonstrated on 256x256 images, scaling to even higher resolutions (e.g., 1024x1024 or larger) might introduce new challenges. The skip connection parameterization is helpful, but consistency loss computation involving Jacobians (in continuous-time formulations) or large batches for discrete losses could become computationally prohibitive.
  • Beyond Image Generation: The paper focuses on images. Exploring consistency models for other data modalities (audio, video, text, 3D data) could be a rich area for future work, potentially revealing new challenges and requiring different architectural adaptations.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.