Paper status: completed

Back to Basics: Let Denoising Generative Models Denoise

Published:11/18/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
12 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper advocates for denoising generative models that directly predict clean images rather than noise. It introduces a simplified Transformer model, JiT, which operates without pre-training or tokenizers, demonstrating superior performance in high-dimensional data generation.

Abstract

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers\textbf{Just image Transformers}", or JiT\textbf{JiT}, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Back to Basics: Let Denoising Generative Models Denoise

1.2. Authors

  • Tianhong Li: A PhD student at MIT, focusing on computer vision and machine learning, particularly generative models.

  • Kaiming He: A renowned research scientist, currently at MIT. He is one of the most influential figures in modern computer vision, best known for his foundational work on Deep Residual Networks (ResNet), Mask R-CNN, Momentum Contrast (MoCo), and a series of other impactful papers. His involvement signals a high-quality, fundamental contribution to the field.

    Both authors are affiliated with the Massachusetts Institute of Technology (MIT).

1.3. Journal/Conference

The paper was submitted to a preprint server, arXiv. As of the time of this analysis, it has not yet been peer-reviewed or published at a major conference. However, given the authors' reputation and the quality of the work, it is highly likely to be presented at a top-tier computer vision or machine learning conference such as CVPR, ICCV, ECCV, or NeurIPS. These venues are the most prestigious in the field.

1.4. Publication Year

The paper was published on arXiv with a timestamp of November 17, 2025. Note: The provided publication date is in the future; this analysis will proceed assuming the date is a placeholder or typo in the prompt and the content is current.

1.5. Abstract

The abstract challenges the current standard practice in denoising diffusion models, which typically predict noise (ε-prediction) or a noised quantity (v-prediction) rather than the clean image itself (x-prediction). The authors argue that this is a fundamental mistake based on the manifold assumption: natural data lies on a low-dimensional manifold, whereas noise is inherently high-dimensional and off-manifold. Predicting the clean data is therefore a more tractable task for a network. They propose a simple model called JiT (Just image Transformers), which is a standard Vision Transformer operating on large patches of raw pixels. Without any tokenizers, pre-training, or auxiliary losses, JiT achieves competitive results on high-resolution ImageNet (256x256 and 512x512), a setting where predicting noise with a similar architecture fails catastrophically. The paper advocates for a self-contained, "back to basics" paradigm for generative modeling.

  • Original Source Link: https://arxiv.org/abs/2511.13720

  • PDF Link: https://arxiv.org/pdf/2511.13720v1.pdf

  • Publication Status: This is a preprint available on arXiv and has not yet undergone formal peer review.


2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is a foundational design choice in modern denoising diffusion models. Since their popularization, these models have achieved state-of-the-art results in image generation. However, a key innovation that led to their success was to train the underlying neural network to predict the noise that was added to an image, rather than predicting the clean image directly. This is known as ε-prediction. Subsequent work also introduced predicting a "velocity" term (v-prediction), which is a combination of the clean image and noise.

The authors argue that while these approaches work, they are fundamentally at odds with a long-standing principle in machine learning: the manifold assumption. This hypothesis states that real-world, high-dimensional data (like the pixels of a natural image) actually populates a much lower-dimensional, non-linear structure called a manifold.

The motivation stems from the following gap:

  • A clean image xx lies on this low-dimensional manifold.

  • Random noise εε or the velocity vv is distributed throughout the entire high-dimensional space and is thus off-manifold.

    Training a neural network to predict an off-manifold quantity is a much harder task. It forces the network to preserve high-dimensional information, requiring a very high capacity. This is especially problematic when working directly in the high-dimensional pixel space, which has led to the dominance of Latent Diffusion Models (LDMs). LDMs first compress images into a small latent space using a pre-trained autoencoder (a "tokenizer"), effectively sidestepping the high-dimensionality problem instead of solving it.

The paper's innovative entry point is to challenge the status quo and ask: What if we just let the denoising model do what its name suggests—denoise, by directly predicting the clean image?

2.2. Main Contributions / Findings

The paper makes several significant contributions:

  1. Re-establishes the Superiority of x-prediction in High Dimensions: The authors demonstrate empirically and through a toy experiment that directly predicting the clean image (x-prediction) is fundamentally more suitable for diffusion models operating on high-dimensional data, especially when using capacity-limited networks like Transformers. In contrast, ε-prediction and v-prediction fail catastrophically in these settings.

  2. Proposes a Simple and Self-Contained Model (JiT): They introduce "Just image Transformers" (JiT), a plain Vision Transformer (ViT) that operates directly on large patches of raw pixels. This approach is conceptually simple and self-contained, meaning it requires:

    • No pre-trained VAE/tokenizer (unlike LDMs).
    • No perceptual or adversarial losses (which require pre-trained classifiers like VGG).
    • No self-supervised pre-training (unlike other recent methods).
  3. Achieves Competitive Performance with High Efficiency: JiT achieves strong results on high-resolution ImageNet (256x256 and 512x512) while being significantly more computationally efficient than competing pixel-space models. By using large patches, it avoids the quadratic scaling of compute with image resolution that plagues dense convolutional models.

  4. Provides Evidence for the Manifold Hypothesis in Diffusion Models: The authors show that introducing an information bottleneck in the network's input layer can actually improve generation quality. This counter-intuitive finding strongly supports their core argument: since the target (the clean image) is low-dimensional, forcing the network to filter out high-dimensional noise through a bottleneck is beneficial.


3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Denoising Diffusion Models

Denoising Diffusion Probabilistic Models (DDPMs) are a class of generative models that learn to create data by reversing a gradual noising process. The process has two parts:

  • Forward Process (Noising): This is a fixed process where a clean image x0\mathbf{x}_0 is gradually corrupted by adding Gaussian noise over a series of TT timesteps. At any timestep tt, the noised image xt\mathbf{x}_t is a combination of the original image and noise. The paper uses a continuous-time formulation where a noisy sample zt\mathbf{z}_t is an interpolation between a clean image x\mathbf{x} and a noise vector ϵ\mathbf{\epsilon}.
  • Reverse Process (Denoising): This is the generative part. A neural network is trained to "denoise" the corrupted data at each timestep. Starting from pure random noise xT\mathbf{x}_T, the model iteratively applies this learned denoising function to gradually produce a clean image x0\mathbf{x}_0.

3.1.2. The Manifold Assumption

The manifold assumption is a core concept in machine learning that posits that high-dimensional data, such as images, audio, or text, is not uniformly distributed throughout its ambient space. Instead, it concentrates near a much lower-dimensional, non-linear sub-manifold.

  • Example: Consider 3D space. A 2D manifold could be a curved sheet of paper floating in that space. A 1D manifold could be a tangled string.

  • For Images: A 256x256 pixel color image exists in a space of 256×256×3=196,608256 \times 256 \times 3 = 196,608 dimensions. However, the space of "natural-looking images" is a tiny, intricately structured subset of this vast space. A random point in this space would just look like TV static. The manifold assumption states that all realistic images lie on a complex, low-dimensional surface within this high-dimensional pixel space.

    The following figure from the paper illustrates this concept. The clean image x\mathbf{x} is on the manifold, while noise ϵ\mathbf{\epsilon} and the velocity term v\mathbf{v} are off-manifold vectors in the high-dimensional ambient space.

    Figure 1. The Manifold Assumption \[4\] hypothesizes that natural images lie on a low-dimensional manifold within the highdimensional pixel space. While a clean image \(_ { \\pmb { x } }\) can be modeled as on-manifold, the noise \(\\epsilon\) or flow velocity `_ { v }` (e.g., \({ \\pmb v } = { \\pmb x } - { \\pmb \\epsilon } )\) is inherently off-manifold. Training a neural network to predict a clean image (i.e., \(_ { \\textbf { \\em x } }\) -prediction) is fundamentally different from training it to predict noise or a noised quantity (i.e., \(\\epsilon / v\) -prediction). 该图像是示意图,展示了图像流形假设及其在噪声预测中的应用。图中表示了输入、干净图像预测(xx-pred)、噪声预测(vv-pred)及其关系。干净图像 x\pmb{x} 被视为流形上的点,而噪声 ϵ\epsilon 和流速 v\pmb{v} 则位于流形之外。此示意图强调预测干净图像与预测噪声之间的根本不同性。

3.1.3. Prediction Targets in Diffusion Models

The neural network in a diffusion model needs to predict something to perform the denoising step. There are three common choices:

  1. x-prediction (xθ\mathbf{x}_{\theta}): The network directly predicts the clean image x\mathbf{x}. This is the most intuitive "denoising" task.

  2. ε-prediction (ϵθ\mathbf{\epsilon}_{\theta}): The network predicts the noise ϵ\mathbf{\epsilon} that was added to the image. This became the standard after the DDPM paper showed it produced better results in their setup.

  3. v-prediction (vθ\mathbf{v}_{\theta}): The network predicts a "velocity" vector v=xϵ\mathbf{v} = \mathbf{x} - \mathbf{\epsilon}. This was introduced as a way to unify different formulations and improve sampling.

    The key insight of this paper is that the choice between these targets is not arbitrary, especially when considering the manifold assumption.

3.1.4. Vision Transformer (ViT)

The Vision Transformer is an architecture that applies the Transformer model, originally designed for natural language processing, to image data.

  • How it works:
    1. Patching: The input image is divided into a grid of non-overlapping square patches (e.g., 16x16 pixels).
    2. Linear Embedding: Each patch is flattened into a long vector and then linearly projected into a token embedding.
    3. Positional Encoding: Since Transformers are permutation-invariant, positional information is added to each token embedding to retain spatial awareness.
    4. Transformer Encoder: This sequence of tokens is fed into a standard Transformer encoder, which uses self-attention mechanisms to model relationships between all pairs of patches.
    5. Output: The processed tokens can then be used for a downstream task, such as classification or, in this case, image generation.

3.2. Previous Works

  • DDPM (Ho et al., 2020): This seminal work revitalized interest in diffusion models. A crucial finding reported was that making the network predict the noise (ε-prediction) led to substantially better image quality than predicting the clean image (x-prediction). This observation established ε-prediction as the de facto standard.
  • Progressive Distillation (Salimans & Ho, 2022): This work introduced v-prediction and provided a theoretical framework that connected xx, εε, and vv predictions. It showed that they could be transformed into one another and that choosing a different prediction target was equivalent to re-weighting the training loss. However, this analysis was done in low-dimensional settings (CIFAR-10) and did not consider the capacity limitations of networks in high-dimensional spaces.
  • Latent Diffusion Models (LDM, Rombach et al., 2022): LDMs became the dominant architecture for large-scale image generation (e.g., Stable Diffusion). They work by first training a powerful autoencoder (like a VAE) to compress images into a low-dimensional latent space. The diffusion model then operates entirely within this compact space, avoiding the computational cost and instability of working with high-dimensional pixels. This paper argues that LDMs hide the dimensionality problem rather than solving it, and rely heavily on pre-training.
  • Pixel-space Diffusion Models (e.g., SiD2, PixelFlow): There have been recent efforts to build powerful diffusion models that work directly on pixels, avoiding the need for a pre-trained tokenizer. These models often use complex, hierarchical, or dense convolutional architectures to handle the high dimensionality. This paper's JiT model contrasts with these by using a much simpler, standard Transformer architecture.

3.3. Technological Evolution

  1. Early Diffusion Models: Proposed denoising as the core mechanism but were not widely adopted.
  2. DDPM Era: ε-prediction becomes the standard, leading to a massive jump in generation quality. Models typically used U-Net architectures operating on pixels.
  3. Rise of Latent Diffusion: To scale to high-resolution images, LDMs shift the entire process to a pre-trained latent space. This becomes the dominant paradigm.
  4. Return to Pixel Space: A recent trend explores getting rid of the pre-trained latent space for a more "end-to-end" or "self-contained" approach. These models often require sophisticated architectural designs to cope with the high dimensionality of pixels.
  5. This Paper's Position: "Back to Basics" argues that the pixel-space approach is viable and can be simple, provided we revisit the fundamental choice of the prediction target. By switching back to x-prediction, they claim we can unlock the power of simple architectures like ViT for direct pixel-space generation.

3.4. Differentiation Analysis

Compared to related work, this paper's core innovations are:

  • vs. LDMs (DiT, SiT): JiT is self-contained. It does not require a pre-trained VAE or any other external model. This simplifies the training pipeline and makes the approach more generalizable to non-image domains where tokenizers are hard to design.

  • vs. other Pixel-Space Models (SiD2, PixelFlow): JiT is architecturally simpler and more efficient. It uses a standard, non-hierarchical ViT with large patches, which is computationally cheaper than the dense U-Nets or multi-scale Transformers used in other pixel-space models.

  • vs. DDPM and its successors: The core differentiation is the conceptual shift from ε/v-prediction back to x-prediction, justified by the manifold assumption. While x-prediction existed before, this paper is the first to systematically show its critical importance in high-dimensional settings and connect it to the network's architectural capacity.


4. Methodology

4.1. Principles

The central principle of the paper is the manifold assumption. The authors hypothesize that a neural network with finite capacity can more easily learn to map a noisy, off-manifold point back to the low-dimensional data manifold (i.e., predict the clean image xx) than it can learn to predict a high-dimensional, off-manifold noise or velocity vector. Predicting noise requires the network to act as a lossless channel for high-dimensional information, which is a demanding task. In contrast, predicting the clean image allows the network to act as a filter, discarding the high-dimensional noise and retaining only the low-dimensional information relevant to the data manifold.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology combines a flow-based diffusion framework with a specific choice of prediction target and a simple Transformer architecture.

4.2.1. Integrated Explanation: Diffusion Formulation and Prediction Spaces

The authors start with a continuous-time, flow-based formulation of diffusion.

Step 1: The Noising Process During training, a noisy sample zt\mathbf{z}_t is created by linearly interpolating a clean image x\mathbf{x} from the data distribution pdatap_{\mathrm{data}} and a random noise vector ϵ\mathbf{\epsilon} from a noise distribution pnoisep_{\mathrm{noise}} (typically N(0,I)\mathcal{N}(0, \mathbf{I})). The interpolation is controlled by a time variable t[0,1]t \in [0, 1]. The formula for this process is: zt=tx+(1t)ϵ \mathbf{z}_t = t\mathbf{x} + (1-t)\mathbf{\epsilon} Here, when t=0t=0, z0=ϵ\mathbf{z}_0 = \mathbf{\epsilon} (pure noise), and when t=1t=1, z1=x\mathbf{z}_1 = \mathbf{x} (clean data).

Step 2: Defining the Flow Velocity The goal of the generative model is to learn a vector field that can transport a point from the noise distribution to the data distribution. This is captured by the time-derivative of zt\mathbf{z}_t, called the flow velocity v\mathbf{v}. Differentiating the equation above with respect to tt gives: v=dztdt=xϵ \mathbf{v} = \frac{d\mathbf{z}_t}{dt} = \mathbf{x} - \mathbf{\epsilon} This simple relationship connects the clean image x\mathbf{x}, the noise ϵ\mathbf{\epsilon}, and the velocity v\mathbf{v}.

Step 3: The Loss Function The model is trained to predict the true velocity v\mathbf{v} given the noisy input zt\mathbf{z}_t and time tt. The network, parameterized by θ\theta, outputs a predicted velocity vθ\mathbf{v}_{\theta}. The training objective is to minimize the mean squared error between the predicted and true velocity. This is called the v-loss: L=Et,x,ϵvθ(zt,t)v2 \mathcal{L} = \mathbb{E}_{t, \mathbf{x}, \mathbf{\epsilon}} \| \mathbf{v}_{\theta}(\mathbf{z}_t, t) - \mathbf{v} \|^2

Step 4: Connecting Loss Space and Prediction Space This is the paper's crucial insight. The network's direct output, netθ(zt,t)net_θ(z_t, t), does not have to be the velocity vθ\mathbf{v}_{\theta}. It can be the clean image xθ\mathbf{x}_{\theta} or the noise ϵθ\mathbf{\epsilon}_{\theta}. If the network predicts one of these quantities, the other two can be algebraically derived using the relationships from Step 1 and 2.

The paper presents all possible combinations in Table 1. Let's analyze the key cases:

  • Case (a): x-prediction (The paper's proposed method). The network's direct output is the clean image: xθ:=netθ(zt,t)\mathbf{x}_{\theta} := \mathrm{net}_{\theta}(\mathbf{z}_t, t). Using the known equations, we can derive the corresponding predicted velocity vθ\mathbf{v}_{\theta} to plug into the v-loss:

    • From zt=txθ+(1t)ϵθ\mathbf{z}_t = t\mathbf{x}_{\theta} + (1-t)\mathbf{\epsilon}_{\theta}, we get ϵθ=(zttxθ)/(1t)\mathbf{\epsilon}_{\theta} = (\mathbf{z}_t - t\mathbf{x}_{\theta})/(1-t).
    • From vθ=xθϵθ\mathbf{v}_{\theta} = \mathbf{x}_{\theta} - \mathbf{\epsilon}_{\theta}, we substitute ϵθ\mathbf{\epsilon}_{\theta} to get vθ=(xθzt)/(1t)\mathbf{v}_{\theta} = (\mathbf{x}_{\theta} - \mathbf{z}_t)/(1-t).
  • Case (b): ε-prediction (The standard DDPM approach). The network directly outputs the noise: ϵθ:=netθ(zt,t)\mathbf{\epsilon}_{\theta} := \mathrm{net}_{\theta}(\mathbf{z}_t, t). The predicted velocity is then vθ=(ztϵθ)/t\mathbf{v}_{\theta} = (\mathbf{z}_t - \mathbf{\epsilon}_{\theta})/t.

  • Case (c): v-prediction (Standard flow-matching approach). The network directly outputs the velocity: vθ:=netθ(zt,t)\mathbf{v}_{\theta} := \mathrm{net}_{\theta}(\mathbf{z}_t, t).

    The following table, transcribed from Table 1 of the paper, summarizes these transformations across different loss and prediction spaces.

    (a) x-pred
    xθ := netθ(zt,t)
    (b) ε-pred
    εθ := netθ(zt,t)
    (c) v-pred
    vθ := netθ(zt,t)
    (1) x-loss: E||xθx||2 xθ xθ = (zt − (1 − t)εθ)/t xθ = (1 − t)vθ + zt
    (2) ε-loss: E||εθε||2 εθ = (zt − txθ)/(1 − t) εθ εθ = zt − tvθ
    (3) v-loss: E||vθv||2 vθ = (xθ − zt)/(1 − t) vθ = (ztεθ)/t vθ

Step 5: The "Just Image Transformer" (JiT) Architecture The authors implement their x-prediction strategy using a simple, standard Vision Transformer. The architecture is shown in the figure below.

Figure 3. The "Just image Transformer" (JiT) architecture: simply a plain ViT \[13\] on patches of pixels for \(_ { \\textbf { \\em x } }\) -prediction. 该图像是一张示意图,展示了“仅图像Transformer”(JiT)架构的过程。图中从左到右依次显示了输入的像素块、线性嵌入、多个Transformer块,以及最终的线性预测,输出为xx-prediction。

The data flow is as follows:

  1. Input: The noisy image zt\mathbf{z}_t is divided into a sequence of large, non-overlapping patches (e.g., 16x16 or 32x32 pixels).
  2. Embedding: Each patch, which is a high-dimensional vector (e.g., 16×16×3=76816 \times 16 \times 3 = 768 dimensions), is passed through a linear projection layer. Positional embeddings are added.
  3. Transformer Blocks: The sequence of patch tokens is processed by a series of standard Transformer blocks, which use self-attention to model global dependencies. The model is conditioned on the time tt and class label using an adaptive normalization scheme (adaLN-Zero).
  4. Output: A final linear layer projects each output token back to the original patch dimension (p×p×3p \times p \times 3). These patches are then reassembled to form the predicted clean image xθ\mathbf{x}_{\theta}.

Step 6: The Final Algorithm The paper's recommended algorithm combines x-prediction with the v-loss. This specific combination was found to perform best empirically. The final loss function being optimized is: L=Et,x,ϵvθ(zt,t)v2,wherevθ(zt,t)=(netθ(zt,t)zt)/(1t). \begin{array}{rl} & \mathcal{L} = \mathbb{E}_{t, \mathbf{x}, \mathbf{\epsilon}} \Big\| \mathbf{v}_{\boldsymbol{\theta}}(\mathbf{z}_t, t) - \mathbf{v} \Big\|^2, \\ \text{where} & \mathbf{v}_{\boldsymbol{\theta}}(\mathbf{z}_t, t) = (\mathrm{net}_{\boldsymbol{\theta}}(\mathbf{z}_t, t) - \mathbf{z}_t) / (1-t). \end{array} Here, netθ\mathrm{net}_{\boldsymbol{\theta}} is the JiT model that directly outputs the predicted clean image xθ\mathbf{x}_{\theta}.

Step 7: Sampling During inference, generation starts with a pure noise sample z0\mathbf{z}_0. The model then solves an ordinary differential equation (ODE) to transform this noise into a clean image: dzt/dt=vθ(zt,t) d\mathbf{z}_t / dt = \mathbf{v}_{\theta}(\mathbf{z}_t, t) This is done numerically using a solver like Heun's method over a series of steps (e.g., 50) from t=0t=0 to t=1t=1. At each step, the network takes the current zt\mathbf{z}_t, predicts the clean image xθ\mathbf{x}_{\theta}, calculates the corresponding velocity vθ\mathbf{v}_{\theta}, and uses that to update zt\mathbf{z}_t for the next step.


5. Experimental Setup

5.1. Datasets

  • ImageNet (ILSVRC 2012): This is the primary dataset used for experiments. It is a large-scale, high-resolution dataset for image classification and generation.
    • Source: Assembled by researchers at Stanford University and Princeton University.
    • Scale: Contains over 1.2 million training images and 50,000 validation images.
    • Characteristics: The images belong to 1,000 different object categories, exhibiting significant diversity in subject matter, background, and composition.
    • Domain: Natural images.
    • Usage: The authors conduct experiments at multiple resolutions: 64x64 (for ablation), 256x256, 512x512, and 1024x1024.
  • Why this dataset? ImageNet is the standard benchmark for class-conditional image generation. Its high resolution and complexity make it an excellent testbed for evaluating the scalability and quality of generative models, especially for testing the authors' hypothesis about high-dimensional data.

5.2. Evaluation Metrics

5.2.1. Fréchet Inception Distance (FID)

  • Conceptual Definition: FID measures the quality and diversity of generated images by comparing the feature distributions of generated images to real images. The features are extracted from an intermediate layer of a pre-trained InceptionV3 network. A lower FID score indicates that the distribution of generated images is closer to the distribution of real images, implying higher quality and diversity. It is the most common metric for evaluating generative models.
  • Mathematical Formula: FID=μrμg22+Tr(Σr+Σg2(ΣrΣg)1/2) \text{FID} = \left\| \mu_r - \mu_g \right\|_2^2 + \text{Tr}\left( \Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2} \right)
  • Symbol Explanation:
    • μr\mu_r and μg\mu_g: The mean vectors of the Inception features for real and generated images, respectively.
    • Σr\Sigma_r and Σg\Sigma_g: The covariance matrices of the Inception features for real and generated images, respectively.
    • 22\|\cdot\|_2^2: The squared L2 norm (measures difference in means, i.e., quality).
    • Tr()\text{Tr}(\cdot): The trace of a matrix (measures difference in covariance, i.e., diversity).

5.2.2. Inception Score (IS)

  • Conceptual Definition: IS measures both the quality (clarity and object-likeness) and diversity of generated images. It uses a pre-trained Inception network to classify each generated image. A high IS means that for each image, the classifier is very confident about which object class it belongs to (high quality), and across all images, the distribution of predicted classes is uniform (high diversity).
  • Mathematical Formula: IS(G)=exp(ExpgDKL(p(yx)p(y))) \text{IS}(G) = \exp\left( \mathbb{E}_{\mathbf{x} \sim p_g} D_{KL}(p(y|\mathbf{x}) \| p(y)) \right)
  • Symbol Explanation:
    • xpg\mathbf{x} \sim p_g: An image x\mathbf{x} sampled from the generator's distribution pgp_g.
    • p(yx)p(y|\mathbf{x}): The conditional class distribution predicted by the Inception network for image x\mathbf{x}.
    • p(y): The marginal class distribution, obtained by averaging p(yx)p(y|\mathbf{x}) over all generated images.
    • DKL()D_{KL}(\cdot \| \cdot): The Kullback-Leibler (KL) divergence, which measures how different the two distributions are.

5.2.3. Precision and Recall

  • Conceptual Definition: These metrics provide a more disentangled view of sample quality and diversity than FID.
    • Precision: Measures the fraction of generated images that are realistic (i.e., fall within the distribution of real images). It is an indicator of sample quality.
    • Recall: Measures the fraction of real images that are represented in the generated set. It is an indicator of sample diversity.
  • Mathematical Formula: These metrics are calculated by constructing manifolds around feature representations of real and generated data and measuring the overlap. The exact formulas are complex, but conceptually they assess whether generated samples are "close" to real ones (precision) and whether all "types" of real samples have corresponding generated ones nearby (recall).

5.3. Baselines

The paper compares JiT against a comprehensive set of state-of-the-art models, which can be grouped into three categories:

  1. Latent-space Diffusion Models: These are the dominant SOTA models that operate in a pre-trained latent space.

    • Examples: DiT, SiT, DDT, RAE.
    • Why representative? They represent the prevailing and highest-performing paradigm in diffusion-based image generation.
  2. Pixel-space (non-diffusion) Models: These are generative models that work on pixels but are not based on diffusion.

    • Examples: JetFormer (autoregressive), FractalMAR.
    • Why representative? They show how other pixel-based generative approaches perform.
  3. Pixel-space Diffusion Models: These are previous attempts at building diffusion models directly on pixels.

    • Examples: ADM-G, RIN, SiD, SiD2, PixelFlow, PixNerd.

    • Why representative? They are the most direct competitors to JiT, often employing more complex architectures to solve the same problem.


6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Toy Experiment: The Curse of Dimensionality

The paper first validates its core hypothesis with a toy experiment. Data is generated from a 2D spiral manifold, which is then randomly projected into a much higher DD-dimensional space. A simple MLP is trained to generate data in this DD-dim space. The results are visualized by projecting the generated samples back to 2D.

The following figure (Figure 2 from the original paper) shows the results.

该图像是多幅水母的照片集,展示了不同种类和颜色的水母,背景色彩丰富,呈现了它们在水中漂浮和游动的状态。这些图像能够突出水母的优雅和神秘。 该图像是多幅水母的照片集,展示了不同种类和颜色的水母,背景色彩丰富,呈现了它们在水中漂浮和游动的状态。这些图像能够突出水母的优雅和神秘。

  • Observation: When the ambient dimension DD is low (e.g., D=2D=2), all prediction methods (x-pred, ε-pred, v-pred) work. However, as DD increases to 512, ε-pred and v-pred fail catastrophically, producing meaningless noise. Only x-pred is able to recover the underlying spiral structure.
  • Analysis: This strongly supports the paper's thesis. The MLP has a fixed hidden dimension of 256, which is smaller than the input dimension of 512. It acts as an information bottleneck. This bottleneck prevents it from preserving the high-dimensional information needed to predict εε or vv. However, since the clean data xx is intrinsically low-dimensional (2D), the network can successfully learn to project the noisy input back onto the manifold, even with limited capacity.

6.1.2. Main Experiment: x-prediction is Critical for High-Dimensional Patches

The most crucial experiment compares the nine combinations of prediction space and loss space on ImageNet. The results are shown in Table 2 from the paper.

The following are the results from Table 2 of the original paper:

(a) ImageNet 256x256, JiT-B/16 (768-dim patch)

x-pred ε-pred v-pred
x-loss 10.14 379.21 107.55
ε-loss 10.45 394.58 126.88
v-loss 8.62 372.38 96.53

(b) ImageNet 64x64, JiT-B/4 (48-dim patch)

x-pred ε-pred v-pred
x-loss 5.76 6.20 6.12
ε-loss 3.56 4.02 3.76
v-loss 3.55 3.63 3.46
  • Observation 1 (High Dimension): In Table 2(a), the patch dimension is 768, which matches the Transformer's hidden dimension. Here, only x-prediction produces good results (low FID scores, highlighted in green). Both ε-prediction and v-prediction fail completely, with extremely high FID scores (highlighted in red).
  • Observation 2 (Low Dimension): In Table 2(b), the patch dimension is only 48, which is much smaller than the hidden dimension of 768. In this case, all combinations work well.
  • Analysis: This is the central evidence of the paper. When the input patch dimension is high and comparable to the network's capacity, the choice of prediction target is critical. The network cannot handle the task of predicting high-dimensional, off-manifold quantities. When the input dimension is low, the network has excess capacity and can succeed regardless of the prediction target. This explains why ε-prediction worked well in earlier works that used low-resolution datasets or latent diffusion models with low-dimensional latents.

6.2. Data Presentation (Tables)

6.2.1. Scalability and High-Resolution Generation

The paper shows that JiT scales gracefully to larger model sizes and higher resolutions.

The following are the results from Table 6 of the original paper:

256x256 512x512
200-ep 600-ep 200-ep 600-ep
JiT-B/16 4.37 3.66 JiT-B/32 4.64 4.02
JiT-L/16 2.79 2.36 JiT-L/32 3.06 2.53
JiT-H/16 2.29 1.86 JiT-H/32 2.51 1.94
JiT-G/16 2.15 1.82 JiT-G/32 2.11 1.78
  • Analysis: The FID consistently improves as the model size increases from Base (B) to Giant (G) and as training proceeds for more epochs. Importantly, the performance at 512x512 resolution is nearly as good as at 256x256, and for the largest model (JiT-G), it is even slightly better. This is achieved with almost the same computational cost because the sequence length remains the same (256 tokens), only the patch dimension changes. This demonstrates the method's excellent scalability and efficiency.

6.2.2. Comparison with State-of-the-Art

The following are the results from Table 8 of the original paper, comparing models on ImageNet 512x512:

ImgNet 512×512 pre-training params Gflops FID↓ IS↑
token perc. self-sup.
Latent-space Diffusion
DiT-XL/2 [46] SD-VAE VGG 675+49M 525 3.04 |240.8
RAE [78], DiTDH-XL/2 RAE VGG DINOv2 839+415M 642 1.13 259.6
Pixel-space Diffusion
SiD2 [26], UViT/2 - - N/A 653 1.48 -
JiT-B/32 - - 133M 26 4.02 271.0
JiT-L/32 - - 462M 89 2.53 [299.9
JiT-H/32 - - 956M 183 1.94 [309.1
JiT-G/32 - - 2B 384 1.78 306.8
  • Analysis: JiT achieves competitive FID scores compared to other methods. While the absolute best latent-space models (like RAE) that use extensive pre-training still achieve lower FID, JiT-G/32 (FID 1.78) is highly competitive with other top pixel-space models like SiD2 (FID 1.48) but does so with a much simpler, self-contained setup. Crucially, JiT models are orders of magnitude more computationally efficient. For example, JiT-H/32 achieves a 1.94 FID with only 183 GFLOPs, whereas DiT-XL/2 requires 525 GFLOPs for a 3.04 FID.

6.3. Ablation Studies / Parameter Analysis

6.3.1. Bottleneck Can Be Beneficial

A surprising and powerful result comes from replacing the initial linear patch embedding layer with a bottleneck structure (two linear layers that first reduce and then expand the dimension).

The following chart (Figure 4 from the original paper) shows the effect of the bottleneck dimension on FID.

Figure 4. Bottleneck linear embedding. Results are for JiT-B/16 on ImageNet \(2 5 6 \\times 2 5 6\) A raw patch is 768-dim \({ 1 6 } { \\times } { 1 6 } { \\times } { 3 } )\) and is embedded by two sequential linear layers with an intermediate bottleneck dimension \(d ^ { \\prime }\) \(( d ^ { \\prime } < 7 6 8 )\) . Here, bottleneck embedding is generally beneficial, and our \(_ { \\textbf { \\em x } }\) -prediction model can work decently even with aggressive bottlenecks as small as 32 or 16. Settings (the same as Tab. 3): 200 epochs, with CFG. 该图像是图表,展示了不同瓶颈维度下的FID-50K指标。横坐标表示瓶颈维度(对数尺度),纵坐标为FID-50K值。数据分为“无瓶颈”和“有瓶颈”两类,其中“无瓶颈”点的FID值随维度变化而增加,而“有瓶颈”则在32到256的维度范围内表现出较低的FID值,最佳值出现在64维。图中还标示了原始768-d维的FID值。

  • Observation: Reducing the dimension from the raw 768-dim patch to an intermediate dimension between 32 and 512 improves the FID score. Even an aggressive bottleneck to 16 dimensions does not cause catastrophic failure.
  • Analysis: This strongly supports the manifold hypothesis. The bottleneck forces the network to discard irrelevant, high-dimensional noise information and focus only on the low-dimensional features that define the data manifold. This acts as a form of regularization that helps the model learn a better representation of the clean data.

6.3.2. Noise Level is Not a Fix

The authors tested whether simply increasing the noise level could fix the failure of ε/v-prediction, as suggested by prior work for high-resolution generation.

The following are the results from Table 3 of the original paper:

t-shift (µ) x-pred ε-pred v-pred
(lower noise)
0.0
-0.4
-0.8
-1.2
(higher noise)
14.44 464.25 120.03
9.79 372.91 109.93
8.62 372.36 96.53
8.99 355.25 106.85
  • Analysis: While adjusting the noise level (controlled by μμ) improves the x-prediction model, it does nothing to rescue the ε/v-prediction models from their catastrophic failure. This shows that the problem is not about the signal-to-noise ratio but is inherent to the prediction task itself.


7. Conclusion & Reflections

7.1. Conclusion Summary

The paper presents a compelling argument, backed by strong empirical evidence, to "go back to basics" in designing denoising diffusion models. The authors conclude that the common practice of predicting noise or velocity (ε/v-prediction) is fundamentally flawed when dealing with high-dimensional data under the manifold assumption. Instead, models should be trained to perform classical denoising by directly predicting the clean image (x-prediction).

This simple conceptual shift enables a plain, large-patch Vision Transformer, dubbed JiT, to become a highly effective and efficient generative model for high-resolution images. The proposed approach is self-contained, requiring no pre-training or auxiliary losses, marking a significant step towards a simpler and more general paradigm for generative modeling on raw, natural data.

7.2. Limitations & Future Work

  • Limitations noted by authors: The paper focuses on establishing the core principle and does not explore all possible avenues for performance improvement. For instance, their brief experiment with an additional classification loss shows that there is potential for further gains by incorporating other objectives. The performance of JiT, while competitive, still slightly trails the most complex state-of-the-art systems that leverage extensive pre-training.
  • Future Work: The authors suggest that their minimalist design can serve as a strong foundation for future research. One clear direction is to combine JiT's x-prediction framework with other techniques like auxiliary losses or architectural refinements. The most exciting future direction is applying this self-contained "Diffusion + Transformer" paradigm to other scientific domains (e.g., protein modeling, weather forecasting) where designing a domain-specific tokenizer is a major bottleneck.

7.3. Personal Insights & Critique

This paper is an excellent example of research that delivers high impact through conceptual clarity rather than architectural complexity.

  • Inspiration: The key takeaway is that sometimes revisiting and questioning foundational assumptions can lead to significant breakthroughs. The de facto standard of ε-prediction was established based on empirical results in a specific context (lower-resolution images, U-Net architectures), and this paper shows that those findings do not generalize to the high-dimensional, ViT-based regime. It highlights the importance of aligning the learning objective with the underlying structure of the data (the manifold).
  • Potential Application: The JiT framework is extremely promising for scientific applications. Many scientific datasets consist of high-dimensional, raw sensor data where the concept of a "tokenizer" is not well-defined. A self-contained model that can learn directly from this raw data is highly valuable.
  • Critique and Nuance:
    • The paper's claim of "catastrophic failure" for ε/v-prediction is specific to their architectural setup (a plain ViT with large patches). Other pixel-space models using dense U-Nets (like ADM) have successfully used ε-prediction. The paper implicitly addresses this by suggesting that dense architectures avoid the information bottlenecks that make ε-prediction so difficult for a ViT. This is a crucial nuance: the recommendation for x-prediction is most critical for architectures that inherently create information bottlenecks, which is a common and efficient design pattern.

    • While the model is "self-contained," it still relies on large-scale labeled data (ImageNet) for class-conditional generation. Exploring its effectiveness in a completely unsupervised setting would be an interesting next step.

      Overall, "Back to Basics" is a strong, well-argued, and timely paper that provides a clear and powerful principle for designing the next generation of generative models. Its emphasis on simplicity, efficiency, and first principles is a welcome contribution to a field that can sometimes be dominated by ever-increasing complexity.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.