Paper status: completed

Diffusion Transformers with Representation Autoencoders

Published:10/14/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces Representation Autoencoders (RAEs) that replace traditional Variational Autoencoders (VAEs) with pretrained representation encoders, enhancing image generation quality in Diffusion Transformers (DiTs). RAEs achieve high-quality reconstructions and rich seman

Abstract

Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Diffusion Transformers with Representation Autoencoders".

1.2. Authors

The authors are Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie, all affiliated with New York University.

1.3. Journal/Conference

The paper is currently published as a preprint on arXiv. While it doesn't specify an official journal or conference publication yet, arXiv is a highly respected platform for disseminating cutting-edge research in computer science, particularly in machine learning and artificial intelligence. The authors' affiliations with New York University suggest a strong academic background.

1.4. Publication Year

The paper was published at 2025-10-13T17:51:39.000Z, which translates to October 13, 2025.

1.5. Abstract

The paper addresses the limitations of traditional Variational Autoencoders (VAEs) in Diffusion Transformers (DiTs), which often use outdated backbones, low-dimensional latent spaces, and weak reconstruction-based representations, thereby compromising generative quality. The authors propose replacing VAEs with what they term Representation Autoencoders (RAEs). RAEs utilize frozen pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders to achieve both high-quality reconstructions and semantically rich, typically high-dimensional, latent spaces. A key challenge of operating DiTs in these high-dimensional latent spaces is analyzed, and theoretically motivated solutions are proposed and empirically validated. Their approach leads to faster convergence without needing auxiliary representation alignment losses. Using a DiT variant with a lightweight, wide DDT head, their method achieves strong image generation results on ImageNet, including 1.51 FID at 256×256256 \times 256 (without guidance) and 1.13 FID at both 256×256256 \times 256 and 512×512512 \times 512 (with guidance). The authors conclude that RAEs offer clear advantages and should become the new default for diffusion transformer training.

The original source link is https://arxiv.org/abs/2510.11690. The PDF link is https://arxiv.org/pdf/2510.11690v1.pdf. The paper is published as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The field of generative modeling has seen significant advancements with latent diffusion models (LDMs) and Diffusion Transformers (DiTs), which achieve high visual fidelity and efficiency by operating in a learned latent space rather than raw pixels. However, the autoencoder component responsible for defining this latent space has remained largely stagnant. Most DiTs still rely on the original VAE encoder, which presents several critical limitations:

  • Outdated Backbones: The VAEs often use legacy convolutional designs, which can be computationally inefficient and compromise architectural simplicity compared to modern Transformer-based models.

  • Low-Dimensional Latent Spaces: Traditional VAEs typically produce heavily compressed, low-dimensional latent spaces. While intended for efficiency, this can restrict information capacity, leading to latents that capture local appearance but lack crucial global semantic structure needed for generalization and high-quality generation.

  • Weak Representations: VAEs are primarily trained with a reconstruction-only objective. This often results in latent representations that are weak or less semantically meaningful, ultimately limiting the generative quality of the downstream diffusion model.

    Meanwhile, visual representation learning has rapidly evolved, with self-supervised and multimodal encoders (e.g., DINO, SigLIP, MAE) learning semantically rich features. However, latent diffusion has largely been isolated from these advances, continuing to diffuse in reconstruction-trained VAE spaces. Existing attempts to bridge this gap, such as REPA-style alignment, introduce complexity with extra training stages and auxiliary losses.

The core problem the paper aims to solve is to modernize and enhance the autoencoder component in Diffusion Transformers to overcome the limitations of traditional VAEs, thereby improving generative quality and efficiency. The paper's innovative idea is to directly integrate advanced, pretrained visual representation encoders into the latent diffusion pipeline.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

  • Introduction of Representation Autoencoders (RAEs): The authors propose RAEs as a new class of autoencoders that replace the traditional VAE with a frozen pretrained representation encoder (e.g., DINO, SigLIP, MAE) combined with a lightweight, trained decoder. RAEs generate semantically rich and structurally coherent latent spaces while offering high-quality reconstructions.
  • Challenging Existing Assumptions: The work demonstrates that pretrained semantic encoders, often believed to be unsuitable for faithful reconstruction or high-dimensional diffusion, can indeed produce superior reconstructions and be effectively used in high-dimensional latent diffusion.
  • Theoretically Motivated Solutions for High-Dimensional Latents: The paper identifies and addresses the challenges of training Diffusion Transformers in RAE's high-dimensional latent spaces. It proposes three key solutions:
    1. Matching DiT Width to Token Dimensionality: It is shown that the Diffusion Transformer's width must match or exceed the RAE's token dimension for effective generation, providing both empirical evidence and a theoretical justification (Theorem 1).
    2. Dimension-Dependent Noise Schedule Shift: The paper generalizes resolution-based noise schedule shifts to account for the effective data dimension (number of tokens times their dimensionality), significantly improving performance in high-dimensional RAE spaces.
    3. Noise-Augmented Decoder Training: To mitigate out-of-distribution issues when the diffusion model generates noisy latents, the RAE decoder is trained with additive noise, enhancing its generalization capabilities to continuous latent distributions.
  • Introduction of DiTDHDiT^DH (Wide Diffusion Head): A new DiT variant is introduced, which augments the standard DiT architecture with a shallow yet wide DDT head. This allows for increased model width and better denoising capabilities without incurring quadratic computational costs, especially beneficial for high-dimensional RAE latents.
  • State-of-the-Art Image Generation Results: The proposed RAE-based DiTDH model achieves strong image generation results on ImageNet, setting new state-of-the-art FID scores:
    • 1.51 FID at 256×256256 \times 256 (no guidance)
    • 1.13 FID at 256×256256 \times 256 and 512×512512 \times 512 (with guidance). These results demonstrate significantly faster convergence and better generative quality compared to prior VAE-based and representation-aligned methods.
  • Reframing Autoencoding: The work redefines autoencoding from merely a compression mechanism to a foundational representation, enabling more efficient training and effective generation for Diffusion Transformers.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Autoencoders (AE)

An Autoencoder (AE) is a type of artificial neural network used to learn efficient data codings (representations) in an unsupervised manner. The goal of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal "noise". It consists of two main parts:

  • Encoder: This part compresses the input data into a latent-space representation (also called a latent vector or bottleneck).
  • Decoder: This part reconstructs the input data from the latent-space representation. The autoencoder is trained to minimize the difference between its input and its output (reconstruction loss).

3.1.2. Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) are a type of generative model that extend autoencoders by introducing a probabilistic approach to the latent space. Instead of mapping inputs directly to a fixed latent vector, a VAE's encoder maps inputs to the parameters of a probability distribution (typically a Gaussian distribution, defined by a mean and a variance) in the latent space. The decoder then samples from this distribution to reconstruct the input. This probabilistic formulation allows VAEs to generate new data samples by sampling from the learned latent distribution. VAEs are trained with two losses:

  • Reconstruction Loss: Measures how accurately the decoder reconstructs the input from the latent sample.
  • Kullback-Leibler (KL) Divergence: Regularizes the latent distribution to be close to a prior distribution (e.g., a standard normal distribution), ensuring a well-structured and generative latent space.

3.1.3. Diffusion Models

Diffusion Models are a class of generative models that learn to generate data by reversing a diffusion process. This process gradually adds noise to data until it becomes pure noise (forward process). The model is then trained to predict and remove this noise (reverse process), effectively learning to transform random noise into meaningful data. Key concepts:

  • Forward Diffusion Process: A fixed Markov chain that gradually adds Gaussian noise to an image x0\mathbf{x}_0 over TT steps, producing a sequence of noisy samples x1,,xT\mathbf{x}_1, \dots, \mathbf{x}_T. As TT \to \infty, xT\mathbf{x}_T approaches pure Gaussian noise.
  • Reverse Diffusion Process: A learned Markov chain that starts from pure noise xT\mathbf{x}_T and iteratively denoises it over TT steps to generate a clean data sample x0\mathbf{x}_0. This is where the neural network (e.g., a U-Net or Transformer) comes in, learning to predict the noise added at each step.
  • Latent Diffusion Models (LDMs): Instead of diffusing in the pixel space, LDMs perform the diffusion process in a compressed latent space learned by an autoencoder (often a VAE). This significantly reduces computational costs and allows for high-resolution image generation.

3.1.4. Transformers

Transformers are a neural network architecture introduced in 2017, primarily used for sequence-to-sequence tasks, notably in natural language processing (NLP). Their key innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. This enables Transformers to capture long-range dependencies in data more effectively than recurrent neural networks. Key components:

  • Self-Attention: A mechanism that computes a weighted sum of all other elements in the input sequence, where the weights are learned based on the relevance of each element. The formula for scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
    • QQ (Query), KK (Key), VV (Value) are matrices derived from the input embeddings.
    • dkd_k is the dimension of the query and key vectors, used for scaling to prevent vanishing gradients.
    • softmax\mathrm{softmax} normalizes the attention scores.
  • Multi-Head Attention: Allows the model to jointly attend to information from different representation subspaces at different positions.
  • Feed-Forward Networks: Position-wise fully connected layers applied to each position independently.
  • Positional Encoding: Adds information about the absolute or relative position of elements in the sequence, as Transformers themselves are permutation-invariant.

3.1.5. Diffusion Transformers (DiTs)

Diffusion Transformers (DiTs) adapt the Transformer architecture to serve as the backbone for diffusion models, particularly latent diffusion models. Instead of the traditional U-Net architecture often used for diffusion models, DiTs process the latent representations (tokens) generated by an autoencoder using a Transformer network. This leverages the scalability and expressive power of Transformers for denoising tasks within the latent space, leading to impressive results in image generation.

3.1.6. Flow Matching

Flow Matching is a training objective for generative models that frames the denoising process as learning a continuous-time vector field (or "flow") that transports samples from a simple prior distribution (e.g., Gaussian noise) to the complex data distribution. Unlike diffusion models that often involve stochastic differential equations (SDEs), flow matching can often be trained to learn an ordinary differential equation (ODE) that directly maps noise to data, potentially leading to faster and more stable sampling during inference. The objective typically involves predicting the velocity of interpolated samples between noise and data.

3.1.7. Fréchet Inception Distance (FID)

Fréchet Inception Distance (FID) is a widely used metric to assess the quality of images generated by generative models. It measures the "distance" between the distribution of features extracted from real images and those extracted from generated images. A lower FID score indicates higher quality and diversity in the generated images, suggesting they are closer to the real image distribution. The features are typically extracted from an intermediate layer of a pre-trained Inception-v3 network.

3.2. Previous Works

The paper extensively references prior research in representation learning and generative modeling, highlighting the evolution and current state of the art.

3.2.1. Representation for Reconstruction

Prior work has explored enhancing VAEs with semantic representations:

  • VA-VAE (Yao et al., 2025): This method aligns VAE latents with a pretrained representation encoder. The paper contrasts this by stating that while VA-VAE improves reconstruction and generation, it still relies on heavily compressed, low-dimensional latents, limiting fidelity and representation quality. The proposed RAEs, in contrast, reconstruct directly from representation encoder features without compression.

  • MAETok (Chen et al., 2025a), DC-AE 1.5 (Chen et al., 2025d), i-DEtok (Yang et al., 2025): These works incorporate MAE- or DAE-inspired objectives into VAE training. The paper notes that these methods also use heavily compressed latents, which RAEs aim to overcome by using frozen, powerful encoders.

    The common belief that representation encoders are unsuitable for reconstruction because they "emphasize high-level semantics while downplaying low-level details" (Tang et al., 2025; Yu et al., 2024b) is challenged. The paper demonstrates that with a properly trained decoder, frozen representation encoders like DINOv2 and SigLIP2 can serve as strong encoders for the diffusion latent space, yielding reconstructions on par with or even better than SD-VAE.

3.2.2. Representation for Generation

Previous research also focuses on using semantic representations to improve generative modeling:

  • REPA (Yu et al., 2025): Accelerates DiT convergence by aligning its middle block with representation encoder features. The paper differentiates its approach by training diffusion models directly on representation encoders (RAE) rather than aligning with an external encoder, achieving faster convergence.

  • DDT (Wang et al., 2025c): Further improves convergence by decoupling DiT into an encoder-decoder and applying REPA loss to the encoder output. The current paper takes inspiration from DDT by introducing a DDT head but applies it within the RAE framework for different design motivations.

  • REG (Wu et al., 2025): Introduces a learnable token into the DiT sequence and explicitly aligns it with a representation encoder's representation.

  • ReDi (Kouzelis et al., 2025b): Generates both VAE latents and PCA components of DINOv2 features within a diffusion model.

    These methods often involve complex alignment procedures or additional training stages. The paper's approach aims for a more direct integration, training diffusion models directly on RAE latents, leading to faster convergence without auxiliary losses.

3.3. Technological Evolution

The field of generative modeling has evolved from early pixel-space models directly capturing image statistics to latent diffusion models that operate in a learned, compact representation space.

  • Early Models: Focused on generating images directly in pixel space, which is computationally expensive and struggles with high-resolution images.

  • VAEs (Kingma & Welling, 2014): Introduced probabilistic latent spaces, allowing for generation and better data representation.

  • GANs (Goodfellow et al., 2014): Achieved impressive photorealism through adversarial training, but often suffered from training instability and mode collapse.

  • Diffusion Models (Ho et al., 2020; Dhariwal & Nichol, 2021): Emerged as powerful generative models, capable of high-quality image synthesis by iteratively denoising data.

  • Latent Diffusion Models (Rombach et al., 2022): Combined diffusion models with autoencoders (typically VAEs) to perform diffusion in a compressed latent space, significantly improving efficiency and enabling high-resolution generation.

  • Diffusion Transformers (DiT) (Peebles & Xie, 2023; Ma et al., 2024): Replaced the traditional U-Net backbone in diffusion models with Transformers, leveraging their scalability for improved performance.

    Parallel to this, visual representation learning has seen a rapid transformation:

  • Self-supervised Learning: Models like DINO (Oquab et al., 2023) and MAE (He et al., 2021) learn rich visual features from unlabeled data.

  • Multimodal Learning: Models like CLIP and SigLIP (Radford et al., 2021; Tschannen et al., 2025) learn representations that bridge vision and language.

    The current paper's work (RAE) fits within this timeline by aiming to bridge the gap between latent diffusion models (specifically DiTs) and the advancements in visual representation learning. It seeks to upgrade the autoencoder component of LDMs by integrating modern, semantically rich representation encoders, thereby moving beyond the limitations of reconstruction-only VAEs.

3.4. Differentiation Analysis

The core differences and innovations of this paper's approach, RAE, compared to main methods in related work, can be summarized as follows:

  • Direct Use of Frozen Pretrained Encoders: Unlike VA-VAE or REPA which align VAE latents with external encoders or use auxiliary losses, RAE directly uses frozen pretrained representation encoders (e.g., DINOv2, SigLIP, MAE) as the encoder component. This means the latent space inherently possesses the rich semantic structure learned by these powerful models without additional alignment training.
  • No Aggressive Latent Compression: Traditional VAEs (like SD-VAE) rely on heavy channel-wise compression, leading to low-capacity latents. RAEs, by leveraging the features of modern Transformer-based encoders, can maintain higher-dimensional, richer latent spaces without explicit compression-driven objectives, challenging the belief that low-dimensionality is always better for diffusion.
  • Focus on Decoder for Reconstruction: The RAE framework demonstrates that even representation encoders optimized for semantics can achieve excellent reconstruction quality when paired with a properly trained lightweight decoder. This contradicts the long-standing assumption that semantic encoders are unsuited for faithful pixel-level reconstruction.
  • Addressing High-Dimensional Latent Challenges Directly: The paper proactively tackles the perceived incompatibility of Diffusion Transformers with high-dimensional latent spaces. It proposes specific architectural and training adjustments (matching DiT width, dimension-dependent noise schedules, noise-augmented decoding) to make diffusion stable and efficient in these richer spaces, rather than avoiding them.
  • Integrated Generative and Semantic Modeling: RAEs intrinsically link semantic modeling (via the frozen encoder) and generative modeling (via the trained decoder and DiT) through a shared latent representation. This is a more direct and arguably cleaner integration compared to methods that introduce REPA-style alignment losses.
  • Enhanced Computational Efficiency: The RAE approach, particularly with the DiTDH variant, offers significant computational efficiency improvements. The RAE decoders are shown to be much more efficient than SD-VAE counterparts, and the DiTDH allows for scaling model width effectively without quadratic cost increases, leading to faster convergence and state-of-the-art results with less compute.

4. Methodology

The core methodology of this work revolves around replacing the traditional VAE in Diffusion Transformers with a Representation Autoencoder (RAE) and then adapting the Diffusion Transformer to effectively operate within the RAE's high-dimensional latent space.

4.1. Representation Autoencoders (RAEs)

The central idea is to use frozen, pretrained representation encoders (which are typically Transformer-based and optimized for learning rich visual semantics) as the encoder component of an autoencoder, and then train a lightweight decoder to reconstruct the original image from these representation features.

4.1.1. RAE Architecture and Training

The RAE consists of two main parts:

  1. Frozen Representation Encoder (EE): This can be any powerful, pretrained visual encoder like DINOv2-B, SigLIP2-B, or MAE-B. These encoders take an input image xR3×H×W\mathbf{x} \in \mathbb{R}^{3 \times H \times W} (where H, W are height and width, and 3 is for RGB channels) and produce a sequence of NN tokens (or features) in a latent space, each with a channel dimension dd. Specifically, if pep_e is the patch size of the encoder, then N=HW/pe2N = HW/p_e^2. The encoder is kept frozen during the RAE training process. Any [CLS] or [REG] tokens produced by the encoder are discarded, and only the patch tokens are used. A layer normalization is applied to each token independently to ensure zero mean and unit variance across channels.

  2. Trained ViT-based Decoder (DD): A ViT (Vision Transformer) decoder is trained to map these NN latent tokens back to the pixel space. The decoder uses a patch size pdp_d, and by default, pd=pep_d = p_e. For 256×256256 \times 256 images, the encoder typically produces 256 tokens. The decoder reconstructs the image to a resolution of 3×Hpepd×Wpepd3 \times H_{p_e}^{p_d} \times W_{p_e}^{p_d}. A learnable [CLS] token is prepended to the decoder's input sequence, similar to MAE, but is discarded after decoding.

    The RAE decoder DD is trained with a combination of L1 loss, LPIPS (Learned Perceptual Image Patch Similarity) loss, and adversarial losses. The loss function is defined as: z=E(x),x^=D(z)Lrec(x)=ωLLPIPS(x^,x)+L1(x^,x)+ωGλGAN(x^,x), \begin{array}{r} z = E(x), \hat{x} = D(z) \\ \mathcal{L}_{rec}(x) = \omega_L \mathrm{LPIPS}(\hat{x}, x) + \mathrm{L1}(\hat{x}, x) + \omega_G \lambda \mathrm{GAN}(\hat{x}, x), \end{array} Where:

  • xx: The input image.

  • E(x): The frozen representation encoder's output, representing the latent tokens zz.

  • D(z): The decoder's reconstruction of the image, denoted as x^\hat{x}.

  • Lrec(x)\mathcal{L}_{rec}(x): The total reconstruction loss for an input image xx.

  • LPIPS(x^,x)\mathrm{LPIPS}(\hat{x}, x): The Learned Perceptual Image Patch Similarity loss, which measures perceptual similarity between the reconstructed image x^\hat{x} and the original image xx. ωL\omega_L is its weight (set to 1).

  • L1(x^,x)\mathrm{L1}(\hat{x}, x): The L1 pixel-wise loss, measuring the absolute difference between x^\hat{x} and xx.

  • GAN(x^,x)\mathrm{GAN}(\hat{x}, x): The adversarial loss component, which encourages the reconstructed image x^\hat{x} to be indistinguishable from real images by a discriminator. ωG\omega_G is its weight (set to 0.75).

  • λ\lambda: An adaptive weight for the GAN loss, defined as λ=x^Lrecx^GAN(x^,x)+ϵ\lambda = \frac{\lVert \nabla_{\hat{x}} \mathcal{L}_{rec} \rVert}{\lVert \nabla_{\hat{x}} \mathrm{GAN}(\hat{x}, x) \rVert + \epsilon}, which balances the scales of reconstruction and adversarial losses.

    The following are the results from Table 1 of the original paper:

    (a) Encoder choice. All encoders outperform SD-VAE. (b) Larger decoders improve rFID while remaining much more efficient than VAEs.
    Model rFID Encoder rFID Decoder rFID GFLOPs
    DINOv2-B 0.49 DINOv2-S 0.52 ViT-B 0.58 22.2
    SigLIP2-B 0.53 DINOv2-B 0.49 ViT-L 0.50 78.1
    MAE-B 0.16 DINOv2-L 0.52 ViT-XL 0.49 106.7
    SD-VAE 0.62 (c) Encoder scaling. rFID is stable across RAE sizes. SD-VAE 0.62 310.4
    (d) Representation quality. RAEs have much higher linear probing accuracy than VAEs.
    Model Top-1 Acc.
    DINOv2-B 84.5
    SigLIP2-B 79.1
    MAE-B 68.0
    SD-VAE 8.0

The results in Table 1 (above) demonstrate that RAEs consistently achieve better reconstruction quality (rFID) and representation quality (linear probing accuracy) compared to SD-VAE, while being more computationally efficient. For example, RAE with MAE-B achieves an rFID of 0.16, significantly outperforming SD-VAE's 0.62. Additionally, ViT-XL decoder achieves an rFID of 0.49 with 106.7 GFLOPs, which is much more efficient than SD-VAE's 310.4 GFLOPs.

As can be seen from the results in Figure 8, all RAEs achieve satisfactory reconstruction fidelity.

Figure 8: Reconstruction examples. From left to right: input image, RAE (DINOv2-B), RAE (SigLIP2-B), RAE (MAE-B), SD-VAE. Zoom in for details. 该图像是重建示例,展示了输入图像以及基于 DINOv2-B、SigLIP2-B 和 MAE-B 的表示自编码器 (RAE) 的重建效果,最后是 SD-VAE 的结果。图像从左到右分别为输入图像、RAE (DINOv2-B)、RAE (SigLIP2-B)、RAE (MAE-B) 和 SD-VAE。

4.2. Taming Diffusion Transformers for RAE

The paper highlights that standard Diffusion Transformer (DiT) training recipes fail when applied directly to RAE's high-dimensional latent spaces. To address this, three theoretically motivated solutions are proposed.

4.2.1. Scaling DiT Width to Match Token Dimensionality

The first crucial insight is that for generation in RAE's latent space to succeed, the Diffusion Model's width must match or exceed the RAE's token dimension.

  • Problem: When DiTs are designed for compact SD-VAEs, they struggle with the increased dimensionality of RAE tokens. If the DiT's width is smaller than the RAE's token dimension, the model fails to learn effectively.

  • Empirical Observation: Experiments on overfitting a single image showed that sample quality is poor when DiT's hidden dimension (dhd_h) is less than the RAE's token dimension (nn), but improves sharply once dhnd_h \geq n. Increasing depth alone did not resolve the issue.

    As can be seen from the results in Figure 3, increasing model width leads to lower loss and better sample quality, while changing model depth has marginal effect on overfitting results.

    Figure 3: Overfitting to a single sample. Left: increasing model width lead to lower loss and better sample quality; Right: changing model depth has marginal effect on overfitting results. 该图像是图表,包含两个子图,分别探讨了Diffusion Transformers的宽度和深度对模型过拟合的影响。左侧图表显示在不同宽度下,损失随着宽度的增加而降低,并指出在某些条件下,模型无法过拟合到单一样本;右侧图表则指出即使增加层数,模型在特定情况下仍然无法过拟合。这些结果帮助理解模型容量与生成质量之间的关系。

  • Theoretical Justification (Theorem 1): The paper provides a theoretical lower bound for the training loss when the model's effective dimension is smaller than the data's dimension. Theorem 1. Assuming xp(x)Rn,εN(0,In),\mathbf { x } \sim p ( \mathbf { x } ) \in \mathbb { R } ^ { n } , \varepsilon \sim { \mathcal { N } } ( 0 , \mathbf { I } _ { n } ) , . t[0,1]t \in [ 0 , 1 ] Let xt=(1t)x+tε{ \bf x } _ { t } = ( 1 - t ) { \bf x } + t \varepsilon , consider the function family Gd={g(xt,t)=Bf(Axt,t):ARd×n,BRn×d,f:[0,1]×RdRd} \mathcal { G } _ { d } = \{ g ( \mathbf { x } _ { t } , t ) = B f ( A \mathbf { x } _ { t } , t ) : A \in \mathbb { R } ^ { d \times n } , B \in \mathbb { R } ^ { n \times d } , f : [ 0 , 1 ] \times \mathbb { R } ^ { d } \to \mathbb { R } ^ { d } \} where d<n,fd < n , f refers to a stack of standard DiT blocks whose width is smaller than the token dimension from the representation encoder, and A , B denote the input and output linear projections, respectively. Then for any gGdg \in { \mathcal { G } } _ { d } L(g,θ)=01Exp(x),εN(0,In)[g(xt,t)(εx)2]dti=d+1nλi \mathcal { L } ( g , \theta ) = \int _ { 0 } ^ { 1 } \mathbb { E } _ { \mathbf { x } \sim p ( \mathbf { x } ) , \mathbf { \boldsymbol { \varepsilon } } \sim \mathcal { N } ( 0 , \mathbf { I } _ { n } ) } \big [ \| g ( \mathbf { x } _ { t } , t ) - ( \boldsymbol { \varepsilon } - \mathbf { x } ) \| ^ { 2 } \big ] \mathrm { d } t \ge \sum _ { i = d + 1 } ^ { n } \lambda _ { i } where λi\lambda _ { i } are the eigenvalues of the covariance matrix of the random variable W=εxW = \varepsilon - \mathbf { x } . Notably, when dnd \geq n , Gd\mathcal { G } _ { d } contains the unique minimizer to L(g,θ)\mathcal { L } ( g , \theta ) .

    Proof Explanation:

    1. Objective: The goal is to minimize the expected squared difference between the model's prediction g(xt,t)g(\mathbf{x}_t, t) and the target (εx)(\boldsymbol{\varepsilon} - \mathbf{x}), averaged over time tt. This target is the "velocity" needed to transform noisy data xt\mathbf{x}_t back to clean data x\mathbf{x}.
    2. Function Family Gd\mathcal{G}_d: This family represents Diffusion Transformers whose internal processing dimension (dd) is less than the input data dimension (nn). AA and BB are linear projections that map the nn-dimensional input to dd-dimensions and back to nn-dimensions, respectively. ff represents the main DiT blocks operating in the dd-dimensional space.
    3. Dimensionality Constraint: The key constraint is d<nd < n. This means the DiT is bottlenecked by a lower-dimensional internal representation.
    4. Implication of Bottleneck: Due to the bottleneck, the output of g(xt,t)g(\mathbf{x}_t, t) is restricted to a subspace of Rn\mathbb{R}^n with dimension at most dd. However, the target (εx)(\boldsymbol{\varepsilon} - \mathbf{x}) generally lies in the full nn-dimensional space.
    5. Lower Bound: The theorem states that if d<nd < n, there's an inherent irreducible error in approximating the target. This error is bounded below by the sum of the smallest n-d eigenvalues of the covariance matrix of W=εxW = \varepsilon - \mathbf{x}. These eigenvalues capture the "variance" or "spread" in the dimensions that the model cannot represent due to its lower dimensionality.
    6. Full Capacity: When dnd \geq n, the model has enough capacity (or can be configured to have enough capacity) to represent the full nn-dimensional target, meaning the loss can theoretically reach zero (or the minimum possible error given the flow matching objective).
    • In simpler terms: If your Transformer is too "thin" (low hidden dimension dd) to fully capture the "width" (token dimension nn) of the data it's trying to process, it will inevitably miss some information, leading to a higher minimum possible error. This justifies scaling DiT's width to match or exceed the RAE's token dimension.

      The following are the results from Table 3 of the original paper:

      DiT-S DiT-B DiT-L
      DINOv2-S 3.6e-2 ✓ 1.0e-3 9.7e-4
      DINOv2-B 5.2e-1 x 2.4e-2 ✓ 1.3e-3
      DINOv2-L 6.5e-1 X 2.7e-1 X 2.2e-2 ✓

The results in Table 3 (above) demonstrate that convergence (indicated by '✓') occurs only when the DiT model's width is at least as large as the RAE token dimension. Conversely, when the DiT model's width is smaller, the loss fails to converge (indicated by 'X').

4.2.2. Dimension-Dependent Noise Schedule Shift

  • Problem: Prior noise scheduling strategies (e.g., resolution-based shifts) were derived for pixel-based or VAE-based inputs with few channels. These strategies did not account for the high dimensionality of RAE tokens, where the "effective resolution" per token increases with the number of channels, reducing information corruption at the same noise level.
  • Solution: The paper generalizes existing resolution-dependent strategies to a dimension-dependent shift. This means the noise schedule is adjusted based on the effective data dimension, defined as the number of tokens multiplied by their dimensionality.
  • Method: The shifting strategy from Esser et al. (2024) is adopted. For a schedule tn[0,1]t_n \in [0, 1] and input dimensions nn (base dimension, here 4096) and mm (effective data dimension of RAE), the shifted timestep tmt_m is defined as: tm=αtn1+(α1)tn t_m = \frac{\alpha t_n}{1 + (\alpha - 1)t_n} Where α=m/n\alpha = \sqrt{m/n} is a dimension-dependent scaling factor. This adjustment helps to appropriately manage the noise levels for high-dimensional RAE latents, leading to significant performance gains.

The following are the results from Table 4 of the original paper:

gFID
w/o shift 23.08
w/ shift 4.81

The results in Table 4 (above) show that applying the dimension-dependent noise schedule shift yields significant performance gains, reducing gFID from 23.08 to 4.81.

4.2.3. Noise-Augmented Decoding

  • Problem: Unlike VAEs that map to continuous latent distributions (e.g., Gaussian), RAE decoders are trained to reconstruct from a discrete distribution of clean latent features from the encoder. However, Diffusion Models at inference time can generate noisy or slightly deviated latents. This mismatch can cause out-of-distribution (OOD) issues for the RAE decoder, degrading sampling quality.

  • Solution: To make the RAE decoder robust to these noisy latents, its training is augmented by adding Gaussian noise nN(0,σ2I)\mathbf{n} \sim \mathcal{N}(0, \sigma^2 \mathbf{I}) to the clean latent representations zz before decoding. The decoder DD is then trained on this smoothed distribution pn(z)p_{\mathbf{n}}(z).

  • Method: The decoder is trained on z+nz+\mathbf{n} instead of just zz. The noise level σ\sigma is made stochastic by sampling it from N(0,τ2)|\mathcal{N}(0, \tau^2)|. This helps regularize training and improves robustness to the continuous outputs of diffusion models.

    The following are the results from Table 5 of the original paper:

    rFID gFID
    z∼ p(z) 0.49 4.81
    Z ∼pn(z) 0.57 4.28

The results in Table 5 (above) show that noise-augmented decoding improves gFID (from 4.81 to 4.28) but slightly worsens rFID (from 0.49 to 0.57). This trade-off is expected, as smoothing the latent distribution for better generalization might reduce exact reconstruction accuracy.

4.3. DiTDHDiT^DH: Improving Model Scalability with Wide Diffusion Head

To overcome the computational expense of scaling the entire DiT backbone to handle higher-dimensional RAE latents, the paper introduces a new DiT variant called DiTDH (Diffusion Transformer with a Wide Diffusion Head). This design is inspired by DDT (Wang et al., 2025c).

4.3.1. Wide DDT Head Architecture

A DiTDH model consists of:

  1. Base DiT (MM): A standard Diffusion Transformer backbone.
  2. Additional Wide, Shallow Transformer Head (HH): This is a lightweight Transformer module specifically dedicated to denoising. It is shallow (fewer layers) but wide (larger hidden dimension). The DDT head receives the output of the base DiT along with the noisy input and timestep information. The combined model predicts the velocity vtv_t as: zt=M(xtt,y),vt=H(xtzt,t), \begin{array}{r} z_t = M(x_t \mid t, y), \\ v_t = H(x_t \mid z_t, t), \end{array} Where:
  • xtx_t: The noisy input at timestep tt.

  • tt: The current timestep.

  • yy: An optional class label for conditional generation.

  • M(xtt,y)M(x_t \mid t, y): The base DiT model, which processes the noisy input xtx_t conditioned on tt and yy, producing an intermediate representation ztz_t.

  • H(xtzt,t)H(x_t \mid z_t, t): The DDT head model, which takes the original noisy input xtx_t, the intermediate representation ztz_t from the base DiT, and the timestep tt, to predict the final velocity vtv_t.

    As can be seen from the results in Figure 6, the Wide DDT Head illustrates its connections within the Diffusion Transformer framework.

    Figure 5: The Wide DDT Head. 该图像是图示,展示了宽DDT头与扩散变换器(DiT)之间的关系。图中展示了输入信息xt,t,y x_t, t, y 如何流入DiT模块,以及从DiT输出到DDT头的连接,最终生成输出vt v_t

The DDT head allows the model to effectively increase its width (capacity to process high-dimensional information) without incurring the quadratic computational costs that would arise from scaling the entire base DiT (due to self-attention). This design is particularly effective for RAE's high-dimensional latent spaces.

The following are the results from Table 16 of the original paper:

Depth Width GFLops FID ↓
6 1152 (XL) 25.65 2.36
4 2048 (G) 53.14 2.31
2 2048 (G) 26.78 2.16

The results in Table 16 (above) indicate that a wide and shallow DDT head is more effective for denoising. A 2-layer, 2048-dim (GG) head outperforms deeper (4-layer GG) or narrower (6-layer XL) ones, even with similar GFlops.

The following are the results from Table 17 of the original paper:

2-768 2-1536 2-2048 2-2688
Dino-S 2.66 2.47 2.42 2.43
Dino-B 2.49 2.24 2.16 2.22
Dino-L N/A 2.95 2.73 2.64

The results in Table 17 (above) show that the optimal DDT head width increases with the RAE encoder size. Larger RAE encoders (e.g., DINOv2-L) benefit more from wider DDT heads, suggesting better utilization of the richer latent representations.

4.4. Flow-Based Models (Generative Process)

The paper adopts a flow matching objective for training the Diffusion Transformer. This involves a continuous-time formulation where samples are interpolated between clean data and Gaussian noise.

  • Interpolation: A linear interpolation is used to generate noisy samples xt\mathbf{x}_t at any timestep t[0,1]t \in [0, 1]: xt=(1t)x+tε \mathbf{x}_t = (1 - t)\mathbf{x} + t\varepsilon Where:
    • x\mathbf{x}: A clean data sample drawn from the real data distribution p(x)p(\mathbf{x}).
    • ε\varepsilon: Pure Gaussian noise drawn from N(0,I)\mathcal{N}(0, \mathbf{I}).
    • tt: A timestep varying from 0 to 1. At t=0t=0, x0=x\mathbf{x}_0 = \mathbf{x} (clean data). At t=1t=1, x1=ε\mathbf{x}_1 = \varepsilon (pure noise).
  • Velocity Prediction: The model is trained to predict the velocity vector v(xt,t)v(\mathbf{x}_t, t) that would transport a sample from xt\mathbf{x}_t towards the clean data distribution. This velocity is formally defined as the conditional expectation of (εx)(\boldsymbol{\varepsilon} - \mathbf{x}) given xt\mathbf{x}_t: v(xt,t)=E[εxxt] v(\mathbf{x}_t, t) = \mathbb{E}[\boldsymbol{\varepsilon} - \mathbf{x} \mid \mathbf{x}_t]
  • Training Objective: The model vθv_\theta (the DiT or DiTDH network) is trained to minimize the squared difference between its prediction and the true velocity target (εx)(\boldsymbol{\varepsilon} - \mathbf{x}): Lvelocity(θ)=01Ex,ϵ[vθ(xt,t)(εx)2]dt \mathcal{L}_{\mathrm{velocity}}(\theta) = \int_0^1 \mathbb{E}_{\mathbf{x}, \epsilon} \bigg[ \| v_\theta(\mathbf{x}_t, t) - (\varepsilon - \mathbf{x}) \|^2 \bigg] \mathrm{d}t Where Ex,ϵ\mathbb{E}_{\mathbf{x}, \epsilon} denotes expectation over xp(x)\mathbf{x} \sim p(\mathbf{x}) and ϵN(0,In)\epsilon \sim \mathcal{N}(0, \mathbf{I}_n).

4.5. Guidance

The paper explores two types of guidance mechanisms to improve sample quality, especially for conditional generation: AutoGuidance and Classifier-Free Guidance (CFG).

4.5.1. AutoGuidance

AutoGuidance (Karras et al., 2025) is the primary guidance method used. The core idea is to use a weaker, typically earlier checkpoint of the diffusion model itself, to guide a stronger diffusion model.

  • Principle: Similar to CFG, it leverages the intuition that a less capable model or an earlier-trained checkpoint can provide useful directional cues to a more capable model. Weaker models or early checkpoints often capture broader structures and make bolder predictions, which can help guide the generation process more effectively without being overly prescriptive.
  • Implementation: A smaller DiTDH variant (e.g., DiTDH-S) or an earlier checkpoint of the main model is used as the "guidance model." This guidance model helps steer the sampling process of the main, more powerful DiTDH-XL model.
  • Benefits: Easier to tune than CFG with interval and generally delivers better performance.

4.5.2. Classifier-Free Guidance (CFG)

Classifier-Free Guidance (Ho & Salimans, 2022) is a common technique in diffusion models to improve the quality and adherence to conditions (e.g., class labels).

  • Principle: It combines predictions from a conditional diffusion model (trained with class labels) and an unconditional diffusion model (trained without class labels) to exaggerate the effect of the condition.
  • Equation for CFG: The guided velocity prediction vguidedv_{guided} is typically calculated as: $ v_{guided}(\mathbf{x}t, t, y) = v\theta(\mathbf{x}t, t, \emptyset) + s \cdot (v\theta(\mathbf{x}t, t, y) - v\theta(\mathbf{x}_t, t, \emptyset)) $ Where:
    • vθ(xt,t,y)v_\theta(\mathbf{x}_t, t, y): The velocity predicted by the model conditioned on class label yy.
    • vθ(xt,t,)v_\theta(\mathbf{x}_t, t, \emptyset): The velocity predicted by the model conditioned on an empty (unconditional) label \emptyset.
    • ss: The guidance scale, a hyperparameter that controls the strength of the guidance. A higher ss pushes the generation stronger towards the conditional input but can also lead to mode collapse or reduced diversity.
  • Guidance Interval: CFG can be applied with Guidance Interval (Kynkäänniemi et al., 2024), where guidance is only applied during specific timesteps (intervals) of the sampling process. This can prevent over-guidance in early or late stages.
  • Observation in Paper: The paper notes that CFG without interval does not improve FID and can even increase it. While CFG with Guidance Interval can achieve competitive FID after careful tuning, AutoGuidance generally performs better for the final model and has lower tuning overhead.

5. Experimental Setup

5.1. Datasets

The primary dataset used for both decoder training and diffusion model training is ImageNet-1K.

  • Source: ImageNet-1K is a subset of the larger ImageNet dataset, consisting of 1,000 object categories.

  • Scale: Contains over 1.2 million training images and 50,000 validation images.

  • Characteristics and Domain: It is a large-scale dataset of natural images, covering a wide range of object categories (animals, vehicles, everyday objects, etc.). The images are diverse in content, style, and complexity, making it a standard benchmark for image generation and classification tasks.

  • Resolution: Most experiments are conducted at a resolution of 256×256256 \times 256. For 512×512512 \times 512 resolution synthesis without decoder upsampling, decoders and diffusion models are trained directly on 512×512512 \times 512 images.

  • Dataset Balancing: The paper notes that the original ImageNet training set is inherently unbalanced, with class sizes ranging from approximately 732 to 1,300 samples. However, 895 classes contain exactly 1,300 samples, indicating a high degree of near-equivalence among most classes.

    An example image from ImageNet-1K could be:

    Figure 10: Uncurated \(5 1 2 \\times 5 1 2\) DiTDH-XL samples. AutoGudance Scale \(= 1 . 5\) Class label `=` "golden retriever" (207) This image (Figure 10 from the original paper) shows an example of a "golden retriever" from ImageNet-1K, which is a common class in the dataset.

5.2. Evaluation Metrics

The paper uses several standard metrics to evaluate the quality and diversity of generated images.

5.2.1. Fréchet Inception Distance (FID)

  • Conceptual Definition: FID measures the "distance" between the feature distributions of real and generated images. It quantifies how similar the generated images are to real images in terms of their perceptual quality and diversity. A lower FID score indicates that the generated images are more realistic and diverse, thus closer to the distribution of real images.
  • Mathematical Formula: $ \text{FID} = \lVert \mu_x - \mu_g \rVert^2 + \text{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}) $
  • Symbol Explanation:
    • μx\mu_x: The mean of the feature vectors for real images.
    • μg\mu_g: The mean of the feature vectors for generated images.
    • Σx\Sigma_x: The covariance matrix of the feature vectors for real images.
    • Σg\Sigma_g: The covariance matrix of the feature vectors for generated images.
    • Tr()\text{Tr}(\cdot): The trace of a matrix. The features are typically extracted from the penultimate layer of a pretrained Inception-v3 network.

5.2.2. Inception Score (IS)

  • Conceptual Definition: IS measures both the sharpness (quality) and diversity of generated images. It relies on a pretrained Inception-v3 network to classify generated images. A high IS indicates that generated images are both clearly recognizable as specific objects (low entropy of conditional class probabilities, meaning high sharpness) and that there is a wide variety of generated objects (high entropy of marginal class probabilities, meaning high diversity).
  • Mathematical Formula: $ \text{IS} = e^{\mathbb{E}x [D{KL}(p(y|x) \Vert p(y))]} $
  • Symbol Explanation:
    • Ex\mathbb{E}_x: Expectation over generated image samples xx.
    • p(yx)p(y|x): The conditional class distribution (softmax output) predicted by an Inception model for a generated image xx.
    • p(y): The marginal class distribution, calculated by averaging p(yx)p(y|x) over all generated samples.
    • DKL()D_{KL}(\cdot \Vert \cdot): The Kullback-Leibler (KL) divergence.

5.2.3. Precision and Recall

  • Conceptual Definition: These metrics, in the context of generative models, quantify how well the generated data distribution covers the true data distribution (recall) and how much of the generated data is realistic (precision).
    • Precision: Reflects the fraction of generated images that appear realistic or belong to the manifold of real data. High precision means fewer "junk" samples.
    • Recall: Reflects the portion of the training data manifold covered by generated samples. High recall means the model can generate a wide variety of real-like images, not just a few modes.
  • Mathematical Formulas: (These are derived from nearest neighbor distances in feature space, typically using Inception features. The paper does not provide explicit formulas, but refers to Kynkäänniemi et al., 2019.) Typically, these are computed by embedding real and generated images into a feature space (e.g., Inception features) and then finding nearest neighbors.
    • For Precision: Calculate for each generated image, its distance to the nearest real image. If this distance is below a threshold, the generated image is considered "realistic." Precision is the proportion of realistic generated images.
    • For Recall: Calculate for each real image, its distance to the nearest generated image. If this distance is below a threshold, the real image's mode is considered "covered." Recall is the proportion of covered real images.
  • Symbol Explanation: While specific symbols are not provided in the paper's abstract, the underlying concepts involve distances in a feature space and counting ratios based on proximity thresholds.

5.3. Baselines

The paper compares its method against a wide range of state-of-the-art generative models across different paradigms:

5.3.1. Autoregressive Models

These models generate images pixel by pixel or token by token, sequentially.

  • VAR (Tian et al., 2024): Visual Autoregressive modeling.
  • MAR (Li et al., 2024b): Autoregressive image generation without vector quantization.
  • xAR (Ren et al., 2025): Next-x prediction for autoregressive visual generation.

5.3.2. Pixel Diffusion Models

These are diffusion models that operate directly in the pixel space.

  • ADM (Dhariwal & Nichol, 2021): Improved denoising diffusion probabilistic models.
  • RIN (Jabri et al., 2023): Scalable adaptive computation for iterative generation.
  • PixelFlow (Chen et al., 2025e): Pixel-space generative models with flow.
  • PixNerd (Wang et al., 2025b): Pixel neural field diffusion.
  • SiD2 (Hoogeboom et al., 2025): Simpler diffusion (sid2): 1.5 FID on ImageNet512 with pixel-space diffusion.

5.3.3. Latent Diffusion with VAE

These are latent diffusion models that use traditional VAEs to define their latent space.

  • DiT (Peebles & Xie, 2023): Scalable diffusion models with Transformers.

  • MaskDiT (Zheng et al.): (Full reference not provided in bibliography, but context suggests a Masked Diffusion Transformer variant).

  • SiT (Ma et al., 2024): Exploring flow and diffusion-based generative models with scalable interpolant transformers.

  • MDTv2 (Gao et al., 2023): Masked diffusion transformer is a strong image synthesizer.

  • VA-VAE (Yao et al., 2025): Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models.

  • REPA (Yu et al., 2025): Representation alignment for generation: Training diffusion transformers is easier than you think.

  • DDT (Wang et al., 2025c): Decoupled diffusion transformer.

  • REPA-E (Leng et al., 2025): Unlocking VAE for end-to-end tuning with latent diffusion transformers.

    These baselines are representative because they cover the major paradigms in image generation (autoregressive, pixel diffusion, latent diffusion) and include state-of-the-art methods within each category, especially focusing on Diffusion Transformers and methods that attempt to improve latent space quality. Comparing against these diverse baselines allows the paper to demonstrate the superiority of RAE-based DiTs across different computational costs and architectural choices.

5.4. Training Details

  • Flow Matching Objective: Adopted with linear interpolation xt=(1t)x+tε\mathbf{x}_t = (1-t)\mathbf{x} + t\varepsilon.
  • Model Backbone: LightningDiT (Yao et al., 2025), a variant of DiT (Peebles & Xie, 2023).
  • Patch Size: 1 for all RAE-based models (resulting in 256 tokens for 256×256256 \times 256 images). For VAE and pixel inputs, patch sizes of 2 and 16 are used, respectively. The computational cost for the DiT backbone remains similar across these settings due to fixed token length.
  • Timestep Input: Continuous time formulation with input values in [0, 1]. Gaussian Fourier embedding layer replaces standard timestep embedding.
  • Positional Embeddings: Absolute Positional Embeddings (APE) are added in addition to RoPE (Rotary Positional Embeddings), though their impact was not significant.
  • Optimization (DiT): AdamW optimizer, constant learning rate of 2.0×1042.0 \times 10^{-4}, batch size of 1024, EMA weight of 0.9999.
  • Optimization (DiTDH): Linear learning rate decay from 2.0×1042.0 \times 10^{-4} to 2.0×1052.0 \times 10^{-5} with a constant warmup of 40 epochs. EMA weight changed to 0.9995. Gradient clipping of 1.0 is used.
  • Sampling: Standard ODE sampling with Euler sampler and 50 steps by default.
  • Computation: PyTorch/XLA on TPU for RAE training and inference. Evaluation uses one v6e-8 for 50k samples.

5.5. FID Evaluation Protocol

  • Sample Generation: For conditional FID evaluation, 50 images are sampled from each class for a total of 50,000 images (class-balanced sampling). The paper notes that some prior works used uniform random sampling across 1,000 class labels, which can yield slightly different (~0.1 lower FID) scores.
  • Reference Statistics: Taken from ADM pre-computed statistics (Dhariwal & Nichol, 2021) over the full ImageNet dataset.
  • Re-evaluation of Baselines: To ensure fair comparison, several recent methods with accessible checkpoints are re-evaluated using class-balanced sampling and their reported scores updated.

6. Results & Analysis

6.1. Core Results Analysis

The paper demonstrates that RAE-based Diffusion Transformers achieve state-of-the-art performance, significantly outperforming prior methods in terms of FID and convergence speed.

The following are the results from Table 8 of the original paper:

Method Epochs #Params Generation@256 w/o guidance Generation@256 w/ guidance
gFID↓ IS↑ Prec.↑ Rec.↑ gFID↓ IS↑ Prec.↑ Rec.↑
Autoregressive
VAR (Tian et al., 2024) 350 2.0B 1.92 323.1 0.82 0.59 1.73 350.2 0.82 0.60
MAR (Li et al., 2024b) 800 943M 2.35 227.8 0.79 0.62 1.55 303.7 0.81 0.62
xAR (Ren et al., 2025) 800 1.1B - - - - 1.24 301.6 0.83 0.64
Pixel Diffusion
ADM (Dhariwal & Nichol, 2021) 400 554M 10.94 101.0 0.69 0.63 3.94 215.8 0.83 0.53
RIN (Jabri et al., 2023) 480 410M 3.42 182.0 - - - - - -
PixelFlow (Chen et al., 2025e) 320 677M - - - - 1.98 282.1 0.81 0.60
PixNerd (Wang et al., 2025b) 160 700M - - 2.15 297.0 0.79 0.59
SiD2 (Hoogeboom et al., 2025) 1280 - - - - - 1.38 - - -
Latent Diffusion with VAE
DiT (Peebles & Xie, 2023) 1400 675M 9.62 121.5 0.67 0.67 2.27 278.2 0.83 0.57
MaskDiT (Zheng et al.) 1600 675M 5.69 177.9 0.74 0.60 2.28 276.6 0.80 0.61
SiT (Ma et al., 2024) 1400 675M 8.61 131.7 0.68 0.67 2.06 270.3 0.82 0.59
MDTv2 (Gao et al., 2023) 1080 675M - - - - 1.58 314.7 0.79 0.65
VA-VAE (Yao et al., 2025) 80 675M 4.29 - - - - - - -
800 2.17 205.6 0.77 0.65 1.35 295.3 0.79 0.65
REPA (Yu et al., 2025) 80 675M 7.90 122.6 0.70 0.65 - - - -
800 5.78 158.3 0.70 0.68 1.29 306.3 0.79 0.64
DDT (Wang et al., 2025c) 80 675M 6.62 135.2 0.69 0.67 1.52 263.7 0.78 0.63
400 6.27 154.7 0.68 0.69 1.26 310.6 0.79 0.65
REPA-E (Leng et al., 2025) 80 675M 3.46 159.8 0.77 0.63 1.67 266.3 0.80 0.63
800 1.70 217.3 0.77 0.66 1.15 304.0 0.79 0.66
Latent Diffusion with RAE (Ours)
DiT-XL (DINOv2-S) 800 676M 1.87 209.7 0.80 0.63 1.41 309.4 0.80 0.63
DiTDH-XL (DINOv2-B) 20 3.71 198.7 0.86 0.50
80 839M 2.16 214.8 0.82 0.59
800 1.51 242.9 0.79 0.63 1.13 262.6 0.78 0.67

The results in Table 8 (above) present a comprehensive comparison of RAE-based DiTDH-XL against various autoregressive, pixel diffusion, and latent diffusion methods on ImageNet 256×256256 \times 256.

  • State-of-the-Art FID: DiTDH-XL (DINOv2-B) with 800 epochs achieves an gFID of 1.51 without guidance and 1.13 with guidance. These numbers are significantly better than all other methods reported. For instance, the closest competitors like REPA-E achieve 1.70 (w/o guidance) and 1.15 (w/ guidance), and VA-VAE achieves 2.17 (w/o guidance) and 1.35 (w/ guidance).

  • Efficiency: Even at 80 epochs, DiTDH-XL (DINOv2-B) reaches an gFID of 2.16 (w/o guidance), which is already competitive with or better than many methods trained for much longer (e.g., VAR at 350 epochs with 1.92, VA-VAE at 800 epochs with 2.17). This indicates remarkable training efficiency.

  • IS, Precision, and Recall: The DiTDH-XL model also shows strong IS scores (242.9 w/o guidance) and competitive Precision and Recall scores, suggesting high-quality and diverse generations.

    The following are the results from Table 7 of the original paper:

    Method Generation@512
    gFID↓ IS↑ Prec.↑ Rec.↑
    BigGAN-deep (Brock et al., 2019) 8.43 177.9 0.88 0.29
    StyleGAN-XL (Sauer et al., 2022) 2.41 267.8 0.52
    0.77
    VAR (Tian et al., 2024) 2.63 303.2 - -
    MAGVIT-v2 (Yu et al., 2024a) 1.91 324.3 -
    XAR (Ren et al., 2025) 1.70 281.5 -
    -
    ADM 3.85 221.7 0.84 0.53
    SiD2 1.50 - -
    DiT - 0.84 0.54
    SiT 3.04 240.8 0.57
    DiffiT (Hatamizadeh et al., 2024) 2.62 252.2 0.84 0.55
    REPA 2.67 252.1 0.83
    DDT 2.08 274.6 0.83 0.58
    EDM2 (Karras et al., 2024) 1.28 305.1 0.80
    1.25 - -
    DiTDH-XL (DINOv2-B) 1.13 259.6 0.80 0.63

The results in Table 7 (above) compare DiTDH-XL on ImageNet 512×512512 \times 512 with guidance.

  • New SOTA at 512x512: DiTDH-XL (DINOv2-B) achieves an gFID of 1.13, surpassing the previous best performance of EDM2 (1.25) by a notable margin. This confirms the method's effectiveness at higher resolutions.

    As can be seen from the results in Figure 5, the chart illustrates that DiTDH models consistently achieve lower FID scores than DiT and VAE-based methods, even with fewer GFLOPs. This indicates that DiTDH with RAE offers superior performance and computational efficiency across various model scales.

    该图像是一个图表,展示了Diffusion Transformers (DiT) 与 Representation Autoencoders (RAE) 在ImageNet数据集上生成图像的FID分数对比。本图包含三部分:图(a)显示了不同DiT架构的FID与训练GFLOPs的关系;图(b)强调了RAE的收敛速度优于传统VAE方法;图(c)则通过气泡图比较了在不同模型规模下,使用RAE的DiT相比VAE方法在FID上表现出更优的性能,气泡面积表示模型的计算量。 该图像是一个图表,展示了Diffusion Transformers (DiT) 与 Representation Autoencoders (RAE) 在ImageNet数据集上生成图像的FID分数对比。本图包含三部分:图(a)显示了不同DiT架构的FID与训练GFLOPs的关系;图(b)强调了RAE的收敛速度优于传统VAE方法;图(c)则通过气泡图比较了在不同模型规模下,使用RAE的DiT相比VAE方法在FID上表现出更优的性能,气泡面积表示模型的计算量。

As can be seen from the results in Figure 4, DiT with RAE demonstrates significantly faster convergence and better FID performance compared to SiT or REPA. The DiT with RAE (DINOv2-B) achieves an FID of 2.39 after 720 epochs, a substantial improvement over SiT-XL and REPA-XL.

Figure 4: DiT w/ RAE reaches much faster convergence and better FID than SiT or REPA. 该图像是一个折线图,展示了不同模型(SiT-XL、REPA-XL 和 DiT-XL)在训练时期数与FID分数之间的关系。可以看到,DiT-XL(RAE: DINOv2-B)在训练周期数达到1400时,相较于SiT-XL和REPA-XL实现了显著的收敛速度,效率提升分别为16倍和47倍。

The qualitative samples shown in Figure 7 demonstrate strong diversity, fine-grained detail, and high visual quality, consistent with the achieved state-of-the-art FID scores.

Figure 7: Qualitative samples from our model trained at \(5 1 2 \\times 5 1 2\) resolution with AutoGuidance. The RAE-based DiT demonstrates strong diversity, fine-grained detail, and high visual quality. 该图像是来自我们模型的定性样本,训练分辨率为 512 imes 512,并使用了 AutoGuidance。基于 RAE 的 DiT 展现了强大的多样性、细致的细节和高质量的视觉效果。

6.1.1. Unconditional Generation

The following are the results from Table 18 of the original paper:

Method gFID ↓ IS ↑
DiT-XL + VAE 30.68 32.73
DiTDH-XL + DINOv2-B (w/ AG) 4.96 123.12
RCG + DiT-XL 4.89 143.2

The results in Table 18 (above) demonstrate that RAE-based DiTDH-XL also performs exceptionally well in unconditional generation. It achieves an gFID of 4.96 and IS of 123.12, which is significantly better than DiT-XL + VAE (gFID 30.68, IS 32.73) and competitive with RCG + DiT-XL (gFID 4.89, IS 143.2), a method specifically designed for unconditional generation.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Encoder Choice and Noisy-Robust Decoding

The following are the results from Table 15 of the original paper:

(a) gFID and rFID of different encoders w/ and w/o noisy-robust decoding. (b) gFID and rFID of different DINOv2 sizes w/ and w/o noisy-robust decoding.
Model gFID rFID Model gFID rFID
DINOv2-B 4.81 / 4.28 0.49 / 0.57 S 3.83 / 3.50 0.52 / 0.64
SigLIP2-B 6.69 / 4.93 0.53 / 0.82 B 4.81/ 4.28 0.49 / 0.57
MAE-B 16.14 / 8.38 0.16 / 0.28 L 6.77 /6.09 0.52 / 0.59

The results in Table 15a (above) show that DINOv2-B achieves the best overall generation performance (gFID), despite MAE-B having the lowest reconstruction rFID. This indicates that low rFID alone does not guarantee good generation quality. DINOv2-B is chosen as the default encoder. Table 15a and 15b also demonstrate the effectiveness of noise-augmented decoding. For all encoders and DINOv2 sizes, adding noise during decoder training (w/ noisy-robust decoding) consistently improves gFID at the cost of a slight increase in rFID. This supports the idea that decoders need to be robust to noisy latent outputs from diffusion models.

6.2.2. Scaling RAE to High Resolutions

The following are the results from Table 9 of the original paper:

Method #Tokens gFID ↓ rFID
Direct 1024 1.13 0.53
Upsample 256 1.61 0.97

The results in Table 9 (above) explore scaling to 512×512512 \times 512 resolution. The "Upsample" method, which uses a decoder to upscale from 256×256256 \times 256 tokens to 512×512512 \times 512 images, achieves a competitive gFID of 1.61 while being 4×4 \times more efficient (using only 256 tokens) than direct training at 512×512512 \times 512 resolution (1024 tokens, gFID 1.13). This highlights RAE's flexibility in handling high resolutions efficiently by decoupling the decoder from the diffusion process.

6.2.3. Does DiTDH Work Without RAE?

The following are the results from Table 10 of the original paper:

VAE DINOv2-B
DiT-XL 7.13 4.28
TDDHXL 11.70 2.16

The results in Table 10 (above) compare DiT-XL and DiTDH-XL on SD-VAE latents versus DINOv2-B (RAE) latents. DiTDH-XL performs worse than DiT-XL on SD-VAE (11.70 vs. 7.13 gFID), despite using extra compute. This suggests that the DDT head offers little benefit in low-dimensional VAE latent spaces and its primary strength is in the high-dimensional diffusion tasks introduced by RAE.

6.2.4. Role of Structured Representation

The following are the results from Table 11 of the original paper:

Pixel DINOv2-B
DiT-XL 51.09 4.28
DiT H-XL 30.56 2.16

The results in Table 11 (above) compare DiT and DiTDH directly on raw pixels versus DINOv2-B (RAE) latents. Both models perform significantly worse on pixels (gFID 51.09 for DiT-XL, 30.56 for DiTDH-XL) than on RAE latents (gFID 4.28 for DiT-XL, 2.16 for DiTDH-XL). This confirms that high dimensionality alone is not sufficient; the structured representation provided by RAE is crucial for strong performance gains.

6.3. Scaling Results

As can be seen from the results in Figure 9, increasing the model's computational capacity leads DiTDH to converge faster and reach a lower final loss.

Figure 9: Training loss of \(\\mathbf { D i T } ^ { \\mathbf { D H } }\) o DNOv-B.We s EMA eh 0.9 he . 该图像是一个训练损失变化曲线图,展示了不同模型(DiTSDHDiT^{DH}_S, DiTBDHDiT^{DH}_B, DiTLDHDiT^{DH}_L, DiTXLDHDiT^{DH}_{XL})在训练过程中的损失值随迭代次数的变化情况,显示出随着训练迭代次数的增加,各模型训练损失逐渐降低的趋势。

6.4. FID Evaluation Remarks

The paper highlights an inconsistency in FID evaluation protocols across prior literature regarding sample generation:

  • Some works use class-balanced sampling (50 images per class for 1000 classes = 50,000 samples).
  • Others use uniform random sampling across class labels 50,000 times. The authors observed that class-balanced sampling consistently yields approximately 0.1 lower FID scores. To ensure fair comparisons, they re-evaluated several methods using class-balanced sampling and updated their reported scores. This careful attention to evaluation protocols adds rigor to their comparisons.

The following are the results from Table 14 of the original paper:

Method Epochs Generation@256 w/o guidance Generation@256 w/ guidance
Random Balanced Random Balanced
gFID↓ IS↑ gFID IS gFID↓ IS↑ gFID IS
Autoregressive
VAR (Tian et al., 2024) 350 1.92 323.1 1.73 350.2
MAR (Li et al., 2024b) 800 2.35 227.8 1.55 303.7
xAR-H (Ren et al., 2025) 800 - - - - - - 1.24 301.6
Latent Diffusion with VAE
SiT (Ma et al., 2024) 1400 8.61 131.7 8.54 132.0 2.06 270.3 1.95 259.5
REPA (Yu et al., 2025) 800 5.90 157.8 5.78 158.3 1.42 305.7 1.29 306.3
DDT (Wang et al., 2025c) 400 - - 6.27 154.7 1.40 303.6 1.26 310.6
REPA-E (Leng et al., 2025) 800 1.83 217.3 1.70 217.3 1.26 314.9 1.15 304.0
Latent Diffusion with RAE (Ours)
DiTDH-XL (DINOv2-B) 800 1.60 242.7 1.51 242.9 1.28 262.9 1.13 262.6

The results in Table 14 (above) confirm that class-balanced sampling generally leads to slightly better gFID scores compared to random sampling across all methods, including DiTDH-XL. This highlights the importance of consistent evaluation protocols for fair comparisons in generative modeling.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work fundamentally challenges conventional wisdom in latent generative modeling by demonstrating the efficacy of pretrained representation encoders for both high-fidelity reconstruction and superior image generation. The authors introduce Representation Autoencoders (RAEs), which pair frozen semantic encoders (like DINO, SigLIP, MAE) with a lightweight, trained decoder. This approach yields semantically rich, high-dimensional latent spaces, overcoming the limitations of traditional VAEs (outdated backbones, low capacity, weak representations).

The paper rigorously addresses the challenges of operating Diffusion Transformers (DiTs) within these high-dimensional RAE latent spaces. It proposes and validates three key methodological advancements:

  1. DiT Width Matching: Scaling DiT's width to match or exceed the RAE token dimensionality, supported by both empirical evidence and theoretical proof, is crucial for effective learning.

  2. Dimension-Dependent Noise Schedule: Adapting the noise schedule based on the effective data dimension (token count times dimensionality) significantly improves training.

  3. Noise-Augmented Decoding: Training the RAE decoder with added noise enhances its robustness and generalization to the imperfect, continuous latent outputs of diffusion models.

    Furthermore, the paper introduces DiTDH, a DiT variant equipped with a shallow yet wide DDT head, which efficiently increases model capacity for denoising high-dimensional latents without incurring quadratic computational overhead.

Empirically, the RAE-based DiTDH achieves state-of-the-art ImageNet generation results, with 1.51 FID at 256×256256 \times 256 (without guidance) and 1.13 FID at both 256×256256 \times 256 and 512×512512 \times 512 (with guidance). These results demonstrate substantially faster convergence and higher generative quality compared to previous VAE-based and representation-aligned methods. The work redefines autoencoding as a representation foundation, advocating for RAE latents as the new default for diffusion transformer training.

7.2. Limitations & Future Work

The authors implicitly or explicitly point out several limitations and suggest future directions:

  • Computational Cost of Scaling: While DiTDH addresses the quadratic cost for the DiT backbone, handling very high-resolution images by directly increasing the number of tokens (e.g., training RAE at 512×512512 \times 512 with 1024 tokens) remains more computationally expensive than using decoder upsampling. This suggests a trade-off between gFID and efficiency at extremely high resolutions.
  • Impact of Encoder Choice: Although DINOv2-B was found to be the best for generation, the varying performance across MAE-B, SigLIP2-B, and DINOv2-B highlights that not all representation encoders are equally suited for generative tasks, even if they have good reconstruction quality. Further understanding of what makes a representation encoder "generative-friendly" could be explored.
  • Role of Structured Representation: While the paper conclusively shows that structured representations from RAEs are crucial (as raw pixel diffusion performs poorly), a deeper theoretical understanding of why certain semantic structures are more beneficial for diffusion models could be investigated.
  • Generalizability of DiTDH: The DiTDH's benefit was shown to be specific to high-dimensional RAE latents, performing worse on low-dimensional VAE latents. This implies it's not a universally optimal architectural modification but rather context-dependent. Future work could explore how to adapt DDT heads to be beneficial across a wider range of latent space dimensionalities.
  • Guidance Mechanism: While AutoGuidance performs well, the paper notes that Classifier-Free Guidance without interval does not improve FID, and CFG with interval requires careful tuning. Further research into more robust and universally effective guidance mechanisms for RAE-based DiTs could be valuable.

7.3. Personal Insights & Critique

This paper offers a highly impactful contribution by modernizing the autoencoder component in latent diffusion models. The core idea of leveraging powerful, frozen representation encoders is elegant and intuitive, effectively inheriting rich semantic knowledge into the generative pipeline. This paradigm shift from compression-centric VAEs to representation-centric RAEs is a significant step forward.

The paper's rigorous analysis of the challenges posed by high-dimensional latent spaces and its theoretically motivated solutions are particularly strong. The empirical validation of needing DiT width to match token dimensionality (Theorem 1) is a crucial insight that explains past difficulties and guides future architectural design. The noise-augmented decoding and dimension-dependent noise schedule are practical and effective techniques that address common pitfalls in diffusion training.

The introduction of DiTDH is also a clever architectural modification that resolves the scalability issue of Transformers in high-dimensional settings without incurring prohibitive costs. The achieved state-of-the-art results on ImageNet validate the effectiveness of the entire framework.

Potential Issues/Areas for Improvement:

  • Computational Cost of RAE Training: While the RAE decoder is lightweight, the initial training of the pretrained representation encoder itself requires massive computational resources. Although these encoders are frozen and pre-existing, the overall "cost of knowledge acquisition" for the RAE system is still very high, implicitly relying on the vast compute used for models like DINOv2. This is an inherent trade-off of leveraging large pretrained models, but worth noting for practitioners with limited resources.
  • Encoder Generalization: While freezing the encoder is efficient, it might limit the model's ability to adapt the latent space to specific downstream tasks or data distributions that significantly differ from the encoder's pretraining data. Exploring fine-tuning strategies for the encoder, even partially, could be a future direction, though it would add complexity.
  • Interpretability of High-Dimensional Latents: While RAE's latents are semantically rich, their high dimensionality might make them less interpretable than highly compressed VAE latents. Further work could explore methods for analyzing or visualizing these complex latent spaces to gain deeper insights into the learned representations.
  • Sensitivity to Hyperparameters: Diffusion models and GANs (used in RAE decoder training) are often sensitive to hyperparameters. While the paper provides detailed training recipes, the optimal values for ωL,ωG,τ\omega_L, \omega_G, \tau for noise-augmented decoding, and guidance scales might vary across different datasets or model sizes.

Transferability & Applications: The RAE concept is highly transferable. This framework could be applied to:

  • Video Generation: RAEs could provide semantically rich latent representations for video frames, leading to higher quality and more coherent video generation in latent video diffusion models.

  • 3D Content Generation: Integrating RAEs with 3D encoders (e.g., for point clouds or NeRFs) could enable more semantically aware and high-fidelity 3D generative models.

  • Cross-Modal Generation: The use of multimodal encoders like SigLIP suggests RAEs could be powerful for text-to-image or text-to-video generation, where the rich semantic alignment could enhance the generation quality significantly.

  • Image Editing: The semantically meaningful latent space could facilitate more intuitive and controllable image editing applications, allowing users to manipulate high-level concepts rather than low-level pixels.

    In conclusion, this paper delivers a robust and highly effective solution to a critical bottleneck in Diffusion Transformers, positioning RAEs as a strong foundation for the next generation of generative models.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.