Diffusion Transformers with Representation Autoencoders
TL;DR Summary
This paper introduces Representation Autoencoders (RAEs) that replace traditional Variational Autoencoders (VAEs) with pretrained representation encoders, enhancing image generation quality in Diffusion Transformers (DiTs). RAEs achieve high-quality reconstructions and rich seman
Abstract
Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Diffusion Transformers with Representation Autoencoders".
1.2. Authors
The authors are Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie, all affiliated with New York University.
1.3. Journal/Conference
The paper is currently published as a preprint on arXiv. While it doesn't specify an official journal or conference publication yet, arXiv is a highly respected platform for disseminating cutting-edge research in computer science, particularly in machine learning and artificial intelligence. The authors' affiliations with New York University suggest a strong academic background.
1.4. Publication Year
The paper was published at 2025-10-13T17:51:39.000Z, which translates to October 13, 2025.
1.5. Abstract
The paper addresses the limitations of traditional Variational Autoencoders (VAEs) in Diffusion Transformers (DiTs), which often use outdated backbones, low-dimensional latent spaces, and weak reconstruction-based representations, thereby compromising generative quality. The authors propose replacing VAEs with what they term Representation Autoencoders (RAEs). RAEs utilize frozen pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders to achieve both high-quality reconstructions and semantically rich, typically high-dimensional, latent spaces. A key challenge of operating DiTs in these high-dimensional latent spaces is analyzed, and theoretically motivated solutions are proposed and empirically validated. Their approach leads to faster convergence without needing auxiliary representation alignment losses. Using a DiT variant with a lightweight, wide DDT head, their method achieves strong image generation results on ImageNet, including 1.51 FID at (without guidance) and 1.13 FID at both and (with guidance). The authors conclude that RAEs offer clear advantages and should become the new default for diffusion transformer training.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2510.11690. The PDF link is https://arxiv.org/pdf/2510.11690v1.pdf. The paper is published as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The field of generative modeling has seen significant advancements with latent diffusion models (LDMs) and Diffusion Transformers (DiTs), which achieve high visual fidelity and efficiency by operating in a learned latent space rather than raw pixels. However, the autoencoder component responsible for defining this latent space has remained largely stagnant. Most DiTs still rely on the original VAE encoder, which presents several critical limitations:
-
Outdated Backbones: The
VAEs often use legacy convolutional designs, which can be computationally inefficient and compromise architectural simplicity compared to modernTransformer-based models. -
Low-Dimensional Latent Spaces: Traditional
VAEs typically produce heavily compressed, low-dimensional latent spaces. While intended for efficiency, this can restrict information capacity, leading to latents that capture local appearance but lack crucial global semantic structure needed for generalization and high-quality generation. -
Weak Representations:
VAEs are primarily trained with a reconstruction-only objective. This often results inlatent representationsthat are weak or less semantically meaningful, ultimately limiting the generative quality of the downstream diffusion model.Meanwhile, visual representation learning has rapidly evolved, with self-supervised and multimodal encoders (e.g.,
DINO,SigLIP,MAE) learning semantically rich features. However, latent diffusion has largely been isolated from these advances, continuing to diffuse in reconstruction-trainedVAEspaces. Existing attempts to bridge this gap, such asREPA-style alignment, introduce complexity with extra training stages and auxiliary losses.
The core problem the paper aims to solve is to modernize and enhance the autoencoder component in Diffusion Transformers to overcome the limitations of traditional VAEs, thereby improving generative quality and efficiency. The paper's innovative idea is to directly integrate advanced, pretrained visual representation encoders into the latent diffusion pipeline.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
- Introduction of Representation Autoencoders (RAEs): The authors propose
RAEsas a new class of autoencoders that replace the traditionalVAEwith a frozen pretrained representation encoder (e.g.,DINO,SigLIP,MAE) combined with a lightweight, trained decoder.RAEsgenerate semantically rich and structurally coherent latent spaces while offering high-quality reconstructions. - Challenging Existing Assumptions: The work demonstrates that pretrained semantic encoders, often believed to be unsuitable for faithful reconstruction or high-dimensional diffusion, can indeed produce superior reconstructions and be effectively used in high-dimensional latent diffusion.
- Theoretically Motivated Solutions for High-Dimensional Latents: The paper identifies and addresses the challenges of training
Diffusion TransformersinRAE's high-dimensional latent spaces. It proposes three key solutions:- Matching DiT Width to Token Dimensionality: It is shown that the
Diffusion Transformer's width must match or exceed theRAE's token dimension for effective generation, providing both empirical evidence and a theoretical justification (Theorem 1). - Dimension-Dependent Noise Schedule Shift: The paper generalizes resolution-based noise schedule shifts to account for the effective data dimension (number of tokens times their dimensionality), significantly improving performance in high-dimensional
RAEspaces. - Noise-Augmented Decoder Training: To mitigate out-of-distribution issues when the diffusion model generates noisy latents, the
RAEdecoder is trained with additive noise, enhancing its generalization capabilities to continuous latent distributions.
- Matching DiT Width to Token Dimensionality: It is shown that the
- Introduction of (Wide Diffusion Head): A new
DiTvariant is introduced, which augments the standardDiTarchitecture with a shallow yet wideDDThead. This allows for increased model width and better denoising capabilities without incurring quadratic computational costs, especially beneficial for high-dimensionalRAElatents. - State-of-the-Art Image Generation Results: The proposed
RAE-basedDiTDHmodel achieves strong image generation results on ImageNet, setting new state-of-the-art FID scores:- 1.51 FID at (no guidance)
- 1.13 FID at and (with guidance).
These results demonstrate significantly faster convergence and better generative quality compared to prior
VAE-based and representation-aligned methods.
- Reframing Autoencoding: The work redefines autoencoding from merely a compression mechanism to a foundational representation, enabling more efficient training and effective generation for
Diffusion Transformers.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Autoencoders (AE)
An Autoencoder (AE) is a type of artificial neural network used to learn efficient data codings (representations) in an unsupervised manner. The goal of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal "noise".
It consists of two main parts:
- Encoder: This part compresses the input data into a
latent-space representation(also called alatent vectororbottleneck). - Decoder: This part reconstructs the input data from the
latent-space representation. Theautoencoderis trained to minimize the difference between its input and its output (reconstruction loss).
3.1.2. Variational Autoencoders (VAEs)
Variational Autoencoders (VAEs) are a type of generative model that extend autoencoders by introducing a probabilistic approach to the latent space. Instead of mapping inputs directly to a fixed latent vector, a VAE's encoder maps inputs to the parameters of a probability distribution (typically a Gaussian distribution, defined by a mean and a variance) in the latent space. The decoder then samples from this distribution to reconstruct the input. This probabilistic formulation allows VAEs to generate new data samples by sampling from the learned latent distribution.
VAEs are trained with two losses:
- Reconstruction Loss: Measures how accurately the decoder reconstructs the input from the
latent sample. - Kullback-Leibler (KL) Divergence: Regularizes the
latent distributionto be close to a prior distribution (e.g., a standard normal distribution), ensuring a well-structured andgenerative latent space.
3.1.3. Diffusion Models
Diffusion Models are a class of generative models that learn to generate data by reversing a diffusion process. This process gradually adds noise to data until it becomes pure noise (forward process). The model is then trained to predict and remove this noise (reverse process), effectively learning to transform random noise into meaningful data.
Key concepts:
- Forward Diffusion Process: A fixed Markov chain that gradually adds Gaussian noise to an image over steps, producing a sequence of noisy samples . As , approaches pure Gaussian noise.
- Reverse Diffusion Process: A learned Markov chain that starts from pure noise and iteratively denoises it over steps to generate a clean data sample . This is where the neural network (e.g., a
U-NetorTransformer) comes in, learning to predict the noise added at each step. - Latent Diffusion Models (LDMs): Instead of diffusing in the pixel space,
LDMsperform the diffusion process in a compressedlatent spacelearned by anautoencoder(often aVAE). This significantly reduces computational costs and allows for high-resolution image generation.
3.1.4. Transformers
Transformers are a neural network architecture introduced in 2017, primarily used for sequence-to-sequence tasks, notably in natural language processing (NLP). Their key innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. This enables Transformers to capture long-range dependencies in data more effectively than recurrent neural networks.
Key components:
- Self-Attention: A mechanism that computes a weighted sum of all other elements in the input sequence, where the weights are learned based on the relevance of each element. The formula for
scaled dot-product attentionis: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices derived from the input embeddings.
- is the dimension of the
queryandkeyvectors, used for scaling to prevent vanishing gradients. - normalizes the attention scores.
- Multi-Head Attention: Allows the model to jointly attend to information from different representation subspaces at different positions.
- Feed-Forward Networks: Position-wise fully connected layers applied to each position independently.
- Positional Encoding: Adds information about the absolute or relative position of elements in the sequence, as
Transformersthemselves are permutation-invariant.
3.1.5. Diffusion Transformers (DiTs)
Diffusion Transformers (DiTs) adapt the Transformer architecture to serve as the backbone for diffusion models, particularly latent diffusion models. Instead of the traditional U-Net architecture often used for diffusion models, DiTs process the latent representations (tokens) generated by an autoencoder using a Transformer network. This leverages the scalability and expressive power of Transformers for denoising tasks within the latent space, leading to impressive results in image generation.
3.1.6. Flow Matching
Flow Matching is a training objective for generative models that frames the denoising process as learning a continuous-time vector field (or "flow") that transports samples from a simple prior distribution (e.g., Gaussian noise) to the complex data distribution. Unlike diffusion models that often involve stochastic differential equations (SDEs), flow matching can often be trained to learn an ordinary differential equation (ODE) that directly maps noise to data, potentially leading to faster and more stable sampling during inference. The objective typically involves predicting the velocity of interpolated samples between noise and data.
3.1.7. Fréchet Inception Distance (FID)
Fréchet Inception Distance (FID) is a widely used metric to assess the quality of images generated by generative models. It measures the "distance" between the distribution of features extracted from real images and those extracted from generated images. A lower FID score indicates higher quality and diversity in the generated images, suggesting they are closer to the real image distribution. The features are typically extracted from an intermediate layer of a pre-trained Inception-v3 network.
3.2. Previous Works
The paper extensively references prior research in representation learning and generative modeling, highlighting the evolution and current state of the art.
3.2.1. Representation for Reconstruction
Prior work has explored enhancing VAEs with semantic representations:
-
VA-VAE (Yao et al., 2025): This method aligns
VAE latentswith a pretrainedrepresentation encoder. The paper contrasts this by stating that whileVA-VAEimproves reconstruction and generation, it still relies on heavily compressed, low-dimensional latents, limiting fidelity and representation quality. The proposedRAEs, in contrast, reconstruct directly fromrepresentation encoder featureswithout compression. -
MAETok (Chen et al., 2025a), DC-AE 1.5 (Chen et al., 2025d), i-DEtok (Yang et al., 2025): These works incorporate
MAE- orDAE-inspired objectives intoVAEtraining. The paper notes that these methods also use heavily compressed latents, whichRAEs aim to overcome by using frozen, powerful encoders.The common belief that
representation encodersare unsuitable for reconstruction because they "emphasize high-level semantics while downplaying low-level details" (Tang et al., 2025; Yu et al., 2024b) is challenged. The paper demonstrates that with a properly trained decoder, frozen representation encoders likeDINOv2andSigLIP2can serve as strong encoders for thediffusion latent space, yielding reconstructions on par with or even better thanSD-VAE.
3.2.2. Representation for Generation
Previous research also focuses on using semantic representations to improve generative modeling:
-
REPA (Yu et al., 2025): Accelerates
DiTconvergence by aligning its middle block withrepresentation encoder features. The paper differentiates its approach by trainingdiffusion modelsdirectly onrepresentation encoders(RAE) rather than aligning with an external encoder, achieving faster convergence. -
DDT (Wang et al., 2025c): Further improves convergence by decoupling
DiTinto anencoder-decoderand applyingREPA lossto the encoder output. The current paper takes inspiration fromDDTby introducing aDDT headbut applies it within theRAEframework for different design motivations. -
REG (Wu et al., 2025): Introduces a learnable token into the
DiTsequence and explicitly aligns it with arepresentation encoder's representation. -
ReDi (Kouzelis et al., 2025b): Generates both
VAE latentsandPCA componentsofDINOv2features within adiffusion model.These methods often involve complex alignment procedures or additional training stages. The paper's approach aims for a more direct integration, training
diffusion modelsdirectly onRAE latents, leading to faster convergence without auxiliary losses.
3.3. Technological Evolution
The field of generative modeling has evolved from early pixel-space models directly capturing image statistics to latent diffusion models that operate in a learned, compact representation space.
-
Early Models: Focused on generating images directly in pixel space, which is computationally expensive and struggles with high-resolution images.
-
VAEs (Kingma & Welling, 2014): Introduced probabilistic
latent spaces, allowing for generation and better data representation. -
GANs (Goodfellow et al., 2014): Achieved impressive photorealism through
adversarial training, but often suffered from training instability and mode collapse. -
Diffusion Models (Ho et al., 2020; Dhariwal & Nichol, 2021): Emerged as powerful
generative models, capable of high-quality image synthesis by iteratively denoising data. -
Latent Diffusion Models (Rombach et al., 2022): Combined
diffusion modelswithautoencoders(typicallyVAEs) to perform diffusion in a compressedlatent space, significantly improving efficiency and enabling high-resolution generation. -
Diffusion Transformers (DiT) (Peebles & Xie, 2023; Ma et al., 2024): Replaced the traditional
U-Netbackbone indiffusion modelswithTransformers, leveraging their scalability for improved performance.Parallel to this,
visual representation learninghas seen a rapid transformation: -
Self-supervised Learning: Models like
DINO(Oquab et al., 2023) andMAE(He et al., 2021) learn rich visual features from unlabeled data. -
Multimodal Learning: Models like
CLIPandSigLIP(Radford et al., 2021; Tschannen et al., 2025) learn representations that bridge vision and language.The current paper's work (
RAE) fits within this timeline by aiming to bridge the gap betweenlatent diffusion models(specificallyDiTs) and the advancements invisual representation learning. It seeks to upgrade theautoencodercomponent ofLDMsby integrating modern, semantically richrepresentation encoders, thereby moving beyond the limitations ofreconstruction-only VAEs.
3.4. Differentiation Analysis
The core differences and innovations of this paper's approach, RAE, compared to main methods in related work, can be summarized as follows:
- Direct Use of Frozen Pretrained Encoders: Unlike
VA-VAEorREPAwhich alignVAE latentswith external encoders or use auxiliary losses,RAEdirectly usesfrozen pretrained representation encoders(e.g.,DINOv2,SigLIP,MAE) as the encoder component. This means thelatent spaceinherently possesses the rich semantic structure learned by these powerful models without additional alignment training. - No Aggressive Latent Compression: Traditional
VAEs(likeSD-VAE) rely on heavy channel-wise compression, leading to low-capacity latents.RAEs, by leveraging the features of modernTransformer-based encoders, can maintain higher-dimensional, richerlatent spaceswithout explicit compression-driven objectives, challenging the belief that low-dimensionality is always better for diffusion. - Focus on Decoder for Reconstruction: The
RAEframework demonstrates that evenrepresentation encodersoptimized for semantics can achieve excellent reconstruction quality when paired with a properly trained lightweight decoder. This contradicts the long-standing assumption that semantic encoders are unsuited for faithful pixel-level reconstruction. - Addressing High-Dimensional Latent Challenges Directly: The paper proactively tackles the perceived incompatibility of
Diffusion Transformerswith high-dimensionallatent spaces. It proposes specific architectural and training adjustments (matchingDiTwidth, dimension-dependent noise schedules, noise-augmented decoding) to make diffusion stable and efficient in these richer spaces, rather than avoiding them. - Integrated Generative and Semantic Modeling:
RAEs intrinsically linksemantic modeling(via the frozen encoder) andgenerative modeling(via the trained decoder andDiT) through a sharedlatent representation. This is a more direct and arguably cleaner integration compared to methods that introduceREPA-style alignment losses. - Enhanced Computational Efficiency: The
RAEapproach, particularly with theDiTDHvariant, offers significant computational efficiency improvements. TheRAEdecoders are shown to be much more efficient thanSD-VAEcounterparts, and theDiTDHallows for scaling model width effectively without quadratic cost increases, leading to faster convergence and state-of-the-art results with less compute.
4. Methodology
The core methodology of this work revolves around replacing the traditional VAE in Diffusion Transformers with a Representation Autoencoder (RAE) and then adapting the Diffusion Transformer to effectively operate within the RAE's high-dimensional latent space.
4.1. Representation Autoencoders (RAEs)
The central idea is to use frozen, pretrained representation encoders (which are typically Transformer-based and optimized for learning rich visual semantics) as the encoder component of an autoencoder, and then train a lightweight decoder to reconstruct the original image from these representation features.
4.1.1. RAE Architecture and Training
The RAE consists of two main parts:
-
Frozen Representation Encoder (): This can be any powerful, pretrained visual encoder like
DINOv2-B,SigLIP2-B, orMAE-B. These encoders take an input image (whereH, Ware height and width, and 3 is for RGB channels) and produce a sequence of tokens (or features) in a latent space, each with a channel dimension . Specifically, if is the patch size of the encoder, then . The encoder is kept frozen during theRAEtraining process. Any[CLS]or[REG]tokens produced by the encoder are discarded, and only the patch tokens are used. A layer normalization is applied to each token independently to ensure zero mean and unit variance across channels. -
Trained ViT-based Decoder (): A
ViT(Vision Transformer) decoder is trained to map these latent tokens back to the pixel space. The decoder uses a patch size , and by default, . For images, the encoder typically produces 256 tokens. The decoder reconstructs the image to a resolution of . A learnable[CLS]token is prepended to the decoder's input sequence, similar toMAE, but is discarded after decoding.The
RAEdecoder is trained with a combination ofL1 loss,LPIPS (Learned Perceptual Image Patch Similarity)loss, andadversarial losses. The loss function is defined as: Where:
-
: The input image.
-
E(x): The frozen representation encoder's output, representing the latent tokens . -
D(z): The decoder's reconstruction of the image, denoted as . -
: The total reconstruction loss for an input image .
-
: The Learned Perceptual Image Patch Similarity loss, which measures perceptual similarity between the reconstructed image and the original image . is its weight (set to 1).
-
: The L1 pixel-wise loss, measuring the absolute difference between and .
-
: The adversarial loss component, which encourages the reconstructed image to be indistinguishable from real images by a discriminator. is its weight (set to 0.75).
-
: An adaptive weight for the
GAN loss, defined as , which balances the scales of reconstruction and adversarial losses.The following are the results from Table 1 of the original paper:
(a) Encoder choice. All encoders outperform SD-VAE. (b) Larger decoders improve rFID while remaining much more efficient than VAEs. Model rFID Encoder rFID Decoder rFID GFLOPs DINOv2-B 0.49 DINOv2-S 0.52 ViT-B 0.58 22.2 SigLIP2-B 0.53 DINOv2-B 0.49 ViT-L 0.50 78.1 MAE-B 0.16 DINOv2-L 0.52 ViT-XL 0.49 106.7 SD-VAE 0.62 (c) Encoder scaling. rFID is stable across RAE sizes. SD-VAE 0.62 310.4 (d) Representation quality. RAEs have much higher linear probing accuracy than VAEs. Model Top-1 Acc. DINOv2-B 84.5 SigLIP2-B 79.1 MAE-B 68.0 SD-VAE 8.0
The results in Table 1 (above) demonstrate that RAEs consistently achieve better reconstruction quality (rFID) and representation quality (linear probing accuracy) compared to SD-VAE, while being more computationally efficient. For example, RAE with MAE-B achieves an rFID of 0.16, significantly outperforming SD-VAE's 0.62. Additionally, ViT-XL decoder achieves an rFID of 0.49 with 106.7 GFLOPs, which is much more efficient than SD-VAE's 310.4 GFLOPs.
As can be seen from the results in Figure 8, all RAEs achieve satisfactory reconstruction fidelity.
该图像是重建示例,展示了输入图像以及基于 DINOv2-B、SigLIP2-B 和 MAE-B 的表示自编码器 (RAE) 的重建效果,最后是 SD-VAE 的结果。图像从左到右分别为输入图像、RAE (DINOv2-B)、RAE (SigLIP2-B)、RAE (MAE-B) 和 SD-VAE。
4.2. Taming Diffusion Transformers for RAE
The paper highlights that standard Diffusion Transformer (DiT) training recipes fail when applied directly to RAE's high-dimensional latent spaces. To address this, three theoretically motivated solutions are proposed.
4.2.1. Scaling DiT Width to Match Token Dimensionality
The first crucial insight is that for generation in RAE's latent space to succeed, the Diffusion Model's width must match or exceed the RAE's token dimension.
-
Problem: When
DiTsare designed for compactSD-VAEs, they struggle with the increased dimensionality ofRAEtokens. If theDiT's width is smaller than theRAE's token dimension, the model fails to learn effectively. -
Empirical Observation: Experiments on overfitting a single image showed that sample quality is poor when
DiT's hidden dimension () is less than theRAE's token dimension (), but improves sharply once . Increasing depth alone did not resolve the issue.As can be seen from the results in Figure 3, increasing model width leads to lower loss and better sample quality, while changing model depth has marginal effect on overfitting results.
该图像是图表,包含两个子图,分别探讨了Diffusion Transformers的宽度和深度对模型过拟合的影响。左侧图表显示在不同宽度下,损失随着宽度的增加而降低,并指出在某些条件下,模型无法过拟合到单一样本;右侧图表则指出即使增加层数,模型在特定情况下仍然无法过拟合。这些结果帮助理解模型容量与生成质量之间的关系。 -
Theoretical Justification (Theorem 1): The paper provides a theoretical lower bound for the training loss when the model's effective dimension is smaller than the data's dimension. Theorem 1. Assuming . Let , consider the function family where refers to a stack of standard DiT blocks whose width is smaller than the token dimension from the representation encoder, and
A , Bdenote the input and output linear projections, respectively. Then for any where are the eigenvalues of the covariance matrix of the random variable . Notably, when , contains the unique minimizer to .Proof Explanation:
- Objective: The goal is to minimize the expected squared difference between the model's prediction and the target , averaged over time . This target is the "velocity" needed to transform noisy data back to clean data .
- Function Family : This family represents
Diffusion Transformerswhose internal processing dimension () is less than the input data dimension (). and are linear projections that map the -dimensional input to -dimensions and back to -dimensions, respectively. represents the mainDiTblocks operating in the -dimensional space. - Dimensionality Constraint: The key constraint is . This means the
DiTis bottlenecked by a lower-dimensional internal representation. - Implication of Bottleneck: Due to the bottleneck, the output of is restricted to a subspace of with dimension at most . However, the target generally lies in the full -dimensional space.
- Lower Bound: The theorem states that if , there's an inherent irreducible error in approximating the target. This error is bounded below by the sum of the smallest
n-deigenvalues of the covariance matrix of . These eigenvalues capture the "variance" or "spread" in the dimensions that the model cannot represent due to its lower dimensionality. - Full Capacity: When , the model has enough capacity (or can be configured to have enough capacity) to represent the full -dimensional target, meaning the loss can theoretically reach zero (or the minimum possible error given the
flow matchingobjective).
-
In simpler terms: If your
Transformeris too "thin" (lowhidden dimension) to fully capture the "width" (token dimension ) of the data it's trying to process, it will inevitably miss some information, leading to a higher minimum possible error. This justifies scalingDiT's width to match or exceed theRAE's token dimension.The following are the results from Table 3 of the original paper:
DiT-S DiT-B DiT-L DINOv2-S 3.6e-2 ✓ 1.0e-3 9.7e-4 DINOv2-B 5.2e-1 x 2.4e-2 ✓ 1.3e-3 DINOv2-L 6.5e-1 X 2.7e-1 X 2.2e-2 ✓
The results in Table 3 (above) demonstrate that convergence (indicated by '✓') occurs only when the DiT model's width is at least as large as the RAE token dimension. Conversely, when the DiT model's width is smaller, the loss fails to converge (indicated by 'X').
4.2.2. Dimension-Dependent Noise Schedule Shift
- Problem: Prior noise scheduling strategies (e.g.,
resolution-based shifts) were derived forpixel-basedorVAE-basedinputs with few channels. These strategies did not account for the high dimensionality ofRAEtokens, where the "effective resolution" per token increases with the number of channels, reducing information corruption at the same noise level. - Solution: The paper generalizes existing
resolution-dependentstrategies to adimension-dependent shift. This means the noise schedule is adjusted based on the effective data dimension, defined as the number of tokens multiplied by their dimensionality. - Method: The shifting strategy from Esser et al. (2024) is adopted. For a schedule and input dimensions (base dimension, here 4096) and (effective data dimension of
RAE), the shifted timestep is defined as: Where is a dimension-dependent scaling factor. This adjustment helps to appropriately manage the noise levels for high-dimensionalRAElatents, leading to significant performance gains.
The following are the results from Table 4 of the original paper:
| gFID | |
|---|---|
| w/o shift | 23.08 |
| w/ shift | 4.81 |
The results in Table 4 (above) show that applying the dimension-dependent noise schedule shift yields significant performance gains, reducing gFID from 23.08 to 4.81.
4.2.3. Noise-Augmented Decoding
-
Problem: Unlike
VAEsthat map to continuouslatent distributions(e.g., Gaussian),RAEdecoders are trained to reconstruct from a discrete distribution of cleanlatent featuresfrom the encoder. However,Diffusion Modelsat inference time can generate noisy or slightly deviated latents. This mismatch can causeout-of-distribution (OOD)issues for theRAEdecoder, degrading sampling quality. -
Solution: To make the
RAEdecoder robust to these noisy latents, its training is augmented by addingGaussian noiseto the cleanlatent representationsbefore decoding. The decoder is then trained on this smoothed distribution . -
Method: The decoder is trained on instead of just . The noise level is made stochastic by sampling it from . This helps regularize training and improves robustness to the continuous outputs of
diffusion models.The following are the results from Table 5 of the original paper:
rFID gFID z∼ p(z) 0.49 4.81 Z ∼pn(z) 0.57 4.28
The results in Table 5 (above) show that noise-augmented decoding improves gFID (from 4.81 to 4.28) but slightly worsens rFID (from 0.49 to 0.57). This trade-off is expected, as smoothing the latent distribution for better generalization might reduce exact reconstruction accuracy.
4.3. : Improving Model Scalability with Wide Diffusion Head
To overcome the computational expense of scaling the entire DiT backbone to handle higher-dimensional RAE latents, the paper introduces a new DiT variant called DiTDH (Diffusion Transformer with a Wide Diffusion Head). This design is inspired by DDT (Wang et al., 2025c).
4.3.1. Wide DDT Head Architecture
A DiTDH model consists of:
- Base DiT (): A standard
Diffusion Transformerbackbone. - Additional Wide, Shallow Transformer Head (): This is a lightweight
Transformermodule specifically dedicated todenoising. It is shallow (fewer layers) but wide (larger hidden dimension). TheDDT headreceives the output of the baseDiTalong with the noisy input and timestep information. The combined model predicts the velocity as: Where:
-
: The noisy input at timestep .
-
: The current timestep.
-
: An optional class label for conditional generation.
-
: The base
DiTmodel, which processes the noisy input conditioned on and , producing an intermediate representation . -
: The
DDT headmodel, which takes the original noisy input , the intermediate representation from the baseDiT, and the timestep , to predict the final velocity .As can be seen from the results in Figure 6, the Wide
DDT Headillustrates its connections within theDiffusion Transformerframework.
该图像是图示,展示了宽DDT头与扩散变换器(DiT)之间的关系。图中展示了输入信息如何流入DiT模块,以及从DiT输出到DDT头的连接,最终生成输出。
The DDT head allows the model to effectively increase its width (capacity to process high-dimensional information) without incurring the quadratic computational costs that would arise from scaling the entire base DiT (due to self-attention). This design is particularly effective for RAE's high-dimensional latent spaces.
The following are the results from Table 16 of the original paper:
| Depth | Width | GFLops | FID ↓ |
|---|---|---|---|
| 6 | 1152 (XL) | 25.65 | 2.36 |
| 4 | 2048 (G) | 53.14 | 2.31 |
| 2 | 2048 (G) | 26.78 | 2.16 |
The results in Table 16 (above) indicate that a wide and shallow DDT head is more effective for denoising. A 2-layer, 2048-dim () head outperforms deeper (4-layer ) or narrower (6-layer XL) ones, even with similar GFlops.
The following are the results from Table 17 of the original paper:
| 2-768 | 2-1536 | 2-2048 | 2-2688 | |
|---|---|---|---|---|
| Dino-S | 2.66 | 2.47 | 2.42 | 2.43 |
| Dino-B | 2.49 | 2.24 | 2.16 | 2.22 |
| Dino-L | N/A | 2.95 | 2.73 | 2.64 |
The results in Table 17 (above) show that the optimal DDT head width increases with the RAE encoder size. Larger RAE encoders (e.g., DINOv2-L) benefit more from wider DDT heads, suggesting better utilization of the richer latent representations.
4.4. Flow-Based Models (Generative Process)
The paper adopts a flow matching objective for training the Diffusion Transformer. This involves a continuous-time formulation where samples are interpolated between clean data and Gaussian noise.
- Interpolation: A linear interpolation is used to generate noisy samples at any timestep :
Where:
- : A clean data sample drawn from the real data distribution .
- : Pure Gaussian noise drawn from .
- : A timestep varying from 0 to 1. At , (clean data). At , (pure noise).
- Velocity Prediction: The model is trained to predict the velocity vector that would transport a sample from towards the clean data distribution. This velocity is formally defined as the conditional expectation of given :
- Training Objective: The model (the
DiTorDiTDHnetwork) is trained to minimize the squared difference between its prediction and the true velocity target : Where denotes expectation over and .
4.5. Guidance
The paper explores two types of guidance mechanisms to improve sample quality, especially for conditional generation: AutoGuidance and Classifier-Free Guidance (CFG).
4.5.1. AutoGuidance
AutoGuidance (Karras et al., 2025) is the primary guidance method used. The core idea is to use a weaker, typically earlier checkpoint of the diffusion model itself, to guide a stronger diffusion model.
- Principle: Similar to
CFG, it leverages the intuition that a less capable model or an earlier-trained checkpoint can provide useful directional cues to a more capable model. Weaker models or early checkpoints often capture broader structures and make bolder predictions, which can help guide the generation process more effectively without being overly prescriptive. - Implementation: A smaller
DiTDHvariant (e.g.,DiTDH-S) or an earlier checkpoint of the main model is used as the "guidance model." This guidance model helps steer the sampling process of the main, more powerfulDiTDH-XLmodel. - Benefits: Easier to tune than
CFGwithintervaland generally delivers better performance.
4.5.2. Classifier-Free Guidance (CFG)
Classifier-Free Guidance (Ho & Salimans, 2022) is a common technique in diffusion models to improve the quality and adherence to conditions (e.g., class labels).
- Principle: It combines predictions from a conditional diffusion model (trained with class labels) and an unconditional diffusion model (trained without class labels) to exaggerate the effect of the condition.
- Equation for CFG: The guided velocity prediction is typically calculated as:
$
v_{guided}(\mathbf{x}t, t, y) = v\theta(\mathbf{x}t, t, \emptyset) + s \cdot (v\theta(\mathbf{x}t, t, y) - v\theta(\mathbf{x}_t, t, \emptyset))
$
Where:
- : The velocity predicted by the model conditioned on class label .
- : The velocity predicted by the model conditioned on an empty (unconditional) label .
- : The guidance scale, a hyperparameter that controls the strength of the guidance. A higher pushes the generation stronger towards the conditional input but can also lead to mode collapse or reduced diversity.
- Guidance Interval:
CFGcan be applied withGuidance Interval(Kynkäänniemi et al., 2024), where guidance is only applied during specific timesteps (intervals) of the sampling process. This can prevent over-guidance in early or late stages. - Observation in Paper: The paper notes that
CFGwithoutintervaldoes not improveFIDand can even increase it. WhileCFGwithGuidance Intervalcan achieve competitiveFIDafter careful tuning,AutoGuidancegenerally performs better for the final model and has lower tuning overhead.
5. Experimental Setup
5.1. Datasets
The primary dataset used for both decoder training and diffusion model training is ImageNet-1K.
-
Source:
ImageNet-1Kis a subset of the largerImageNetdataset, consisting of 1,000 object categories. -
Scale: Contains over 1.2 million training images and 50,000 validation images.
-
Characteristics and Domain: It is a large-scale dataset of natural images, covering a wide range of object categories (animals, vehicles, everyday objects, etc.). The images are diverse in content, style, and complexity, making it a standard benchmark for image generation and classification tasks.
-
Resolution: Most experiments are conducted at a resolution of . For resolution synthesis without decoder upsampling,
decodersanddiffusion modelsare trained directly on images. -
Dataset Balancing: The paper notes that the original
ImageNettraining set is inherently unbalanced, with class sizes ranging from approximately 732 to 1,300 samples. However, 895 classes contain exactly 1,300 samples, indicating a high degree of near-equivalence among most classes.An example image from ImageNet-1K could be:
This image (Figure 10 from the original paper) shows an example of a "golden retriever" from ImageNet-1K, which is a common class in the dataset.
5.2. Evaluation Metrics
The paper uses several standard metrics to evaluate the quality and diversity of generated images.
5.2.1. Fréchet Inception Distance (FID)
- Conceptual Definition:
FIDmeasures the "distance" between the feature distributions of real and generated images. It quantifies how similar the generated images are to real images in terms of their perceptual quality and diversity. A lowerFIDscore indicates that the generated images are more realistic and diverse, thus closer to the distribution of real images. - Mathematical Formula: $ \text{FID} = \lVert \mu_x - \mu_g \rVert^2 + \text{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}) $
- Symbol Explanation:
- : The mean of the feature vectors for real images.
- : The mean of the feature vectors for generated images.
- : The covariance matrix of the feature vectors for real images.
- : The covariance matrix of the feature vectors for generated images.
- : The trace of a matrix.
The features are typically extracted from the penultimate layer of a pretrained
Inception-v3network.
5.2.2. Inception Score (IS)
- Conceptual Definition:
ISmeasures both thesharpness(quality) anddiversityof generated images. It relies on a pretrainedInception-v3network to classify generated images. A highISindicates that generated images are both clearly recognizable as specific objects (low entropy of conditional class probabilities, meaning high sharpness) and that there is a wide variety of generated objects (high entropy of marginal class probabilities, meaning high diversity). - Mathematical Formula: $ \text{IS} = e^{\mathbb{E}x [D{KL}(p(y|x) \Vert p(y))]} $
- Symbol Explanation:
- : Expectation over generated image samples .
- : The conditional class distribution (softmax output) predicted by an
Inceptionmodel for a generated image . p(y): The marginal class distribution, calculated by averaging over all generated samples.- : The
Kullback-Leibler (KL) divergence.
5.2.3. Precision and Recall
- Conceptual Definition: These metrics, in the context of
generative models, quantify how well the generated data distribution covers the true data distribution (recall) and how much of the generated data is realistic (precision).- Precision: Reflects the fraction of generated images that appear realistic or belong to the
manifoldof real data. High precision means fewer "junk" samples. - Recall: Reflects the portion of the training data
manifoldcovered by generated samples. High recall means the model can generate a wide variety of real-like images, not just a few modes.
- Precision: Reflects the fraction of generated images that appear realistic or belong to the
- Mathematical Formulas: (These are derived from nearest neighbor distances in feature space, typically using
Inceptionfeatures. The paper does not provide explicit formulas, but refers to Kynkäänniemi et al., 2019.) Typically, these are computed by embedding real and generated images into a feature space (e.g.,Inceptionfeatures) and then finding nearest neighbors.- For
Precision: Calculate for each generated image, its distance to the nearest real image. If this distance is below a threshold, the generated image is considered "realistic." Precision is the proportion of realistic generated images. - For
Recall: Calculate for each real image, its distance to the nearest generated image. If this distance is below a threshold, the real image's mode is considered "covered." Recall is the proportion of covered real images.
- For
- Symbol Explanation: While specific symbols are not provided in the paper's abstract, the underlying concepts involve distances in a feature space and counting ratios based on proximity thresholds.
5.3. Baselines
The paper compares its method against a wide range of state-of-the-art generative models across different paradigms:
5.3.1. Autoregressive Models
These models generate images pixel by pixel or token by token, sequentially.
- VAR (Tian et al., 2024): Visual Autoregressive modeling.
- MAR (Li et al., 2024b): Autoregressive image generation without vector quantization.
- xAR (Ren et al., 2025): Next-x prediction for autoregressive visual generation.
5.3.2. Pixel Diffusion Models
These are diffusion models that operate directly in the pixel space.
- ADM (Dhariwal & Nichol, 2021): Improved
denoising diffusion probabilistic models. - RIN (Jabri et al., 2023): Scalable adaptive computation for iterative generation.
- PixelFlow (Chen et al., 2025e): Pixel-space
generative modelswith flow. - PixNerd (Wang et al., 2025b): Pixel neural field diffusion.
- SiD2 (Hoogeboom et al., 2025): Simpler diffusion (sid2): 1.5
FIDon ImageNet512 withpixel-space diffusion.
5.3.3. Latent Diffusion with VAE
These are latent diffusion models that use traditional VAEs to define their latent space.
-
DiT (Peebles & Xie, 2023): Scalable
diffusion modelswithTransformers. -
MaskDiT (Zheng et al.): (Full reference not provided in bibliography, but context suggests a
Masked Diffusion Transformervariant). -
SiT (Ma et al., 2024): Exploring flow and
diffusion-based generative modelswithscalable interpolant transformers. -
MDTv2 (Gao et al., 2023): Masked
diffusion transformeris a strong image synthesizer. -
VA-VAE (Yao et al., 2025): Reconstruction vs. generation: Taming optimization dilemma in
latent diffusion models. -
REPA (Yu et al., 2025):
Representation alignmentfor generation: Trainingdiffusion transformersis easier than you think. -
DDT (Wang et al., 2025c): Decoupled
diffusion transformer. -
REPA-E (Leng et al., 2025): Unlocking
VAEforend-to-end tuningwithlatent diffusion transformers.These baselines are representative because they cover the major paradigms in
image generation(autoregressive,pixel diffusion,latent diffusion) and include state-of-the-art methods within each category, especially focusing onDiffusion Transformersand methods that attempt to improvelatent spacequality. Comparing against these diverse baselines allows the paper to demonstrate the superiority ofRAE-basedDiTsacross different computational costs and architectural choices.
5.4. Training Details
- Flow Matching Objective: Adopted with linear interpolation .
- Model Backbone:
LightningDiT(Yao et al., 2025), a variant ofDiT(Peebles & Xie, 2023). - Patch Size: 1 for all
RAE-based models (resulting in 256 tokens for images). ForVAEandpixelinputs, patch sizes of 2 and 16 are used, respectively. The computational cost for theDiTbackbone remains similar across these settings due to fixed token length. - Timestep Input: Continuous time formulation with input values in [0, 1].
Gaussian Fourier embedding layerreplaces standard timestep embedding. - Positional Embeddings:
Absolute Positional Embeddings (APE)are added in addition toRoPE(Rotary Positional Embeddings), though their impact was not significant. - Optimization (DiT):
AdamWoptimizer, constant learning rate of , batch size of 1024,EMAweight of 0.9999. - Optimization (DiTDH): Linear learning rate decay from to with a constant warmup of 40 epochs.
EMAweight changed to 0.9995. Gradient clipping of 1.0 is used. - Sampling: Standard
ODE samplingwithEuler samplerand 50 steps by default. - Computation:
PyTorch/XLAonTPUforRAEtraining and inference. Evaluation uses one v6e-8 for 50k samples.
5.5. FID Evaluation Protocol
- Sample Generation: For conditional
FIDevaluation, 50 images are sampled from each class for a total of 50,000 images (class-balanced sampling). The paper notes that some prior works useduniform random samplingacross 1,000 class labels, which can yield slightly different (~0.1 lower FID) scores. - Reference Statistics: Taken from
ADMpre-computed statistics (Dhariwal & Nichol, 2021) over the full ImageNet dataset. - Re-evaluation of Baselines: To ensure fair comparison, several recent methods with accessible checkpoints are re-evaluated using
class-balanced samplingand their reported scores updated.
6. Results & Analysis
6.1. Core Results Analysis
The paper demonstrates that RAE-based Diffusion Transformers achieve state-of-the-art performance, significantly outperforming prior methods in terms of FID and convergence speed.
The following are the results from Table 8 of the original paper:
| Method | Epochs | #Params | Generation@256 w/o guidance | Generation@256 w/ guidance | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| gFID↓ | IS↑ | Prec.↑ | Rec.↑ | gFID↓ | IS↑ | Prec.↑ | Rec.↑ | ||||
| Autoregressive | |||||||||||
| VAR (Tian et al., 2024) | 350 | 2.0B | 1.92 | 323.1 | 0.82 | 0.59 | 1.73 | 350.2 | 0.82 | 0.60 | |
| MAR (Li et al., 2024b) | 800 | 943M | 2.35 | 227.8 | 0.79 | 0.62 | 1.55 | 303.7 | 0.81 | 0.62 | |
| xAR (Ren et al., 2025) | 800 | 1.1B | - | - | - | - | 1.24 | 301.6 | 0.83 | 0.64 | |
| Pixel Diffusion | |||||||||||
| ADM (Dhariwal & Nichol, 2021) | 400 | 554M | 10.94 | 101.0 | 0.69 | 0.63 | 3.94 | 215.8 | 0.83 | 0.53 | |
| RIN (Jabri et al., 2023) | 480 | 410M | 3.42 | 182.0 | - | - | - | - | - | - | |
| PixelFlow (Chen et al., 2025e) | 320 | 677M | - | - | - | - | 1.98 | 282.1 | 0.81 | 0.60 | |
| PixNerd (Wang et al., 2025b) | 160 | 700M | - | - | 2.15 | 297.0 | 0.79 | 0.59 | |||
| SiD2 (Hoogeboom et al., 2025) | 1280 | - | - | - | - | - | 1.38 | - | - | - | |
| Latent Diffusion with VAE | |||||||||||
| DiT (Peebles & Xie, 2023) | 1400 | 675M | 9.62 | 121.5 | 0.67 | 0.67 | 2.27 | 278.2 | 0.83 | 0.57 | |
| MaskDiT (Zheng et al.) | 1600 | 675M | 5.69 | 177.9 | 0.74 | 0.60 | 2.28 | 276.6 | 0.80 | 0.61 | |
| SiT (Ma et al., 2024) | 1400 | 675M | 8.61 | 131.7 | 0.68 | 0.67 | 2.06 | 270.3 | 0.82 | 0.59 | |
| MDTv2 (Gao et al., 2023) | 1080 | 675M | - | - | - | - | 1.58 | 314.7 | 0.79 | 0.65 | |
| VA-VAE (Yao et al., 2025) | 80 | 675M | 4.29 | - | - | - | - | - | - | - | |
| 800 | 2.17 | 205.6 | 0.77 | 0.65 | 1.35 | 295.3 | 0.79 | 0.65 | |||
| REPA (Yu et al., 2025) | 80 | 675M | 7.90 | 122.6 | 0.70 | 0.65 | - | - | - | - | |
| 800 | 5.78 | 158.3 | 0.70 | 0.68 | 1.29 | 306.3 | 0.79 | 0.64 | |||
| DDT (Wang et al., 2025c) | 80 | 675M | 6.62 | 135.2 | 0.69 | 0.67 | 1.52 | 263.7 | 0.78 | 0.63 | |
| 400 | 6.27 | 154.7 | 0.68 | 0.69 | 1.26 | 310.6 | 0.79 | 0.65 | |||
| REPA-E (Leng et al., 2025) | 80 | 675M | 3.46 | 159.8 | 0.77 | 0.63 | 1.67 | 266.3 | 0.80 | 0.63 | |
| 800 | 1.70 | 217.3 | 0.77 | 0.66 | 1.15 | 304.0 | 0.79 | 0.66 | |||
| Latent Diffusion with RAE (Ours) | |||||||||||
| DiT-XL (DINOv2-S) | 800 | 676M | 1.87 | 209.7 | 0.80 | 0.63 | 1.41 | 309.4 | 0.80 | 0.63 | |
| DiTDH-XL (DINOv2-B) | 20 | 3.71 | 198.7 | 0.86 | 0.50 | − | − | ||||
| 80 | 839M | 2.16 | 214.8 | 0.82 | 0.59 | ||||||
| 800 | 1.51 | 242.9 | 0.79 | 0.63 | 1.13 | 262.6 | 0.78 | 0.67 | |||
The results in Table 8 (above) present a comprehensive comparison of RAE-based DiTDH-XL against various autoregressive, pixel diffusion, and latent diffusion methods on ImageNet .
-
State-of-the-Art FID:
DiTDH-XL (DINOv2-B)with 800 epochs achieves angFIDof 1.51 without guidance and 1.13 with guidance. These numbers are significantly better than all other methods reported. For instance, the closest competitors likeREPA-Eachieve 1.70 (w/o guidance) and 1.15 (w/ guidance), andVA-VAEachieves 2.17 (w/o guidance) and 1.35 (w/ guidance). -
Efficiency: Even at 80 epochs,
DiTDH-XL (DINOv2-B)reaches angFIDof 2.16 (w/o guidance), which is already competitive with or better than many methods trained for much longer (e.g.,VARat 350 epochs with 1.92,VA-VAEat 800 epochs with 2.17). This indicates remarkable training efficiency. -
IS, Precision, and Recall: The
DiTDH-XLmodel also shows strongISscores (242.9 w/o guidance) and competitivePrecisionandRecallscores, suggesting high-quality and diverse generations.The following are the results from Table 7 of the original paper:
Method Generation@512 gFID↓ IS↑ Prec.↑ Rec.↑ BigGAN-deep (Brock et al., 2019) 8.43 177.9 0.88 0.29 StyleGAN-XL (Sauer et al., 2022) 2.41 267.8 0.52 0.77 VAR (Tian et al., 2024) 2.63 303.2 - - MAGVIT-v2 (Yu et al., 2024a) 1.91 324.3 - XAR (Ren et al., 2025) 1.70 281.5 - - ADM 3.85 221.7 0.84 0.53 SiD2 1.50 - - DiT - 0.84 0.54 SiT 3.04 240.8 0.57 DiffiT (Hatamizadeh et al., 2024) 2.62 252.2 0.84 0.55 REPA 2.67 252.1 0.83 DDT 2.08 274.6 0.83 0.58 EDM2 (Karras et al., 2024) 1.28 305.1 0.80 1.25 - - DiTDH-XL (DINOv2-B) 1.13 259.6 0.80 0.63
The results in Table 7 (above) compare DiTDH-XL on ImageNet with guidance.
-
New SOTA at 512x512:
DiTDH-XL (DINOv2-B)achieves angFIDof 1.13, surpassing the previous best performance ofEDM2(1.25) by a notable margin. This confirms the method's effectiveness at higher resolutions.As can be seen from the results in Figure 5, the chart illustrates that
DiTDHmodels consistently achieve lowerFIDscores thanDiTandVAE-based methods, even with fewerGFLOPs. This indicates thatDiTDHwithRAEoffers superior performance and computational efficiency across various model scales.
该图像是一个图表,展示了Diffusion Transformers (DiT) 与 Representation Autoencoders (RAE) 在ImageNet数据集上生成图像的FID分数对比。本图包含三部分:图(a)显示了不同DiT架构的FID与训练GFLOPs的关系;图(b)强调了RAE的收敛速度优于传统VAE方法;图(c)则通过气泡图比较了在不同模型规模下,使用RAE的DiT相比VAE方法在FID上表现出更优的性能,气泡面积表示模型的计算量。
As can be seen from the results in Figure 4, DiT with RAE demonstrates significantly faster convergence and better FID performance compared to SiT or REPA. The DiT with RAE (DINOv2-B) achieves an FID of 2.39 after 720 epochs, a substantial improvement over SiT-XL and REPA-XL.
该图像是一个折线图,展示了不同模型(SiT-XL、REPA-XL 和 DiT-XL)在训练时期数与FID分数之间的关系。可以看到,DiT-XL(RAE: DINOv2-B)在训练周期数达到1400时,相较于SiT-XL和REPA-XL实现了显著的收敛速度,效率提升分别为16倍和47倍。
The qualitative samples shown in Figure 7 demonstrate strong diversity, fine-grained detail, and high visual quality, consistent with the achieved state-of-the-art FID scores.
该图像是来自我们模型的定性样本,训练分辨率为 512 imes 512,并使用了 AutoGuidance。基于 RAE 的 DiT 展现了强大的多样性、细致的细节和高质量的视觉效果。
6.1.1. Unconditional Generation
The following are the results from Table 18 of the original paper:
| Method | gFID ↓ | IS ↑ |
|---|---|---|
| DiT-XL + VAE | 30.68 | 32.73 |
| DiTDH-XL + DINOv2-B (w/ AG) | 4.96 | 123.12 |
| RCG + DiT-XL | 4.89 | 143.2 |
The results in Table 18 (above) demonstrate that RAE-based DiTDH-XL also performs exceptionally well in unconditional generation. It achieves an gFID of 4.96 and IS of 123.12, which is significantly better than DiT-XL + VAE (gFID 30.68, IS 32.73) and competitive with RCG + DiT-XL (gFID 4.89, IS 143.2), a method specifically designed for unconditional generation.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Encoder Choice and Noisy-Robust Decoding
The following are the results from Table 15 of the original paper:
| (a) gFID and rFID of different encoders w/ and w/o noisy-robust decoding. | (b) gFID and rFID of different DINOv2 sizes w/ and w/o noisy-robust decoding. | |||||
|---|---|---|---|---|---|---|
| Model | gFID | rFID | Model | gFID | rFID | |
| DINOv2-B | 4.81 / 4.28 | 0.49 / 0.57 | S | 3.83 / 3.50 | 0.52 / 0.64 | |
| SigLIP2-B | 6.69 / 4.93 | 0.53 / 0.82 | B | 4.81/ 4.28 | 0.49 / 0.57 | |
| MAE-B | 16.14 / 8.38 | 0.16 / 0.28 | L | 6.77 /6.09 | 0.52 / 0.59 | |
The results in Table 15a (above) show that DINOv2-B achieves the best overall generation performance (gFID), despite MAE-B having the lowest reconstruction rFID. This indicates that low rFID alone does not guarantee good generation quality. DINOv2-B is chosen as the default encoder.
Table 15a and 15b also demonstrate the effectiveness of noise-augmented decoding. For all encoders and DINOv2 sizes, adding noise during decoder training (w/ noisy-robust decoding) consistently improves gFID at the cost of a slight increase in rFID. This supports the idea that decoders need to be robust to noisy latent outputs from diffusion models.
6.2.2. Scaling RAE to High Resolutions
The following are the results from Table 9 of the original paper:
| Method | #Tokens | gFID ↓ | rFID |
|---|---|---|---|
| Direct | 1024 | 1.13 | 0.53 |
| Upsample | 256 | 1.61 | 0.97 |
The results in Table 9 (above) explore scaling to resolution. The "Upsample" method, which uses a decoder to upscale from tokens to images, achieves a competitive gFID of 1.61 while being more efficient (using only 256 tokens) than direct training at resolution (1024 tokens, gFID 1.13). This highlights RAE's flexibility in handling high resolutions efficiently by decoupling the decoder from the diffusion process.
6.2.3. Does DiTDH Work Without RAE?
The following are the results from Table 10 of the original paper:
| VAE | DINOv2-B | |
|---|---|---|
| DiT-XL | 7.13 | 4.28 |
| TDDHXL | 11.70 | 2.16 |
The results in Table 10 (above) compare DiT-XL and DiTDH-XL on SD-VAE latents versus DINOv2-B (RAE) latents. DiTDH-XL performs worse than DiT-XL on SD-VAE (11.70 vs. 7.13 gFID), despite using extra compute. This suggests that the DDT head offers little benefit in low-dimensional VAE latent spaces and its primary strength is in the high-dimensional diffusion tasks introduced by RAE.
6.2.4. Role of Structured Representation
The following are the results from Table 11 of the original paper:
| Pixel | DINOv2-B | |
|---|---|---|
| DiT-XL | 51.09 | 4.28 |
| DiT H-XL | 30.56 | 2.16 |
The results in Table 11 (above) compare DiT and DiTDH directly on raw pixels versus DINOv2-B (RAE) latents. Both models perform significantly worse on pixels (gFID 51.09 for DiT-XL, 30.56 for DiTDH-XL) than on RAE latents (gFID 4.28 for DiT-XL, 2.16 for DiTDH-XL). This confirms that high dimensionality alone is not sufficient; the structured representation provided by RAE is crucial for strong performance gains.
6.3. Scaling Results
As can be seen from the results in Figure 9, increasing the model's computational capacity leads DiTDH to converge faster and reach a lower final loss.
该图像是一个训练损失变化曲线图,展示了不同模型(, , , )在训练过程中的损失值随迭代次数的变化情况,显示出随着训练迭代次数的增加,各模型训练损失逐渐降低的趋势。
6.4. FID Evaluation Remarks
The paper highlights an inconsistency in FID evaluation protocols across prior literature regarding sample generation:
- Some works use
class-balanced sampling(50 images per class for 1000 classes = 50,000 samples). - Others use
uniform random samplingacross class labels 50,000 times. The authors observed thatclass-balanced samplingconsistently yields approximately 0.1 lowerFIDscores. To ensure fair comparisons, they re-evaluated several methods usingclass-balanced samplingand updated their reported scores. This careful attention to evaluation protocols adds rigor to their comparisons.
The following are the results from Table 14 of the original paper:
| Method | Epochs | Generation@256 w/o guidance | Generation@256 w/ guidance | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Random | Balanced | Random | Balanced | ||||||
| gFID↓ | IS↑ | gFID | IS | gFID↓ | IS↑ | gFID | IS | ||
| Autoregressive | |||||||||
| VAR (Tian et al., 2024) | 350 | 1.92 | 323.1 | 1.73 | 350.2 | ||||
| MAR (Li et al., 2024b) | 800 | 2.35 | 227.8 | 1.55 | 303.7 | ||||
| xAR-H (Ren et al., 2025) | 800 | - | - | - | - | - | - | 1.24 | 301.6 |
| Latent Diffusion with VAE | |||||||||
| SiT (Ma et al., 2024) | 1400 | 8.61 | 131.7 | 8.54 | 132.0 | 2.06 | 270.3 | 1.95 | 259.5 |
| REPA (Yu et al., 2025) | 800 | 5.90 | 157.8 | 5.78 | 158.3 | 1.42 | 305.7 | 1.29 | 306.3 |
| DDT (Wang et al., 2025c) | 400 | - | - | 6.27 | 154.7 | 1.40 | 303.6 | 1.26 | 310.6 |
| REPA-E (Leng et al., 2025) | 800 | 1.83 | 217.3 | 1.70 | 217.3 | 1.26 | 314.9 | 1.15 | 304.0 |
| Latent Diffusion with RAE (Ours) | |||||||||
| DiTDH-XL (DINOv2-B) | 800 | 1.60 | 242.7 | 1.51 | 242.9 | 1.28 | 262.9 | 1.13 | 262.6 |
The results in Table 14 (above) confirm that class-balanced sampling generally leads to slightly better gFID scores compared to random sampling across all methods, including DiTDH-XL. This highlights the importance of consistent evaluation protocols for fair comparisons in generative modeling.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work fundamentally challenges conventional wisdom in latent generative modeling by demonstrating the efficacy of pretrained representation encoders for both high-fidelity reconstruction and superior image generation. The authors introduce Representation Autoencoders (RAEs), which pair frozen semantic encoders (like DINO, SigLIP, MAE) with a lightweight, trained decoder. This approach yields semantically rich, high-dimensional latent spaces, overcoming the limitations of traditional VAEs (outdated backbones, low capacity, weak representations).
The paper rigorously addresses the challenges of operating Diffusion Transformers (DiTs) within these high-dimensional RAE latent spaces. It proposes and validates three key methodological advancements:
-
DiT Width Matching: Scaling
DiT's width to match or exceed theRAEtoken dimensionality, supported by both empirical evidence and theoretical proof, is crucial for effective learning. -
Dimension-Dependent Noise Schedule: Adapting the noise schedule based on the effective data dimension (token count times dimensionality) significantly improves training.
-
Noise-Augmented Decoding: Training the
RAEdecoder with added noise enhances its robustness and generalization to the imperfect, continuouslatent outputsofdiffusion models.Furthermore, the paper introduces
DiTDH, aDiTvariant equipped with a shallow yet wideDDThead, which efficiently increases model capacity for denoising high-dimensional latents without incurring quadratic computational overhead.
Empirically, the RAE-based DiTDH achieves state-of-the-art ImageNet generation results, with 1.51 FID at (without guidance) and 1.13 FID at both and (with guidance). These results demonstrate substantially faster convergence and higher generative quality compared to previous VAE-based and representation-aligned methods. The work redefines autoencoding as a representation foundation, advocating for RAE latents as the new default for diffusion transformer training.
7.2. Limitations & Future Work
The authors implicitly or explicitly point out several limitations and suggest future directions:
- Computational Cost of Scaling: While
DiTDHaddresses the quadratic cost for theDiTbackbone, handling very high-resolution images by directly increasing the number of tokens (e.g., trainingRAEat with 1024 tokens) remains more computationally expensive than using decoder upsampling. This suggests a trade-off betweengFIDand efficiency at extremely high resolutions. - Impact of Encoder Choice: Although
DINOv2-Bwas found to be the best for generation, the varying performance acrossMAE-B,SigLIP2-B, andDINOv2-Bhighlights that not allrepresentation encodersare equally suited forgenerative tasks, even if they have good reconstruction quality. Further understanding of what makes a representation encoder "generative-friendly" could be explored. - Role of Structured Representation: While the paper conclusively shows that
structured representationsfromRAEsare crucial (as rawpixel diffusionperforms poorly), a deeper theoretical understanding of why certainsemantic structuresare more beneficial fordiffusion modelscould be investigated. - Generalizability of
DiTDH: TheDiTDH's benefit was shown to be specific to high-dimensionalRAElatents, performing worse on low-dimensionalVAElatents. This implies it's not a universally optimal architectural modification but rather context-dependent. Future work could explore how to adaptDDTheads to be beneficial across a wider range oflatent spacedimensionalities. - Guidance Mechanism: While
AutoGuidanceperforms well, the paper notes thatClassifier-Free Guidancewithoutintervaldoes not improveFID, andCFGwithintervalrequires careful tuning. Further research into more robust and universally effective guidance mechanisms forRAE-basedDiTscould be valuable.
7.3. Personal Insights & Critique
This paper offers a highly impactful contribution by modernizing the autoencoder component in latent diffusion models. The core idea of leveraging powerful, frozen representation encoders is elegant and intuitive, effectively inheriting rich semantic knowledge into the generative pipeline. This paradigm shift from compression-centric VAEs to representation-centric RAEs is a significant step forward.
The paper's rigorous analysis of the challenges posed by high-dimensional latent spaces and its theoretically motivated solutions are particularly strong. The empirical validation of needing DiT width to match token dimensionality (Theorem 1) is a crucial insight that explains past difficulties and guides future architectural design. The noise-augmented decoding and dimension-dependent noise schedule are practical and effective techniques that address common pitfalls in diffusion training.
The introduction of DiTDH is also a clever architectural modification that resolves the scalability issue of Transformers in high-dimensional settings without incurring prohibitive costs. The achieved state-of-the-art results on ImageNet validate the effectiveness of the entire framework.
Potential Issues/Areas for Improvement:
- Computational Cost of
RAETraining: While theRAEdecoder is lightweight, the initial training of the pretrained representation encoder itself requires massive computational resources. Although these encoders are frozen and pre-existing, the overall "cost of knowledge acquisition" for theRAEsystem is still very high, implicitly relying on the vast compute used for models likeDINOv2. This is an inherent trade-off of leveraging large pretrained models, but worth noting for practitioners with limited resources. - Encoder Generalization: While freezing the encoder is efficient, it might limit the model's ability to adapt the
latent spaceto specific downstream tasks or data distributions that significantly differ from the encoder's pretraining data. Exploring fine-tuning strategies for the encoder, even partially, could be a future direction, though it would add complexity. - Interpretability of High-Dimensional Latents: While
RAE's latents are semantically rich, their high dimensionality might make them lessinterpretablethan highly compressedVAE latents. Further work could explore methods for analyzing or visualizing these complexlatent spacesto gain deeper insights into the learned representations. - Sensitivity to Hyperparameters:
Diffusion modelsandGANs(used inRAEdecoder training) are often sensitive to hyperparameters. While the paper provides detailed training recipes, the optimal values for fornoise-augmented decoding, andguidance scalesmight vary across different datasets or model sizes.
Transferability & Applications:
The RAE concept is highly transferable. This framework could be applied to:
-
Video Generation:
RAEscould provide semantically richlatent representationsforvideo frames, leading to higher quality and more coherent video generation inlatent video diffusion models. -
3D Content Generation: Integrating
RAEswith3D encoders(e.g., forpoint cloudsorNeRFs) could enable more semantically aware and high-fidelity3D generative models. -
Cross-Modal Generation: The use of multimodal encoders like
SigLIPsuggestsRAEscould be powerful fortext-to-imageortext-to-videogeneration, where the rich semantic alignment could enhance the generation quality significantly. -
Image Editing: The semantically meaningful
latent spacecould facilitate more intuitive and controllableimage editingapplications, allowing users to manipulate high-level concepts rather than low-level pixels.In conclusion, this paper delivers a robust and highly effective solution to a critical bottleneck in
Diffusion Transformers, positioningRAEsas a strong foundation for the next generation ofgenerative models.
Similar papers
Recommended via semantic vector search.