Paper status: completed

PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Published:10/01/2023

Diffusion Models (8)Text-to-Image Generation (19)Diffusion Transformer (6)Efficient Training Strategy (1)Vision-Language Alignment (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

PixArt-α uses a decomposed training strategy, efficient Transformer design, and rich auto-labeled data to produce high-quality 1024px text-to-image synthesis with only 10.8% of Stable Diffusion's training cost.

Abstract

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART- $\alpha$ , a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART- $\alpha$ 's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART- $\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly $300,000 ($26,000 vs. $320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART- $\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART- $\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

Mind Map

In-depth Reading

English Analysis~14 min read · 18,775 chars

1. Bibliographic Information

Title: PixArt- $\alpha$ : Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Authors: Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li.
Affiliations: The authors are affiliated with Huawei Noah's Ark Lab, Dalian University of Technology, The University of Hong Kong (HKU), and The Hong Kong University of Science and Technology (HKUST). This indicates a strong collaboration between industry research and academia.
Journal/Conference: The paper is available on arXiv, which is a preprint server. This means it has not yet undergone formal peer review for a conference or journal publication at the time of this version's release, but it allows for rapid dissemination of research findings.
Publication Year: 2023
Abstract: The paper introduces PIXART- $\alpha$ , a text-to-image (T2I) diffusion model based on the Transformer architecture. It aims to produce photorealistic, high-resolution (1024px) images that are competitive with state-of-the-art models like Imagen and Midjourney, but at a fraction of the training cost. The core innovations are a decomposed three-stage training strategy, an efficient T2I Transformer architecture, and the use of high-quality, auto-labeled training data. As a result, PIXART- $\alpha$ requires only 10.8% of the training time of Stable Diffusion v1.5, significantly reducing financial costs and CO2 emissions. The authors hope this work will enable smaller teams and startups to develop high-quality generative models.
Original Source Link:
- Official Link: https://arxiv.org/abs/2310.00426
- PDF Link: http://arxiv.org/pdf/2310.00426v3

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Training state-of-the-art text-to-image (T2I) models is prohibitively expensive. For example, Stable Diffusion v1.5 required over 6,000 A100 GPU days (costing ~320,000), and RAPHAEL required 60,000 A100 GPU days (~3,000,000).
- Importance: These high costs create a significant barrier to entry for academic labs, startups, and individual researchers, stifling innovation in the AIGC (AI Generated Content) community. Additionally, the immense energy consumption leads to substantial CO2 emissions.
- Fresh Angle: The paper challenges the "bigger is better" paradigm that relies on massive datasets and computational brute force. It asks: Can we develop a high-quality image generator with affordable resource consumption? The innovation lies in a strategic, efficiency-focused approach that decomposes the complex T2I task into simpler, manageable stages and improves both the model architecture and data quality.
Main Contributions / Findings (What):
1. Training Strategy Decomposition: Instead of training a model from scratch on all objectives simultaneously, the authors break the process into three distinct stages: (1) Learning pixel dependencies on generic images, (2) Learning text-image alignment with high-quality data, and (3) Fine-tuning for aesthetic quality and high resolution. This disentangled approach dramatically improves training efficiency.
2. Efficient T2I Transformer: The paper adapts the Diffusion Transformer (DiT) architecture for T2I generation. It injects text conditions via cross-attention layers and streamlines the conditioning mechanism with a novel adaLN-single module, reducing parameter count and GPU memory usage without sacrificing quality. A re-parameterization trick allows this efficient model to leverage weights from a pre-trained class-conditional model.
3. High-Informative Data Curation: The authors identify that common datasets like LAION have noisy and sparse captions. They propose an auto-labeling pipeline using a Vision-Language Model (LLaVA) to generate dense, detailed pseudo-captions for images from the SAM dataset. This high "concept density" data accelerates the learning of text-image alignment.

Foundational Concepts:
- Text-to-Image (T2I) Synthesis: The task of generating a visual image that semantically corresponds to a given text prompt.
- Denoising Diffusion Probabilistic Models (DDPMs): A class of generative models that work by a two-step process. First, a "forward process" gradually adds Gaussian noise to an image until it becomes pure noise. Second, a "reverse process" learns to denoise it step-by-step, effectively generating an image from random noise.
- Latent Diffusion Models (LDMs): An efficiency-focused improvement on DDPMs, popularized by Stable Diffusion. Instead of applying the diffusion process in the high-dimensional pixel space, LDMs first compress the image into a lower-dimensional latent space using a Variational Autoencoder (VAE). The diffusion process happens in this compact space, significantly reducing computational requirements. The final latent representation is then decoded back into a full-resolution image.
- Transformer: An architecture originally designed for natural language processing (NLP) that relies on a self-attention mechanism. Unlike CNNs, which have a local receptive field, Transformers can model long-range dependencies between all elements in a sequence (e.g., words in a sentence or patches in an image), making them highly scalable and powerful.
- Diffusion Transformer (DiT): A model proposed by Peebles & Xie (2023) that replaces the conventional U-Net backbone in diffusion models with a Transformer. This demonstrates that Transformers are a scalable and effective architecture for diffusion-based image generation.
- Cross-Attention: A mechanism within the Transformer architecture that allows one sequence to "attend to" or incorporate information from another sequence. In T2I models, it's the key mechanism used to inject the conditioning information from the text prompt into the image generation process.
Previous Works:
- DALL·E 2, Imagen, Stable Diffusion (LDM): These are seminal T2I models that demonstrated the power of diffusion models for photorealistic image synthesis. However, as the paper points out, their training is computationally intensive and relies on vast, often noisy, web-scraped data.
- RAPHAEL: A more recent and larger model that achieves state-of-the-art image quality but at an even higher computational cost (60,000 A100 GPU days), exemplifying the problem PixArt- $\alpha$ aims to solve.
- Diffusion Transformer (DiT): PixArt- $\alpha$ builds directly upon the DiT architecture. DiT showed that a Transformer backbone could outperform the traditional U-Net for class-conditional image generation (e.g., generating an image of a 'dog' from the ImageNet classes). PixArt- $\alpha$ adapts this class-conditional model for the more complex text-conditional task.
Technological Evolution: The field has moved from early generative models like GANs and VAEs to the more stable and high-fidelity DDPMs. LDMs made these models practical by operating in a latent space. DiT further advanced the architecture by replacing the U-Net with a more scalable Transformer. PixArt- $\alpha$ represents the next logical step: optimizing the DiT architecture and the entire training pipeline for efficiency and effectiveness in the T2I domain.
Differentiation:
- Training Paradigm: Unlike models trained end-to-end on massive, noisy datasets, PixArt- $\alpha$ uses a decomposed, multi-stage strategy that progressively builds capabilities, starting from basic image structure and moving to complex text alignment and aesthetics.
- Data Strategy: Instead of relying on raw web captions, PixArt- $\alpha$ emphasizes data quality over quantity. It curates a smaller but higher-quality dataset with dense captions generated by a powerful VLM, accelerating learning.
- Architectural Efficiency: While based on DiT, it introduces specific modifications like adaLN-single to reduce parameters and memory footprint, making the model more lightweight without compromising performance.

4. Methodology (Core Technology & Implementation)

The core methodology of PixArt- $\alpha$ is built on three pillars designed to tackle the high cost of T2I training.

1. Training Strategy Decomposition

The authors argue that training a T2I model involves learning three distinct things: pixel dependencies (how to make a realistic image), text-image alignment (how to follow the prompt), and aesthetic quality (how to make it look good). Training all three simultaneously from scratch is inefficient. They propose a three-stage approach:

Stage 1: Pixel Dependency Learning: The model first learns the fundamental structure of natural images. This is done by training a class-conditional DiT model on the ImageNet dataset. ImageNet provides a diverse set of objects, and class-conditional training is computationally cheaper than full text-to-image training. This stage provides the model with a strong "prior" on what realistic images look like, serving as a powerful initialization for the next stage.
Stage 2: Text-Image Alignment Learning: This is the most critical and challenging part. The goal is to teach the model to understand text prompts. The key insight is to use a dataset with high concept density. The authors use the SAM dataset (which is rich in objects) and generate new, highly detailed captions using the LLaVA model. This "information-rich" data allows the model to learn more concepts per training iteration and reduces ambiguity, leading to faster and more accurate alignment between text and images.
Stage 3: High-Resolution and Aesthetic Image Generation: After the model has learned image structure and text alignment, it is fine-tuned on a curated dataset of high-quality, aesthetically pleasing images (JourneyDB and an internal dataset). This final stage polishes the model's output, improves its artistic quality, and adapts it to generate high-resolution images (up to 1024x1024).

2. Efficient T2I Transformer

PixArt- $\alpha$ modifies the standard DiT architecture for efficiency and T2I capabilities.

$Figure 4: Model architecture of PixART- $\\alpha$ . A cross-attention module is integrated into each block to inject textual conditions. To optimize efficiency, all blocks share the same adaLN-single…$ 该图像是PixART-α模型结构示意图，展示了在每个Transformer块中融入了多头交叉注意力模块以注入文本条件，且所有块共享时间条件的adaLN-single参数，图中区分了可调节参数和冻结参数。

Cross-Attention Layer: To inject text conditioning, a standard multi-head cross-attention layer is added to each Transformer block, positioned after the self-attention layer. This allows the image representation at each block to query the text embeddings (from a T5 text encoder) and incorporate semantic information.
adaLN-single: The original DiT uses an adaLN (adaptive Layer Normalization) module where conditioning information (like timestep $t$ $t$ and class label $c$ $c$ ) is processed by a block-specific MLP to produce scaling and shifting parameters for the normalization layer. This is parameter-heavy. PixArt- $\alpha$ $α$ proposes adaLN-single, which streamlines this process:
1. A single global MLP processes only the time embedding $t$ once at the beginning to produce a global set of scale/shift parameters, $\overline{S}$ .
2. This global $\overline{S}$ is shared across all Transformer blocks.
3. Each block $i$ has a small, layer-specific trainable embedding $E^{(i)}$ that is added to the global $\overline{S}$ to get the final parameters $S^{(i)}$ for that block. This design replaces many large MLPs with one global MLP and several small embeddings, significantly reducing the model's parameter count (by 26%) and GPU memory usage (by 21%).
Re-parameterization: To enable loading the weights from the Stage 1 class-conditional model, the authors devise an initialization trick. The layer-specific embeddings $E^{(i)}$ are initialized such that for a specific timestep (e.g., $t=500$ ), the resulting scale/shift parameters $S^{(i)}$ are identical to those of the original DiT model (when the class condition is null). This ensures a smooth transition from the pre-trained model to the T2I model, preserving the learned knowledge of image structure.

3. Dataset Construction

The paper emphasizes that data quality is as important as model architecture.

Image-text pair auto-labeling: The authors find captions in large public datasets like LAION to be problematic: they are often misaligned with the image, describe only a small part of the content, and use a limited vocabulary. To fix this, they use LLaVA, a powerful vision-language model, to generate new, detailed captions. Using the prompt "Describe this image and its style in a very detailed manner", LLaVA produces captions that are much richer and more accurate.
SAM Dataset: Instead of using LAION, which contains many simple product images, they apply the auto-labeling pipeline to the SAM dataset. The SAM dataset was created for segmentation and thus contains images with a rich variety of objects and complex scenes, making it ideal for learning complex text-image compositions.

Vocabulary Analysis: The effectiveness of this data strategy is shown in Table 1.

Manually transcribed Table 1: Statistics of noun concepts for different datasets.

Dataset	VN/DN	Total Noun	Average
LAION	210K/2461K = 8.5%	72.0M	6.4/Img
LAION-LLaVA	85K/646K = 13.3%	233.9M	20.9/Img
SAM-LLaVA	23K/124K = 18.6%	327.9M	29.3/Img
Internal	152K/582K = 26.1%	136.6M	12.2/Img

Here, VN is the number of valid distinct nouns (appearing >10 times) and DN is the total number of distinct nouns.

The SAM-LLaVA dataset has a much higher average number of nouns per image (29.3) compared to the original LAION (6.4).
It also has a higher ratio of valid nouns (18.6%), indicating a more balanced and less sparse vocabulary. This high "concept density" means the model learns to associate more objects and attributes with text in every training step, dramatically improving efficiency.

5. Experimental Setup

Datasets:
- Training:
  - Stage 1: 1M images from ImageNet.
  - Stage 2: 10M images from the SAM dataset, captioned with LLaVA (SAM-LLaVA).
  - Stage 3: 14M high-quality images from JourneyDB (4M) and an internal dataset (10M).
- Evaluation:
  - MS-COCO: Used for zero-shot FID evaluation. It's a standard benchmark for image generation quality.
  - T2I-CompBench: A specialized benchmark designed to evaluate a model's ability to handle complex compositional prompts (e.g., attributes, spatial relations).
Evaluation Metrics:
1. Fréchet Inception Distance (FID):
  - Conceptual Definition: FID measures the quality and diversity of generated images by comparing the statistical distribution of generated images to that of real images. It uses features extracted from an InceptionV3 network. A lower FID score indicates that the two distributions are more similar, meaning the generated images are closer to real images in terms of quality and diversity.
  - Mathematical Formula: $\mathrm{FID}(x, g) = \left\| \mu_x - \mu_g \right\|_2^2 + \mathrm{Tr}\left( \Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2} \right)$
  - Symbol Explanation:
    - $x$ and $g$ are the sets of real and generated image features, respectively.
    - $\mu_x$ and $\mu_g$ are the mean vectors of the features for real and generated images.
    - $\Sigma_x$ and $\Sigma_g$ are the covariance matrices of the features.
    - $\|\cdot\|_2^2$ denotes the squared L2 norm.
    - $\mathrm{Tr}(\cdot)$ denotes the trace of a matrix.
2. T2I-CompBench Score:
  - Conceptual Definition: This is not a single score but a suite of metrics that evaluate compositional abilities. It tests for:
    - Attribute Binding: Can the model correctly assign attributes (e.g., color, shape, texture) to the right objects? (e.g., "a red cube and a blue sphere").
    - Object Relationship: Can the model correctly depict spatial (e.g., "a cat on top of a car") and non-spatial relationships?
    - Complex Compositions: Can the model handle prompts with multiple objects, attributes, and relationships simultaneously? Higher scores are better for all sub-metrics.
3. Human Preference Rate:
  - Conceptual Definition: A metric derived from user studies where human evaluators are asked to choose which model's output they prefer based on criteria like image quality (fidelity) and alignment with the text prompt. It is expressed as the percentage of times a model's output was chosen as the best.
Baselines: The paper compares PixArt- $\alpha$ against a wide range of prominent T2I models, including:
- Major Diffusion Models: DALL·E, DALL·E 2, Imagen, LDM (Stable Diffusion v1.5, SDXL, SDv2), DeepFloyd.
- GAN-based Model: GigaGAN.
- High-Cost SOTA Model: RAPHAEL.
- Compositionality-focused Models: Composable v2, Structured v2, etc., on the T2I-CompBench benchmark.

6. Results & Analysis

Core Results

Fidelity and Training Efficiency (Table 2): Manually transcribed Table 2: Comparison with recent T2I models.

Method	Type	#Params	#Images	FID-30K↓	GPU days
DALL·E	Diff	12.0B	250M	27.50
GLIDE	Diff	5.0B	250M	12.24
LDM	Diff	1.4B	400M	12.64
DALL·E 2	Diff	6.5B	650M	10.39	41,667 A100
SDv1.5	Diff	0.9B	2000M	9.62	6,250 A100
GigaGAN	GAN	0.9B	2700M	9.09	4,783 A100
Imagen	Diff	3.0B	860M	7.27	7,132 A100
RAPHAEL	Diff	3.0B	5000M+	6.61	60,000 A100
PIXART-α	Diff	0.6B	25M	7.32	753 A100

The results are striking. PIXART- $\alpha$ achieves an FID score of 7.32, which is competitive with Imagen (7.27) and far better than SDv1.5 (9.62). However, it does so with drastically fewer resources:

Training Cost: 753 A100 GPU days, which is only 12% of SDv1.5's cost and about 1% of RAPHAEL's cost.
Data Usage: Only 25M images, compared to 2B for SDv1.5 and 5B+ for RAPHAEL.
Model Size: The smallest at 0.6B parameters. This table provides the strongest evidence for the paper's central claim of achieving SOTA-competitive quality with extreme efficiency.

Alignment Assessment (Table 3): Manually transcribed Table 3: Alignment evaluation on T2I-CompBench.

Model	Attribute Binding			Object Relationship		Complex↑
Model	Color ↑	Shape↑	Texture↑	Spatial↑	Non-Spatial↑	Complex↑
Stable v1.4	0.3765	0.3576	0.4156	0.1246	0.3079	0.3080
Stable v2	0.5065	0.4221	0.4922	0.1342	0.3096	0.3386
Composable v2	0.4063	0.3299	0.3645	0.0800	0.2980	0.2898
Structured v2	0.4990	0.4218	0.4900	0.1386	0.3111	0.3355
Attn-Exct v2	0.6400	0.4517	0.5963	0.1455	0.3109	0.3401
GORS	0.6603	0.4785	0.6287	0.1815	0.3193	0.3328
Dalle-2	0.5750	0.5464	0.6374	0.1283	0.3043	0.3696
SDXL	0.6369	0.5408	0.5637	0.2032	0.3110	0.4091
PIXART-α	0.6886	0.5582	0.7044	0.2082	0.3179	0.4117

PIXART- $\alpha$ achieves the best or second-best scores across all 6 categories, outperforming larger models like SDXL and Dalle-2. This demonstrates the success of its Stage 2 alignment learning with high-density captions, which gives it superior control over compositional semantics.

User Study:

$Figure 5: User study on 300 fixed prompts from Feng et al. (2023). The ratio values indicate the percentages of participants preferring the corresponding model. PixART- $\\alpha$ achieves a superior p…$ 该图像是图表，展示了基于300个固定提示词的用户偏好调查结果。横轴为模型类别，纵轴为用户偏好比例（百分比），数据表明PIXART-α在图像质量和文本对齐度上均优于其他对比模型。

The user study shows that human evaluators strongly prefer PIXART- $\alpha$ over other accessible top-tier models like SDXL, DALLE-2, and DeepFloyd in terms of both image quality and text alignment. For instance, it was preferred over 4 times more often than SDv2 for alignment. This complements the quantitative metrics and confirms its near-commercial quality.

Ablations / Parameter Sensitivity

$Figure 6: Left: Visual comparison of ablation studies are presented. Right: Zero-shot FID-2K on SAM, and GPU memory usage. Our method is on par with the "adaLN"' and saves $21 \\%$ in GPU memory. Bett…$ 该图像是图表类型，展示了不同方法在图像生成效果上的消融对比以及对应的零样本FID-2K和GPU内存使用情况。左侧为不同方法生成的对比图，右侧为GPU内存和FID值，展示了本方法在保持较低FID的同时节省21%的GPU内存。

The ablation study in Figure 6 validates the key design choices:

w/o re-param: A model trained from scratch without the Stage 1 pre-training and re-parameterization trick performs terribly. The generated images are distorted and lack detail, proving that the decomposed training strategy and knowledge transfer are essential for efficiency and quality.
adaLN vs. adaLN-single: The model with the original adaLN from DiT and the model with the proposed adaLN-single produce visually comparable results and similar FID scores. However, adaLN-single uses 21% less GPU memory (23GB vs. 29GB) and has 26% fewer parameters (611M vs. 833M). This confirms that the adaLN-single module is a successful optimization that improves efficiency without harming performance.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully demonstrates that it is possible to train a high-quality, photorealistic text-to-image model without exorbitant computational resources. PIXART- $\alpha$ achieves performance competitive with or superior to leading models like SDXL and Imagen at a fraction of the training cost (~10% of SDv1.5, ~1% of RAPHAEL). The success is attributed to a trifecta of innovations: a decomposed training strategy, an efficient Transformer architecture, and a data-centric approach using high-quality, auto-labeled captions.
Limitations & Future Work: The authors honestly acknowledge several limitations (Appendix A.11):
- Counting: The model struggles to accurately generate a specific number of objects (e.g., "three books").
- Fine Details: It can fail to render complex details accurately, particularly human hands, a common problem in T2I models.
- Text Generation: The model's ability to render coherent text within images is weak, likely due to a lack of such data in its training set. The authors plan to address these issues in future work.
Personal Insights & Critique:
- Democratizing AI: This paper is a significant contribution to the AIGC community. By providing a clear and reproducible roadmap for efficient T2I training, it lowers the barrier to entry and empowers smaller research teams and startups to innovate in this space. It's a welcome counter-narrative to the trend of ever-larger models from a few big labs.
- Data-Centric AI: The emphasis on data quality over quantity is a powerful takeaway. The auto-labeling pipeline using LLaVA is a clever and effective strategy that could be applied to many other domains beyond T2I. It shows that investing in curating a smaller, higher-quality dataset can be far more efficient than brute-forcing with a massive, noisy one.
- Architectural Elegance: The adaLN-single module is a simple yet effective architectural tweak that highlights the importance of thoughtful model design for efficiency.
- Reproducibility Concern: A minor critique is the use of a "10M internal dataset" in the final aesthetic tuning stage. While common in corporate research, this component is not publicly available, which slightly hinders full reproducibility by the external community.
- Flexibility: The successful application of DreamBooth and ControlNet (Appendix A.9) demonstrates that the PIXART- $\alpha$ base model is not just a one-trick pony but a flexible foundation that can be extended with popular customization techniques. This greatly enhances its practical value.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.