SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
TL;DR Summary
SDXL improves high-resolution text-to-image synthesis with a threefold larger UNet, novel conditioning, and multi-aspect ratio training, plus a refinement model enhancing visual fidelity; it rivals top closed-source generators and is fully open source.
Abstract
We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
- Authors: Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.
- Affiliations: The authors are affiliated with Stability AI's Applied Research team.
- Journal/Conference: The paper was released as a preprint on arXiv. It is a technical report detailing the model, not a peer-reviewed conference or journal publication. The reputation of Stability AI as the creator of Stable Diffusion lends significant weight to this report.
- Publication Year: 2023 (Initial version submitted in July 2023).
- Abstract: The paper introduces SDXL, a new latent diffusion model for generating images from text. SDXL improves upon previous Stable Diffusion models with a UNet backbone that is three times larger, featuring more attention blocks and a larger context from using a second text encoder. The authors introduce novel conditioning methods, train the model on multiple aspect ratios, and add an optional refinement model to enhance visual fidelity post-generation. They report that SDXL significantly outperforms older versions of Stable Diffusion and is competitive with leading closed-source, state-of-the-art models. The paper emphasizes its commitment to open research by releasing the code and model weights.
- Original Source Link: https://arxiv.org/abs/2307.01952 (PDF: http://arxiv.org/pdf/2307.01952v1)
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Previous open-source text-to-image models, including earlier versions of Stable Diffusion (1.5, 2.1), struggled with image quality, prompt understanding, and compositional accuracy compared to closed-source, "black-box" models like Midjourney and DALL-E 2. Specific issues included generating anatomically incorrect figures, poor text rendering, and artifacts like cropped subjects.
- Importance & Gaps: The dominance of closed-source models hinders research reproducibility, community innovation, and objective assessment of model biases and limitations. There was a clear need for a powerful, open-source foundation model that could close the performance gap.
- Innovation: SDXL introduces a combination of architectural scaling, novel data conditioning techniques, and a two-stage generation process to drastically improve image quality. The key innovations are a larger and more powerful UNet, the use of two text encoders for richer text conditioning, methods to handle variable image sizes and prevent cropping artifacts, and a separate refinement model.
-
Main Contributions / Findings (What):
- Scaled-Up Architecture: SDXL employs a UNet with 2.6 billion parameters, roughly 3x larger than its predecessors, combined with two separate text encoders (OpenCLIP and CLIP) for a much larger and more expressive conditioning context.
- Novel Micro-Conditioning Techniques: The paper introduces two simple yet effective methods:
- Size-Conditioning: The model is conditioned on the original resolution of training images, preventing data loss from discarding small images and avoiding artifacts from upscaling.
- Crop-Conditioning: The model is conditioned on cropping coordinates, which prevents the model from learning to generate cropped objects and gives users control over framing.
- Multi-Aspect Ratio Training: The model is fine-tuned on images of various aspect ratios (e.g., portrait, landscape), allowing it to generate non-square images more naturally.
- Two-Stage Pipeline with Refiner: An optional refinement model is introduced, which takes the output of the base SDXL model and applies a specialized noising-denoising process to add high-frequency details and improve overall visual fidelity, particularly for faces and complex textures.
- State-of-the-Art Open-Source Performance: User studies show that SDXL is strongly preferred over previous Stable Diffusion versions and achieves results competitive with, and in some cases superior to, leading closed-source models like Midjourney v5.1.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Diffusion Models (DMs): Generative models that learn to create data by reversing a gradual noising process. They start with pure noise and iteratively denoise it over a series of steps to produce a clean sample (e.g., an image). The model that predicts the noise (or the clean image) at each step is typically a deep neural network.
- Latent Diffusion Models (LDMs): A specific type of diffusion model, on which Stable Diffusion is based. Instead of operating in the high-dimensional space of pixels, LDMs work in a compressed "latent" space. An autoencoder (specifically, a Variational Autoencoder or VAE) is first trained to compress images into a smaller latent representation and then reconstruct them. The diffusion process then happens in this computationally cheaper latent space.
- UNet: A deep convolutional neural network architecture with an encoder-decoder structure and "skip connections" that link layers from the encoder to corresponding layers in the decoder. This architecture is highly effective at tasks requiring pixel-level output and is the standard backbone for most diffusion models.
- Cross-Attention: A mechanism that allows the UNet to incorporate external conditioning information, such as text. During the denoising process, the UNet can "pay attention" to specific parts of the text prompt's embedding to guide the image generation, ensuring the output matches the text description.
- Text Encoders (CLIP & OpenCLIP): Models that are trained on vast amounts of image-text pairs to learn a shared embedding space where similar concepts in text and images are close together. Their text encoders are used to convert a user's prompt into a numerical vector (embedding) that the diffusion model's cross-attention mechanism can understand.
- Classifier-Free Guidance (CFG): A technique used during inference to improve the adherence of the generated image to the text prompt. It works by blending the model's prediction for the given prompt with its prediction for an empty (unconditional) prompt. A higher
CFG scalepushes the generation to be more faithful to the prompt, often at the cost of diversity.
-
Previous Works:
- Stable Diffusion 1.x & 2.x: These are the direct predecessors of SDXL. They established the LDM framework as a powerful and efficient open-source tool for text-to-image generation. However, they were smaller (around 860M UNet parameters) and had limitations in image quality and prompt understanding that SDXL aims to solve.
- Imagen & DALL-E 2: Foundational, large-scale text-to-image models from Google and OpenAI, respectively. These models demonstrated that massive scale and sophisticated architectures could achieve unprecedented levels of photorealism and compositional ability, but they remained closed-source.
- Midjourney: A popular, high-quality, closed-source text-to-image service. At the time of the paper, it was considered a benchmark for aesthetic quality, making it a key competitor for SDXL to be evaluated against.
-
Technological Evolution: The paper follows an established trend in deep generative modeling: scaling up the model size (more parameters) leads to better performance. It also shows a move towards more sophisticated model conditioning, not just on text but also on metadata like image size and cropping, treating them as first-class citizens in the training process.
-
Differentiation: SDXL's key innovation lies in its holistic approach. It doesn't just scale up one component but intelligently combines architectural improvements (larger UNet, dual text encoders) with novel and practical training techniques (
size-conditioning,crop-conditioning,multi-aspect training) and an optional refinement stage. Crucially, it delivers these improvements in an open-source package, democratizing access to state-of-the-art generation capabilities.
4. Methodology (Core Technology & Implementation)
SDXL introduces several modular improvements to the Stable Diffusion architecture.
4.1. Architecture & Scale
SDXL's core is a significantly larger and more powerful UNet and text encoder system. The key changes are detailed in Table 1.
(Manual Transcription of Table 1)
| Model | SDXL | SD 1.4/1.5 | SD 2.0/2.1 |
|---|---|---|---|
| # of UNet params | 2.6B | 860M | 865M |
| Transformer blocks | [0, 2, 10] | [1, 1, 1, 1] | [1, 1, 1, 1] |
| Channel mult. | [1, 2, 4] | [1, 2, 4, 4] | [1, 2, 4, 4] |
| Text encoder | CLIP ViT-L & OpenCLIP ViT-bigG | CLIP ViT-L | OpenCLIP ViT-H |
| Context dim. | 2048 | 768 | 1024 |
| Pooled text emb. | OpenCLIP ViT-bigG | N/A | N/A |
- Larger UNet (2.6B parameters): The UNet is approximately three times larger than previous models. The parameter increase comes primarily from using more Transformer blocks (which contain the self-attention and cross-attention layers). Unlike older models which distributed these blocks evenly, SDXL concentrates them at lower spatial resolutions (10 blocks at the second-to-last level), where more complex semantic processing occurs.
- Dual Text Encoders: SDXL uses two different text encoders: the standard
CLIP ViT-Land the much largerOpenCLIP ViT-bigG. The outputs from the penultimate layer of both encoders are concatenated along the channel dimension. This creates a combined context dimension of 2048 (768 from CLIP-L + 1280 from OpenCLIP-bigG, though the paper states the final dimension is 2048, implying some projection or selection might be used), providing a richer and more nuanced text representation for the cross-attention layers. - Pooled Text Embedding Conditioning: In addition to cross-attention, SDXL uses the pooled text embedding from the
OpenCLIPmodel as a further conditioning signal. This single vector, which represents the entire prompt, is added to the timestep embedding, giving the model a global sense of the prompt's content.
4.2. Micro-Conditioning
These are novel techniques that condition the model on image metadata during training.
-
Conditioning the Model on Image Size:
-
Problem: Datasets contain images of varying resolutions. Standard practice is to either discard images smaller than a target resolution (e.g., 512x512), losing valuable data (see Figure 2), or upscale them, which can introduce blurriness and artifacts.
该图像是论文中的图表,展示了预训练数据集中图像高度与宽度的分布。虚线框显示了若无尺寸条件约束,约39%的数据因边长小于256像素而被舍弃,颜色深浅代表样本数量。 -
Solution: SDXL conditions the UNet on the original resolution of the training image, .
-
Implementation: The original height and width values are encoded using a Fourier feature encoding (a method to represent continuous values as a high-dimensional vector that neural networks can process more easily). This vector is then added to the model's timestep embedding. At inference, the user can specify a target "apparent resolution," influencing the level of detail in the generated image, as shown in the examples below.
该图像是多张街头涂鸦机器人的示意图,展示了不同大小参数 分别为 (64, 64)、(128, 128)、(256, 236)、(512, 512) 对图像效果的影响,图像内容均为灰砖墙和人行道背景上的蓝色机器人涂鸦,尺寸越大细节越清晰。
该图像是四幅动漫风格插画,均描绘身穿白大褂的熊猫科学家在操作各种化学试剂,背景色彩丰富,体现实验室场景的神秘与趣味。
-
-
Conditioning the Model on Cropping Parameters:
-
Problem: Random cropping is a standard data augmentation technique, but it can cause the model to learn to generate truncated or badly framed subjects (e.g., a person with their head cut off).
该图像是图像插图,展示了图4中SDXL与之前版本Stable Diffusion在相同提示词下生成的猫和龙样本对比,每个模型均显示了3个随机生成样本,迭代步数为50,使用DDIM采样器和cfg-scale 8.0。 -
Solution: During training, when an image is randomly cropped, the model is conditioned on the crop coordinates, , which specify the pixels cropped from the top and left.
-
Implementation: Similar to size conditioning, these coordinates are encoded via Fourier features and added to the timestep embedding. During inference, these values are set to , instructing the model to generate a well-centered, uncropped image. The user can also manually adjust these values to control the framing, as demonstrated below.
该图像是示意图,展示了在不同裁剪坐标(ccrop)情况下由SDXL模型生成的多张宇航员与猪的合成图像,体现了图像合成的局部细节变化。
该图像是一组微距照片插图,展示了不同角度的乐高积木小动物模型置于草地上的细节特写,表现了模型和背景的结合与纹理对比。
-
The overall conditioning pipeline is summarized in Algorithm 1.
4.3. Multi-Aspect Training
- Problem: Most text-to-image models are trained and generate only square images (e.g., 1024x1024), which is restrictive and unnatural for many use cases (e.g., landscape photos, portrait shots).
- Solution: After initial pretraining, SDXL is fine-tuned on data "bucketed" by different aspect ratios. These buckets are chosen to keep the total area close to pixels (e.g., 1216x832, 896x1152, 1536x640).
- Implementation: During training, each batch consists of images from a single aspect ratio bucket. The model is conditioned on the target resolution of the bucket, , again using Fourier feature embeddings. This allows a single model to generate images in a wide variety of dimensions natively.
4.4. Improved Autoencoder
The autoencoder (VAE) is responsible for encoding images into the latent space and decoding latents back into images. While most of the generation is handled by the UNet, a better VAE can improve fine details and reduce artifacts. SDXL's VAE was trained from scratch with a larger batch size (256 vs. 9 for the original SD VAE) and used an exponential moving average (EMA) of the model's weights, a technique that often leads to better generalization. As shown in Table 3, the resulting SDXL-VAE significantly outperforms the VAEs from SD 1.x and 2.x in standard reconstruction metrics.
(Manual Transcription of Table 3)
| model | PNSR ↑ | SSIM ↑ | LPIPS ↓ | rFID ↓ |
|---|---|---|---|---|
| SDXL-VAE | 24.7 | 0.73 | 0.88 | 4.4 |
| SD-VAE 1.x | 23.4 | 0.69 | 0.96 | 5.0 |
| SD-VAE 2.x | 24.5 | 0.71 | 0.92 | 4.7 |
4.5. Putting Everything Together: The Full Pipeline
The final SDXL model is trained in a multi-stage process and used in a two-stage pipeline for best results.
-
Training Stages:
- Pretraining: The base model is first trained on 256x256 images using
size-andcrop-conditioning. - Higher-Resolution Training: Training continues on 512x512 images.
- Finetuning: The model is finally fine-tuned using
multi-aspect trainingon various resolutions with an area close to pixels. This stage also usesoffset-noise, a technique that improves the model's ability to generate very dark and very bright regions.
- Pretraining: The base model is first trained on 256x256 images using
-
Inference Pipeline (with Refinement):
-
Base Model: The base SDXL model generates a latent from a text prompt at the desired resolution (e.g., 1024x1024).
-
Refinement Model: A separate, specialized LDM is then used. It takes the latent from the base model, adds a small amount of noise (for the first 200 timesteps), and then denoises it using the same text prompt. This process, similar to
SDEdit, does not change the image's composition but adds significant high-frequency detail and polish. The effect of the refiner is shown below.
该图像是图表与示意图结合,左侧柱状图比较了SDXL及其细化模型与Stable Diffusion 1.5和2.1的用户偏好胜率,右侧为两阶段生成流程示意,从128×128初始潜变量生成高分辨率1024×1024图像,涉及SDEdit和自编码器。
该图像是对比插图,展示了SDXL模型生成的1024²像素纽约被水淹没并被植被覆盖的城市景观图片,左侧为未经过精炼模型处理的结果,右侧为经过精炼模型提升视觉细节后的结果,局部放大区域用红蓝框标出细节差异。
-
5. Experimental Setup
-
Datasets:
- Pretraining: An large-scale internal dataset was used. Its resolution distribution is shown in Figure 2.
- Ablation Study:
ImageNetwas used for a controlled experiment to validate thesize-conditioningtechnique. - User Studies: Prompts were sourced from the
PartiPrompts (P2)benchmark, which is designed to test models on challenging compositional and creative prompts. - Automated Metrics: The
COCOdataset caption-image pairs were used for zero-shot evaluation of FID and CLIP scores.
-
Evaluation Metrics:
- Human Preference Score: Users are shown images from different models and asked to pick their favorite. The result is reported as a win rate percentage. This is considered the most reliable evaluation method.
- FID (Fréchet Inception Distance): Measures the distance between the distribution of features from real images and generated images, as extracted by a pretrained InceptionV3 network. Lower is better, indicating the generated images are statistically similar to real ones.
- Conceptual Definition: It assesses both the quality and diversity of generated images. A low FID suggests the generated images are both realistic and varied.
- Mathematical Formula:
- Symbol Explanation:
- : Mean of feature vectors for real (x) and generated (g) images.
- : Covariance matrices of feature vectors for real and generated images.
- : The trace of a matrix (sum of diagonal elements).
- IS (Inception Score): Measures the quality and diversity of generated images. Higher is better.
- Conceptual Definition: It favors images that are clearly classifiable by an Inception network (high quality) and where the distribution of predicted classes is uniform (high diversity).
- Mathematical Formula:
- Symbol Explanation:
- : An image sampled from the generator G.
- : The conditional class distribution given image , predicted by the Inception model.
p(y): The marginal class distribution, averaged over all generated images.- : Kullback-Leibler divergence, which measures how much differs from
p(y).
- CLIP Score: Measures the semantic similarity between a text prompt and the image generated from it. Higher is better.
- Conceptual Definition: It calculates the cosine similarity between the CLIP embeddings of the prompt and the generated image.
- Mathematical Formula:
- Symbol Explanation:
- : Image embedding from CLIP's image encoder.
- : Text embedding from CLIP's text encoder.
- : Cosine similarity function.
- VAE Reconstruction Metrics:
PNSR(Peak Signal-to-Noise Ratio),SSIM(Structural Similarity Index),LPIPS(Learned Perceptual Image Patch Similarity), andrFID(reconstruction FID). These measure how faithfully the VAE can reconstruct an image after encoding and decoding it.
-
Baselines:
- Open-Source: Stable Diffusion 1.5, Stable Diffusion 2.1.
- Closed-Source/State-of-the-Art: Midjourney (v5.1 and v5.2), DALL-E 2, DeepFloyd IF, Bing Image Creator.
6. Results & Analysis
-
Core Results:
-
Superiority over Previous SD Versions: The user study (Figure 1, left) is decisive: SDXL with the refiner was preferred 48.44% of the time, and the SDXL base model was preferred 36.93% of the time. In contrast, SD 1.5 and SD 2.1 were chosen only 7.91% and 6.71% of the time, respectively. This shows an overwhelming user preference for SDXL.
-
Qualitative Improvements: Figures 4, 14, and 15 visually demonstrate SDXL's superiority. It generates more detailed, coherent, and aesthetically pleasing images, correctly interprets complex prompts, and avoids the object-cropping artifacts that plagued older models.
该图像是图像比较插图,展示了SDXL与早期Stable Diffusion版本(SD 1.5和SD 2.1)在不同主题(山羊模型和悉尼歌剧院背景)的生成效果对比,每组展示了模型针对同一提示生成的3张随机样本。
该图像是图15,展示了SDXL与之前版本Stable Diffusion在相同提示词下生成结果的对比。左侧为黑白森林小屋图像,右侧为持黄色玫瑰的獾的彩色图像,每组展示3个随机样本,均采用50步DDIM采样和cfg-scale 8.0。 -
Competitiveness with State-of-the-Art: In a large-scale user study against Midjourney v5.1 using the PartiPrompts benchmark, SDXL was preferred 54.9% of the time overall (Figure 9). When broken down by category (Figures 10 & 11), SDXL outperformed or was statistically equal to Midjourney in most categories testing complex prompt understanding, such as those involving abstract concepts, specific wording, or complex scenes.
该图像是一个水平条形图,展示了“Vanilla”类别中两个不同频率区间的比例分布,图中用两种颜色区分比例大小,横轴表示频率百分比。
该图像是图表,展示了SDXL(无精炼模型)与Midjourney V5.1在不同文本类别用户偏好比较,其中SDXL在除两个类别外均优于Midjourney V5.1。
该图像是图表,展示了包含修正模型的SDXL与Midjourney V5.1在复杂提示词下的偏好比较,SDXL在10个类别中有7个类别表现优于或与Midjourney V5.1无显著差异。
-
-
Ablations / Parameter Sensitivity:
-
Impact of Size-Conditioning: Table 2 shows that on ImageNet, the model trained with
size-conditioning(CIN-size-cond) achieves a significantly better (lower) FID of 36.53 and a higher IS of 215.34, compared to a model that discards small images (CIN-512-only, FID 43.84) or one that uses upscaled images without conditioning (CIN-nocond, FID 39.76). This confirms the effectiveness of the technique.(Manual Transcription of Table 2)
model FID-5k ↓ IS-5k ↑ CIN-512-only 43.84 110.64 CIN-nocond 39.76 211.50 CIN-size-cond 36.53 215.34 -
Impact of Refinement Stage: Figures 6 and 13 clearly visualize the benefit of the refiner. It adds crisp, high-frequency details to faces, textures, and backgrounds, transforming a good image from the base model into an excellent one without altering its fundamental composition.
该图像是代码示意图,展示了SDXL模型中用于结合多种条件(尺寸、裁剪、长宽比)和文本编码器输出的嵌入拼接方法,包含傅里叶特征计算及拼接实现,核心函数为concat_embeddings。
该图像是论文中图13的示意图,展示了使用SDXL模型生成的图像在未使用(左侧)和使用后期细化模型(右侧)处理下的视觉效果对比。图中通过红、蓝、黄框标注并放大了面部和餐桌细节部分,细化模型提升了图像清晰度和细节表现。 -
Critique of Automated Metrics (FID/CLIP): Appendix F and Figure 12 deliver a crucial insight. Despite being vastly superior in human evaluations, SDXL's FID scores on the COCO dataset are worse than those of SD 1.5 and SD 2.1. Its CLIP scores are only marginally better. This finding strongly supports the growing consensus that automated metrics like FID are not well-correlated with human perception of image quality and aesthetics for modern foundation models.
该图像是两个散点折线图,展示了在不同采样步数(40步和50步)和指导权重下,sd1-5、sd2-1与sdxl模型的FID分数与Clip ViT-g14分数的关系。
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully presents SDXL, a major leap forward for open-source text-to-image generation. By combining a scaled-up architecture with intelligent conditioning on text and image metadata (
size,crop,aspect ratio), and adding a post-hoc refinement stage, SDXL achieves performance that is competitive with top-tier closed-source models. The release of the model and its weights is a significant contribution to the research community, promoting transparency and enabling further innovation. -
Limitations & Future Work:
-
Authors' Acknowledged Limitations (Appendix B):
-
Complex Structures: The model still struggles with fine-grained, complex structures like human hands (see Figure 7).
-
Photorealism: It achieves high realism but not perfect photorealism; subtle lighting and textures can be missed.
-
Bias: As it is trained on large, uncurated datasets, it can reproduce and amplify social biases.
-
Concept Bleeding: Attributes of one object in a prompt can sometimes "bleed" onto another (e.g., "a red hat and blue gloves" might produce a blue hat and red gloves).
-
Text Rendering: While improved, rendering long, legible text remains a challenge.
该图像是论文中展示的图7,属于插图类型,展示了SDXL模型生成的失败案例。图片显示模型虽有提升,但在复杂空间及细节描述上仍有不足,如手部形态不准确和概念混淆等问题。
-
-
Future Work (Section 3):
- Single-Stage Model: Develop a single model that achieves the quality of the base+refiner pipeline to reduce memory usage and complexity.
- Improved Text Synthesis: Incorporate techniques like byte-level tokenizers to improve text rendering.
- Architecture Exploration: Further investigate pure Transformer-based architectures like DiT.
- Distillation: Use knowledge distillation to create smaller, faster versions of SDXL for more accessible deployment (e.g., on mobile devices).
- Improved Training Framework: Adopt more modern diffusion frameworks like EDM (
Elucidating the Design Space of Diffusion-Based Generative Models) for better sampling flexibility.
-
-
Personal Insights & Critique:
- Impact of Openness: The decision to release SDXL as an open model is its most impactful contribution. It has fueled a massive wave of innovation in the open-source AI community, enabling countless applications, research projects, and artistic endeavors that would be impossible with closed-source models.
- Pragmatic Engineering: The success of SDXL is a testament to smart, pragmatic engineering. The
size-andcrop-conditioningtechniques are not theoretically profound but are highly effective solutions to real-world data-handling problems. This focus on practical improvements is a key lesson. - The Evaluation Dilemma: The paper's candid discussion on the failure of FID scores is a critical contribution to the field. It underscores the urgent need for better, more human-aligned automated metrics for evaluating generative models, as reliance on human preference studies is expensive and slow.
- Two-Stage Pipeline Trade-off: The base+refiner pipeline is a clever way to achieve top-tier quality but comes with a significant cost in terms of memory (VRAM) and inference speed. It represents a trade-off between peak performance and accessibility, one that the authors rightly identify as a priority for future work.
Similar papers
Recommended via semantic vector search.