Paper status: completed

Kandinsky 3.0 Technical Report

Published:12/06/2023

Image Super-resolution (6)Text-to-Image Generation (19)Latent Diffusion Models (1)Image Inpainting and Editing (1)Image-to-Video Generation (3)

Original Link PDF

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Kandinsky 3.0, a latent diffusion text-to-image model, advances image realism and text understanding with efficient architecture, super-resolution, editing, and video generation, plus a 20× faster distilled version, outperforming competitors in human evaluations.

Abstract

We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. In this report we describe the architecture of the model, the data collection procedure, the training technique, and the production system for user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. We also describe extensions and applications of our model, including super resolution, inpainting, image editing, image-to-video generation, and a distilled version of Kandinsky 3.0 - Kandinsky 3.1, which does inference in 4 steps of the reverse process and 20 times faster without visual quality decrease. By side-by-side human preferences comparison, Kandinsky becomes better in text understanding and works better on specific domains. The code is available at https://github.com/ai-forever/Kandinsky-3

Mind Map

In-depth Reading

English Analysis~15 min read · 19,515 chars

1. Bibliographic Information

Title: Kandinsky 3.0 Technical Report
Authors: Vladimir Arkhipkin, Andrei Filatov, Viacheslav Vasilev, Anastasia Maltseva, Said Azizov, Igor Pavlov, Julia Agafonova, Andrey Kuznetsov, Denis Dimitrov.
Affiliations: The authors are from Sber AI and the Artificial Intelligence Research Institute (AIRI).
Journal/Conference: This paper is a technical report published on arXiv, a preprint server. This means it has not undergone formal peer review for a conference or journal at the time of publication but serves as a way to quickly disseminate research findings.
Publication Year: 2023 (submitted in December).
Abstract: The authors introduce Kandinsky 3.0, a large-scale latent diffusion model for text-to-image generation. It aims for higher image quality, realism, and better text understanding compared to previous models in the Kandinsky series. The report details the model's architecture, data collection, training methods, and user interaction systems. Key architectural components that significantly improved quality are highlighted. The paper also covers extensions like a 20x faster distilled version (Kandinsky 3.1), super-resolution, inpainting, image editing, and image-to-video capabilities. Human preference studies show that Kandinsky 3.0 surpasses its predecessors and competitors in text understanding and performance on specific domains.
Original Source Link:
- arXiv Page: https://arxiv.org/abs/2312.03511
- PDF Link: http://arxiv.org/pdf/2312.03511v3

2. Executive Summary

Background & Motivation (Why)

The field of text-to-image generation has advanced rapidly, with models now capable of producing highly realistic and complex images from textual descriptions. However, challenges remain in achieving perfect alignment between the generated image and the user's prompt (text understanding) and in pushing the boundaries of photorealism.

This paper introduces Kandinsky 3.0, the next iteration in a series of models developed by Sber AI. The primary motivation was to build upon their previous work (Kandinsky 2.x series) to significantly improve image quality and text comprehension. The authors identified a need to simplify the architecture of prior Kandinsky models, which used a more complex two-stage pipeline, and to leverage recent advancements in large language models (LLMs) for better text encoding. By making the model and its code publicly available, the authors also aim to foster openness and further innovation in the research community.

Main Contributions / Findings (What)

The paper makes several key contributions:

A Novel Large-Scale Model Architecture: Kandinsky 3.0 is an 11.9 billion parameter latent diffusion model featuring three core innovations:
- A massive 8.6B parameter text encoder (Flan-UL2) for superior text understanding.
- A deep 3.0B parameter U-Net based on modified BigGAN-deep blocks, enhancing its generative capacity.
- A custom high-fidelity 270M parameter image decoder (Sber-MoVQGAN) for state-of-the-art image reconstruction.
Extensive Model Extensions and Applications: The report details several powerful extensions built upon the base model:
- Kandinsky 3.1: A distilled version of Kandinsky 3.0 that uses Adversarial Diffusion Distillation to achieve a 20x speedup in inference (generating an image in just 4 steps) with negligible loss in visual quality.
- Kandinsky SuperRes: A specialized model for upscaling images to 4K resolution, which outperforms existing methods like Real-ESRGAN and Stable Diffusion's upscaler.
- Practical Applications: The paper demonstrates how the model can be used for inpainting, outpainting, image editing (via an IP-Adapter), and both image-to-video and text-to-video generation.
Comprehensive Human Evaluation: The authors conducted extensive side-by-side comparisons against leading models (Kandinsky 2.2, SDXL, DALL-E 3). The results show Kandinsky 3.0 is significantly preferred over its predecessor and SDXL in both visual quality and text alignment. While DALL-E 3 still leads in text comprehension, Kandinsky 3.0 is competitive in visual quality.
Open-Source Release: The code and model weights for Kandinsky 3.0 and its extensions have been made publicly available, promoting transparency and further research.

Foundational Concepts

To understand this paper, one should be familiar with the following concepts:

Text-to-Image Generation: The task of creating an image from a textual description (a "prompt").
Diffusion Models: A class of generative models inspired by thermodynamics. They work in two steps: a forward process that gradually adds noise to an image until it becomes pure noise, and a reverse process where a neural network learns to progressively remove the noise to generate a clean image. The model is often conditioned on extra information, like text.
Latent Diffusion Models (LDMs): A highly efficient variant of diffusion models. Instead of running the computationally expensive diffusion process on high-resolution images, LDMs first compress the image into a smaller, lower-dimensional "latent space" using an autoencoder (like a VQGAN). The diffusion process happens in this latent space, and the final latent representation is then upscaled back to a full image by the decoder. Kandinsky 3.0 is an LDM.
U-Net: A neural network architecture with an encoder-decoder structure. The encoder downsamples the input (e.g., a noisy latent image), and the decoder upsamples it back to the original size. A key feature is skip connections, which connect layers from the encoder to corresponding layers in the decoder, helping preserve fine-grained details during reconstruction. It is the standard architecture for the denoising network in diffusion models.
Text Encoder: A model that converts a text prompt into a numerical vector (an "embedding"). This embedding contains the semantic meaning of the text and is used to guide the diffusion model's generation process. Early models used encoders like CLIP, but Kandinsky 3.0 uses a much larger and more powerful language model, Flan-UL2.
Cross-Attention: A mechanism within the U-Net that allows the model to "pay attention" to different parts of the text embedding at different stages of image generation. This is how the text prompt conditions or guides the image creation process, ensuring the output aligns with the prompt.
VQGAN (Vector Quantized Generative Adversarial Network): An autoencoder that learns to compress an image into a grid of discrete codes from a learned "codebook." The MoVQGAN used in this paper is an advanced version that improves reconstruction quality.

Technological Evolution & Differentiation

The paper positions Kandinsky 3.0 within the lineage of powerful diffusion models like GLIDE, DALL-E 2, Imagen, and Stable Diffusion (SDXL).

Previous Kandinsky Models (2.x): The earlier versions used a two-stage pipeline. First, a Diffusion Mapping model translated a text embedding into an image embedding. Then, a separate latent diffusion model generated the image from that image embedding. Kandinsky 3.0 simplifies this by adopting a single-stage latent diffusion architecture, similar to Stable Diffusion, where the text embedding directly conditions the U-Net.
Differentiation from Competitors (SDXL, DALL-E 3):
- Text Encoder: Kandinsky 3.0's most significant differentiator is its use of the massive 8.6B parameter Flan-UL2 as its text encoder. In contrast, SDXL uses a combination of two smaller CLIP encoders (totaling ~0.8B parameters). The hypothesis is that a more powerful language model provides a richer, more nuanced understanding of the text prompt.
- U-Net Architecture: While most models use a U-Net, Kandinsky 3.0's is specifically built with modified BigGAN-deep residual blocks, a choice aimed at increasing network depth efficiently for better generative performance.
- Image Decoder: The paper emphasizes its custom Sber-MoVQGAN, which is shown to achieve state-of-the-art image reconstruction quality, contributing to the final image's clarity and detail.

4. Methodology (Core Technology & Implementation)

Kandinsky 3.0's architecture consists of three main frozen or trainable components, as shown in Image 1.

FigurKandinsky3.0overal pipelinrchitecture It onsiststex encoder, alatent conditie model, and an image decoder. 该图像是Kandinsky 3.0模型的示意图，展示了从文本编码到图像生成的整体流程。输入文本先由FLAN-UL2文本编码器处理，经过潜在扩散深度U-Net，最终通过Sber-MoVQGAN解码器生成图像，如示例中寿司房子里的柯基犬。

The following table, transcribed from the paper's Table 1, compares the parameter counts of Kandinsky 3.0 with its predecessor and SDXL. Kandinsky 3.0 is substantially larger, primarily due to its text encoder.

	Kandinsky 2.2 [11]	SDXL [9]	Kandinsky 3.0
Model type	Latent Diffusion	Latent Diffusion	Latent Diffusion
Total parameters	4.6B	3.33B	11.9B
Text encoder	0.62B	0.8B	8.6B
Diffusion Mapping [11]	1.0B
Denoising U-Net	1.2B	2.5B	3.0B
Image decoder	0.08B	0.08B	0.27B

This table was manually transcribed from the source paper.

1. Text Encoder: Flan-UL2

Kandinsky 3.0 uses the encoder part of the Flan-UL2 20B model, which itself has 8.6 billion parameters. Flan-UL2 is a powerful language model pre-trained on a vast array of language tasks using an instruction-tuning methodology called "Flan Prompting." The authors found that this extensive language understanding capability significantly improves the model's ability to generate images that accurately reflect complex prompts. During the training of the diffusion model, the text encoder is frozen, meaning its weights are not updated.

2. Denoising U-Net Architecture

The core of the generative process is a 3.0B parameter U-Net, which is trained to predict and remove noise from the latent representation. Its architecture, shown in Image 12, has several key features:

该图像是一幅高清摄影插图，展示了放置在蓝色盘子上的两块橙子夹层蛋糕，背景为深蓝色布料，突显蛋糕的色彩和质感。

Modified BigGAN-deep Blocks: Instead of standard residual blocks, the U-Net is built from blocks inspired by the BigGAN-deep architecture. These blocks use a bottleneck design: a $1 \times 1$ convolution reduces the number of channels, followed by a standard $3 \times 3$ convolution, and another $1 \times 1$ convolution restores the channel count. This allows for a deeper network (more layers) without a proportional explosion in parameters, which the authors found improves training.
Architectural Modifications: The authors modified the original BigGAN-deep blocks in three ways:
1. They use Group Normalization instead of Batch Normalization, as it is more stable for this type of training.
2. They use the SiLU (Sigmoid-weighted Linear Unit) activation function instead of ReLU.
3. They implement standard skip connections for better feature flow.
Hybrid Attention: The U-Net uses a combination of convolutional blocks and transformer layers. At higher resolutions (less compressed latents), it relies on convolutional blocks. At lower resolutions (more compressed latents), the features are also fed into transformer layers containing cross-attention (to incorporate text information) and self-attention (to model global relationships between different parts of the image).

3. Image Decoder: Sber-MoVQGAN

To transform the final denoised latent back into a pixel-based image, Kandinsky 3.0 uses a custom-trained 270M parameter Sber-MoVQGAN. This is an advanced version of a VQGAN autoencoder.

Key Innovation (Mo): The "Mo" stands for Modulating. It introduces spatially conditional normalization, which works similarly to Adaptive Instance Normalization (AdaIN) in StyleGAN. The quantized latent codes from the codebook modulate the feature maps in the decoder. This is described by the formula:

$F ^ { i } = \phi _ { \gamma } ( z _ { q } ) \frac { F ^ { i - 1 } - \mu ( F ^ { i - 1 } ) } { \sigma ( F ^ { i - 1 } ) } + \phi _ { \beta } ( z _ { q } )$
- $F^{i-1}$ is the feature map from the previous layer of the decoder.
- $\mu(F^{i-1})$ and $\sigma(F^{i-1})$ are its mean and standard deviation. The term $\frac { F ^ { i - 1 } - \mu ( F ^ { i - 1 } ) } { \sigma ( F ^ { i - 1 } ) }$ normalizes the feature map.
- $z_q$ is the latent code from the codebook.
- $\phi_{\gamma}(z_q)$ and $\phi_{\beta}(z_q)$ are learnable transformations that convert $z_q$ into scaling and shifting factors, respectively. This allows the decoder to use the high-level information from $z_q$ to finely control the style and features of the reconstructed image.

Performance: The authors trained several versions of Sber-MoVQGAN and show in Table 2 (transcribed below) that their 270M version achieves state-of-the-art results in image reconstruction on the ImageNet dataset, outperforming other popular autoencoders. This high-quality decoder is crucial for generating sharp and detailed final images.

Model	Latent size	Num Z	Train steps	FID	SSIM ↑	PSNR ↑	L1↓
ViT-VQGAN [31]	32x32	8192	500,000	1.28
RQ-VAE [33]	8x8x16	16384	10 epochs	1.83
Mo-VQGAN [29]	16x16x4	1024	40 epochs	1.12	0.673	22.42
VQ CompVis [34]	32x32	16384	971,043	1.34	0.650	23.85	0.0533
KL CompVis [34]	32x32		246,803	0.968	0.692	25.11	0.0474
Sber-VQGAN	32x32	8192	1 epoch	1.44	0.682	24.31	0.0503
Sber-MoVQGAN 67M	32x32	1024	5,000,000	1.34	0.704	25.68	0.0451
Sber-MoVQGAN 67M	32x32	16384	2,000,000	0.965	0.725	26.45	0.0415
Sber-MoVQGAN 102M	32x32	16384	2,360,000	0.776	0.737	26.89	0.0398
Sber-MoVQGAN 270M	32x32	16384	1,330,000	0.686	0.741	27.04	0.0393

This table was manually transcribed from the source paper's Table 2.

4. Data and Training Strategy

The model was trained on a massive internal dataset of 1.5 billion text-image pairs. This dataset was curated from various sources, including Common Crawl and other "dirty" internet data, and filtered for high aesthetic quality.

Culturally Specific Data: To improve performance on Russian-related content, the team collected and labeled a custom dataset of 100,000 images of Russian cartoons, famous people, and landmarks.
Multi-Stage Training: Training was conducted in five stages, progressively increasing the image resolution. This "curriculum learning" approach helps the model first learn coarse features and then refine fine details, leading to more stable training and better final quality.
1. 256x256: 1.1B pairs, 600k steps
2. 384x384: 768M pairs, 500k steps
3. 512x512: 450M pairs, 400k steps
4. 768x768: 224M pairs, 250k steps
5. Mixed High-Resolution (up to 1024x1024): 280M pairs, 350k steps

5. Experimental Setup

Datasets

Human Evaluation: A custom, balanced set of 2,100 prompts across 11 categories (e.g., characters, objects, landscapes, art styles) was created for side-by-side (SBS) human evaluation.
Super-Resolution Evaluation: Standard academic datasets were used:
- $RealSR(V3)$ and Set14: Commonly used benchmarks for super-resolution.
- Wikidata 5k: A custom dataset of 5,000 images from Wikipedia.

Evaluation Metrics

The paper relies heavily on human evaluation but also uses standard quantitative metrics for specific tasks.

Human Preference Rate: For text-to-image, users were shown two images generated by different models for the same prompt and asked to choose the best one based on two criteria:
1. Alignment: How well the image matches the text prompt.
2. Visual Quality: The aesthetic appeal, realism, and coherence of the image.
Quantitative Metrics (for Super-Resolution):
1. FID (Fréchet Inception Distance): Measures the perceptual quality and realism of generated images by comparing the distribution of deep features from an InceptionV3 network with that of real images. A lower FID indicates higher similarity and better quality.
  - Formula: $d^2 = ||\mu_x - \mu_y||^2 + \text{Tr}(\Sigma_x + \Sigma_y - 2(\Sigma_x\Sigma_y)^{1/2})$
  - Symbols: $\mu_x, \mu_y$ are the means of the feature vectors for real and generated images. $\Sigma_x, \Sigma_y$ are their covariance matrices. Tr is the trace of a matrix.
2. PSNR (Peak Signal-to-Noise Ratio): Measures the quality of image reconstruction by comparing the maximum possible pixel value to the mean squared error between the original and reconstructed images. Higher is better.
  - Formula: $PSNR = 20 \cdot \log_{10}(MAX_I) - 10 \cdot \log_{10}(MSE)$
  - Symbols: $MAX_I$ is the maximum pixel value (e.g., 255). MSE is the mean squared error between the images.
3. SSIM (Structural Similarity Index Measure): Measures the perceptual difference between two images based on luminance, contrast, and structure. Values range from -1 to 1, with 1 indicating a perfect match. Higher is better.
4. L1 Loss (Mean Absolute Error): Calculates the average absolute difference between corresponding pixels in two images. It is a direct measure of reconstruction error. Lower is better.

Baselines

Text-to-Image: Kandinsky 2.2, SDXL, DALL-E 3.
Super-Resolution: Real-ESRGAN, Stable Diffusion x4 Upscaler.

6. Results & Analysis

Core Results: Human Evaluation

The human preference comparisons reveal Kandinsky 3.0's strengths and weaknesses.

Kandinsky 3.0 vs. Kandinsky 2.2 (Images 40-42): Kandinsky 3.0 is overwhelmingly preferred, winning in ~60-70% of comparisons for both text comprehension and visual quality. This demonstrates a massive improvement over the previous generation.
Kandinsky 3.0 vs. SDXL (Images 47-49): Kandinsky 3.0 shows a clear advantage, being preferred in over 50% of comparisons for both text and visuals, while SDXL wins in only ~20-25% of cases. This suggests its architectural choices, particularly the powerful text encoder, yield better results than a leading open-source competitor.
Kandinsky 3.0 vs. DALL-E 3 (Images 43-44, 46): DALL-E 3 wins overall, especially in text comprehension (preferred in ~48% of cases vs. K3.0's ~30%). However, for visual quality, the gap narrows significantly, with DALL-E 3 only slightly ahead and a large percentage of votes for "Both" or "No one," indicating comparable aesthetic quality in many cases.

Analysis of Extensions

Kandinsky 3.1 (Distilled Version):
- Method: The authors used Adversarial Diffusion Distillation. A student model (Kandinsky 3.1) is trained to produce the same output as the teacher (Kandinsky 3.0) in a single step. A discriminator, cleverly constructed from the frozen downsampling blocks of the original U-Net (Image 23), provides an adversarial loss to maintain high quality. This allows the distilled model to generate images in just 4 steps, a 20x speedup over the typical 50-100 steps.
- Results (Images 50-52): Human evaluation shows the full Kandinsky 3.0 is still preferred, but Kandinsky 3.1 is highly competitive. The quality drop is surprisingly small for such a massive speedup, making it a very practical model for real-world applications.
  
  该图像是一张插图，展示了一个色彩斑斓的热气球在日出或日落时分，悬浮在开阔的田野和村庄上空，体现出自然景观的静谧与美丽。

Kandinsky SuperRes:

Method: This model operates in pixel space (not latent space) and uses an Efficient U-Net, which is more memory-friendly for high resolutions. It is trained to predict the original image ( $x_0$ ) directly, which helps preserve color fidelity. For 4K generation, the MultiDiffusion algorithm is used to stitch together overlapping patches.

Results (Images 34, 45): Quantitative metrics in Table 3 (transcribed below) show that Kandinsky SuperRes significantly outperforms Real-ESRGAN and Stable Diffusion's upscaler across all metrics (FID, PSNR, SSIM, L1) on three different test datasets. Visual examples confirm its ability to generate crisp, detailed high-resolution images.

Datasets	Model	FID↓	PSNR↑	SSIM↑	L1↓
Wikidata 5k	Real-ESRGAN	9.96	24.48	0.73	0.0428
	Stable Diffusion	3.04	25.05	0.67	0.0435
	Kandinsky SuperRes	0.89	28.52	0.81	0.0257
RealSR(V3)[42]	Real-ESRGAN	73.26	23.12	0.72	0.0610
	Stable Diffusion	47.79	24.85	0.67	0.0493
	Kandinsky SuperRes	47.37	25.05	0.75	0.0462
Set14 [43]	Real-ESRGAN	115.94	22.88	0.62	0.0561
	Stable Diffusion	76.32	23.60	0.57	0.0520
	Kandinsky SuperRes	61.00	25.70	0.70	0.0390

This table was manually transcribed from the source paper's Table 3.

该图像是一张插图，展示了一只站立在绿草地上的小羊羔，背景为模糊的绿色草坪，整体呈现自然生动的户外场景。

Prompt Beautification (Images 19-21): The authors used a fine-tuned Mistral-7B LLM to automatically expand simple user prompts into more descriptive ones. Human evaluations (Images 53-59) show this technique provides a major quality boost for both Kandinsky 3.0 and 3.1. The improvement is especially pronounced for the distilled model (Kandinsky 3.1), suggesting that better prompts can help compensate for a slightly weaker text understanding capability.
Other Applications: The paper demonstrates the model's versatility with examples of:
- Inpainting/Outpainting: Editing parts of an image or extending its boundaries (Image 56).
- Image Editing with IP-Adapter: Generating variations of an input image or combining its style with a text prompt (Images 30-36).
- Image-to-Video: Animating a static image based on a depth map and camera movement instructions (Image 60).

7. Conclusion & Reflections

Conclusion Summary

The "Kandinsky 3.0 Technical Report" presents a significant advancement in open-source text-to-image generation. By integrating a massive language model (Flan-UL2) for text encoding, a deep U-Net, and a high-fidelity decoder, Kandinsky 3.0 achieves substantial improvements in both text alignment and visual quality over its predecessor and strong competitors like SDXL. The paper also introduces a suite of powerful and practical extensions, including a highly efficient distilled version (Kandinsky 3.1) and a state-of-the-art super-resolution model, all of which are released publicly.

Limitations & Future Work

The authors acknowledge that despite its strengths, the model is not perfect. Key areas for future improvement include:

Semantic Coherence: Further improving the alignment between complex prompts and the generated image.
Fine-Grained Control: Enhancing control over specific image attributes like object positioning, camera angles, and lighting.
Ethical Considerations: While content filters are in place, malicious or biased outputs remain a concern, requiring ongoing development of safety mechanisms.

Personal Insights & Critique

The Power of the Text Encoder: The paper's results strongly validate the hypothesis that a larger, more capable text encoder is a critical driver of performance in text-to-image models. Kandinsky 3.0's success over SDXL, despite both being latent diffusion models, can be largely attributed to its 8.6B parameter Flan-UL2 encoder versus SDXL's smaller CLIP encoders. This aligns with the trend seen in closed-source models like DALL-E 3, which leverages GPT-4 for prompt enrichment.
A Complete Ecosystem: This report is more than just a model release; it presents a full ecosystem. The inclusion of a distilled model (Kandinsky 3.1), a super-resolution model, and various application pipelines (editing, video) makes the Kandinsky 3.0 suite highly versatile and practical for both researchers and end-users. The distilled version, in particular, is a crucial contribution that makes this large model accessible to those with limited computational resources.
Openness as a Driver of Progress: By releasing the models and code, the authors provide a powerful open-source alternative to closed-source giants like Midjourney and DALL-E 3. This enables the community to build upon their work, verify their findings, and explore new applications, accelerating progress in the field.
Evaluation Methodology: The heavy reliance on human evaluation is appropriate for this domain, as automated metrics often fail to capture the nuances of aesthetic quality and text alignment. The detailed breakdown of preferences across different categories provides valuable insights into the model's specific strengths and weaknesses.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.