Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support
TL;DR Summary
Taiyi-Diffusion-XL extends CLIP and Stable-Diffusion-XL via bilingual continuous pretraining, expanding Chinese vocabulary and enriching prompts with large vision-language models, significantly improving bilingual text-to-image generation quality and image-text retrieval.
Abstract
Recent advancements in text-to-image models have significantly enhanced image generation capabilities, yet a notable gap of open-source models persists in bilingual or Chinese language support. To address this need, we present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model which is developed by extending the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training. This approach includes the efficient expansion of vocabulary by integrating the most frequently used Chinese characters into CLIP's tokenizer and embedding layers, coupled with an absolute position encoding expansion. Additionally, we enrich text prompts by large vision-language model, leading to better images captions and possess higher visual quality. These enhancements are subsequently applied to downstream text-to-image models. Our empirical results indicate that the developed CLIP model excels in bilingual image-text retrieval.Furthermore, the bilingual image generation capabilities of Taiyi-Diffusion-XL surpass previous models. This research leads to the development and open-sourcing of the Taiyi-Diffusion-XL model, representing a notable advancement in the field of image generation, particularly for Chinese language applications. This contribution is a step forward in addressing the need for more diverse language support in multimodal research. The model and demonstration are made publicly available at \href{https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/}, fostering further research and collaboration in this domain.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support
1.2. Authors
Xiaojun Wu, Dixiang Zhang, Ruyi Gan, Junyu Lu, Ziwei Wu, Renliang Sun, Jiaxing Zhang, Pingjian Zhang, Yan Song. The authors are affiliated with the International Digital Economy Academy, South China University of Technology, and the University of Science and Technology of China. Their research backgrounds appear to be in areas related to artificial intelligence, natural language processing, computer vision, and multimodal learning, particularly with a focus on large models and their applications.
1.3. Journal/Conference
The paper is published on arXiv, a preprint server for electronic preprints of scientific papers. As arXiv is a preprint server, it means the paper has not yet undergone or completed peer review for formal publication in a journal or conference. However, arXiv is widely recognized and used in the research community for rapid dissemination of findings, making it influential, especially in fast-moving fields like AI.
1.4. Publication Year
2024 (Published at UTC: 2024-01-26T07:17:50.000Z)
1.5. Abstract
The paper introduces Taiyi-Diffusion-XL, a novel bilingual (Chinese and English) text-to-image (T2I) model designed to bridge the gap in open-source models supporting Chinese language. The core methodology involves extending the capabilities of CLIP and Stable-Diffusion-XL through a process called bilingual continuous pre-training. This includes efficiently expanding CLIP's vocabulary by integrating frequently used Chinese characters into its tokenizer and embedding layers, alongside an absolute position encoding expansion. Furthermore, the model enriches text prompts using a large vision-language model (LVLM) to generate better image captions and achieve higher visual quality. Empirical results demonstrate that the developed CLIP model excels in bilingual image-text retrieval, and Taiyi-Diffusion-XL outperforms previous models in bilingual image generation. The research contributes to the development and open-sourcing of this model, marking a significant advancement in image generation, particularly for Chinese language applications, and fostering broader language support in multimodal research.
1.6. Original Source Link
https://arxiv.org/abs/2401.14688
The paper is available as a preprint on arXiv.
PDF Link: http://arxiv.org/pdf/2401.14688v3
2. Executive Summary
2.1. Background & Motivation
The field of text-to-image (T2I) generation has seen rapid advancements with powerful diffusion models. However, a significant limitation persists: most prominent open-source T2I models predominantly support English, leaving a notable gap for bilingual or specifically Chinese language support. This creates a challenge for users and researchers in Chinese-speaking regions, as relying on translation software to convert Chinese prompts to English can lead to a loss of meaning, cultural nuances, and emotional context, ultimately resulting in suboptimal image generation. While some prior works like Taiyi-Diffusion and Alt-Diffusion have attempted to adapt T2I models for Chinese, they often achieve this by replacing the original English text encoders with Chinese-specific ones, which inadvertently leads to a loss of the model's original English understanding capabilities.
The core problem the paper aims to solve is the lack of a high-quality, open-source T2I model that robustly supports both Chinese and English without sacrificing the capabilities of either language. The motivation stems from the need for more inclusive and effective tools for diverse linguistic contexts, ensuring that the richness of language is preserved in the image generation process. The paper's innovative idea is to extend existing powerful English-centric models like CLIP and Stable-Diffusion-XL to become truly bilingual through continuous pre-training, efficient vocabulary expansion, and the strategic use of large vision-language models for prompt enrichment, rather than replacing core components.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of bilingual text-to-image generation:
-
Efficient Algorithms for Bilingual Expansion: The authors developed and applied efficient algorithms for expanding the vocabulary and position encoding within
CLIP's tokenizer and embedding layers. This enables the model to effectively process both Chinese and English, facilitating more accurate and culturally sensitive image generation without discarding original language capabilities. -
Enrichment of Text Prompts by Large Vision-Language Models (LVLMs): The research introduces an approach to enhance text prompts by leveraging
LVLMs. This technique generates richer and more accurate image captions, which in turn leads to higher visual quality and better alignment between generated images and complex textual descriptions. This innovative use ofLVLMssignificantly improves the model's ability to interpret and visualize intricate text. -
Creation and Open-Sourcing of Bilingual Models: The paper culminates in the development and open-sourcing of the
Taiyi-Diffusion-XLmodel, along with an enhancedCLIPmodel (Our-CLIP). These models represent a substantial advancement in the research and application of bilingual text-to-image generation, providing a powerful tool for the community.The key conclusions and findings of the paper are:
-
The developed
Our-CLIPmodel demonstrates superior performance in bilingual image-text retrieval tasks, significantly outperforming previousCLIPandAltCLIPmodels on both English and Chinese datasets (Flickr30K-CN,MSCOCO-CN). -
Taiyi-Diffusion-XLexhibits advanced bilingual image generation capabilities, surpassing previous open-source models likeAlt-Diffusion,Pai-Diffusion, andTaiyi-v0.1in terms ofCLIP Sim,Inception Score (IS), andFréchet Inception Distance (FID)on both English (COCO) and Chinese (COCO-CN) datasets. -
Qualitative human evaluation indicates that
Taiyi-Diffusion-XLachieves a photographic style comparable toMidjourneyand significantly outperforms other open-source bilingual models, though a gap with commercial models likeDALL-E 3andMidjourneyexists, primarily attributed to differences in training data scale and quality. These findings collectively address the need for more diverse language support in multimodal research, making high-quality text-to-image generation accessible for Chinese language applications while maintaining strong English capabilities.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the Taiyi-Diffusion-XL paper, a foundational grasp of several key concepts in deep learning and multimodal AI is necessary.
- Text-to-Image (T2I) Generation: This is the task of generating an image from a natural language description (text prompt). Recent advancements allow for the creation of photorealistic and stylistically diverse images based on user input.
- Diffusion Models: A class of generative models that learn to reverse a diffusion process. They start with a noisy image and progressively denoise it over several steps to generate a clear image.
- Diffusion Process (Forward Process): Gradually adds Gaussian noise to an image until it becomes pure noise.
- Reverse Process (Denoising Process): Learns to predict and remove noise at each step, transforming pure noise back into a coherent image.
- Latent Diffusion Models (LDMs): A specific type of diffusion model (e.g.,
Stable Diffusion) that performs the diffusion process in a lower-dimensional latent space rather than directly in the pixel space. This significantly reduces computational cost and memory requirements while maintaining high image quality. An autoencoder (composed of an encoder and a decoder) is used to map images to and from this latent space. - CLIP (Contrastive Language-Image Pre-training): A neural network trained on a vast dataset of image-text pairs to learn highly effective visual representations from natural language.
CLIPconsists of two main components:- Image Encoder: Maps an image into a high-dimensional embedding space.
- Text Encoder: Maps a text description into the same high-dimensional embedding space.
The training objective is to maximize the similarity (e.g., cosine similarity) between correct image-text pairs and minimize it for incorrect pairs. This allows
CLIPto perform zero-shot tasks like image-text retrieval (finding images for a given text or vice versa) and image classification.
- Stable-Diffusion-XL (SD-XL): An advanced version of
Stable Diffusionmodels, known for generating high-resolution images with improved aesthetic quality and better prompt adherence.SD-XLutilizes a more complex architecture and training strategy compared to earlier versions, often incorporating multiple text encoders and operating at higher resolutions. - UNet: A convolutional neural network architecture originally developed for biomedical image segmentation. It has an encoder-decoder structure with "skip connections" that directly connect corresponding layers in the encoder and decoder. In diffusion models,
UNetis commonly used as thenoise predictor(), tasked with predicting the noise component in the latent space at each denoising step. - Tokenizer: A component of natural language processing (NLP) models that breaks down raw text into smaller units called
tokens. Thesetokenscan be words, subwords, or characters. For Chinese, character-based tokenization or subword tokenization (e.g.,BPEorWordPiece) is common due to the lack of explicit word boundaries. - Embedding Layers: In NLP models, after tokenization,
tokensare converted into numerical vectors calledembeddings.Embedding layersmap discretetokensto continuous vector representations that capture semantic meaning. - Position Encoding: In transformer-based models (which
CLIP's text encoder is),position encodingis added totoken embeddingsto provide information about the relative or absolute position oftokensin a sequence. This is crucial because transformers processtokensin parallel without inherent sequential understanding.Absolute position encodingassigns a unique positional vector to each position. - Large Vision-Language Models (LVLMs): Models that integrate capabilities from both large language models (LLMs) and computer vision models. They can understand and generate content across modalities, performing tasks like visual question answering, image captioning, and multimodal reasoning. The paper uses an
LVLMto enrich text prompts by generating more detailed and accurate image descriptions.
3.2. Previous Works
The paper contextualizes its contributions by discussing the landscape of text-to-image (T2I) generation and bilingual models.
- Early Generative Models:
- Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Arjovsky et al., 2017): Consist of a generator and a discriminator network that compete, leading to the generator creating increasingly realistic data.
- Variational Autoencoders (VAEs) (Kingma & Welling, 2013): Learn a probabilistic mapping from data to a latent space and back, allowing for data generation.
- Flow-based models (Rezende & Mohamed, 2015): Explicitly model the probability density function of data, enabling exact likelihood computation and efficient sampling.
- Autoregressive models (Ramesh et al., 2021; Ding et al., 2021; 2022): Generate data sequentially, pixel by pixel or token by token.
DALL-E(Ramesh et al., 2021) is an early example.
- Advancements in Diffusion Models:
- The paper highlights the rise of diffusion models as leading technology (Vincent, 2011; Ho et al., 2020; Song et al., 2020; Cao et al., 2022).
- DALL-E 2 (Ramesh et al., 2022): Utilizes a hierarchical diffusion model with
CLIPlatents. - Imagen (Saharia et al., 2022) and Deepfloyd-IF (Shonenkov et al., 2023): Demonstrate photorealistic image generation with deep language understanding using diffusion models.
- Latent Diffusion Model (LDM) (Rombach et al., 2022): The foundation for
Stable Diffusionvariants (SD-v1.5,SD-v2.1,SD-XL), which perform diffusion in a latent space, integratingCLIPtext models for textual feature extraction to reduce computational overhead.
- Text-to-Image Models in Bilingual Context:
- Initial Chinese
CLIPvariants:Taiyi-CLIP(Zhang et al., 2022),Chinese-CLIP(Yang et al., 2022), andAlt-CLIP(Chen et al., 2022) focused on replacingCLIP's text encoder with a Chinese-specific one and pre-training on Chinese datasets for image-text matching. - Chinese Diffusion Models: Following the
CLIPadaptation,Taiyi-Diffusion(Zhang et al., 2022),Alt-Diffusion(Ye et al., 2023), andPai-Diffusion(Wang et al., 2023) replaced the text encoder inStable Diffusionand continued training on Chinese image-text datasets to generate images from Chinese prompts. - Limitation of previous Chinese T2I models: A crucial drawback of these models is that replacing the original
CLIPtext encoder often leads to the loss of the model's English language capabilities, making them Chinese-centric rather than truly bilingual.
- Initial Chinese
- Text-Image Datasets:
- Traditional datasets:
COCO(Lin et al., 2014) andFlickr(Young et al., 2014) for English, andCOCO-CN(Li et al., 2019) andFlickr-CN(Li et al., 2016) for Chinese, provide foundational data but are limited in scale (below one million entries). - Web-crawled datasets: Large-scale datasets like
Laion(Schuhmann et al., 2021) (English, up to billions of pairs) andWukong(Gu et al., 2022) (Chinese, 100 million pairs) are critical for training modern diffusion models.
- Traditional datasets:
3.3. Technological Evolution
The evolution of text-to-image generation has moved from early generative models (GANs, VAEs, Autoregressive models) towards the highly effective diffusion models. Initially, these models were largely English-centric. The next wave of innovation focused on adapting these powerful models for other languages, particularly Chinese, due to its distinct linguistic characteristics. Early attempts at bilingual support often involved replacing English-specific components with language-specific alternatives, which, while effective for the target language, compromised the model's original capabilities.
This paper's work, Taiyi-Diffusion-XL, fits within this technological timeline as a refinement step. Instead of replacement, it focuses on extension and continuous pre-training, aiming to achieve true bilingualism where both English and Chinese capabilities are maintained and enhanced. The integration of Large Vision-Language Models (LVLMs) for prompt enrichment also signifies a step towards more sophisticated understanding and utilization of textual input in multimodal generation. This represents an evolution from merely translating text or swapping encoders to deeply integrating multilingual understanding and leveraging advanced LLMs for better input conditioning.
3.4. Differentiation Analysis
Compared to the main methods in related work, Taiyi-Diffusion-XL presents several core differences and innovations:
-
Bilingual Capability Preservation vs. Replacement:
- Previous Chinese T2I models (e.g.,
Taiyi-Diffusion,Alt-Diffusion,Pai-Diffusion): These models typically replace the EnglishCLIPtext encoder with a Chinese-specific encoder, leading to a loss of English language understanding. Taiyi-Diffusion-XL's Innovation: It extends the capabilities of the original English-centricCLIPandStable-Diffusion-XLthrough a process of bilingual continuous pre-training. This approach involves integrating Chinese characters and expandingCLIP's vocabulary and position encoding, rather than replacing the encoder. This ensures that the model retains its strong English capabilities while gaining robust Chinese support.
- Previous Chinese T2I models (e.g.,
-
Prompt Enrichment with LVLMs:
- Standard T2I models: Rely directly on user-provided text prompts, which can sometimes be brief or ambiguous.
Taiyi-Diffusion-XL's Innovation: It leveragesLarge Vision-Language Models (LVLMs)to enrich text prompts. TheLVLMgenerates more detailed and accurate image captions from initial prompts and even imperfect web-crawled captions. This leads to a richer understanding of the desired image content, resulting in higher visual quality and better alignment with complex descriptions. This is a novel way to improve input conditioning for T2I models.
-
Open-Source
Stable-Diffusion-XLBasis:-
The paper leverages the powerful
Stable-Diffusion-XLas its backbone, building upon its advanced architecture and training methodologies. This ensuresTaiyi-Diffusion-XLstarts with a strong foundation for high-resolution image generation, unlike some earlier Chinese models that might have been based on olderStable Diffusionversions or custom architectures.In essence,
Taiyi-Diffusion-XLinnovates by moving beyond mere language adaptation through component replacement to a more sophisticated strategy of bilingual augmentation and AI-driven prompt enhancement, positioning it as a leading open-source solution for truly bilingual text-to-image generation.
-
4. Methodology
The methodology for Taiyi-Diffusion-XL involves a systematic approach to adapt and enhance existing state-of-the-art models for bilingual text-to-image generation. It can be broadly divided into Dataset Preparation, CLIP Training for bilingual understanding, and Taiyi-XL Training for image generation.
4.1. Principles
The core idea behind the Taiyi-Diffusion-XL method is to achieve robust bilingual (Chinese and English) text-to-image generation by extending the capabilities of existing powerful models rather than replacing their core components. The theoretical basis and intuition are rooted in the effectiveness of continuous pre-training for language adaptation and the benefits of rich, contextually relevant prompts for generative models. The key principles include:
- Bilingual Continuous Pre-training: Starting with well-established English-centric models (
CLIPandStable-Diffusion-XL), the goal is to expand their linguistic understanding to include Chinese without forgetting their original English knowledge. This is more efficient and effective than training from scratch or replacing components. - Efficient Vocabulary Expansion: For
CLIP's text encoder, directly integrating frequently used Chinese characters into its existing tokenizer and embedding layers, combined withabsolute position encodingexpansion, allows for efficient processing of mixed-language inputs. This avoids the pitfalls of language-specific encoders that might lose cross-lingual alignment. - Prompt Enrichment via Large Vision-Language Models (LVLMs): Recognizing that the quality of generated images heavily depends on the clarity and detail of text prompts, the method employs
LVLMsto refine and enrich user-provided or web-crawled captions. This leverages the advanced understanding capabilities ofLVLMsto produce high-quality, detailed descriptions that guide the image generation process more effectively. - Latent Diffusion for Efficiency: Maintaining the
latent diffusion modelparadigm ensures computational efficiency by performing the denoising process in a compressed latent space, making the training and inference practical for high-resolution image generation.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Dataset Preparation
The first crucial step in the methodology is curating a high-quality dataset of image-text pairs, denoted as (X, Y), where is an image and is its descriptive text.
- Emphasis on Comprehensive Descriptions: Unlike traditional datasets that might rely on simple tags,
Taiyi-Diffusion-XLprioritizescomprehensive descriptionsthat capturematerials,styles,colors, andspatial layoutsof the images. - Addressing Web-Crawled Data Limitations: Web-crawled resources often contain noisy, irrelevant, or inaccurate tags. To overcome this, the paper employs
vision-language large models (LVLMs)to generatesynthetic captions. - LVLM for Caption Generation: Specifically, the
Lyricsmodel (Lu et al., 2023b;a) is utilized. This model inherits the language capabilities of bilingual large language models (Gan et al., 2023) and expands the visual capabilities ofLLMs.- Inputs to
Lyrics: TheLyricsmodel takes three inputs: theimage(), theweb crawl caption(potentially imperfect), andinstructionsfor generating the description. - Instructions:
- For Chinese: "请详细描述图片内容" (Please describe the image content in detail).
- For English: "Write a detailed description of the given image."
- Output of
Lyrics: TheLyricsmodel generates new, accurate descriptive text () by extracting features from the images and distilling useful information from the imperfect web-crawled captions. This process effectively enhances the richness, relevance, and detail of the descriptions.
- Inputs to
- Final Dataset: The generated high-quality text is then combined with the original images to form enhanced image-text pairs
(X, Y'), which are subsequently used for training theTaiyi-XLmodel.
4.2.2. CLIP Training
The foundation of Taiyi-Diffusion-XL's bilingual understanding lies in its enhanced CLIP model.
- Starting Point: The process begins with a
pre-trained English-only CLIP model(Radford et al., 2021). This provides a strong initial understanding of English image-text relationships. - Bilingual Adaptation and Continuous Pre-training: The
CLIPmodel undergoes an extension process to accommodate bilingual adaptation and the requirements of high-quality image-text data. This is achieved throughbilingual continuous pre-training.- Stage 1: Large-scale Bilingual Training: In the first stage, the
CLIPmodel is continuously pre-trained on a large-scale bilingual dataset. This includes prominent datasets likeLaion(Schuhmann et al., 2021) (primarily English) andWukong(Gu et al., 2022) (primarily Chinese). This stage also involves crucialdata cleaningandquality enhancementto ensure effective learning from diverse sources.- Loss Function: A
contrastive loss functionis employed. This function aims to bring the embeddings of matching image-text pairs closer in the latent space while pushing non-matching pairs further apart. - Training Approach: A
distributed, memory-efficient training approach(Chen et al., 2023) is used to handle the massive datasets involved.
- Loss Function: A
- Stage 2: Enriched Dataset Training: The second stage continues training on the
enriched dataset(the one prepared usingLVLMsas described in Section 4.2.1). This stage focuses on leveraging the diverse perspectives and detailed information captured in these high-quality image-text pairs to further refine theCLIPmodel's understanding.
- Stage 1: Large-scale Bilingual Training: In the first stage, the
- Vocabulary and Position Encoding Expansion (Implicit in CLIP Training): While not explicitly detailed as a separate step here, the abstract mentions "efficient expansion of vocabulary by integrating the most frequently used Chinese characters into CLIP's tokenizer and embedding layers, coupled with an absolute position encoding expansion." This expansion is a key part of how
CLIPadapts to bilingual contexts during its continuous pre-training. It ensures that thetext encodercan properly tokenize and embed Chinese characters and understand their positional relationships within a sentence.
4.2.3. Taiyi-XL Training
The Taiyi-XL training process is the core of the text-to-image generation methodology, building upon the bilingual CLIP model.
4.2.3.1. Initialization and Training
- Model Initialization: The
Taiyi-XLmodel, denoted as , is initialized with several key components:- A
noise predictor: This is typically implemented as atime-conditional UNet(Ronneberger et al., 2015), responsible for predicting the noise component at each step of the diffusion process. - A
CLIP text encoder: This is the bilingualCLIPtext encoder trained in Section 4.2.2, which maps textual descriptions into a latent representation. - A
latent encoder: Part of theVariational Autoencoder (VAE)(Kingma & Welling, 2013), which compresses an input image into a lower-dimensional latent representation. - A
dataset: Comprising image-text pairs , where is an image and is its corresponding textual descriptor.
- A
- Training Phase: The model is trained at mixed resolutions of and . The objective is to minimize a loss function that guides the
image denoising process. Thelatent representations() are obtained from during training. - Loss Function: The loss function is defined as:
$
L(\theta):=\mathbb{E}{\mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0,1), t}\left[\left|\epsilon-\epsilon{\theta}\left(z_{t}, t, \tau_{\theta}(y)\right)\right|_{2}^{2}\right]
$
- Explanation of Symbols:
- : Represents all the trainable parameters of the
Taiyi-XLmodel. - : Denotes the expectation (average) over various components.
- : The latent representation of an image , obtained by passing the image through the
latent encoder. - : The textual descriptor (prompt) corresponding to the image .
- : Pure Gaussian noise sampled from a standard normal distribution. This is the target noise that the model aims to predict.
- : The current
timestepin the diffusion process, typically ranging from 1 to . - : The
noise predictor(UNet) with parameters . It takes the noisy latent representation , the current timestep , and thetext embedding(obtained from theCLIP text encoder) as input, and predicts the noise component present in . - : Represents the squared norm, which calculates the squared Euclidean distance between the actual noise and the predicted noise . The goal is to minimize this difference.
- : Represents all the trainable parameters of the
- Function: This
loss functionis a common objective indenoising diffusion probabilistic models. It trains thenoise predictorto accurately predict the noise that was added to a clean latent at a giventimestep, conditioned on the text prompt.
- Explanation of Symbols:
- Model Implementation: The
noise predictoris implemented as atime-conditional UNet(Ronneberger et al., 2015). - Latent Representations: The
latent representationsare efficiently derived from during training. After the generation process, theselatent representationsare decoded back to the image space using aVAE decoder. - Text Encoder Optimization: The
text encoder, parameterized as atransformer model(part ofCLIP), is optimized jointly with thenoise predictoras per Equation 1. This ensures that thetext encoderprovides relevant conditioning information for the image generation task. - Optimization: Model parameters are iteratively updated using
gradient descentto minimize the loss function : $ \theta_{e+1}=\theta_{e}-\eta \cdot \nabla_{\theta} L\left(\theta_{e}, e\right) $- Explanation of Symbols:
- : The model parameters at
epochoriteration. - : The updated model parameters for the next
epochoriteration. - : Represents the
learning rate, which controls the step size of parameter updates. - : The gradient of the loss function with respect to the model parameters at
epoch. This indicates the direction of steepest ascent of the loss. By subtracting it, parameters are adjusted in the direction that minimizes the loss.
- : The model parameters at
- Explanation of Symbols:
4.2.3.2. Text-to-Image Generation
Once the Taiyi-XL model is trained, the text-to-image generation process involves:
- Feature Extraction: The
trained bilingual text encoderis used to extracttextual featuresfrom the user's input text prompt . This results in thetext embedding. - Latent Diffusion Process: These extracted
textual featuresare then integrated into thelatent diffusion process. This integration is crucial for guiding the image generation towards the content described in the prompt. - Iterative Denoising: The generation starts from the last
timestepwithpure noise(a randomly sampled latent vector). The model iteratively denoises this input, step by step, gradually removing noise and adding coherent image features. - Convergence to Clean Image: The process continues until it converges to , which represents the clean, generated image, as described by:
$
x_{t-1}=x_{t}-\epsilon_{\theta}\left(x_{t}, t, \tau_{\theta}(y)\right), \quad \lim {t \rightarrow 0} x{t}=x_{0}
$
- Explanation of Symbols:
- : The noisy latent representation at
timestep. - : The denoised latent representation for the previous
timestept-1. - : The
noise predictor(UNet) predicts the noise component in , conditioned on thetimestepand thetext embedding. - : This operation subtracts the predicted noise from the current noisy latent, effectively denoising it.
- : As the
timestepapproaches 0 (meaning the process moves from pure noise to a clean image), the latent representation converges to the final clean image .
- : The noisy latent representation at
- Function: This equation describes the reverse diffusion (denoising) process, where the model iteratively refines the latent representation guided by the predicted noise and the text conditioning, ultimately producing a coherent image. This iterative process enhances computational efficiency compared to pixel-space diffusion models.
- Explanation of Symbols:
5. Experimental Setup
5.1. Datasets
The experiments for Taiyi-Diffusion-XL utilized a combination of large-scale web-crawled datasets and established benchmark datasets for evaluation.
-
Training Datasets:
- Laion (Schuhmann et al., 2021): A massive dataset primarily consisting of English image-text pairs, often containing hundreds of millions or billions of entries. It's a widely used resource for training
CLIPandlatent diffusion models. - Wukong (Gu et al., 2022): A large-scale Chinese cross-modal pre-training benchmark dataset, primarily containing Chinese image-text pairs, with sizes up to 100 million entries.
- Enriched Dataset (Internal): An internal dataset curated by the authors, where
vision-language large models(specificallyLyrics) were used to generate high-quality, comprehensive descriptions for images, distilling information from web-crawled captions and image features. This dataset plays a crucial role inCLIPandTaiyi-XLtraining, emphasizing detailed captions over simple tags.
- Laion (Schuhmann et al., 2021): A massive dataset primarily consisting of English image-text pairs, often containing hundreds of millions or billions of entries. It's a widely used resource for training
-
Evaluation Datasets:
-
Flickr30K (Young et al., 2014): A standard English image captioning dataset containing 30,000 images, each paired with five human-annotated captions.
-
MSCOCO (Lin et al., 2014): Microsoft Common Objects in Context, a large-scale object detection, segmentation, and captioning dataset for English. It contains hundreds of thousands of images with multiple captions per image.
-
Flickr30K-CN (Li et al., 2016): The Chinese version of Flickr30K, providing Chinese captions for the same set of images, used to evaluate Chinese image-text retrieval.
-
MSCOCO-CN (Li et al., 2019): The Chinese version of MSCOCO, offering Chinese captions, used for evaluating Chinese image-text retrieval and image generation.
These datasets were chosen to cover both large-scale pre-training (Laion, Wukong, internal enriched dataset) and rigorous evaluation on established benchmarks in both English and Chinese (
Flickr30K,MSCOCO, and their Chinese counterparts). This allows for a comprehensive validation of the model's performance across different languages and tasks.
-
5.2. Evaluation Metrics
The evaluation framework encompassed both machine and human evaluations. For machine evaluation, several quantitative metrics were employed to assess different aspects of the model's performance.
5.2.1. CLIP Performance Evaluation (Image-to-Text Retrieval and Text-to-Image Retrieval)
These metrics assess the ability of the CLIP model to correctly match images with their corresponding text descriptions, and vice-versa, in a zero-shot setting.
- Recall@K (R@K): This metric measures the proportion of queries for which the correct item (image or text) is found among the top retrieved results.
R@1specifically refers to whether the correct item is the top-1 retrieved result.- Conceptual Definition: Recall@K evaluates the retrieval accuracy. For retrieval, it asks: "If I query with an image, is its correct caption among the top K retrieved captions?" For retrieval, it asks: "If I query with a text, is its correct image among the top K retrieved images?" A higher R@K indicates better retrieval performance.
- Mathematical Formula (for a single query and a set of candidates):
$
\text{Recall@K} = \begin{cases} 1 & \text{if the correct item is in the top K retrieved candidates} \ 0 & \text{otherwise} \end{cases}
$
The final
Recall@Kreported is typically the average over all queries in the test set. - Symbol Explanation:
- : An integer representing the number of top retrieved results considered (e.g., 1, 5, 10).
correct item: The ground truth corresponding image or text for a given query.top K retrieved candidates: The K items (images or texts) that have the highest similarity scores with the query, according to the model.
5.2.2. CLIP Similarity (CLIP Sim)
- Conceptual Definition:
CLIP Similaritymeasures the semantic alignment between a generated image and its guiding text description. It quantifies how well the generated image visually represents the concepts expressed in the text prompt. A higherCLIP Simscore indicates better text-image consistency. - Mathematical Formula:
$
\text{CLIP Sim} = \frac{1}{N} \sum_{i=1}^{N} \text{cosine_similarity}(\text{CLIP_Image_Encoder}(G(y_i)), \text{CLIP_Text_Encoder}(y_i))
$
- Symbol Explanation:
- : The total number of generated image-text pairs being evaluated.
- : Index for each pair.
- : The image generated by the
text-to-image model(e.g.,Taiyi-XL) from the text prompt . - : The image encoder component of the
CLIPmodel. - : The text encoder component of the
CLIPmodel. - : Measures the cosine of the angle between two vectors and . A value closer to 1 indicates higher similarity.
- Function: This metric uses the
CLIPmodel's learned multimodal embedding space to evaluate the semantic coherence between a generated image and its prompt.
- Symbol Explanation:
5.2.3. Inception Score (IS)
- Conceptual Definition: The
Inception Score (IS)is used to evaluate the quality and diversity of generated images. A higherISsuggests that the generated images are both "sharper" (high quality, recognizable objects) and "more diverse" (cover a wide range of categories). It does not directly evaluate text-image alignment. - Mathematical Formula:
$
\text{IS} = \exp\left( \mathbb{E}x \left[ D{KL}(p(y|x) || p(y)) \right] \right)
$
- Symbol Explanation:
- : Exponential function.
- : Expectation over generated images .
- : The
Kullback-Leibler (KL) divergencebetween two probability distributions. - : The conditional class probability distribution for a generated image , obtained by feeding into a pre-trained
Inception v3network and taking its softmax output. This represents how likely an image contains a recognizable object. p(y): The marginal class probability distribution, which is the average of over all generated images. This represents the diversity of objects in the generated images.
- Function:
ISrewards images that have easily recognizable objects (low entropy ) and a wide variety of these objects across the generated set (high entropyp(y)).
- Symbol Explanation:
5.2.4. Fréchet Inception Distance (FID)
- Conceptual Definition: The
Fréchet Inception Distance (FID)measures the "distance" between the distribution of generated images and the distribution of real images. It is a popular metric for assessing the realism and diversity of generated samples. A lowerFIDscore indicates that the generated images are more similar to real images in terms of visual quality and statistical properties. - Mathematical Formula:
$
\text{FID} = |\mu_1 - \mu_2|_2^2 + \text{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2})
$
- Symbol Explanation:
- : The mean activation vector of the real images in a specific layer (e.g., last pooling layer) of a pre-trained
Inception v3network. - : The mean activation vector of the generated images in the same layer of the
Inception v3network. - : The squared Euclidean distance.
- : The trace of a matrix (sum of diagonal elements).
- : The covariance matrix of the real images' activations.
- : The covariance matrix of the generated images' activations.
- : The mean activation vector of the real images in a specific layer (e.g., last pooling layer) of a pre-trained
- Function:
FIDcompares the first (mean) and second (covariance) moments of the feature distributions of real and generated images. It is considered more robust thanISin capturing both image quality and diversity.
- Symbol Explanation:
5.2.5. Human Preference Evaluation
- Conceptual Definition: This involves human raters assessing the quality, aesthetic appeal, and prompt adherence of generated images. Due to inherent subjectivity, the paper primarily employs a
case analysis approach. - Function: Instead of quantitative scoring, human evaluation focuses on qualitative examination, highlighting
unique attributes,performance nuances, anddistinct characteristicsof image generation outcomes produced by different models. This helps in discerning subtle differences in style, detail, and prompt following that quantitative metrics might miss.
5.3. Baselines
For comparative analysis, Taiyi-XL was benchmarked against a range of established and state-of-the-art models, including both commercial and open-source solutions:
-
Commercial/Closed-Source Models:
Midjourney(https://www.midjourney.com/): A highly regarded text-to-image generative AI program known for its high aesthetic quality and artistic output.DALL-E 3(Betker et al., 2023, https://openai.com/dall-e-3): OpenAI's latest iteration of theDALL-Eseries, recognized for its exceptional prompt-following capabilities and image quality.
-
Open-Source
Stable DiffusionVariants:SD-XL(Podell et al., 2023):Stable Diffusion XL, the foundation model upon whichTaiyi-XLis built. It's a powerful version ofStable Diffusiondesigned for high-resolution synthesis.SD-v1.5(Rombach et al., 2022): An earlier, widely used version ofStable Diffusion.
-
Open-Source Chinese/Bilingual T2I Models:
Taiyi-v0.1(Wang et al., 2022): The authors' previous work on Chinese text-to-image generation.Alt-Diffusion(Ye et al., 2023): A multilingual text-to-image diffusion model.Pai-Diffusion(Wang et al., 2023): Another Chinese diffusion model developed for text-to-image synthesis.
-
CLIP Models (for retrieval evaluation):
-
CLIP(Radford et al., 2021): The original English-onlyCLIPmodel. -
AltCLIP(Chen et al., 2022): ACLIPvariant designed for extended language capabilities, including Chinese. -
Our-CLIP: TheCLIPmodel developed in this paper, which serves as the text encoder forTaiyi-XL.These baselines were chosen to represent the spectrum of current T2I capabilities, from general-purpose (SD-XL), to cutting-edge commercial (DALL-E 3, Midjourney), to specific Chinese/bilingual open-source models, allowing for a comprehensive evaluation of
Taiyi-XL's advancements, particularly in bilingual generation and fidelity to textual prompts.
-
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate the superior performance of Taiyi-XL and its underlying Our-CLIP model, especially in bilingual contexts.
6.1.1. CLIP Model Evaluation
The paper first evaluates the CLIP model, which is the backbone for text encoding in Taiyi-XL. The results, presented in Table 1, highlight the Our-CLIP model's exemplary performance in zero-shot image-text retrieval tasks across both English and Chinese datasets.
The following are the results from Table 1 of the original paper:
| Flickr30K | MSCOCO | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Image Text | Text Image | Image Text | Text Image | |||||||||
| Model | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
| CLIP (Radford et al., 2021) | 85.1 | 97.3 | 99.2 | 65.0 | 87.1 | 92.2 | 56.4 | 79.5 | 86.5 | 36.5 | 61.1 | 71.1 |
| AltCLIP (Chen et al., 2022) | 86.0 | 98.0 | 99.1 | 72.5 | 91.6 | 95.4 | 58.6 | 80.6 | 87.8 | 42.9 | 68.0 | 77.4 |
| Our-CLIP | 88.4 | 98.8 | 99.9 | 75.7 | 93.8 | 96.9 | 61.2 | 84.8 | 90.3 | 49.2 | 70.3 | 79.6 |
| Flickr30K-CN | MSCOCO-CN | |||||||||||
| Image Text | Text Image | Image Text | Text Image | |||||||||
| CLIP (Radford et al., 2021) | 2.3 | 8.1 | 12.6 | 0 | 2.4 | 4.0 | 0.6 | 4.1 | 7.1 | 1.8 | 6.7 | 11.9 |
| AltCLIP (Chen et al., 2022) | 69.8 | 89.9 | 94.7 | 84.8 | 97.4 | 98.8 | 63.9 | 87.2 | 93.9 | 62.8 | 88.8 | 95.5 |
| Our-CLIP | 73.2 | 90.3 | 96.5 | 88.1 | 98.2 | 99.1 | 66.0 | 91.1 | 96.6 | 69.7 | 91.3 | 96.8 |
Analysis:
-
English Datasets (Flickr30K, MSCOCO):
Our-CLIPconsistently outperforms the originalCLIPandAltCLIPacross allR@Kmetrics for both and retrieval. For instance, on ,Our-CLIPachieves 49.2%, significantly higher thanCLIP's 36.5% andAltCLIP's 42.9%. This indicates that the continuous pre-training and enhancements applied toCLIPnot only maintained but improved its English comprehension. -
Chinese Datasets (Flickr30K-CN, MSCOCO-CN): The original
CLIPmodel performs very poorly on Chinese datasets, withR@1scores often in the single digits or even 0%, highlighting its English-centric nature.AltCLIPshows a substantial improvement, demonstrating its multilingual adaptation. However,Our-CLIPfurther surpassesAltCLIPon Chinese tasks. Notably, for onFlickr30K-CN,Our-CLIPachieves 88.1% compared toAltCLIP's 84.8%. On ,Our-CLIPreaches 69.7%, exceedingAltCLIP's 62.8%.This superior performance of
Our-CLIPin bilingual contexts is critical as it forms the text encoder forTaiyi-XL. A robustCLIPmodel ensures a more nuanced understanding of user prompts in both languages, directly leading to better-aligned and higher-quality image generation.
6.1.2. Diffusion Model Evaluation
The next set of results focuses on the image generation capabilities of Taiyi-XL compared to other diffusion models using metrics like CLIP Sim, FID, and IS.
The following are the results from Table 2 of the original paper:
| Model | CLIP Sim() | FID() | IS() |
|---|---|---|---|
| English Dataset (COCO) | |||
| Alt-Diffusion(Ye et al., 2023) | 0.220 | 27.600 | 31.577 |
| SD-v1.5(Rombach et al., 2022) | 0.225 | 25.342 | 32.876 |
| SD-XL(Podell et al., 2023) | 0.231 | 23.887 | 33.793 |
| Taiyi-XL | 0.254 | 22.543 | 35.465 |
| Chinese Dataset (COCO-CN) | |||
| Taiyi-v0.1(Wang et al., 2022) | 0.197 | 69.226 | 21.060 |
| Alt-Diffusion(Ye et al., 2023) | 0.220 | 68.488 | 22.126 |
| Pai-Diffusion(Wang et al., 2023) | 0.196 | 72.572 | 19.145 |
| Taiyi-XL | 0.225 | 67.675 | 22.965 |
Analysis:
-
English Dataset (COCO):
Taiyi-XLdemonstrates superior performance across all three metrics. It achieves the highestCLIP Sim(0.254), indicating strong semantic alignment between generated images and English text prompts. It also has the lowestFID(22.543), meaning its generated images are statistically closer to real images (better realism and diversity), and the highestIS(35.465), signifying high quality and diversity of generated content. This indicates thatTaiyi-XLnot only preserves but enhances the English generation capabilities compared toSD-XLand other baselines. -
Chinese Dataset (COCO-CN): Similar to the English results,
Taiyi-XLleads all other models on the Chinese dataset. It boasts the highestCLIP Sim(0.225), the lowestFID(67.675), and the highestIS(22.965). The improvement over other Chinese-specific models likeTaiyi-v0.1,Alt-Diffusion, andPai-Diffusionis evident across all metrics. This confirmsTaiyi-XL's robust bilingual capabilities, making it a leading solution for high-fidelity image generation from both English and Chinese prompts.Overall, these quantitative results strongly validate the effectiveness of
Taiyi-XL's approach, particularly its ability to excel in both English and Chinese image generation tasks simultaneously.
6.1.3. Human Preference Evaluation
Qualitative assessments provide further insights into the model's performance nuances. Figure 3 illustrates Chinese text-to-image generation, and Figure 4 shows English text-to-image generation comparisons.
The following figure (Figure 3 from the original paper) shows the comparison of Different Models in Chinese Text-to-Image Generation Performance:
该图像是一个图表,展示了Taiyi-v0.1、Alt-Diffusion、Pai-Diffusion、DALL·E 3与Taiyi-XL五种模型,针对三条中文提示生成图像的对比效果,体现了各模型在细节和画质上的差异。
The following figure (Figure 4 from the original paper) shows the comparison of Different Models in English Text-to-Image Generation Performance:

Analysis:
- Improvements of XL versions: Both figures show that
XLversions of models (SDXL,Taiyi-XL) generally produce significantly better images than their 1.5 counterparts (SD-v1.5,Alt-Diffusion). This indicates the benefits of larger model parameters, improved algorithms, and advanced training methodologies. - Prompt Following:
DALL-E 3is noted for its exceptionalprompt-following capability, setting a high benchmark for aligning images precisely with textual descriptions. It occasionally producesoverly vivid colors. - Aesthetic and Style:
Taiyi-XLis characterized by aphotographic styleand is described as closely paralleling the performance ofMidjourneyin aesthetic appeal, particularly in English generation (Figure 4). - Bilingual Support: A key distinction and strength of
Taiyi-XLis its enhanced support forbilingual (Chinese and English) text-to-image generation. This is crucial for diverse linguistic contexts and is a major differentiator compared to models that are either English-only or lose English capabilities when adapted to Chinese. As seen in Figure 3,Taiyi-XLgenerates visually coherent and prompt-aligned images from Chinese text, outperformingTaiyi-v0.1,Alt-Diffusion, andPai-Diffusion. - Gap with Commercial Models: While
Taiyi-XLsignificantly surpasses current bilingual open-source models, the authors acknowledge a gap with commercial models likeDALL-E 3andMidjourney. This disparity is primarily attributed to differences in thequantity,quality, anddiversityof theimage-text dataused for training.Taiyi-XLwas trained exclusively oncopyright-compliant image-text data, highlighting the ongoing challenges related to data access and copyright inAIGCdevelopment.
6.1.4. Impact of Latent Consistency Models (LCM)
The paper also briefly discusses the impact of using Latent Consistency Models (LCM) to accelerate the image generation process, as visualized in Figure 5.
The following figure (Figure 5 from the original paper) shows Taiyi-XL generation examples with Latent Consistency Model:

Analysis:
- Inference Steps vs. Image Quality: A notable observation from
LCMtests is the correlation between thereduction in inference stepsand aconsequent decline in image quality.- Single Step Generation: When constrained to a
single generation step, the resulting images predominantly exhibitonly basic outlinesandlack finer details. For example, in Figure 5, the "astronaut riding a white horse" generated in 1 step is barely recognizable. - Eight Step Generation: Extending the generation process to
8 stepsensures aconsiderably higher qualityof the generated images, preserving satisfactory levels of detail and overall image fidelity. The images generated with 8 steps in Figure 5 are significantly more detailed and visually appealing.
- Single Step Generation: When constrained to a
- Balancing Speed and Quality: This finding suggests that while
LCMcan effectively speed up generation, a balance must be struck. A minimum number of steps (e.g., eight) is crucial for maintaining acceptable image quality.
6.2. Ablation Studies / Parameter Analysis
The paper does not explicitly detail ablation studies on components of Taiyi-XL (e.g., the effect of LVLM-enriched prompts vs. original prompts, or specific aspects of the vocabulary expansion). However, the comparison of Our-CLIP with CLIP and AltCLIP (Table 1) implicitly serves as an ablation for the effectiveness of the bilingual continuous pre-training and vocabulary expansion strategies on the CLIP text encoder. The superior performance of Our-CLIP suggests these modifications are highly effective.
The analysis of Latent Consistency Models (LCM) (Figure 5) can be seen as a parameter analysis regarding the number of sampling steps. It shows that 8 steps are significantly better than 4 steps, and 4 steps are better than 1 step, directly illustrating the impact of this hyper-parameter on image quality for accelerated generation.
7. Conclusion & Reflections
7.1. Conclusion Summary
This research successfully introduces Taiyi-Diffusion-XL, a significant advancement in bilingual (Chinese and English) text-to-image generation. By implementing a novel bilingual continuous pre-training approach that efficiently expands CLIP's vocabulary and position encoding, the authors have developed an Our-CLIP model that excels in bilingual image-text retrieval. This enhanced CLIP model then serves as the foundation for Taiyi-Diffusion-XL, which further benefits from Large Vision-Language Models (LVLMs) to enrich text prompts, leading to higher visual quality and better semantic alignment in generated images. Empirical results quantitatively demonstrate Taiyi-Diffusion-XL's superior performance over previous open-source models across both English and Chinese datasets, while qualitative evaluations highlight its aesthetic quality comparable to commercial tools like Midjourney for specific styles. The project also contributes by open-sourcing the model, fostering future research and collaboration in diverse linguistic contexts within multimodal AI.
7.2. Limitations & Future Work
The authors acknowledge a primary limitation:
-
Data Quantity, Quality, and Diversity: The performance gap between
Taiyi-XLand state-of-the-art commercial models likeDALL-E 3andMidjourneyis primarily attributed to differences in thequantity,quality, anddiversityof theimage-text dataused for training.Taiyi-XLwas trained exclusively oncopyright-compliant data, which, while ethically sound, might limit the scale and breadth compared to datasets potentially available to commercial entities.Based on these limitations and the paper's overall direction, potential future research directions could include:
-
Data Curation Strategies: Exploring novel methods for curating larger, higher-quality, and more diverse
copyright-compliant bilingual datasetsto further close the gap with commercial models. This could involve more advancedLVLM-driven synthetic data generation or improved filtering techniques for public data. -
Advanced LVLM Integration: Further integrating
LVLMsfor more complex prompt understanding, style transfer, or even interactive image generation, moving beyond mere caption enrichment. -
Cross-Cultural Nuances: Enhancing the model's understanding of
cross-cultural nuancesand less common linguistic expressions to generate images that are not only grammatically correct but also culturally sensitive and relevant. -
Efficiency for Low-Resource Languages: Applying similar bilingual expansion techniques to other
low-resource languagesto broaden the scope of multilingual text-to-image generation. -
Ethical AI and Bias Mitigation: Investigating and mitigating potential biases that might arise from training data in bilingual contexts, ensuring fair and inclusive image generation.
7.3. Personal Insights & Critique
Taiyi-Diffusion-XL represents a commendable and practical step forward in making advanced text-to-image generation truly accessible and effective for bilingual users, particularly those working with Chinese. The core innovation of extending existing powerful models through continuous pre-training and careful vocabulary expansion, rather than component replacement, is a robust and resource-efficient strategy. This preserves valuable pre-trained knowledge, which is crucial for achieving high performance in multiple languages. The integration of Large Vision-Language Models (LVLMs) for prompt enrichment is also a particularly insightful move. It addresses a common bottleneck in T2I models – the quality of the input prompt – by automatically enhancing it, thereby potentially improving user experience and generation quality without requiring users to craft overly elaborate prompts.
A potential area for deeper exploration, not explicitly detailed in the paper, is the specific technical implementation of the vocabulary and position encoding expansion. While mentioned in the abstract, a more detailed breakdown of how Chinese characters were integrated into CLIP's tokenizer (e.g., character-level, subword-level, or a hybrid approach) and how the absolute position encoding was adapted would be beneficial for other researchers looking to replicate or extend this work to other languages.
The acknowledged limitation regarding training data quantity, quality, and diversity is a common challenge for open-source models competing with well-funded commercial counterparts. This highlights an unverified assumption in the open-source community: that models can quickly catch up with commercial ones solely through architectural improvements. Often, the sheer scale and proprietary nature of commercial training datasets provide an almost insurmountable advantage. Taiyi-Diffusion-XL's focus on copyright-compliant data is highly responsible but underscores the broader ethical and practical dilemma in AI development regarding data access.
The methods and conclusions of this paper could be transferred to other domains requiring multilingual multimodal understanding. For instance, similar techniques could be used to build bilingual video generation models, or to improve multilingual visual question-answering systems. The LVLM-driven prompt enrichment could also inspire better data annotation strategies for multimodal datasets. Overall, Taiyi-Diffusion-XL is not just an improvement for Chinese T2I; it offers a generalizable blueprint for building robust, genuinely bilingual multimodal AI systems.
Similar papers
Recommended via semantic vector search.