Paper status: completed

Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation

Published:03/29/2024

Low-Rank Adaptation Finetuning (5)Image Generation (4)Composition of LoRA Models (1)Contrastive Test-Time Fusion (1)Attention Mechanism Adjustment (1)

Original Link PDF

Price: 0.100000

10 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

CLoRA is a training-free test-time method that refines attention maps of multiple LoRA models to fuse semantic features, overcoming attention overlap and concept omission issues, significantly improving multi-concept image generation accuracy and quality.

Abstract

Low-Rank Adaptation (LoRA) has emerged as a powerful and popular technique for personalization, enabling efficient adaptation of pre-trained image generation models for specific tasks without comprehensive retraining. While employing individual pre-trained LoRA models excels at representing single concepts, such as those representing a specific dog or a cat, utilizing multiple LoRA models to capture a variety of concepts in a single image still poses a significant challenge. Existing methods often fall short, primarily because the attention mechanisms within different LoRA models overlap, leading to scenarios where one concept may be completely ignored (e.g., omitting the dog) or where concepts are incorrectly combined (e.g., producing an image of two cats instead of one cat and one dog). We introduce CLoRA, a training-free approach that addresses these limitations by updating the attention maps of multiple LoRA models at test-time, and leveraging the attention maps to create semantic masks for fusing latent representations. This enables the generation of composite images that accurately reflect the characteristics of each LoRA. Our comprehensive qualitative and quantitative evaluations demonstrate that CLoRA significantly outperforms existing methods in multi-concept image generation using LoRAs.

Mind Map

In-depth Reading

English Analysis~30 min read · 37,750 chars

1. Bibliographic Information

1.1. Title

Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation

1.2. Authors

Tuna Han Salih Meral
Enis Simsar
Federico Tombari
Pinar Yanardag

Affiliations:

Virginia Tech (Tuna Han Salih Meral, Pinar Yanardag)
ETH Zürich (Enis Simsar)
TUM (Federico Tombari)
Google (Federico Tombari)

1.3. Journal/Conference

The paper is published as a preprint on arXiv. arXiv is an open-access repository for scholarly articles, primarily in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It allows researchers to share their work rapidly, often before peer review and formal publication in a journal or conference. This means the work has been made public but may not yet have undergone the full peer-review process typical of established academic venues, though it is widely recognized and used within the research community.

1.4. Publication Year

2024 (specifically, published at UTC: 2024-03-28T18:58:43.000Z, version 2).

1.5. Abstract

The paper addresses the challenge of composing multiple Low-Rank Adaptation (LoRA) models for image generation. While individual LoRA models excel at representing single concepts (e.g., a specific dog or cat), combining them to create images with multiple distinct concepts is difficult. Existing methods fail because the attention mechanisms of different LoRA models often overlap, leading to concepts being ignored or incorrectly combined. The authors introduce CLoRA, a training-free approach that tackles these issues by updating the attention maps of multiple LoRA models at test-time. It then leverages these refined attention maps to create semantic masks, which are used to fuse latent representations. This process enables the generation of composite images that accurately reflect the characteristics of each LoRA. Extensive qualitative and quantitative evaluations demonstrate that CLoRA significantly outperforms existing methods in multi-concept image generation using LoRAs.

1.6. Original Source Link

https://arxiv.org/abs/2403.19776v2 Publication Status: This is a preprint, indicating it has been made publicly available on arXiv but may not yet have completed peer review for formal publication.

1.7. PDF Link

https://arxiv.org/pdf/2403.19776v2.pdf

2. Executive Summary

2.1. Background & Motivation

The proliferation of diffusion text-to-image models (e.g., Stable Diffusion, DALL-E 2) has revolutionized image generation, leading to widespread applications beyond mere image creation, including editing, inpainting, and object detection. A critical aspect of these models is personalization, allowing users to tailor image generation to specific preferences. Low-Rank Adaptation (LoRA) has emerged as a powerful and popular technique for this, enabling efficient adaptation of pre-trained diffusion models for specific tasks (like generating a particular character or style) without extensive retraining.

However, a significant challenge arises when attempting to combine multiple pre-trained LoRA models to create a single image with diverse, distinct concepts. For instance, a user might have one LoRA for a specific cat and another for a specific dog, and want to generate an image featuring both. Existing methods often fall short in this multi-concept composition task due to two primary issues:

Attention Overlap: The attention mechanisms within different LoRA models, which dictate where the model focuses its generation efforts, often overlap. This can lead to one LoRA concept dominating or skewing the output, potentially ignoring other intended concepts entirely (e.g., generating two cats instead of a cat and a dog).
Attribute Binding: Features meant to represent distinct subjects can blend indistinctly, failing to maintain the integrity and recognizability of each concept. For example, a dog LoRA might adopt plushie features from a bunny LoRA instead of maintaining its own distinct characteristics.

These issues limit the compositionality of LoRA models, making it difficult to leverage their full potential for blending diverse elements into cohesive visual narratives. The current field lacks a robust, training-free, and control-free method for effectively composing multiple LoRA models while preserving the distinct identity of each concept.

2.2. Main Contributions / Findings

The paper introduces CLoRA, a novel approach designed to overcome the limitations of multi-concept image generation using LoRA models. Its primary contributions and findings are:

Training-Free and Control-Free Approach: CLoRA is a training-free method that integrates multiple content and style LoRAs simultaneously at test-time, without requiring any additional training of new models or specifying explicit controls (like bounding boxes or segmentation masks), unlike many existing methods.
Contrastive Objective for Attention Map Refinement: It proposes a contrastive objective to dynamically update attention maps during test-time. This objective encourages attention maps associated with distinct LoRA concepts to repel each other (preventing overlap) while attracting attention maps related to the same concept (ensuring coherence). This directly addresses the attention overlap and attribute binding issues.
Masked Latent Fusion: CLoRA leverages the refined attention maps to create semantic masks. These masks are then used to selectively fuse latent representations generated by different LoRA models, ensuring that each LoRA influences only the relevant regions of the image.
Compatibility with Community LoRAs: Unlike methods that require specialized LoRA variants (e.g., EDLoRAs), CLoRA can directly utilize off-the-shelf, widely available LoRA models (such as those found on Civit.ai) in a plug-and-play manner.
Superior Qualitative and Quantitative Performance: Comprehensive evaluations demonstrate that CLoRA significantly outperforms existing baselines across various metrics.
- Qualitatively: It generates compositions that faithfully represent multiple LoRA concepts without omissions or incorrect blending (e.g., correctly generating a cat and a dog instead of two cats).
- Quantitatively: It achieves higher DINO-based similarity (reflecting visual content alignment), CLIP-I (image-to-image similarity), and CLIP-T (image-to-text similarity) scores, indicating better alignment with both visual references and textual prompts.
- User Study: A user study confirmed CLoRA's superiority in faithfully representing LoRA concepts in composite images.
Efficiency and Scalability: The method is highly efficient in both memory usage and runtime, capable of scaling to handle multiple LoRA models (up to eight LoRAs were tested with predictable increases in computational demands).
Publicly Shared Resources: The authors commit to publicly sharing their source code and LoRA collection to promote transparency, reproducibility, and further research.

3.1. Foundational Concepts

To understand CLoRA, a grasp of several fundamental concepts in generative AI, particularly diffusion models and attention mechanisms, is essential.

3.1.1. Diffusion Models

Diffusion models are a class of generative models that have shown remarkable success in generating high-quality, diverse images from noise. The core idea is inspired by non-equilibrium thermodynamics.

Forward Diffusion Process: This process gradually adds Gaussian noise to an input data point (e.g., an image) over a series of time steps $t$ . Starting from an original data point $x_0$ , at each step $t$ , a small amount of noise is added, transforming $x_{t-1}$ into $x_t$ . After many steps, $x_T$ becomes pure Gaussian noise.
Reverse Denoising Process: This is the generative part. The model learns to reverse the forward process. Given a noisy input $x_t$ , a neural network (often a UNet) is trained to predict and remove the noise, step by step, gradually transforming pure noise back into a coherent image. The objective is to learn the conditional probability distribution $p(x_{t-1} | x_t)$ , effectively denoising the image.
Latent Space: Models like Stable Diffusion operate in a latent space. An autoencoder consists of an encoder ( $\mathcal{E}$ ) that compresses a high-dimensional image $x$ into a lower-dimensional latent code $z = \mathcal{E}(x)$ , and a decoder ( $\mathcal{D}$ ) that reconstructs the image from the latent code, such that $\mathcal{D}(z) \approx x$ . The diffusion process (adding and removing noise) happens entirely within this compressed latent space, making it more computationally efficient.
Denoiser ( $\epsilon_\theta$ ): The UNet-based neural network responsible for predicting the noise that needs to be removed at each step. Its parameters are denoted by $\theta$ .
Training Objective: The model is trained to minimize the difference between the actual noise added ( $\epsilon$ ) and the noise predicted by the denoiser ( $\epsilon_\theta$ ). The common objective for diffusion models is:

$\mathcal { L } = \mathbb { E } _ { z _ { t } , \epsilon \sim \mathrm { N } ( 0 , \mathrm { I } ) , c ( \mathcal { P } ) , t } \left[ \| \epsilon - \epsilon _ { \theta } \big ( z _ { t } , c ( \mathcal { P } ) , t \big ) \| ^ { 2 } \right]$
- $\mathcal{L}$ : The loss function being minimized.
- $\mathbb{E}$ : Expectation (average) over various components.
- $z_t$ : The noisy latent code at timestep $t$ .
- $\epsilon \sim \mathrm{N}(0, \mathrm{I})$ : The actual Gaussian noise sampled from a standard normal distribution (mean 0, variance 1) that was added at timestep $t$ .
- $c(\mathcal{P})$ : The conditional information derived from the text prompt $\mathcal{P}$ . This guides the generation process to match the text.
- $t$ : The current timestep in the diffusion process.
- $\epsilon_\theta(z_t, c(\mathcal{P}), t)$ : The denoiser model (e.g., UNet) with parameters $\theta$ , which predicts the noise from the noisy latent code $z_t$ , conditioned on $c(\mathcal{P})$ and $t$ .
- $\| \cdot \|^2$ : Squared L2 norm, measuring the squared difference between the true noise and the predicted noise.

3.1.2. Low-Rank Adaptation (LoRA)

LoRA is a parameter-efficient fine-tuning technique designed to adapt large pre-trained models (like diffusion models or Large Language Models) to specific downstream tasks or concepts without modifying all of their original weights.

Concept: Instead of fine-tuning the entire weight matrix $W$ of a pre-trained model (which can be massive), LoRA introduces a small, low-rank matrix $W_{LoRA}$ that is added to the original weight matrix. This $W_{LoRA}$ is typically decomposed into two smaller matrices, $W_{in}$ and $W_{out}$ , such that $W_{LoRA} = W_{in}W_{out}$ .
Mechanism: When LoRA is applied, the original weight matrix $W$ of the pre-trained model is frozen. Only the newly introduced low-rank matrices ( $W_{in}$ and $W_{out}$ ) are trained. During inference, the updated weights $W'$ are calculated as $W' = W + W_{in}W_{out}$ .
Benefits:
- Efficiency: Drastically reduces the number of trainable parameters, leading to faster training and less memory usage.
- Portability: The LoRA adaptation (the $W_{in}$ and $W_{out}$ matrices) is very small (e.g., 10-100MB) compared to the base model (several GB), making it easy to store and share.
- Personalization: Enables fine-tuning models for specific subjects, styles, or concepts with minimal computational resources.
Application in Diffusion Models: In Stable Diffusion, LoRA is primarily applied to the cross-attention layers within the UNet, as these layers are crucial for connecting the text prompt (conditioning) with the visual features being generated.

3.1.3. Cross-Attention Mechanisms

Attention mechanisms are fundamental in transformer architectures, which are widely used in diffusion models. They allow the model to focus on specific parts of the input when processing another part. Cross-attention specifically links different modalities, such as text and image features.

Queries (Q), Keys (K), Values (V): In cross-attention, one modality (e.g., the latent representation from the UNet) provides the Queries (Q), while another modality (e.g., the text embedding from CLIP) provides the Keys (K) and Values (V).
Calculation: The attention score measures the similarity between a Query and all Keys. These scores are then normalized (typically with a softmax function) and used to weight the Values. The output is a weighted sum of Values, allowing the Query to "attend" to the most relevant Values.
Formula (General Attention): The general Scaled Dot-Product Attention formula is: $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
- $Q$ : Query matrix.
- $K$ : Key matrix.
- $V$ : Value matrix.
- $d_k$ : Dimensionality of the Key vectors, used for scaling to prevent very large dot products that push the softmax into regions with tiny gradients.
- $QK^T$ : Dot product between Queries and Keys, measuring similarity.
- $\mathrm{softmax}(\cdot)$ : Normalization function, converting scores into probabilities that sum to 1.
In Stable Diffusion: The text embedding sequence $c$ (from CLIP) is linearly projected into Keys (K) and Values (V). The UNet's intermediate representation is projected into Queries (Q). The resulting attention maps (specifically, $A_t = \operatorname { S o f t m a x } ( Q K ^ { \tau } / \sqrt { d } )$ as mentioned in the paper) indicate which parts of the text prompt are most relevant to generating specific regions of the image at each diffusion step. The paper specifically uses $16 \times 16$ attention maps, as they are known to capture semantically meaningful information.

3.1.4. Contrastive Learning (InfoNCE Loss)

Contrastive learning is a powerful technique for learning representations by pulling similar (positive) data points closer together in an embedding space and pushing dissimilar (negative) data points farther apart.

Core Idea: Given an anchor data point, contrastive learning aims to make its embedding similar to the embedding of its positive pairs and dissimilar to the embeddings of its negative pairs.
InfoNCE (Noise-Contrastive Estimation) Loss: Also known as NT-Xent (Normalized Temperature-scaled Cross-Entropy Loss), it is a widely used contrastive loss function. It treats the learning problem as a classification task where the goal is to classify the positive sample among a set of negative samples.

For a given anchor $x^j$ and its positive pair $x^{j^+}$ , along with $N$ negative pairs $x^{j_1^-}, \dots, x^{j_N^-}$ , the InfoNCE loss for $x^j$ is typically expressed as: $\mathcal { L } = - \log \frac { \exp ( \mathrm{sim} ( f ( x ^ { j } ) , f ( x ^ { j ^ { + } } ) ) / \tau ) } { \sum _ { n \in \{ j ^ { + } , j _ { 1 } ^ { - } , \cdots j _ { N } ^ { - } \} } \exp ( \mathrm{sim} ( f ( x ^ { j } ) , f ( x ^ { n } ) ) / \tau ) }$
- $f(\cdot)$ : An encoder function that maps input data points to embedding vectors.
- $\mathrm{sim}(u, v)$ $sim (u, v)$ : A similarity function (e.g., cosine similarity) that measures the likeness between two embedding vectors $u$ $u$ and $v$ $v$ . $\mathrm{sim} ( \boldsymbol { u } , \boldsymbol { v } ) = \boldsymbol { u } ^ { T } \cdot \boldsymbol { v } / \Vert \boldsymbol { u } \Vert \Vert \boldsymbol { v } \Vert$
  - $\boldsymbol{u}^T \cdot \boldsymbol{v}$ : Dot product of vectors $\boldsymbol{u}$ and $\boldsymbol{v}$ .
  - $\Vert \boldsymbol{u} \Vert$ : L2 norm (magnitude) of vector $\boldsymbol{u}$ .
- $\tau$ : A temperature parameter that scales the arguments of the softmax. A smaller $\tau$ makes the distribution sharper, emphasizing the hardest negatives more.
- The numerator measures the similarity between the anchor and its positive pair.
- The denominator sums the similarities between the anchor and all other samples (one positive and $N$ negatives) in the batch.
- Minimizing this loss encourages the similarity of positive pairs to be high relative to negative pairs.

3.2. Previous Works

The paper positions CLoRA within several streams of research:

3.2.1. Attention-Based Methods for Improved Fidelity

Problem: Text-to-image diffusion models often struggle with complex prompts, leading to neglected tokens (concepts ignored) or attribute binding issues (features blending incorrectly).
Prior Approaches:
- A-Star [1] and DenseDiffusion [29]: Refine attention during image generation.
- [5] (Attend-and-Excite): Addresses neglected tokens by boosting attention for under-represented concepts.
- [34]: Proposes separate objective functions for missing objects and incorrect attribute binding.
- [55] and [39]: Use bounding boxes as additional constraints to manage multiple subjects in specific areas.
- [37] (Conform): Uses a contrastive approach on a single diffusion model to improve fidelity.
CLoRA's Differentiation: While these methods address attention overlap and attribute binding within a single diffusion model, CLoRA uniquely tackles these issues across multiple LoRA models, each fine-tuned for distinct objects or attributes. CLoRA also uses a contrastive approach but applies it to multiple diffusion models (i.e., LoRAs).

3.2.2. Personalized Image Generation

Evolution: This field has progressed from image-based style transfer ([11, 19]) to GAN-based approaches ([8, 13, 26, 27, 33]) and now heavily relies on diffusion models ([20, 44, 52]).
Personalization Techniques:
- Textual Inversion [12] & DreamBooth [46]: Focus on learning specific subject representations from a few images.
- LoRA [47] & StyleDrop [51]: Optimize for style personalization.
- Custom Diffusion [32]: Aims for multi-concept learning but faces challenges with joint training and style disentanglement.
- [58]: Uses attention calibration to disentangle multiple concepts from a single image for personalization.
CLoRA's Role: CLoRA builds upon LoRA's personalization capabilities by enabling the composition of multiple personalized elements, which LoRA itself was not originally designed for in a multi-concept setting.

3.2.3. Merging Multiple LoRA Models

This is the most direct area of comparison for CLoRA.

Weighted Summation [47]: A simple method of combining LoRAs by summing their weights with scalar coefficients.
- Limitation: Often yields suboptimal results, frequently leading to one concept being ignored or attributes blending incorrectly.
Mix-of-Show [16]: Requires retraining specific EDLoRA (Embedding-Decomposed LoRAs) variants for each concept and depends on ControlNet conditions (e.g., regions).
- Limitation: Not compatible with existing community LoRAs (e.g., Civit.ai models) and restricts condition-free generation.
[54] (Mole): Proposes composing LoRAs through a mixture of experts, but requires learnable gating functions trained for each domain.
- Limitation: Complexity and training overhead for new domains.
Test-time LoRA Composition (Multi LoRA Composite and Switch by [59]): These are inference-time methods.
- Limitation: Do not operate on attention maps directly, leading to unsatisfactory results in complex multi-concept scenarios.
ZipLoRA [49]: Synthesizes a new LoRA model from a style and a content LoRA.
- Limitation: Falls short in handling multiple content LoRAs simultaneously.
OMG [30]: Utilizes off-the-shelf segmentation methods to isolate subjects during generation.
- Limitation: Performance is heavily dependent on the accuracy of the underlying segmentation model and the multi-object generation fidelity of the base diffusion model (if the base model doesn't generate an object, segmentation won't help).
[56] (LoRA-Composer): A training-free approach tackling concept confusion using user-provided bounding boxes for injection and isolation constraints.
- Limitation: Requires explicit user controls (bounding boxes), limiting ease of use, and sometimes restricted to specific LoRA types (ED-LoRA). Also has higher memory demands.
Orthogonal Adaptation [40]: Reduces interference by separating attributes across LoRAs and merges them via a weighted average.
- Limitation: Complicates training by modifying fine-tuning and requiring original data. Can also require additional conditions (sketches, key points) to work effectively.
LoRACLR [50]: Merges pre-trained LoRA models using a contrastive objective for scalable multi-concept generation with minimal interference.
- Limitation: Relies on user-provided control conditions during inference.

3.3. Technological Evolution

The field has evolved from generic text-to-image generation to highly personalized generation. Early methods focused on style transfer or subject inversion. The advent of diffusion models significantly improved image quality and control. LoRA brought efficiency to personalization, allowing many specific LoRAs to be trained and shared. The current challenge, addressed by this paper, is to unlock the full compositional power of these individual LoRAs by enabling them to work together harmoniously in a single image. CLoRA represents a step forward by providing a training-free, control-free, and highly compatible solution for this multi-concept composition problem, bridging the gap between single-model attention refinement and effective LoRA model composition.

3.4. Differentiation Analysis

Compared to the main methods in related work, CLoRA offers several core differences and innovations:

Training-Free and Control-Free: A major distinction is that CLoRA operates entirely at test-time (inference), without needing any additional training or fine-tuning for multi-concept composition. Crucially, it also does not require explicit user controls such as bounding boxes, segmentation masks, or ControlNet conditions, which are often necessary for baselines like Mix-of-Show, OMG, LoRA-Composer, or Orthogonal Adaptation to achieve separation. This makes CLoRA significantly more user-friendly and plug-and-play.
Direct Attention Map Manipulation for Multiple LoRAs: While some methods (e.g., Attend-and-Excite, A-Star) manipulate attention maps for fidelity within a single diffusion model, CLoRA specifically targets attention maps to resolve overlap and attribute binding across multiple distinct LoRA models. This is a novel application of attention refinement to the multi-LoRA composition problem.
Contrastive Objective for Disentanglement: CLoRA uses a contrastive objective not just for representation learning (like LoRACLR or Conform), but specifically to disentangle the attention maps of different LoRA concepts during the diffusion process. This proactive separation ensures that each LoRA maintains its distinct identity and focuses on its intended subject.
Masked Latent Fusion: The combination of attention map refinement with semantic masking for latent fusion is key. This ensures that the influence of each LoRA is spatially localized to its intended object, preventing bleed-over or mixing of features that are not desired.
Compatibility with Off-the-Shelf LoRAs: Unlike methods like Mix-of-Show (which requires EDLoRAs) or LoRA-Composer (which can be restricted to specific types), CLoRA is fully compatible with standard, pre-trained LoRA models widely available in the community (e.g., on Civit.ai). This broad compatibility significantly enhances its practical applicability.
Scalability: CLoRA demonstrates robust performance and predictable scaling with an increasing number of LoRAs, maintaining high fidelity even with 3-5 LoRA models without the performance degradation or instability seen in some merging approaches.

In essence, CLoRA innovates by offering a practical, efficient, and robust solution for multi-concept LoRA composition that bypasses the need for re-training or auxiliary controls, making it highly accessible and effective for users to combine diverse personalized elements.

4. Methodology

4.1. Principles

The core idea of CLoRA is to enable harmonious composition of multiple Low-Rank Adaptation (LoRA) models at test-time (inference) without requiring any additional training or explicit user controls. It addresses the critical problems of attention overlap and attribute binding that arise when combining LoRAs. The method operates on two main principles:

Contrastive Attention Refinement: During the diffusion process, CLoRA uses a contrastive objective to dynamically adjust the cross-attention maps generated by the diffusion model. This objective encourages attention maps corresponding to different LoRA concepts to become distinct (repel each other), preventing them from focusing on the same regions or confusing attributes. Conversely, attention maps related to the same LoRA concept (e.g., the LoRA itself and its target subject token) are encouraged to be similar (attract each other), ensuring the LoRA's influence is correctly aligned with its intended subject.
Masked Latent Fusion: After refining the attention maps, CLoRA utilizes them to create semantic masks. These masks delineate the specific regions of the image where each LoRA concept should exert its influence. This allows for a targeted fusion of latent representations—combining the latent codes from the base diffusion model with those specifically conditioned by individual LoRAs—ensuring that each LoRA contributes only to its designated part of the image, thereby maintaining concept integrity and avoiding unwanted blending of features.

By combining these principles, CLoRA effectively disentangles the contributions of multiple LoRAs, enabling the generation of composite images that accurately reflect all specified concepts.

4.2. Core Methodology In-depth (Layer by Layer)

The CLoRA methodology is implemented during the inference phase of a Stable Diffusion model. It involves several key steps: prompt variations, concept grouping, a contrastive objective for latent optimization, and masked latent fusion. The overall process is visualized in Figure 2.

The following figure (Figure 2 from the original paper) illustrates the CLoRA workflow:

$Figure 2. Overview of CLoRA, a training-free, test-time approach for composing multiple LoRA models. Our method accepts a userprovided text prompt, such as 'An `L _ { 1 }` cat and an `L _ { 2 }` dog,…$ 该图像是论文中图2的示意图，展示了CLoRA方法的整体流程。该流程接受文本提示及对应的多个LoRA模型，利用对比目标在测试时优化注意力图，解决注意力重叠和属性绑定问题，最终通过语义掩码融合潜变量生成多概念图像。 Alt text: Figure 2. Overview of CLoRA, a training-free, test-time approach for composing multiple LoRA models. Our method accepts a userprovided text prompt, such as 'An L _ { 1 } cat and an L _ { 2 } dog,' along with their corresponding LoRAs, L _ { 1 } and L _ { 2 } . CLoRA applies testtime optimization to attention maps to address attention overlap and attribute binding issues using a contrastive objective.

4.2.1. Prompt Variations and Text Embeddings

Given a user-provided text prompt that includes multiple LoRA models (e.g., 'An $L_1$ cat and an $L_2$ dog', where $L_1$ and $L_2$ refer to specific LoRA models, and cat and dog are the concepts $S_1$ and $S_2$ they represent), CLoRA generates three variations of this prompt. These variations are crucial for forming positive and negative pairs in the contrastive objective.

Original Prompt (Base): The full prompt as provided by the user, e.g., 'An $S_1$ and an $S_2$ '. This prompt applies both LoRA models implicitly or through the base model's understanding.
$L_1$ -applied Prompt: A prompt specifically designed to activate only LoRA $L_1$ for its concept $S_1$ , while LoRA $L_2$ is not explicitly applied, e.g., 'An $L_1 \ S_1$ and an $S_2$ '.
$L_2$ -applied Prompt: Similarly, a prompt activating only LoRA $L_2$ for its concept $S_2$ , e.g., 'An $S_1$ and an $L_2 \ S_2$ '.

These prompt variations are then used to generate corresponding text embeddings using the CLIP model (or a fine-tuned text encoder if the LoRA training included text encoder fine-tuning). These text embeddings serve as the conditional information $c(\mathcal{P})$ for the diffusion model and are crucial for calculating cross-attention maps.

4.2.2. Concept Grouping for Cross-Attention Maps

During the image generation process, the Stable Diffusion model produces cross-attention maps at each diffusion step. These maps indicate the spatial relationship between text tokens and image regions. To prepare for the contrastive objective, CLoRA categorizes these cross-attention maps into concept groups.

For a scenario with two LoRA models ( $L_1$ for concept $S_1$ and $L_2$ for concept $S_2$ ):

Concept Group $C_1$ (for $S_1$ and $L_1$ ): This group includes cross-attention maps that should be associated with $S_1$ $S_{1}$ and $L_1$ $L_{1}$ . Specifically, it contains:
- The cross-attention map for $S_1$ from the original prompt.
- The cross-attention maps for LoRA $L_1$ and $S_1$ from the

L_1`-applied prompt`.
    *   The `cross-attention map` for  $S_1$  from the

L_2-applied prompt.

Concept Group $C_2$ (for $S_2$ and $L_2$ ): Similarly, this group includes cross-attention maps that should be associated with $S_2$ $S_{2}$ and $L_2$ $L_{2}$ . It contains:
- The cross-attention map for $S_2$ from the original prompt.
- The cross-attention map for $S_2$ from the

L_1`-applied prompt`.
    *   The `cross-attention maps` for `LoRA`  $L_2$  and  $S_2$  from the

L_2-applied prompt.

    This `grouping` ensures that the `diffusion process` maintains a consistent understanding of each concept and helps prevent `attention overlap` by clearly defining which `attention maps` should be related and which should be distinct.

4.2.3. CLoRA - Contrastive Objective

The core of CLoRA's attention refinement is a contrastive objective applied during inference, utilizing the InfoNCE loss due to its fast convergence properties. This loss function operates on the grouped cross-attention maps.

The objective is designed to:

Attract Positive Pairs: Make attention maps within the same concept group (e.g., all attention maps related to cat and L1) similar to each other. This aligns the LoRA's influence with its intended subject.
Repel Negative Pairs: Make attention maps from different concept groups (e.g., attention maps for cat vs. attention maps for dog) dissimilar, thereby resolving attention overlap and attribute binding issues.

The loss function for a single positive pair is expressed as: $\mathcal { L } = - \log \frac { \exp ( \mathrm{sim} ( A ^ { j } , A ^ { j ^ { + } } ) / \tau ) } { \sum _ { n \in \{ j ^ { + } , j _ { 1 } ^ { - } , \cdots j _ { N } ^ { - } \} } \exp ( \mathrm{sim} ( A ^ { j } , A ^ { n } ) / \tau ) }$ Where:
$\mathcal{L}$ : The InfoNCE loss for a given anchor attention map $A^j$ . The overall InfoNCE loss is averaged across all positive pairs.
$A^j$ : An anchor cross-attention map from one of the concept groups (e.g., the attention map for 'cat' from the original prompt).
$A^{j^+}$ : A positive cross-attention map, which is another attention map belonging to the same concept group as $A^j$ (e.g., the attention map for ' $L_1$ ' from the

L_1`-applied prompt`, if  $A^j$  was for 'cat' from the `original prompt`).
*    $A^n$ : A `negative cross-attention map`, which is an `attention map` from a *different concept group* than  $A^j$  (e.g., the `attention map` for 'dog' from the `original prompt`).
*    $N$ : The number of `negative pairs` included in the denominator for  $A^j$ .
*    $\mathrm{sim}(u, v)$ : The `cosine similarity` between two `attention maps`  $u$  and  $v$ . It is defined as:

\mathrm{sim} ( \boldsymbol { u } , \boldsymbol { v } ) = \boldsymbol { u } ^ { T } \cdot \boldsymbol { v } / \Vert \boldsymbol { u } \Vert \Vert \boldsymbol { v } \Vert

*    $\boldsymbol{u}, \boldsymbol{v}$ : The vectorized `cross-attention maps`.
    *    $\boldsymbol{u}^T \cdot \boldsymbol{v}$ : The dot product between the two vectorized `attention maps`.
    *    $\Vert \boldsymbol{u} \Vert$ ,  $\Vert \boldsymbol{v} \Vert$ : The Euclidean (L2) norms of the vectorized `attention maps`.
*    $\tau$ : A `temperature parameter` that controls the `softmax` distribution's sharpness. A value of `0.5` is used in the experiments.
*    $\exp(\cdot)$ : The exponential function.

    This loss is calculated at specific `diffusion steps` (e.g.,  $i \in \{0, 10, 20\}$ ) to guide the `latent optimization`.

### 4.2.4. Latent Optimization

The calculated `contrastive loss`  $\mathcal{L}$  is then used to update the `latent representation`  $z_t$  during the `diffusion process`. This is an iterative optimization step that aims to steer the `latent code` in a direction that minimizes the `contrastive loss`, thereby refining the `attention maps` and guiding the generation towards the desired `multi-concept composition`.

The `latent representation` is updated using a gradient descent-like step:

z _ { t } ^ { \prime } = z _ { t } - \alpha _ { t } \nabla _ { z _ { t } } \mathcal { L }

Where:
*    $z_t'$ : The updated `latent representation` at timestep  $t$ .
*    $z_t$ : The current `latent representation` at timestep  $t$  before optimization.
*    $\alpha_t$ : The `learning rate` (or step size) at timestep  $t$ , controlling how much the `latent` is adjusted based on the gradient.
*    $\nabla_{z_t} \mathcal{L}$ : The `gradient` of the `contrastive loss`  $\mathcal{L}$  with respect to the `latent representation`  $z_t$ . This gradient indicates the direction in which  $z_t$  should be changed to minimize  $\mathcal{L}$ .

    This `latent optimization` is performed for a limited number of early `diffusion steps` (e.g., up to  $i=25$ ) to prevent artifacts, as suggested by prior work.

### 4.2.5. Masked Latent Fusion

After the `latent optimization` step in the `diffusion process`, `CLoRA` combines the `latent representations` to ensure each `LoRA`'s influence is confined to its relevant regions. This is achieved through a `masking mechanism` based on the refined `attention maps`.

**Steps for Masked Latent Fusion:**

1.  **Extract LoRA-Specific Attention Maps:** For each `LoRA` (e.g.,  $L_1$ ), `attention maps` are extracted for the relevant `tokens` from its corresponding `LoRA-applied prompt`. For  $L_1$ , this includes `attention maps` for the tokens ' $L_1$ ' and ' $S_1$ ' from the prompt 'an  $L_1 \ S_1$  and an  $S_2$ '.
2.  **Create Binary Masks:** A `thresholding operation` is applied to these `attention maps` to convert them into `binary masks`. This process is similar to `semantic segmentation`.

M [ x , y ] = \mathbb { I } \left( A [ x , y ] \geq \lambda \operatorname* { m a x } _ { i , j } A [ i , j ] \right)

Where:
    *   `M[x, y]`: The `binary mask value` at position `(x, y)`. It will be 1 (active) or 0 (inactive).
    *   `A[x, y]`: The `attention map value` at position `(x, y)` for the corresponding token.
    *    $\mathbb{I}(\cdot)$ : The `indicator function`. It outputs 1 if the condition inside the parenthesis is true, and 0 otherwise.
    *    $\lambda$ : A `threshold value` between 0 and 1. This parameter determines the sensitivity of the mask. Only `attention values` exceeding  $\lambda$  times the maximum `attention value` in the map will be included in the mask.
    *    $\operatorname*{max}_{i,j} A[i,j]$ : The maximum `attention value` across all positions `(i, j)` in the `attention map`  $A$ .
3.  **Union Operation for Multiple Tokens:** If a single `LoRA` concept is represented by `multiple tokens` (e.g., both ' $L_1$ ' and ' $S_1$ ' contribute to the `cat LoRA`), a `union operation` is performed on their individual `binary masks`. This ensures that any region receiving significant `attention` from *either* token is included in the final `mask` for that `LoRA`.
4.  **Latent Fusion:** The generated `latent representations` from the base `Stable Diffusion model` and from the `LoRA-conditioned` paths are combined using these `masks`. This means that in regions where `LoRA`  $L_1$ 's mask is active, the `latent features` from  $L_1$  are predominantly used, while in regions where  $L_2$ 's mask is active,  $L_2$ 's `latent features` are used. This prevents `bleed-over` and ensures accurate spatial localization of each `LoRA`'s effect.

    **Style LoRA Usage Modes:** The `masking procedure` also offers flexibility for `style LoRAs`. They can be applied:
*   **Globally:** Affecting the entire `composition` by having a mask that covers the whole image.
*   **Subject-Specific:** Restricted to specific subjects by applying their `mask` only to those regions, allowing for `selective stylization`.

    By meticulously integrating these steps, `CLoRA` achieves a robust and accurate `multi-concept image generation` capability with `LoRA` models, effectively addressing the challenges of `attention overlap` and `attribute binding`.

# 5. Experimental Setup

## 5.1. Datasets

Due to the absence of standardized benchmarks for `multi-LoRA composition`, the authors compiled a specialized set of `LoRA` models and prompts for their evaluations.

*   **LoRA Models:** A total of `131 pre-trained LoRA models` were used. These models were derived from two sources:
    *   **Custom Collection:** `20 unique characters` were generated using the "character sheet trick" (a common method in the `Stable Diffusion` community to train `LoRAs` for specific characters). A `LoRA` was trained for each character.
    *   **CustomConcept101 [32]:** This popular dataset, which includes `101 diverse concepts` such as `plushie bunny`, `flower`, and `chair`, was also utilized. `LoRAs` were trained for all `101 concepts`.
    *   **Total:** The combination of these sources yielded `131 LoRA models`.
*   **Text Prompts:** `200 text prompts` were created, designed to represent various `multi-concept composition scenarios`. Each prompt specified at least two concepts, each corresponding to a `LoRA` model.
    *   **Example Data Samples:**
        *   'A plushie bunny and a flower in the forest' (using `plushie_bunny` and  $flower_1$  `LoRAs`)
        *   'A cat and a dog in the mountain' (using `blackcat` and `browndog` `LoRAs`)
        *   'A cat watching a garden scene intently from behind a window, eager to explore.' (using `blackcat` and `scene_garden` `LoRAs`)
        *   'A cat playfully batting at a Pikachu toy on the floor of a child's room.' (using `blackcat` and `toy-pikachu1` `LoRAs`)
    *   These prompts typically involve two subjects and a background, or more complex arrangements as shown in the additional results.

        The datasets were chosen specifically to demonstrate `CLoRA`'s ability to combine distinct concepts, including various characters, objects, and scenes, and to evaluate its performance against the challenges of `attention overlap` and `attribute binding` in diverse scenarios.

## 5.2. Evaluation Metrics

The paper employs a combination of `DINO-based metrics`, `CLIP-based similarities`, and a `user study` to quantitatively assess `CLoRA`'s performance.

### 5.2.1. DINO-based Similarity

`DINO (Self-supervised Vision Transformer)` provides a hierarchical representation of image content, allowing for detailed analysis of visual similarities. For `CLoRA`, `DINO` features are used to compare the generated `merged image` with `reference images` generated by individual `LoRAs`.

To calculate `DINO-based metrics`:
1.  Separate `reference images` are generated for each individual `LoRA` based on `prompt subcomponents` (e.g., 'an  $L_1$  cat' and 'an  $L_2$  flower').
2.  `DINO features` are extracted for the `merged image` (generated by `CLoRA` or baselines) and for each `single LoRA reference output`.
3.  `Cosine similarity` is calculated between the `DINO features` of the `merged image` and the `corresponding features` from each `single LoRA output`.

    The `cosine similarity` metric is defined as:

\mathrm{CosineSimilarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}

*    $\mathbf{A}$ ,  $\mathbf{B}$ : Two `feature vectors` (e.g., `DINO embeddings` of images).
*    $\mathbf{A} \cdot \mathbf{B}$ : The `dot product` of vectors  $\mathbf{A}$  and  $\mathbf{B}$ .
*    $\|\mathbf{A}\|$ ,  $\|\mathbf{B}\|$ : The `Euclidean norms` (magnitudes) of vectors  $\mathbf{A}$  and  $\mathbf{B}$ .

    Three `DINO-based metrics` are reported:
*   **Average DINO Similarity:** Reflects the `overall alignment` between the `merged image` and the individual `LoRA` concepts, averaged across all `LoRAs` used in the composition. A higher value indicates better overall representation of all concepts.
*   **Minimum DINO Similarity:** Measures the `cosine similarity` between the `merged image` and the `least similar LoRA reference output`. This metric is crucial for identifying if any `LoRA` concept was `neglected` or poorly represented. A higher minimum indicates that even the `least represented concept` is still well-preserved.
*   **Maximum DINO Similarity:** Identifies the `LoRA reference image` whose influence is `most represented` in the `merged image`. A higher maximum suggests strong incorporation of at least one `LoRA`.

    For each `LoRA` model and `composition prompt`, `50 reference images` are generated for `DINO similarity` calculation.

### 5.2.2. CLIP-I (Image-to-Image Similarity)

`CLIP (Contrastive Language–Image Pre-training)` models learn a joint embedding space for images and text. `CLIP-I` measures the semantic similarity between two images.

*   **Conceptual Definition:** Quantifies how semantically similar a generated image is to a reference image, based on `CLIP's multimodal understanding`.
*   **Mathematical Formula:** The `CLIP-I` score is typically the `cosine similarity` between the `CLIP image embeddings` of the two images. Let  $E_I(I)$  be the `CLIP image embedding` function.

\mathrm{CLIP-I}(I_1, I_2) = \mathrm{CosineSimilarity}(E_I(I_1), E_I(I_2))

*    $I_1, I_2$ : The two images being compared.
    *    $E_I(\cdot)$ : The `CLIP image encoder`.
    *    $\mathrm{CosineSimilarity}$ : As defined above.

### 5.2.3. CLIP-T (Image-to-Text Similarity)

`CLIP-T` measures the semantic alignment between a generated image and its guiding `text prompt`.

*   **Conceptual Definition:** Evaluates how well the generated image corresponds to the textual description, reflecting the model's ability to follow the prompt.
*   **Mathematical Formula:** The `CLIP-T` score is typically the `cosine similarity` between the `CLIP image embedding` of the generated image and the `CLIP text embedding` of the prompt. Let  $E_T(T)$  be the `CLIP text embedding` function.

\mathrm{CLIP-T}(I, T) = \mathrm{CosineSimilarity}(E_I(I), E_T(T))
\$\$
*    $I$ : The generated image.
*    $T$ : The `text prompt`.
*    $E_T(\cdot)$ : The `CLIP text encoder`.
*    $\mathrm{CosineSimilarity}$ : As defined above.

5.2.4. User Study

A user study was conducted to gather human perception of the generated image quality and fidelity to the LoRA concepts.

Methodology: 50 participants were recruited via the Prolific platform [41]. Each participant was shown four generated images per composition (from different methods) and asked to rate how faithfully each method preserved the concepts represented by the LoRAs.
Rating Scale: A Likert scale from 1 to 5 was used, where $1 = \text{"Not faithful"}$ (concepts not faithfully represented) and $5 = \text{"Very faithful"}$ (both concepts faithfully represented). The order of images was randomized per participant.

5.3. Baselines

CLoRA was compared against several existing methods for combining or adapting LoRA models:

LoRA-Merge [47]: This method combines LoRAs by performing a weighted linear combination of their parameters. It's a simple, straightforward approach.
ZipLoRA [49]: Synthesizes a new LoRA model based on a style and a content LoRA. The paper implies it struggles with multiple content LoRAs.
Mix-of-Show [16]: Requires training specific EDLoRA variants for each concept and, in its full form, relies on external controls like ControlNet regions. For fair comparison, the main paper's experiments for Mix-of-Show did not use its additional conditioning to evaluate its core LoRA merging capability more directly.
Custom Diffusion [32]: A method for multi-concept customization of text-to-image diffusion models.
MultiLoRA [59]: This method has two variants, 'Composite' and 'Switch', designed for test-time LoRA composition. The paper states 'Composite' outperformed 'Switch', so 'Composite' was primarily used for comparison.
OMG [30]: Uses off-the-shelf segmentation methods to isolate subjects during generation, meaning its performance is tied to the accuracy of the segmentation model.
LoRA-Composer [56]: A training-free approach that addresses concept confusion but requires user-provided bounding boxes for injection and isolation constraints.
Orthogonal Adaptation [40]: Aims to reduce interference by separating attributes across LoRAs and merges them via a weighted average. It can require additional user-provided conditions (sketches, key points) for accurate multi-concept compositions.

5.4. Implementation Details

Base Model: Stable Diffusion v1.5 (SDv1.5) was used as the underlying text-to-image generation framework.
Random Seeds: For each prompt, 10 random seeds were used to generate multiple outputs and assess robustness.
Iterations: 50 iterations (presumably diffusion steps) were run per generation.
Optimization Schedule: Following Attend-and-Excite [5], latent optimization (using the contrastive objective) was applied only at specific diffusion steps: $i \in \{0, 10, 20\}$ . Optimization was stopped after $i = 25$ to prevent the introduction of artifacts.
Temperature Parameter ( $\tau$ ): For the contrastive learning objective (Equation 2 in the paper, InfoNCE loss), the temperature was set to $\tau = 0.5$ .
Hardware: Image generation was performed on an NVIDIA V100 GPU.
Runtime: CLoRA takes approximately 25 seconds to compose two LoRAs. It demonstrated the ability to combine up to eight LoRAs on an NVIDIA H100 GPU.

6. Results & Analysis

6.1. Core Results Analysis

The qualitative and quantitative evaluations demonstrate CLoRA's significant advantages in multi-concept image generation using LoRAs, especially in overcoming attention overlap and attribute binding.

6.1.1. Qualitative Results

The following figures (Figure 1 and Figure 4 from the original paper) illustrate the qualitative performance:

该图像是图示，展示了CLoRA与传统LoRA合并方法在多概念图像生成中的对比，包含不同概念（女性、花、风格与猫、狗）对应的生成结果及其注意力热力图，突出CLoRA通过更新多模型注意力图实现更准确的概念融合。

该图像是多概念图像生成示意图，展示了利用不同LoRA模型分别控制猫、狗、运河、自行车、夹克、熊猫、鞋子和植物在多种场景组合中的生成效果。图中通过注释清晰标示每个LoRA模型所代表的概念，展示了CLoRA方法在多概念复合生成中的应用。

As seen in Figure 1 and Figure 4, CLoRA successfully composes images with multiple content LoRAs (e.g., a cat and a dog) in varied backgrounds (mountain, moon). It also effectively combines a content LoRA with a scene LoRA (e.g., placing a cat within a specific canal). The method handles diverse LoRA combinations, such as a cat with a bicycle or clothing. Notably, it scales to compositions involving more than two LoRAs, as evidenced by the panda, shoe, and plant example in the bottom right of Figure 4. This showcases CLoRA's versatility and robustness across different types of LoRA concepts.

6.1.2. Qualitative Comparison with Baselines

The following figure (Figure 3 from the original paper) provides a qualitative comparison with several baseline methods:

Alt text: Figure 3. Qualitative Comparison of CLoRA, Mix of Show, MultiLoRA, LoRA-Merge, ZipLoRA and Custom Diffusion. Our method can generate compositions that faithfully represent the LoRA concepts, whereas other methods often overlook one of the LoRAs and generate a single LoRA concept for both subjects.

Figure 3 highlights CLoRA's superiority. For instance, with the prompt "An $L_1$ cat and an $L_2$ penguin in the house":

Mix-of-Show often blends objects or ignores one, producing two plush penguins while omitting the cat, or a single cat with plush-like features.
MultiLoRA fails to resemble the specific LoRA models, yielding two cats or two penguins.
LoRA-Merge captures the cat LoRA somewhat but misses the penguin.
ZipLoRA frequently fails to incorporate the plush penguin, creating two cats.
Custom Diffusion often completely overlooks the cat LoRA, focusing only on the plush penguin.

In contrast, CLoRA faithfully captures both concepts, representing the specific cat and plush penguin distinctly without attention overlap or attribute binding issues. Similar observations are made for object-object compositions (shoes and a purse) and scenarios where attributes might blend (a bunny LoRA blending with a dog LoRA to create a dog with bunny's plushie features).

6.1.3. Composition with Three LoRA Models

The following figure (Figure 5a from the original paper) evaluates CLoRA with more complex compositions:

该图像是论文中的实验结果插图，展示了使用三种不同LoRA模型组合生成图像的对比效果。图片分为三部分，分别展示了三种LoRA组合、多个人体主题组合以及女性与风格的多种组合，突出CLoRA方法在多概念图像合成中的优势。 Alt text: Figure 5a is part of a larger image showing experimental results, specifically demonstrating CLoRA's capability for three LoRA compositions, multiple human subject compositions, and compositions of woman and style.

Figure 5a demonstrates CLoRA's ability to handle more than two LoRAs, maintaining the characteristics of each in the composite image. Other methods often struggle, blending multiple models incoherently. Figure 5c further shows compositions using 3 LoRAs for style, object, and human subjects, showcasing the method's versatility.

6.1.4. Composition with Human Subjects and Style LoRAs

Figure 5b compares the composition of human subjects, where CLoRA seamlessly integrates subjects with objects, preserving the distinct properties of each LoRA. Other methods typically struggle with effective integration. The paper also highlights CLoRA's capability to blend style and concept LoRAs (e.g., flower and human with a consistent style LoRA across the image).

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

	LoRA Merge	Composite	Switch	ZipLoRA	Mix-of-Show	Ours
Min. ONIO	0.376 ± 0.041	0.288 ± 0.049	0.307 ± 0.055	0.369 ± 0.036	0.407 ± 0.035	0.447 ± 0.035
Avg.	0.472 ± 0.036	0.379 ± 0.045	0.395 ± 0.053	0.496 ± 0.030	0.526 ± 0.024	0.554 ± 0.028
Max.	0.504 ± 0.038	0.417 ± 0.046	0.432 ± 0.055	0.533 ± 0.032	0.564 ± 0.024	0.593 ± 0.024
Min.	0.641 ± 0.029	0.614 ± 0.035	0.619 ± 0.039	0.659 ± 0.022	0.664 ± 0.023	0.683 ± 0.017
IPD Avg.	0.683 ± 0.029	0.654 ± 0.035	0.659 ± 0.036	0.707 ± 0.021	0.712 ± 0.022	0.725 ± 0.017
Max.	0.714 ± 0.028	0.690 ± 0.033	0.695 ± 0.036	0.740 ± 0.021	0.744 ± 0.023	0.756 ± 0.017
CLIP-T	0.814 ± 0.054	0.833 ± 0.091	0.822 ± 0.089	0.767 ± 0.081	0.760 ± 0.074	0.862 ± 0.052
User Study	2.0 ± 1.10	2.11 ± 1.12	1.98 ± 1.14	2.81 ± 1.18	2.03 ± 1.12	3.32 ± 1.13

Table 1a. Quantitative Evaluation and Runtime Analysis. Average, Minimum/Maximum DINO image-image similarities, and CLIP-I and CLIP-T metrics between the merged prompts and individual LoRA models, User Study. For all metrics, the higher, the better.

Method	CivitAI	VRAM	Runtime
Custom Diffusion	×	28GB + 8GB	4.2 min + 3.5s
LoRA Merge	✓	7GB	3.2s
Composite	✓	7GB	3.4s
Switch	✓	7GB	4.8s
Mix-of-Show	×	10GB + 10GB	10min + 3.3s
ZipLoRA	✓	39GB + 17GB	8min + 4.2s
OMG	✓	30GB	62s
LoRA-Composer	×	51GB	35s
Ours	✓	25GB	24s

Table 1b. Comparison of methods in terms of CivitAI compatibility, VRAM usage, and runtime (Finetuning and/or Inference).

6.2.1. Quantitative Comparison (Table 1a)

DINO-based Similarities (Min., Avg., Max.):

CLoRA consistently outperforms all baselines across Minimum, Average, and Maximum DINO Similarity metrics. This indicates that CLoRA not only achieves a high overall alignment with the individual LoRA concepts (Avg. DINO: 0.554), but also significantly reduces the risk of neglecting any single concept (Min. DINO: 0.447). The higher minimum DINO score for CLoRA is particularly important, demonstrating its ability to robustly represent all concepts within the composition, avoiding the omission issues observed in other methods.

CLIP-I (IPD Avg., Min., Max.):

While labeled as IPD Avg., the context implies CLIP-I (Image-to-Image similarity), often used interchangeably with perceptual distance. CLoRA again shows superior performance with an Avg. CLIP-I of 0.725, suggesting its generated images are perceptually closer to the reference LoRA outputs than those of other methods. This aligns with DINO results, reinforcing CLoRA's fidelity.

CLIP-T (Image-to-Text Similarity):

CLoRA achieves the highest CLIP-T score of 0.862. This metric directly measures how well the generated image semantically aligns with the input text prompt. A higher score signifies that CLoRA produces images that are more accurately described by the prompt, which is crucial for multi-concept generation where the prompt explicitly states multiple subjects or styles.

User Study:

The user study results confirm CLoRA's qualitative advantages, with a score of 3.32. This is notably higher than all baselines, including ZipLoRA (2.81) which was the next best. This human evaluation validates that CLoRA successfully creates composite images where LoRA concepts are faithfully and accurately represented to human observers.

Overall, CLoRA clearly demonstrates superior performance across all quantitative metrics, confirming its effectiveness in multi-concept image generation using LoRAs.

6.2.2. Runtime Comparison (Table 1b)

CivitAI Compatibility:

CLoRA is fully compatible with CivitAI (indicated by '✓'), unlike Custom Diffusion, Mix-of-Show, and LoRA-Composer. This is a significant practical advantage, as it allows users to directly utilize the vast collection of community-trained LoRA models without requiring specialized variants or retraining.

VRAM Usage and Runtime:

CLoRA shows a favorable balance between VRAM usage (25GB) and runtime (24s for composing two LoRAs).
- While some methods like LoRA Merge, Composite, and Switch have lower VRAM (7GB) and faster runtimes (3-5s), their quantitative performance is significantly lower, suggesting they achieve efficiency at the cost of quality and fidelity.
- Other higher-performing or more complex methods like Custom Diffusion (28GB + 8GB VRAM, 4.2 min + 3.5s), ZipLoRA (39GB + 17GB VRAM, 8 min + 4.2s), OMG (30GB VRAM, 62s), and LoRA-Composer (51GB VRAM, 35s) often demand substantially more VRAM or runtime, or both, especially when considering fine-tuning time.
CLoRA's 25GB VRAM and 24s runtime (on NVIDIA H100) position it as an efficient and effective solution for multi-concept generation, particularly given its superior output quality.

6.2.3. Ablation Study

The following figure (Figure 6 from the original paper) illustrates the ablation study:

$Figure 6. Ablation Study. Using the `L _ { 1 }` cat and `L _ { 2 }` dog LoRAs, the effects of two key components (latent update and latent masking) can be observed.$ 该图像是图6的消融实验示意图，通过使用 $L_1$ 猫和 $L_2$ 狗的 LoRA 模型，展示了潜变量更新和潜变量掩码两个关键组件对生成结果的影响。 Alt text: Figure 6. Ablation Study. Using the L _ { 1 } cat and L _ { 2 } dog LoRAs, the effects of two key components (latent update and latent masking) can be observed.

The ablation study highlights the importance of CLoRA's two key components:

Latent Update (Contrastive Objective): Without the latent update guided by the contrastive objective, the model fails to correctly direct attention. As shown in Figure 6, this can lead to erroneous generation such as duplicate objects (e.g., two dogs instead of a cat and a dog) or incorrect attribute connections. This confirms that the contrastive loss is essential for disentangling concepts and ensuring each LoRA focuses on its designated subject.
Latent Masking: Latent masking is crucial for preserving the identity and spatial integrity of each subject. Without masking, every pixel would be influenced by all prompts and LoRAs, leading to inconsistencies, loss of identity, and unwanted feature blending in the final image. The masking mechanism ensures that each LoRA's influence is localized to the relevant regions.

Together, these components work synergistically to enhance the composition process, enabling the faithful and accurate integration of specific styles or variations from multiple LoRAs into designated regions of the image.

6.2.4. Scalability Analysis

The following figures (Figure 7 and Table 2 from the original paper) provide a scalability analysis:

Alt text: Figure 7. Analysis on Runtime. Number of LoRAs vs. VRAM usage, and inference time.

The following are the results from Table 2 of the original paper:

		Mix-of-Show	Orthogonal Adaptation	Ours
CLIP-I	Max.	0.688 ± 0.042	0.668 ± 0.075	0.668 ± 0.065
	Avg.	0.490 ± 0.031	0.524 ± 0.042	0.525 ± 0.039
	Min.	0.371 ± 0.032	0.395 ± 0.033	0.396 ± 0.034
DINO	Max.	0.574 ± 0.078	0.548 ± 0.093	0.543 ± 0.080
	Avg.	0.351 ± 0.039	0.343 ± 0.066	0.347 ± 0.054
	Min.	0.155 ± 0.046	0.158 ± 0 `.058</td> <td><strong>0.161 ± 0.058</strong></td> </tr>`

Table 2. Quantitative comparison of Mix-of-Show (MoS), Orthogonal Adaptation (Orth) and our method on 3-, 4-, 5-subject generation using CLIP-I and DINO similarity metrics.

Figure 7 illustrates the scaling behavior of CLoRA. While VRAM usage and inference time naturally increase with the number of LoRAs, this growth remains predictable, underscoring the practicality of the approach. For instance, 2 LoRAs use 25GB VRAM and take 24s, while 8 LoRAs consume 80GB VRAM and take 96s.

Table 2 presents quantitative results for more complex compositions involving 3-5 LoRA models. CLoRA maintains comparable performance to state-of-the-art methods like Mix-of-Show and Orthogonal Adaptation even without auxiliary conditions (like ControlNet key-points) that these baselines often rely on for separation. CLoRA achieves CLIP-I (Max: 0.668, Avg: 0.525, Min: 0.396) and DINO (Max: 0.543, Avg: 0.347, Min: 0.161) scores that are competitive or slightly better in some cases, demonstrating that its contrastive test-time strategy effectively handles increased subject complexity without strong priors.

6.2.5. Additional Qualitative Results

The paper provides further qualitative evidence for CLoRA's capabilities:

Similar Subject Compositions (Figure 10): CLoRA can generate images with multiple objects from the same super-class (e.g., two people or multiple cats) by assigning respective LoRAs to individual tokens. Its separation mechanism relies on LoRA-specific attention groups rather than coarse class labels, allowing it to handle such cases effectively.
Complex and Interacting Scenes (Figure 11): CLoRA is capable of merging LoRAs in visually complex scenes with many objects (e.g., bottles, plates, sea in the background, or ship and ball). Each LoRA subject retains its unique attributes without cross-subject leakage.
Comparison with LoRA-Composer (Figure 12): This comparison highlights CLoRA's advantage of not requiring user-provided bounding boxes. LoRA-Composer needs these boxes to ensure accurate depictions, and without them, its results can be poor. CLoRA consistently produces coherent multi-concept compositions without such manual input, and also boasts better VRAM efficiency and Civit.ai compatibility.
Comparison with OMG (Figure 15): OMG relies on off-the-shelf segmentation models to isolate subjects. If the segmentation model fails (e.g., misses an object) or if the base Stable Diffusion model doesn't generate the intended objects initially (due to attention overlap or attribute binding), OMG's performance suffers. CLoRA bypasses this dependency by directly updating attention maps and fusing latent representations, ensuring more robust concept capture even in challenging scenarios.

Comparison with Orthogonal Adaptation (Figure 14): Orthogonal Adaptation may require additional conditions like sketches or key points for accurate multi-concept compositions. Without these user-provided conditions, Orthogonal Adaptation can struggle to generate accurate depictions. CLoRA, in contrast, achieves consistent multi-concept compositions without such extra input, preserving individual attributes more effectively.

The following are the results from Table 3 of the original paper:

		Merge	Composite	ZipLoRA	Mix-of-Show	Ours
CIPD	Min.	76.0% ± 8.7%	76.2% ± 7.2%	73.4% ± 8.1%	75.2% ± 9.5%	83.3% ± 5.5%
	Avg.	79.5% ± 8.3%	79.7% ± 6.8%	77.1% ± 7.6%	78.7% ± 9.2%	87.1% ± 4.9%
	Max.	82.5% ± 8.1%	82.5% ± 6.7%	80.6% ± 7.6%	81.7% ± 9.2%	89.8% ± 4.8%
	Min.	37.0% ± 15%	30.3% ± 13%	36.9% ± 13%	37.5% ± 17%	47.2% ± 14%
ONIA	Avg.	43.7% ± 17%	38.5% ± 13%	49.6% ± 15%	48.0% ± 22%	57.3% ± 14%
	Max.	50.5% ± 17%	49.5% ± 14%	53.3% ± 16%	55.6% ± 23%	69.1% ± 14%

Table 3. Instance-level CLIP-I (CIPD) and DINO (ONIA) similarity scores.

6.2.6. Additional Quantitative Analysis (Table 3)

The paper provides an additional quantitative analysis using instance segmentation (SEEM [60]) to isolate objects within composed images and calculate similarity metrics at a more granular level. This is performed on 700 images per method.

Instance-level CLIP-I (CIPD): CLoRA significantly outperforms all methods across Min., Avg., and Max. CLIP-I scores (e.g., Avg. CIPD: 87.1% vs. 79.7% for Composite). This indicates that not only is the overall image good, but the individual objects within the image, as segmented, are also much more semantically aligned with their LoRA references when generated by CLoRA.
Instance-level DINO (ONIA): Similarly, CLoRA achieves the highest scores for Min., Avg., and Max. DINO similarity (Avg. ONIA: 57.3% vs. 49.6% for ZipLoRA). This further corroborates CLoRA's ability to maintain high fidelity for individual LoRA concepts at the object level, confirming that its attention refinement and masked latent fusion effectively prevent concept leakage and neglect.

This detailed instance-level analysis provides stronger evidence for CLoRA's effectiveness in multi-concept composition, demonstrating its ability to preserve the integrity and characteristics of each LoRA concept, even when multiple objects are present in a complex scene.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces CLoRA, a novel, training-free, and test-time method for composing multiple Low-Rank Adaptation (LoRA) models for image generation. CLoRA effectively addresses the long-standing challenges of attention overlap and attribute binding that plague existing multi-LoRA composition techniques. By employing a contrastive objective to dynamically update attention maps and leveraging these maps to create semantic masks for latent fusion, CLoRA ensures that each LoRA accurately guides the diffusion process towards its designated subject.

The comprehensive qualitative and quantitative evaluations consistently demonstrate CLoRA's superior performance over existing baselines across metrics such as DINO-based similarity, CLIP alignment, and user study evaluations. A key strength of CLoRA is its compatibility with a wide range of community-developed LoRAs (e.g., from Civit.ai) without requiring specific LoRA variants or additional user controls like bounding boxes or segmentation masks. Furthermore, CLoRA maintains efficiency in memory usage and runtime, showcasing predictable scalability for an increasing number of LoRAs. This work significantly advances the capability of personalized image generation by enabling robust and high-fidelity multi-concept compositions.

7.2. Limitations & Future Work

The authors acknowledge several considerations:

Ethical Concerns: While CLoRA democratizes creativity, the ease of generating personalized images also carries inherent risks of misuse, such as the creation of deepfakes [31]. The authors emphasize the necessity of thoughtful discourse around ethical use and note that their user study adheres to anonymity protocols. This is a general concern for powerful generative AI, not specific to CLoRA itself, but amplified by its enhanced personalization capabilities.
Lack of Standardized Benchmarks: The absence of standardized benchmarks for composing multiple LoRA models required the authors to compile a custom dataset of 131 LoRA models and 200 text prompts. This makes direct comparison across different research efforts more challenging, as each might use its own evaluation set. Future work could benefit from the development of widely accepted multi-LoRA composition benchmarks.
Computational Demands with Increasing LoRAs: While CLoRA scales predictably, the VRAM and inference time still increase with the number of LoRAs (e.g., 80GB VRAM and 96s for 8 LoRAs on an H100 GPU). For very high numbers of LoRAs or in resource-constrained environments, these demands could become a practical limitation. Optimizing memory usage or exploring more efficient scaling strategies could be future directions.
Focus on Attention Mechanism: The method heavily relies on attention maps. While effective, it might not capture all nuances of concept disentanglement that could be addressed by other model components.

7.3. Personal Insights & Critique

CLoRA represents a significant and practical advancement in personalized image generation. Its training-free and control-free nature is a major selling point, democratizing multi-concept composition for a broader user base, from artists to casual users, who might lack the technical expertise or computational resources for complex training or manual control inputs. The ability to directly use off-the-shelf LoRAs from platforms like Civit.ai further enhances its utility and broad adoption potential.

The core innovation lies in the intelligent application of a contrastive objective and masked latent fusion to manage attention dynamics across multiple LoRAs. This directly tackles the fundamental issues of concept entanglement and neglected attributes in a principled way. The ablation study clearly validates the necessity of both latent update and latent masking, demonstrating a well-thought-out design.

Potential Issues/Areas for Improvement:

Threshold Parameter ( $\lambda$ ): The threshold parameter $\lambda$ for mask generation is a hyper-parameter. While not discussed in detail, its optimal value could be task-dependent and might require tuning, potentially impacting the quality of semantic masks and thus latent fusion.
Complex Scene Interaction: While CLoRA handles complex scenes well, situations involving highly nuanced physical interactions or occlusions between objects might still pose challenges. For instance, correctly rendering shadows or reflections between two LoRA-generated objects could be difficult if the underlying diffusion model struggles, or if the masks don't perfectly capture interaction regions.
Prompt Sensitivity: The effectiveness of the prompt variations and the subsequent concept grouping relies on the CLIP text encoder's ability to correctly parse and disentangle the specified LoRA and subject tokens. Complex or ambiguous phrasing in the text prompt might still lead to suboptimal results, even with the contrastive objective.
Generalizability to Other Diffusion Architectures: While demonstrated on SDv1.5, it would be interesting to see CLoRA's performance on newer diffusion models like SDXL or other latent diffusion architectures.

Applicability to Other Domains: The core idea of using contrastive objectives to disentangle representations and attention maps for masked fusion could potentially be extended beyond LoRAs in image generation. For example:

Multi-Agent Control: In robotics or multi-agent systems, contrastive attention could help individual agents focus on their specific tasks while avoiding interference from others.
Medical Image Analysis: Combining multiple specialized models (similar to LoRAs) for different pathologies or organs within a single medical image could benefit from masked fusion to ensure each model's expertise is applied only to relevant regions.
Personalized Video Generation: Extending CLoRA to video diffusion models to compose multiple character LoRAs or style LoRAs in a consistent temporal sequence would be a logical next step.

Overall, CLoRA is a highly impactful paper that provides an elegant and effective solution to a pressing problem in generative AI, paving the way for more sophisticated and user-friendly multi-concept content creation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~30 min read · 37,750 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Diffusion Models

3.1.2. Low-Rank Adaptation (LoRA)

3.1.3. Cross-Attention Mechanisms

3.1.4. Contrastive Learning (InfoNCE Loss)

3.2. Previous Works

3.2.1. Attention-Based Methods for Improved Fidelity

3.2.2. Personalized Image Generation

3.2.3. Merging Multiple LoRA Models

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Prompt Variations and Text Embeddings

4.2.2. Concept Grouping for Cross-Attention Maps

4.2.3. CLoRA - Contrastive Objective

5.2.4. User Study

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Qualitative Results

6.1.2. Qualitative Comparison with Baselines

6.1.3. Composition with Three LoRA Models

6.1.4. Composition with Human Subjects and Style LoRAs

6.2. Data Presentation (Tables)

6.2.1. Quantitative Comparison (Table 1a)

6.2.2. Runtime Comparison (Table 1b)

6.2.3. Ablation Study

6.2.4. Scalability Analysis

6.2.5. Additional Qualitative Results

6.2.6. Additional Quantitative Analysis (Table 3)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers