FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time
TL;DR Summary
FreeFuse enables multi-subject LoRA fusion via automatic, context-aware masks from cross-attention weights at test time, requiring no extra training or auxiliary models, improving multi-subject text-to-image generation quality and usability over existing methods.
Abstract
This paper proposes FreeFuse, a novel training-free approach for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to existing methods that either focus on pre-inference LoRA weight merging or rely on segmentation models and complex techniques like noise blending to isolate LoRA outputs, our key insight is that context-aware dynamic subject masks can be automatically derived from cross-attention layer weights. Mathematical analysis shows that directly applying these masks to LoRA outputs during inference well approximates the case where the subject LoRA is integrated into the diffusion model and used individually for the masked region. FreeFuse demonstrates superior practicality and efficiency as it requires no additional training, no modification to LoRAs, no auxiliary models, and no user-defined prompt templates or region specifications. Alternatively, it only requires users to provide the LoRA activation words for seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both generation quality and usability under the multi-subject generation tasks. The project page is at https://future-item.github.io/FreeFuse/
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time
1.2. Authors
- Yaoli Liu (State Key Laboratory of CAD&CG, Zhejiang University)
- Yao-Xiang Ding (State Key Laboratory of CAD&CG, Zhejiang University)
- Kun Zhou (State Key Laboratory of CAD&CG, Zhejiang University)
1.3. Journal/Conference
The paper is published as a preprint on arXiv. The listed publication date (2025-10-27T16:54:08.000Z) suggests it is a very recent or upcoming work. While arXiv is a reputable platform for disseminating research quickly, it is not a peer-reviewed journal or conference proceeding itself. Papers published on arXiv are typically submitted to conferences or journals later. The affiliations indicate a strong academic background from Zhejiang University, which is a well-regarded institution in computer science and related fields.
1.4. Publication Year
2025 (as indicated by the arXiv timestamp, though this is a future date, implying it's a recent submission for upcoming review cycles).
1.5. Abstract
This paper introduces FreeFuse, a novel approach for generating images with multiple subjects from text prompts. It works by automatically fusing multiple subject LoRAs (Low-Rank Adaptations) without requiring any additional training. Unlike existing methods that merge LoRA weights before inference or use complex techniques like segmentation models and noise blending, FreeFuse leverages the insight that dynamic, context-aware subject masks can be automatically extracted from cross-attention layer weights during inference. Mathematical analysis supports that applying these masks directly to LoRA outputs during inference closely approximates the ideal scenario where each subject LoRA is used individually for its masked region within the diffusion model. FreeFuse is highly practical and efficient, as it needs no extra training, no modifications to existing LoRAs, no auxiliary models, and no specific user-defined prompt templates or region specifications. Users only need to provide LoRA activation words to integrate it into standard workflows. Extensive experiments show that FreeFuse surpasses current methods in both generation quality and ease of use for multi-subject generation tasks.
1.6. Original Source Link
https://arxiv.org/abs/2510.23515 (Preprint) PDF Link: https://arxiv.org/pdf/2510.23515v1.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem FreeFuse aims to solve is the effective and high-quality generation of multi-subject text-to-image content using Low-Rank Adaptations (LoRAs). Large-scale Text-to-Image (T2I) models like FLUX.1-dev have shown impressive performance in general image generation. To personalize these models, LoRAs have become a popular choice due to their efficiency and precision in fine-tuning.
However, a significant challenge arises when attempting to combine multiple subject LoRAs (e.g., for different characters) within a single image. A straightforward combination often leads to performance degradation, characterized by feature conflicts and quality deterioration. This happens because LoRAs tend to compete strongly in key regions (like faces), causing features from one subject to bleed into another or leading to a confused, averaged appearance.
Prior research has attempted to address this with various techniques:
-
Pre-inference LoRA weight merging(e.g.,ZipLoRA). -
Methods requiring
retrainingoradditional trainable parameters(e.g.,Mix-of-Show). -
Approaches relying on
external segmentation modelsorcomplex noise blending(e.g.,OMG,Concept Weaver). -
Methods demanding
user-defined template promptsorregion specifications(e.g.,CLoRA,Mix-of-Show).These existing solutions often struggle with complex multi-subject interactions, lack efficiency, or impose significant burdens on the user, making them less practical for everyday use. The paper identifies a clear gap: a need for a
training-free,user-friendly, andefficientmethod that can effectively mitigatefeature conflictsformulti-subject generationin advancedDiTmodels.
2.2. Main Contributions / Findings
The paper's primary contributions are threefold:
-
Analysis of Feature Conflicts in Multi-Subject LoRA Inference: The authors provide an in-depth analysis revealing that the main issue stems from
subject LoRAsinterfering with regions designated for other subjects during joint inference. They mathematically demonstrate that constraining eachLoRA's output to its target region via masks can effectively alleviate these conflicts. This finding underpins the entireFreeFuseapproach. -
General Solution for Mitigating LoRA Interference:
FreeFuseoffers a novel solution that effectively isolates conflicts betweenLoRAsinDiTmodels. It preserves the originalLoRAweights, even in cases of overfitting, by using masks automatically derived fromattention maps. Crucially, this solution requires no new trainable parameters, no modifications toLoRAmodules, no auxiliary models, and no additional prompts from users, making it highly compatible and practical. -
Portable and Highly Efficient Framework (
FreeFuse): The paper introducesFreeFuseas an end-to-end framework formulti-subject scene generation. It operates in two stages:- Mask Generation: Automatically computes high-quality
subject masksfromcross-attention mapsby addressing theattention sinkphenomenon, exploiting thelocality of self-attention, and applyingpatch-level voting. This is done very efficiently, using only one step and one attention block during inference. - Mask Application: Directly constrains
LoRAoutputs to their respective masked regions during inference, avoiding complex strategies like feature replacement or noise blending. Experimental results demonstrate thatFreeFusesignificantly outperforms previous methods in terms of alleviatingfeature conflicts, enhancing image quality, maintaining subject fidelity, and adhering to complex prompts, all while being highly practical and seamlessly integratable into standardT2Iworkflows.
- Mask Generation: Automatically computes high-quality
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand FreeFuse, a foundational understanding of Text-to-Image (T2I) diffusion models and Low-Rank Adaptations (LoRAs) is essential.
-
Diffusion Models: These are a class of generative models that learn to reverse a gradual diffusion (noise-adding) process. During training, they learn to denoise data (e.g., images) corrupted with Gaussian noise. During inference, they start with random noise and iteratively remove noise, guided by a text prompt, to generate a coherent image. The core idea is to transform noise into meaningful data over several steps.
- Denoising Diffusion Probabilistic Models (DDPMs): A prominent type of diffusion model that performs denoising over many discrete steps.
- Latent Diffusion Models (LDMs): A variant that applies the diffusion process in a compressed
latent spacerather than directly on pixel space, significantly reducing computational cost and allowing for higher-resolution image generation. Models likeStable DiffusionareLDMs.
-
Text-to-Image (T2I) Generation: The task of generating an image from a given text description.
Diffusion modelsare currently state-of-the-art forT2Igeneration. -
Diffusion Transformers (DiT): An advanced architecture for
diffusion modelsthat replaces the traditionalU-Netbackbone with aTransformernetwork.Transformersare powerful neural network architectures that have shown great success in natural language processing and, more recently, in computer vision.DiTmodels process imagelatentsas sequences of patches (tokens), allowing them to scale efficiently to higher resolutions and leverage theTransformer's capacity for global reasoning.FLUX.1-devis an example of aDiT-basedmodel. -
Low-Rank Adaptation (LoRA): A parameter-efficient fine-tuning technique. Instead of fine-tuning all the weights of a large pre-trained model (like a
T2Imodel),LoRAinjects small, trainable rank-decomposition matrices into theTransformerlayers. This means that for a pre-trained weight matrix , the update is represented as , where and are low-rank matrices ( and , with ). Only and are trained, significantly reducing the number of trainable parameters and computational cost while retaining high performance.LoRAsare widely used forpersonalized generation, allowing users to train specific concepts (e.g., a particular character, object, or style) efficiently. -
Cross-Attention Mechanism: A key component in
Transformerarchitectures, especially inT2Imodels. It allows information to flow between different modalities, typically between textembeddings(from the prompt) and imagelatent tokens. Incross-attention,Query (Q)vectors are derived from one modality (e.g., imagelatents), andKey (K)andValue (V)vectors are derived from another modality (e.g., textembeddings). Theattention weightsindicate how much each imagelatent token"attends" to each word in the text prompt, effectively localizing semantic information in the image.$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where:
- is the Query matrix.
- is the Key matrix.
- is the Value matrix.
- is the dimension of the
Keyvectors, used for scaling to prevent very small gradients. - normalizes the attention scores.
- The
Attentionmechanism computes a weighted sum ofValuevectors, where the weights are determined by the similarity betweenQueryandKeyvectors.
-
Self-Attention Mechanism: Similar to
cross-attention, but here theQuery,Key, andValuevectors all come from the same modality (e.g., imagelatent tokens). It allows eachlatent tokento attend to all otherlatent tokenswithin the same image, capturing long-range dependencies and global contextual information within the image itself. -
Attention Sink: A phenomenon observed in
Transformer-basedmodels, particularly inattention maps. It refers to the tendency for a fewtokens(often boundary or trivialtokens) to accumulate a disproportionately high amount ofattention weights, effectively "sinking" attention that should be distributed more meaningfully across important semantic regions. This can lead to misleading or noisyattention mapsif not handled. -
Superpixels (SLIC - Simple Linear Iterative Clustering): A method for over-segmenting an image into small, perceptually uniform regions (superpixels). Instead of processing individual pixels,
SLICgroups pixels into compact, nearly uniform regions, which can then be treated as individual units. This reduces the number of entities to process and often aligns better with object boundaries than rigid grid-based segmentation. It is often used for simplifying image processing tasks and improving computational efficiency by leveraging local image coherence.
3.2. Previous Works
The paper contextualizes FreeFuse by discussing existing approaches to multi-concept generation using LoRAs in T2I models, drawing parallels with similar challenges in Large Language Models (LLMs).
-
In LLMs:
Multi-LoRAinference inLLMsalso facedquality drops. Solutions included:Clustering LoRAsin advance [Zhao et al., 2024].- Introducing
gating networks[Wu et al., 2024]. Retrainingwithconflict-mitigation objectives[Feng et al., 2025].
-
In T2I Models (LoRA-based):
- Pre-inference LoRA Weight Merging: Methods like
ZipLoRA[Shah et al., 2024] andK-LoRA[Ouyang et al., 2025] fuse multipleLoRAsbefore inference. While successful for tasks like style transfer, they struggle withmulti-concept generation(especially characters) due to inherent conflicts that arise from simply summing weights. - Inference-time Conflict Mitigation:
Multi-LoRA[Zhong et al., 2024]: Usesswitch and composite strategiesduring inference. Showed promise for character-object compositions but struggled withmulti-character scenarios.OMG (Occlusion-friendly Multi-concept Generation)[Kong et al., 2024]: Introduces anauxiliary modelforcharacter localizationand appliesnoise blending. Its reliance on an external model andLoRA's redraw tendencycan lead to failures with complex interactions.Mix-of-Show[Gu et al., 2023]: RequiresretrainingLoRAsandmanually specifying spatial constraints(e.g., bounding boxes). This is labor-intensive and lacks cross-LoRAawareness, failing in close interaction scenes.Concept Weaver[Kwon et al., 2024]: Mitigates issues withFusion Samplingbut still heavily relies on the quality ofsegmentation.CLoRA[Meral et al., 2024]: Leveragesattention mapsto deriveconcept masks. However, it requirestemplate promptsfor mask extraction, limiting its performance in complex multi-concept scenes.
- Pre-inference LoRA Weight Merging: Methods like
3.3. Technological Evolution
The field of image generation has seen rapid advancements:
-
Early Generative Models (GANs): Starting with
Generative Adversarial Networks (GANs)[Goodfellow et al., 2014], models learned to generate realistic images by having two networks (generator and discriminator) compete. Subsequent improvements likeWGANs[Arjovsky et al., 2017] andStyleGANs[Karras et al., 2019, 2020] pushed the boundaries of image quality and control. -
U-Net Based Diffusion Models: The emergence of
Denoising Diffusion Probabilistic Models (DDPMs)[Ho et al., 2020] marked a paradigm shift. These models, often usingU-Net[Ronneberger et al., 2015] as their backbone, showed superior synthesis quality and diversity.Latent Diffusion Models (LDMs)[Rombach et al., 2022] further refined this by performing diffusion in alatent space, making them computationally more efficient for high-resolution images. -
DiT-Based Diffusion Models: The latest evolution involves replacing the
U-Netbackbone withTransformerarchitectures, leading toDiffusion Transformers (DiT)[Podell et al., 2023; Peebles and Xie, 2023]. Models likeFLUX.1-dev[Labs, 2024; Labs et al., 2025] leverage the scalability and global reasoning capabilities ofTransformersfor even better performance inT2Itasks. -
Personalized Generation and LoRAs: Alongside these architectural advancements,
personalized generationhas become a major focus. Techniques likeTextual Inversion[Gal et al., 2022],DreamBooth[Ruiz et al., 2023],IP-Adapter[Ye et al., 2023], and especiallyLoRA[Hu et al., 2022] have enabled users to customizeT2Imodels with specific concepts, objects, or styles without extensive retraining.LoRAhas become the most widely adopted due to its efficiency.FreeFusefits into this timeline by addressing a critical challenge—multi-subject generationwithLoRAs—specifically within the context of the powerful and efficientDiT-baseddiffusion models, building upon the capabilities ofLoRAwhile overcoming its inherent limitations in multi-concept scenarios.
3.4. Differentiation Analysis
FreeFuse differentiates itself from existing multi-LoRA multi-concept generation methods primarily through its emphasis on practicality, efficiency, and training-free operation, particularly for DiT models.
The following table, adapted from Table 1 in the paper, summarizes the key differentiators:
The following are the results from Table 1 of the original paper:
| Adaptive Mask generation | No external model required | Cross-LoRA awareness | No template prompt required | LoRA usable as-is | |
|---|---|---|---|---|---|
| LoRA Merge | X | X | X | X | X |
| Mix-of-Show | X | X | X | ||
| ZipLoRA | X | X | X | X | |
| OMG | X | X | X | ||
| Concept Weaver | |||||
| CLoRA | X | X | |||
| FreeFuse(Ours) | ✅ | ✅ | ✅ | ✅ | ✅ |
-
Training-Free and No LoRA Modification: Unlike
Mix-of-Showwhich requires retrainingLoRAsorOMGwhich relies on specificLoRAredraw tendencies,FreeFuseoperates entirely without retraining or modifying the existingLoRAs. This makes it highly portable and compatible with any pre-trainedLoRA. -
No Auxiliary Models:
OMGutilizes anauxiliary modelfor character localization, adding complexity and potential points of failure.FreeFusegenerates its masks solely from the internalattention mechanismsof the diffusion model itself, avoiding external dependencies. -
No User-Defined Prompts/Regions:
Mix-of-Showdemands manualspatial constraints, andCLoRArequirestemplate promptsto derive masks.FreeFuseautomatically derives dynamic, context-awaresubject maskswithout any explicit user input regarding regions or specific prompt structures, requiring onlyLoRA activation words. -
Cross-LoRA Awareness: Many methods, particularly those relying on static merging or rigid regional constraints, lack awareness of how different
LoRAsmight interact.FreeFuseimplicitly handles this by dynamically generating masks based onattention maps, which reflect the interplay ofLoRAsduring inference. -
Efficiency:
FreeFuseextractssubject masksfrom a singleattention blockat a singledenoising step(e.g., 17th block at 6th step forFLUX.1-dev), offering significant efficiency gains over approaches that might repeatedly updateattention mapsor require more intensive computations throughout the inference process. -
Applicability to DiT Models: Most prior
multi-LoRAmethods were implemented onU-Net-basedmodels.FreeFuseexplicitly addresses and demonstrates its capabilities within the more advancedDiT-basedmodels (e.g.,FLUX.1-dev), which are becoming the new standard.In essence,
FreeFuseprovides a highly practical, efficient, and user-friendly solution that automates the complex task ofmulti-subject LoRA fusionby intelligently leveraging the intrinsicattention mechanismsof modernDiTmodels, distinguishing itself through its "training-free" and "no-user-intervention" philosophy.
4. Methodology
4.1. Principles
The core principle behind FreeFuse is to mitigate feature conflicts during multi-subject LoRA joint inference by ensuring each subject LoRA only influences its intended region within the generated image. The key insight is that context-aware dynamic subject masks can be automatically and efficiently derived from the cross-attention layer weights of the diffusion model.
The theoretical basis is that by applying a spatial mask to the output of a LoRA, the influence on the masked region is nearly identical to the scenario where that LoRA was used individually for that region. This approximation holds because:
-
LoRAsprimarily inject semantic features into thefeed-forward (FF)andvalue (V)layers of theTransformer. -
LoRAoutputs () are generally much smaller in magnitude than the base model's outputs. -
Attention mechanismsexhibitlocality, meaningtokenswithin a target region primarily attend to each other, with minimal influence from outside the region.This allows
FreeFuseto "localize" the effect of eachLoRAwithout modifying theLoRAitself or retraining, effectively preventing feature bleeding and competition.
4.2. Core Methodology In-depth (Layer by Layer)
FreeFuse operates in a two-stage pipeline:
-
Stage 1: Subject Mask Calculation: Automatic derivation of high-quality
subject masksfromattention maps. -
Stage 2: Mask Application: Repeatedly applying these masks to
LoRAoutputs during inference.The pipeline is illustrated in Figure 5 from the original paper:
该图像是论文FreeFuse中的示意图,展示了多主体LoRA融合的工作流程。图中通过对不同主体LoRA输出应用自动生成的掩码,再与原始模块输出合并,最终进行去噪步骤生成融合结果。
Figure 5: Pipeline. Our pipeline consists of two stages: the first derives subject masks from attention maps, and the second applies these masks to LoRA outputs, ensuring that each LoRA only operates within its corresponding subject region.
4.2.1. Analysis of Feature Conflicts and Masking Justification
The paper first highlights the problem: when multiple subject LoRAs are applied, they compete in key regions (like faces), leading to feature conflicts and confusion, as shown in Figure 3a.

该图像是图表和插图,展示了不同LoRA层被禁用后的视觉效果和平均MSE损失对比。左侧展示了禁用Q和K层、V层、投影输出层以及前馈层对生成图像的影响,右侧柱状图显示前馈层与V层禁用导致MSE损失显著增加。
Figure 3: Conflicts Ananlysis
To address this, the proposed solution is to apply a spatial mask to the LoRA output for each subject.
The masked LoRA output, denoted as , is calculated as:
$
\tilde{\Delta \mathbf{h}} = \mathbf{M} \odot \Delta \mathbf{h}
$
where:
-
is the masked
LoRAoutput. -
is a binary
spatial mask, where elements are 1 within the specified target region for a subject and 0 elsewhere. -
is the original output of the
LoRA(the change it proposes to the base model's hidden states). -
denotes the element-wise (Hadamard) product.
The authors justify this by observing that
LoRAsprimarily modify thefeed-forward (FF)andvalue (V)layers. This is supported by experiments shown in Figure 4, which demonstrates that disablingLoRAcomponents inFFor layers causes significant semantic loss, indicating these are the primary injection points for semantic information.
该图像是图表和插图,展示了不同LoRA层被禁用后的视觉效果和平均MSE损失对比。左侧展示了禁用Q和K层、V层、投影输出层以及前馈层对生成图像的影响,右侧柱状图显示前馈层与V层禁用导致MSE损失显著增加。
Figure 4: Left: Experiments show that removing LoRA from the feedforward (FF) and value (V) layers causes relatively significant semantic loss than removing it from other layers. Right: We randomly downloaded 45 FLUX-based LoRAs from Civitai and sampled 225 images. Results show that disabling the FF or V layers causes a large increase in L2 loss, while other layers have limited effect, indicating that semantic information is primarily injected through the V and FF layers.
Furthermore, attention mechanisms in DiT models exhibit locality. Tokens within a subject's region primarily attend to other tokens within that same region. The attention output of a LoRA in one attention block can be represented as:
$
\mathrm{Attn}(\mathbf{Q}, \mathbf{K}, \mathbf{V} + \Delta \mathbf{V}) = \mathrm{softmax} \left( \frac{\mathbf{Q} \mathbf{K}^{\top}}{\sqrt{D}} \right) (\mathbf{V} + \Delta \mathbf{V})
$
where:
-
represents the attention mechanism.
-
, , are the
Query,Key, andValuematrices from the base diffusion model. -
captures the contribution of the
LoRA(from both itsFFlayer in the previous block and its layer in the current block). -
is the token dimension (for scaling).
The locality of attention means that for a mask , the sum of attention weights between tokens within the masked region is much greater than the sum of attention weights from tokens inside to tokens outside the region: $ \sum_{i : \mathbf{M}i = 1} \sum{j : \mathbf{M}j = 1} \alpha{ij} \gg \sum_{i : \mathbf{M}i = 1} \sum{j : \mathbf{M}j = 0} \alpha{ij} $ where is the attention weight from token to token .
This strong locality implies that the semantic representation of target tokens is dominated by FF and value semantics from within the masked region. Therefore, masking the LoRA output () leads to a result within the masked region that is approximately equivalent to using the full LoRA output:
$
\mathbf{M} \odot f(\mathbf{h} + \Delta \mathbf{h}) \approx \mathbf{M} \odot f(\mathbf{h} + \tilde{\Delta \mathbf{h}})
$
where:
-
is the
diffusion generation model. -
is the hidden state before
LoRAapplication. -
is the full
LoRAoutput. -
is the masked
LoRAoutput.This mathematical justification supports the idea that mask-based
LoRAoutput fusion is a valid and effective approach to addressfeature conflicts. The challenge then becomes how to generate theseLoRAmasks automatically and robustly.
4.2.2. Stage 1: Subject Mask Calculation
This stage automatically generates subject masks using cross-attention maps. This process happens once at a specific early denoising step (e.g., step 6 for FLUX.1-dev).
4.2.2.1. Cross-Attention Map Computation and Attention Sink Handling
First, cross-attention maps are computed between text queries and image keys.
$
\mathbf{A}{\mathrm{cross}} = \mathrm{softmax}\left(\frac{\mathbf{Q}{\mathrm{text}} \mathbf{K}_{\mathrm{img}}^T}{\sqrt{D}}\right)
$
where:
-
is the
cross-attention map. -
is the
Querymatrix derived fromtext embeddings. -
is the
Keymatrix derived fromimage tokens(latents). -
is the
batch size. -
is the sequence length of the text tokens.
-
is the sequence length of the image tokens.
-
is the token dimension.
Raw
attention mapsoften suffer from theattention sinkphenomenon, where boundary or uninformative pixels accumulate excessive attention. To counteract this, a heuristic filtering mechanism is applied, combiningTop-K thresholdingwithspatial edge detection: $ M_{\mathrm{topk}}(i, j) = \mathbb{I}[{\mathbf{A}}(i, j) \geq \tau_k] $ $ \mathcal{M}{\mathrm{edge}}(i, j) = \mathbb{I}[(i, j) \in \mathcal{E}] $ $ \mathcal{M}{\mathrm{handle_sink}} = \neg (\mathcal{M}{\mathrm{topk}} \wedge \mathcal{M}{\mathrm{edge}}) $ where: -
is a binary mask indicating pixels
(i, j)whose attention value is above aTop-K threshold. -
is the
indicator function, which returns 1 if the condition is true and 0 otherwise. -
is the -th largest
attention value, with . In practice, is chosen as 1% (meaning the top 1% of pixels with highest attention). -
is a binary mask indicating pixels
(i, j)that belong toedge regions. -
represents the
edge pixel regions. -
is the final
attention sink handling mask. The operation essentially masks out regions that are bothTop-Kandedge regions, preventingattention sinksat the boundaries.The filtered
attention mapis then normalized: $ \tilde{\mathbf{A}} = \frac{\mathbf{A} \odot \mathcal{M}{\mathrm{handle_sink}}}{\sum_j (\mathbf{A} \odot \mathcal{M}{\mathrm{handle_sink}})_{ij}} $ where: -
is the normalized,
attention sink-handledcross-attention map. -
The denominator ensures that the attention weights for each token sum to 1 after filtering.
4.2.2.2. LoRA Activation Word Attention Map Derivation
Given a set of LoRA activation words (e.g., "harry_potter", "daiyulin") and their corresponding token position sets within the prompt, the cross-attention map for each LoRA is extracted by averaging over its token positions:
$
\mathbf{M}_l = \frac{1}{|\mathcal{T}l|} \sum{idx \in \mathcal{T}_l} \tilde{\mathbf{A}}[idx, :]
$
where:
-
is the preliminary
attention mapfor the -thLoRA. -
is the number of tokens in the
token position setfor the -thLoRA. -
refers to the attention weights from the
idx-th text token to all image tokens.However, these
cross-attention maps(from ) can still exhibit interference betweenLoRAs. To enhance locality and produce more cohesive attention patterns,self-attention mapsare leveraged. First, the most salient regions are identified by selecting the top 1% pixels from thecross-attention map: $ \mathcal{T}_{1%} = \mathrm{TopK}(\mathbf{M}l, K = \lfloor N{\mathrm{img}} \times 0.01 \rfloor) $ where: -
is the set of indices for the top 1% pixels.
-
function returns the indices of the top values.
The final
attention mapfor the -thLoRAthen uses theself-attentionweights originating from these salient regions: $ \mathbf{M}l^{\mathrm{self_attn}} = \frac{1}{|\mathcal{T}{1%}|} \sum_{i \in \mathcal{T}{1%}} \mathbf{A}{\mathrm{self}}[i, :] $ where: -
is the enhanced spatial attention distribution for the -th
LoRA activation word. -
is the
self-attention mapcomputed betweenimage tokens. This map shows how much each image token attends to other image tokens. -
The average over
self-attentionfrom the salientcross-attentionregions provides a more focused and localized map.
4.2.2.3. Superpixel-based Ensemble Masking
Pixel-wise competition between LoRA attention maps can lead to fragmented or "holey" masks. To create spatially coherent masks, a superpixel-based ensemble approach is used.
At a designated denoising step (e.g., step 6), the predicted sample (the image latent at that step) is used to generate superpixels via SLIC segmentation:
$
\mathcal{R} = \mathrm{SLIC}(\mathbf{x}0, n{\mathrm{segments}}, \mathrm{compactness}, \sigma)
$
where:
-
is the set of
superpixel regions. -
is the predicted image sample at the current denoising step.
-
is the desired number of
superpixels. In practice, it's set to the square root of the total image area (number of pixels). -
(e.g., 10) controls the balance between color similarity and spatial proximity in
superpixelformation. -
is the standard deviation for Gaussian smoothing.
For each
superpixel region, anaggregated attention scoreis computed for eachLoRA: $ s_{l,j} = \sum_{(u, v) \in r_j} \mathbf{M}_l^{\mathrm{self_attn \uparrow}}(u, v) $ where: -
is the
aggregated attention scorefor the -thLoRAinsuperpixel region. -
is the
enhanced self-attention map(upsampled to the image resolution if necessary) for the -thLoRA. -
(u, v)represents pixel coordinates within thesuperpixel region.The ownership of
superpixel regionis then assigned to theLoRAwith the maximum aggregated score: $ l^* = \arg\max_l s_{l,j} $ The final binary mask for the -thLoRAis constructed as: $ \mathbf{F}_l(u, v) = \begin{cases} 1, & \mathrm{if ~} (u, v) \in r_j \mathrm{~ and ~} l^* = l \ 0, & \mathrm{otherwise} \end{cases} $ Thissuperpixel-based votingensures that masks are spatially coherent and have fewer artifacts compared to pixel-wise approaches.
4.2.3. Stage 2: Mask Application
After the subject masks are generated in Stage 1 (e.g., at step 6 of a 28-step inference), they are repeatedly applied during the remaining denoising steps (e.g., steps 7-28). For each LoRA , its output is multiplied by its corresponding mask before being added to the base model's hidden state. This ensures that each LoRA's influence is strictly confined to its designated region, effectively mitigating feature conflicts. The process is highly efficient as mask calculation is a one-time operation early in the inference.
The practical implementation notes that masks are extracted from the 17th Double Stream Block of the FLUX.1-dev model at the 6th denoising step, and n_segments for SLIC is the square root of total image pixels.
5. Experimental Setup
5.1. Datasets
The experiments focused on character-centric tasks as these often present the most severe LoRA conflicts. To ensure a fair comparison, the authors prepared identical sets of 5-character LoRAs for each method pipeline, resulting in a total of 20 LoRAs.
- Character LoRAs: Each of the 5 characters had a dedicated
LoRA. - Training Data: For each
LoRA,15 high-quality imagescovering multiple angles and diverse outfits were collected. - Prompt Generation:
Gemini-2.5[Comanici et al., 2025] was used to generate the prompts for training theLoRAs. - Training Process: Each
LoRAwas trained using the optimal training method recommended by the respective method's base pipeline until convergence, on the same datasets. - Evaluation Prompts: A challenging set of
50 prompt setswas prepared (detailed in Appendix B). These prompts involved:- Close Character Interactions: e.g., hugging, kissing, caressing a face, whispering, tending a wound.
- Complex Actions: e.g., pillow fights, arm wrestling, eating pizza.
- Intricate Lighting Conditions: e.g., faces illuminated by a campfire or lantern.
This challenging set was designed to thoroughly examine each method's performance on complex
multi-subject generationtasks.
An example of a prompt used for evaluation (from Appendix B) is:
harry-potter tucking a flower in daiyulin's hair, both smiling warmly face-to-face.
5.2. Evaluation Metrics
The paper employed four distinct evaluation metrics, plus a custom VLM Score, to comprehensively assess method performance. For all metrics, higher scores indicate better performance.
-
LVFace (Facial Similarity Scoring):
- Conceptual Definition: Measures how well each method preserves the unique
character-specific features(facial identity) of the subjects. It assesses the similarity between the generated faces and the original reference faces. This directly addresses evaluation objective (1): "Ability to best preserve subject characteristics in complex scenes." - Mathematical Formula: The paper refers to
LVFace[You et al., 2025] as a state-of-the-art face recognition model for scoring, but does not provide its internal mathematical formula.LVFacelikely outputs a similarity score (e.g., cosine similarity offace embeddings) between two given face images. If is the embedding function ofLVFace, and is a generated face image, is its reference, the similarity score would generally be: $ \mathrm{Similarity}(I_{gen}, I_{ref}) = \mathrm{cosine_similarity}(f_E(I_{gen}), f_E(I_{ref})) $ where . - Symbol Explanation:
- : The face image generated by the model.
- : The reference face image of the character.
- : The embedding function of the
LVFacemodel that extracts a high-dimensional feature vector (embedding) from a face image. - : A common metric to measure the similarity between two non-zero vectors.
- Conceptual Definition: Measures how well each method preserves the unique
-
DINOv3 (Subject Region Feature Similarity):
- Conceptual Definition:
DINOv3[Siméoni et al., 2025] is a self-supervised visualTransformermodel capable of detecting subject regions and extractingfeature embeddings. This metric evaluates how well the features of the generated subjects align with the features of the training images. It addresses evaluation objective (2): "Ability to generate images with quality closest to the pretraining data." - Mathematical Formula: The paper doesn't provide a specific formula for how
DINOv3is used to calculate this score. Typically,DINOv3might be used to extract feature vectors for detected subjects in both generated and training images, and then a similarity metric (e.g., Euclidean distance or cosine similarity) is computed. LetF(I, R)be theDINOv3feature extractor for region in image . $ \mathrm{DINOv3_Score} = \mathrm{Similarity}(F(I_{gen}, R_{gen}), F(I_{train}, R_{train})) $ - Symbol Explanation:
- : The generated image.
- : The detected subject region in the generated image.
- : A training image for the subject.
- : The subject region in the training image.
- : A function to compare the feature embeddings.
- Conceptual Definition:
-
DreamSim (Human Preference Alignment for Image Quality):
- Conceptual Definition:
DreamSim[Fu et al., 2023] is a metric designed to better align withhuman preferencesfor image quality compared to traditional metrics. It measures the aesthetic quality and realism of the generated images, thereby addressing evaluation objective (2). - Mathematical Formula: Not provided in the paper, but
DreamSimtypically uses a learned perceptual similarity metric derived from synthetic data, aiming to capture how humans perceive image similarity and quality. It would output a scalar score.
- Conceptual Definition:
-
HPSv3 (Human Preference Score):
- Conceptual Definition:
HPSv3[Ma et al., 2025] is a state-of-the-arthuman preference alignment model, proven effective inreinforcement learning[Xue et al., 2025]. It assesses both the overallimage qualityand how well the generated image follows the instructions in the prompt. This metric directly addresses evaluation objectives (3) "Robustness in adhering to complex prompts" and (4) "Alignment with human preference in terms of lighting, details, realism, and artifact-free generation." - Mathematical Formula: Not provided.
HPSv3is a learned model that predicts a score reflecting human preference given an image and a text prompt.
- Conceptual Definition:
-
VLM Score (Vision Language Model Scoring):
- Conceptual Definition: A custom metric defined in the paper, utilizing
Gemini-2.5[Comanici et al., 2025] (aVision Language Model) as a scoring model. It evaluates image quality across three dimensions:character consistency,prompt consistency, andimage clarity/quality. This holistic metric addresses multiple evaluation objectives. - Scoring Breakdown:
- Character presence and clarity: 50 points
- Prompt adherence: 25 points
- Image clarity and quality: 25 points
- Total score: 100 points
- Prompt for VLM Scoring (from Appendix C):
You are an image quality evaluator specializing in character generation and image quality assessment. Please evaluate the quality of the last image (the generated image) based on the following criteria: Reference images: The first {len(reference_images)} images show reference characters <A> and <B> that should appear in the generated image. Target image: The last image is the generated image that should be evaluated. Generation prompt: "{prompt_text}" Evaluation criteria (total 100 points): 1. Character presence and clarity (50 points): Both characters from the reference images appear in the target image with clear and recognizable features. 2. Prompt adherence (25 points): The generated image follows the requirements described in the prompt. 3. Image clarity and quality (25 points): The image is clear, not blurry, and free of artifacts. Please provide: 1. Detailed analysis for each criterion 2. Score for each criterion (out of the maximum points) 3. Total score (sum of all criteria scores) 4. Brief reasoning for the scores Format your response as: Character Analysis: [your analysis] Character Score: [0-50] Prompt Analysis: [your analysis] Prompt Score: [0-25] Clarity Analysis: [your analysis] Clarity Score: [0-25] Total Score: [0-100] Reasoning: [brief explanation]
- Conceptual Definition: A custom metric defined in the paper, utilizing
5.2.1. Experimental Protocol
- Character Pairs: The 5 prepared
character LoRAswere paired pairwise to form10 unique pairs. - Seeds: For each prompt and character pair, results were generated using
10 different random seeds(ranging from 42 to 52). - Total Images: Each method generated images.
- Averaging: Both
global averagesand10-Pass averageswere calculated. The10-Pass averagetakes the best result from the 10 outputs per prompt (likely based onIDAor a similar quality metric mentioned in qualitative comparisons) for averaging, reflecting the best achievable quality.
5.3. Baselines
The FreeFuse method was compared against a set of representative baselines and competing methods for multi-subject generation:
- Direct LoRA Joint Inference (LoRA Merge): This serves as the primary baseline. It represents the naive approach of simply applying multiple
LoRAssimultaneously to the diffusion model without any specific conflict mitigation strategy. The paper refers to it as "direct LoRA joint inference" or "LoRA Merge". - ZipLoRA [Shah et al., 2024]: A method that fuses multiple
LoRAsbefore inference. While effective for style transfer, it's known to have limitations inmulti-concept generation. - OMG (Occlusion-friendly Multi-concept Generation) [Kong et al., 2024]: This method introduces an
auxiliary modelto localize character regions and usesnoise blendingto mitigate conflicts. - Mix-of-Show [Gu et al., 2023]: This approach requires
retrainingLoRAsandmanually specifying spatial constraints(e.g., bounding boxes) for eachLoRA. - CLoRA [Meral et al., 2024]: Leverages
attention mapsto deriveconcept masksbut requirestemplate promptsas a basis for mask extraction.
5.4. Implementation Details
- Base Model:
FLUX.1-devmodel, which is aDiT-basedtext-to-imagemodel. - Framework: Code built on
Huggingface Diffusers[hunggingface, 2025]. - LoRA Training: The
LoRAsused in experiments were trained with theAitoolkit[ostris, 2025] framework. - Mask Extraction Timing:
- Standard
28-step inferenceprocess. - No intervention during the first
6 steps. - At
step 6, thesubject maskis extracted by computing theAttention Mapfrom the17th double_stream_block.
- Standard
- Superpixel Parameters: For
superpixel-level voting,n_segmentswas set to the square root of the total image pixels, andcompactnesswas set to 10. - Mask Application: During the
remaining denoising steps(steps 7-28), eachLoRAoutput is multiplied by its corresponding mask until inference completes. - Hardware: Experiments were conducted on a single
NVIDIA L20 GPUwith48GB VRAM. - Inference Time: Achieved an average inference time of
37 seconds.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results validate that FreeFuse significantly outperforms existing approaches in multi-subject generation tasks across various quantitative metrics and qualitative assessments.
The following are the results from Table 2 of the original paper:
| LoRA Merge | ZipLoRA | OMG | Mix-of-Show | CLoRA | Ours | |
| EAONI Avg.10-Pass. | 0.5314 | 0.4781 | 0.4457 | 0.52840.5789 | 0.44520.4953 | 0.5397 |
| 0.5946 | 0.5256 | 0.5045 | 0.5949 | |||
| cemm Avg.10-Pass. | 0.72420.7683 | 0.66480.7187 | 0.6292 | 0.73240.7921 | 0.64130.7037 | 0.7368 |
| 0.7025 | 0.8052 | |||||
| Tee Avg.10-Pass. | 0.28760.3698 | 0.20370.2720 | 0.2179 | 0.34300.4417 | 0.18370.2625 | 0.33020.4685 |
| 0.3018 | ||||||
| EASdH Avg.10-Pass. | 9.12810.71 | 9.02410.92 | 9.05210.80 | 6.8688.644 | 5.5269.383 | 10.6312.25 |
| VLM Score | 51.94 | 49.97 | 53.02 | 57.74 | 23.56 | 74.03 |
Quantitative Results Analysis (Table 2):
-
VLM Score:
FreeFuseachieves a significantly higherVLM Scoreof74.03, compared to the next bestMix-of-Showat57.74. This metric, derived fromGemini-2.5evaluating character consistency, prompt adherence, and image quality, indicatesFreeFuse's superior performance across these critical aspects. The large margin suggests thatFreeFuseexcels at producing images that are both visually high-quality and semantically aligned with complex multi-subject prompts. -
LVFace (Facial Similarity): For the
LVFace-AVGmetric (average facial similarity),FreeFuse's score (e.g.,0.5949in the 10-Pass forcemmrow, and0.5946forEAONIforLoRA Merge, though the specific rows for LVFace are a bit unclear in the presented table format, assuming the last two rows before VLM Score are the relevant ones). The paper notes thatMix-of-Showmight slightly outperformFreeFuseinLVFace-AVGdue to its reliance on user-specified rectangular regions, which might simplify face detection. However,FreeFuseshows superior performance in10-Pass tasks, implying that when given multiple attempts, its adaptive masks (which can capture more complex interactions) lead to better results. ForEASdH,FreeFuseachieves12.25(10-Pass), significantly higher than all others, indicating excellent subject characteristic preservation. -
DINOv3 and DreamSim (Image Quality and Feature Alignment): While
DINOv3can sometimes score high for artifact-heavy images,DreamSimbetter aligns with human preferences. The provided table seems to combineDINOv3andDreamSimintoEAONI,cemm, andTee(likely referring to the characteristics of the LoRAs, as the evaluation metrics mentioned areDINOv3andDreamSimin the intro paragraph for these objectives).FreeFuseconsistently shows higher10-Passscores across these rows (e.g.,0.5949forEAONI,0.8052forcemm,0.4685forTee), indicating better alignment with training data features and higher perceived image quality according to human preferences. -
HPSv3 (Human Preference Score): The
HPSv3score forFreeFuse(e.g.,12.25forEASdH10-Pass, assuming it corresponds to HPSv3 from the text) is notably higher than competitors, demonstrating its superior ability in generating aesthetically pleasing images that follow instructions well.Overall,
FreeFusedemonstrates clear advantages in image quality, character feature preservation, and alignment with human preferences and prompts, especially under challenging multi-subject interaction scenarios.
Qualitative Results Analysis (Figure 6):

该图像是图6的定性结果比较,展示了多主体LoRA融合中不同方法对视觉效果和自动生成掩码的影响。左侧为原始人物图像及描述,中间显示缺失关键步骤时的掩码效果,右侧为本文方法的完整融合结果,掩码更加精准。
Figure 6: Qualitative results
Figure 6 presents a qualitative comparison, showing that FreeFuse produces superior images compared to baseline and competing methods.
- Visual Fidelity:
FreeFusegenerates images with higher detail, better lighting, and greater realism. - Character Interaction: It effectively handles complex interactions, such as physical contact, which other methods struggle with. The example shows
FreeFusesuccessfully depictingharry-potter tucking a flower in daiyulin's hairwith warm smiles and natural facial expressions, while other methods show merged, distorted, or incorrect features (e.g.,Mix-of-Showstruggles with interaction,CLoRAhas poor character quality,OMGshows feature blending). - Subject Fidelity: The characters generated by
FreeFusemore closely resemble their originalLoRArepresentations without feature bleeding or distortion.
6.2. Ablation Studies / Parameter Analysis
The paper includes an ablation study (Figure 7) to demonstrate the importance of each component of FreeFuse's mask generation pipeline.

该图像是多组多主体文本到图像生成结果对比的示意图,展示了不同方法下的生成效果,包括Mix-of-Show、CLoRA、ZipLoRA、OMG、LoRA Merge及FreeFuse。图中可见FreeFuse方法在融合多主体时呈现更自然和协调的图像风格。
Figure 7: Ablation studies demonstrate that each step of our method is essential for producing highly usable subject masks.
The study investigates the impact of removing three key components: attention sink handling, the use of self-attention maps, and block-level voting.
-
Without Attention Sink Handling: If
attention sink handlingis omitted, thecross-attention mapsbecome noisy. This often causes oneLoRAto over-focus on trivial boundary elements (sinks), leaving most of the meaningful regions to be dominated by anotherLoRA. This leads to poor mask separation and reduced generation quality. -
Without Self-Attention Maps: Removing the use of
self-attention maps(relying solely oncross-attention) results in masks withsevere cross-intrusion. This means the masks for different subjects overlap significantly or are intermingled, leading to feature conflicts and confusion similar to basicLoRA merging. Theself-attentioncomponent is crucial for enhancing the locality and cohesiveness of each subject's attention region. -
Without Block-Level Voting (Superpixel-based Ensemble Masking): Without
superpixel-based ensemble masking(orblock-level voting), the masks generated at a pixel-wise level containnumerous holesand are fragmented. This lack of spatial coherence in the masks translates into artifacts and inconsistencies in the generated image. Thesuperpixelapproach effectively smooths and consolidates these regions.The
ablation studyclearly demonstrates that each step—attention sink handling,self-attention mapintegration, andsuperpixel-based ensemble masking—is critical for generating the high-quality, usablesubject masksthat underpinFreeFuse's success inmulti-subject generation.
7. Conclusion & Reflections
7.1. Conclusion Summary
FreeFuse presents a highly practical and efficient method for multi-concept generation by fusing multiple subject LoRAs in text-to-image diffusion models. The core insight is that feature conflicts during joint inference can be effectively mitigated by constraining each LoRA to its target region using automatically derived dynamic subject masks. The mathematical analysis confirms that masking LoRA outputs well approximates individual LoRA application within specified regions. FreeFuse achieves this training-free, requiring no LoRA modifications, auxiliary models, or user-defined prompt templates or regions. Instead, it intelligently leverages cross-attention layer weights, enhanced by attention sink handling, self-attention maps, and superpixel-based block voting, to generate high-quality masks from a single attention block at an early denoising step. Extensive experiments demonstrate that FreeFuse surpasses previous methods in subject fidelity, prompt adherence, and overall generation quality, particularly in complex character-centric tasks.
7.2. Limitations & Future Work
The authors acknowledge a key limitation: the effectiveness of FreeFuse's theoretical foundation—that applying subject masks approximates individual LoRA usage—gradually diminishes as the number of subject LoRAs increases. This occurs because each LoRA is assigned an increasingly smaller region, thereby creating more opportunities for features from other LoRAs to intrude into the target region. The authors identify addressing this feature intrusion with a higher number of subjects as a goal for future improvements.
7.3. Personal Insights & Critique
FreeFuse represents a significant step forward in the practicality and usability of multi-subject generation with LoRAs. Its "training-free" and "no auxiliary model" nature is a major advantage, drastically lowering the barrier to entry for users who want to combine multiple personalized concepts. The efficiency gains from extracting masks once, early in the inference process, are also noteworthy.
One key insight drawn from this paper is the power of leveraging the intrinsic attention mechanisms of Transformer-based models for semantic control. The ability to automatically derive masks from cross-attention without explicit training or external models is elegant and robust. The ablation study effectively demonstrates the synergistic effect of its components, particularly the attention sink handling and superpixel-based voting, which address common issues in raw attention maps.
Potential Issues/Areas for Improvement:
- Generalization Beyond Characters: While the paper focuses on
character-centric tasks, it would be interesting to see howFreeFuseperforms with other types of subjects or concepts (e.g., specific objects, styles, or environments) in multi-concept scenarios. Docross-attention mapsfor non-characterLoRAsexhibit similar locality and separability? - Increased Subject Count: The acknowledged limitation regarding the increasing number of
subject LoRAsis critical. If masks become too small or fragmented,feature intrusionwill indeed become a bigger problem. Future work could explore more advancedmask refinement techniquesor incorporate some form ofsoft maskingorattention controlthat allows for more nuanced blending at boundaries without full intrusion. - Ambiguous Prompts: The reliance on
LoRA activation wordsandcross-attentionmight face challenges with ambiguous prompts orLoRAswhose activation words overlap with common vocabulary, potentially leading to incorrect mask assignments. - DiT Model Dependency: The method is developed and tested on
DiT-basedmodels. WhileDiTis advanced, its direct applicability toU-Net-basedmodels (which are still widely used) would need further validation. The specificattention blockanddenoising stepchosen might also be sensitive to the base model.
Transferability/Applicability to Other Domains:
The core idea of deriving semantic masks from attention mechanisms and using them to constrain modular components (LoRAs) could be applied to other areas:
-
Local Editing:
FreeFuse's mask generation could be adapted forlocal image editing, allowing users to precisely control which parts of an image are affected by an edit (e.g., changing color, style, or adding elements to a specific region). -
Style Transfer: When combining multiple style
LoRAs, the same masking principle could ensure that different styles are applied to distinct regions of an image without bleeding into each other. -
Fine-Grained Control: This method offers a pathway towards more
fine-grained controlingenerative models, allowing users to dictate how different elements of a prompt (or differentLoRAs) manifest spatially without needing complex segmentation tools. -
Explainability: The generated
attention-based maskscan also provide a degree ofexplainability, showing which regions of the image correspond to whichLoRAor prompt token, which is valuable for understanding model behavior.In conclusion,
FreeFuseoffers a pragmatic and powerful solution that simplifies the complex task ofmulti-subject generation, setting a high bar for efficiency and user-friendliness in the evolving landscape ofpersonalized text-to-image synthesis.
Similar papers
Recommended via semantic vector search.