Infinite-Story: A Training-Free Consistent Text-to-Image Generation
TL;DR Summary
Infinite-Story is a training-free framework for consistent text-to-image generation in multi-prompt scenarios, addressing identity and style inconsistencies. With key techniques, it achieves state-of-the-art performance and is over 6X faster during inference than existing models,
Abstract
We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Infinite-Story: A Training-Free Consistent Text-to-Image Generation". This title indicates the paper introduces a novel framework for generating a sequence of images from text prompts that maintain high consistency in identity and style, without requiring any model training or fine-tuning, focusing on visual storytelling scenarios.
1.2. Authors
The authors are Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, and Sunghoon Im (corresponding author, denoted by †*). All authors are affiliated with DGIST, South Korea. Their research backgrounds appear to be in computer vision, machine learning, and generative AI, specifically focusing on text-to-image generation and consistency in visual outputs.
1.3. Journal/Conference
The paper does not explicitly state the journal or conference it was published in. However, given the nature of the research (text-to-image generation, computer vision, machine learning) and the typical publication venues for such work (e.g., CVPR, ICCV, NeurIPS, ICLR, ACM SIGGRAPH), it is likely intended for a highly reputable conference or journal in these fields. The inclusion of an arXiv preprint suggests it's awaiting or has been submitted to a peer-reviewed venue.
1.4. Publication Year
The publication date for the preprint is 2025-11-17T05:46:16.000Z.
1.5. Abstract
This paper introduces Infinite-Story, a training-free framework designed for consistent text-to-image (T2I) generation in multi-prompt storytelling. Built upon a scale-wise autoregressive model, it addresses identity inconsistency and style inconsistency. The framework employs three main techniques: Identity Prompt Replacement to mitigate context bias in text encoders and align identity attributes, and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation to enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based methods that require fine-tuning or are slow, Infinite-Story operates entirely at test time, offering high identity and style consistency across diverse prompts. Experimental results show state-of-the-art generation performance and over 6X faster inference (1.72 seconds per image) compared to existing consistent T2I models, making it practical for real-world visual storytelling.
1.6. Original Source Link
Original Source Link: https://arxiv.org/abs/2511.13002 PDF Link: https://arxiv.org/pdf/2511.13002v1.pdf This is a preprint publication on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the lack of consistency in generated images from Text-to-Image (T2I) models, particularly in scenarios requiring coherence across multiple images, such as visual storytelling, comic strip generation, and character-driven content. While large-scale diffusion-based T2I models (e.g., Rombach et al. 2022, Ramesh et al. 2021) have achieved remarkable performance in generating high-quality individual images, they struggle to maintain a consistent subject identity (e.g., the same character looks different in different scenes) and visual style (e.g., varying art styles or lighting across a sequence) when generating multiple images from different prompts.
This problem is important because inconsistent outputs severely limit the usability of T2I models for practical applications that require visual narratives. For instance, creating a comic strip where the main character's appearance changes from panel to panel would disrupt the story's coherence. Prior research in consistent text-to-image generation has largely focused on identity consistency but often overlooks style consistency, which is equally crucial for visually coherent narratives.
Another significant challenge is the inference speed of existing consistent T2I models, especially those based on diffusion models. These models typically take over 10 seconds per image, which is too slow for interactive applications and can lead to user frustration, as per Nielsen's usability guidelines. While scale-wise autoregressive models have emerged as faster alternatives, they also face consistency issues.
The paper's entry point is to leverage the speed benefits of scale-wise autoregressive models while introducing a training-free framework to address both identity and style inconsistencies. The innovative idea is to achieve this consistency by judiciously manipulating prompt embeddings and attention mechanisms during inference, without the need for extensive fine-tuning or re-training the base generative model. This makes the solution highly efficient and practical.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
First Training-Free Scale-Wise Autoregressive Framework for Consistent T2I Generation: The paper introduces
Infinite-Story, the first framework that achieves consistent text-to-image generation in a training-free manner using ascale-wise autoregressive model(specifically, based on theInfinityarchitecture). This is a significant departure from many prior diffusion-based methods that require fine-tuning or suffer from slow inference. -
Identity Prompt Replacement (IPR) Technique: A novel technique is proposed that aligns identity attributes across prompts by unifying
identity prompt embeddings. This mitigatescontext biasin text encoders, which often causes the same identity to appear differently based on surrounding descriptive words in a prompt. -
Unified Attention Guidance Mechanism: This mechanism consists of two complementary parts:
- Adaptive Style Injection (ASI): This technique enhances both
identity appearanceandglobal visual style consistencyby injecting reference features into early-stageself-attention layers. It uses a similarity-guided adaptive operation to smoothly align appearance. - Synchronized Guidance Adaptation (SGA): This component ensures
prompt fidelityby applying the same feature adaptation (derived fromASI) to both the conditional and unconditional branches ofClassifier-Free Guidance (CFG). This maintains the balance necessary for accurately reflecting text prompts while preserving consistency.
- Adaptive Style Injection (ASI): This technique enhances both
-
State-of-the-Art Performance with High Efficiency:
Infinite-Storyachieves state-of-the-art generation quality in terms of both identity and style consistency, as demonstrated by quantitative metrics (highestDINO similarity,CLIP-I, andDreamSimscores) and qualitative results. Crucially, it offers significantly faster inference, approximately 1.72 seconds per image, which is over 6 times faster than the existing fastest diffusion-based consistent T2I models.These findings solve the problems of identity and style inconsistency in multi-prompt T2I generation and overcome the slow inference speed limitations of previous approaches. The training-free nature makes the method highly practical and adaptable, enabling real-time and interactive visual storytelling applications.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner needs to grasp several foundational concepts in generative AI and natural language processing:
- Text-to-Image (T2I) Generation: This is the task of generating photorealistic or artistic images from natural language descriptions (text prompts). Models learn to map text semantics to visual features.
- Generative Models: Algorithms that can create new data instances that resemble the training data. Examples include
Generative Adversarial Networks (GANs),Variational Autoencoders (VAEs), andDiffusion Models. - Diffusion Models: A class of generative models that learn to reverse a diffusion process. They iteratively denoise a random signal (e.g., Gaussian noise) into a coherent image, guided by a text prompt. They are known for high-quality image generation but can be computationally intensive and slow during inference.
- Autoregressive Models: Models that generate data point by point, where each new point is conditioned on the previously generated points. In the context of images, this could mean generating pixels sequentially or in blocks. Traditional autoregressive models can be very slow for high-resolution images.
- Scale-Wise Autoregressive Models: An evolution of autoregressive models that generate images by iteratively predicting features at increasing scales (e.g., from low resolution to high resolution). This
next-scale predictionparadigm aims to improve efficiency compared to pixel-by-pixel autoregression or diffusion models, by processing information more coarsely at initial stages and refining it later. The paper uses theInfinityarchitecture (Han et al. 2024) as its backbone, which is a scale-wise autoregressive model. - Text Encoders: Neural networks (often
Transformers) that convert text (words, sentences) into numerical representations calledembeddingsorcontextual embeddings. These embeddings capture the semantic meaning of the text and are used to condition image generation models. The paper mentionsFlan-T5(Chung et al. 2024) as the text encoder, which is a powerful pre-trained language model. - Embeddings: Numerical vectors that represent words, sentences, or images in a continuous vector space. Similar items have embeddings that are close to each other in this space.
- Transformer Architecture: A neural network architecture (
Vaswani et al. 2017) that relies heavily onself-attentionmechanisms. It has become dominant in natural language processing and is increasingly used in computer vision. - Self-Attention Mechanism: A core component of
Transformers. It allows the model to weigh the importance of different parts of the input sequence (or image features) when processing each part. For a given query, it computes an attention score against all keys and uses these scores to create a weighted sum of values. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where is thequerymatrix, is thekeymatrix, is thevaluematrix, and is the dimension of the keys. - Classifier-Free Guidance (CFG): A technique used in conditional generative models (like T2I models) to improve the alignment of generated samples with the conditioning input (e.g., a text prompt). It works by performing two parallel generation processes: one conditioned on the prompt and one unconditional (or conditioned on an empty prompt). The output is then extrapolated in the direction of the conditional generation, effectively amplifying the influence of the prompt. This helps to achieve better
prompt fidelity. - Fine-tuning: The process of taking a pre-trained model and further training it on a smaller, specific dataset for a particular task. This usually involves adjusting the model's weights.
Training-freemethods, likeInfinite-Story, avoid this step, making them faster to deploy.
3.2. Previous Works
The paper categorizes related work into Text-to-Image generation, Personalized image generation, and Consistent text-to-image generation.
3.2.1. Text-to-Image Generation
- Early T2I Models: These include
GAN-basedmodels (Kang et al. 2023) andautoregressive (AR)-basedmodels (Chang et al. 2023,Han et al. 2024,Tang et al. 2024). AR models evolved fromnext-token prediction(Van Den Oord, Vinyals et al. 2017,Esser, Rombach, and Ommer 2021) tomasked token generation(Chang et al. 2022,2023) and then tonext-scale prediction(Tian et al. 2024), the latter significantly improving efficiency. - Diffusion Models: Currently dominant, offering strong synthesis quality. Key examples include
Ramesh et al. 2021(DALL-E),Rombach et al. 2022(Stable Diffusion),Saharia et al. 2022(Imagen),Podell et al. 2023(SDXL), andLabs 2024(FLUX). While excellent for single image quality, they are known for slow inference times and struggle withsubject identity consistencyacross multiple images. - This paper's fit:
Infinite-Storybuilds upon the fasterscale-wise autoregressiveparadigm (Han et al. 2024) to address the latency issue while focusing on consistency, a common limitation of all T2I models.
3.2.2. Personalized Image Generation
This field focuses on generating images with user-specified features.
- Subject-Driven Methods: Aim to inject specific
concept embeddings(e.g., a particular person or object) from reference images. Examples includeLi, Li, and Hoi 2023,Gal et al.,Wei et al. 2023,Ye et al. 2023,Ruiz et al. 2023. Many requirefine-tuningorparameter-efficient fine-tuning(Nam et al. 2024,Kumari et al. 2023), often needing external datasets. - Style-Driven Methods: Focus on maintaining visual consistency in terms of style. This can involve
LoRA-based tuning(Frenkel et al. 2024,Shah et al. 2024,Sohn et al. 2023,Hu et al. 2022,Ryu 2022) or adaptingattentionmechanisms (Hertz et al. 2024,Park et al. 2025) for stylistic coherence. - Limitation: Most personalized generation methods rely on diffusion models, inheriting their slow inference speed.
- This paper's fit:
Infinite-Storytackles both subject (identity) and style consistency but does so in atraining-freeand fast manner, making it more suitable for interactive use cases than prior personalized generation methods.
3.2.3. Consistent Text-to-Image Generation
This is a more specific area within personalized image generation, focusing on preserving identity across multiple images.
- Attention-based Consistency: Many recent works manipulate
attention layer weightsto modulate identity (Kumari et al. 2023,Li et al. 2024,Zhou et al. 2024b,Tewel et al. 2024). - Structured Control: Incorporating additional controls to aid identity preservation (
Mou et al. 2023,Zhang, Rao, and Agrawala 2023). - Textual Conditioning: Leveraging the linguistic strength of
transformer-based text encoders(Radford et al. 2021,Vaswani et al. 2017,Devlin et al. 2019,Chen et al. 2025,Raffel et al. 2023) and enhanced textual conditioning (Hertz et al. 2022,Gal et al. 2022) to improve consistency.Liu et al. 2025specifically useprompt embedding variations. - Limitations: While these methods address identity consistency, they often "overlook style consistency" and are typically "diffusion-based," leading to slow inference.
- This paper's fit:
Infinite-Storyis directly inspired by these insights, manipulatingprompt embeddingsandattention mechanisms, but uniquely extends it to atraining-freeand fastscale-wise autoregressivesetting, ensuring both identity and style consistency.
3.3. Technological Evolution
The field of text-to-image generation has evolved from early attempts with GANs and autoregressive models generating lower-resolution or less coherent images to the current era dominated by powerful diffusion models capable of producing stunning high-resolution, photorealistic outputs. The Transformer architecture has played a pivotal role, first revolutionizing NLP and then becoming integral to modern generative models, including diffusion models and advanced autoregressive variants.
A key evolutionary step was the move from token-by-token or pixel-by-pixel generation in early autoregressive models to masked token generation and then next-scale prediction. This shift aimed to tackle the inherent slowness of autoregressive generation for complex data like images, making them more competitive in terms of speed while maintaining quality.
Initially, the focus was primarily on generating high-quality individual images. The subsequent evolution addressed specific challenges like personalization (making models generate specific subjects or styles) and consistency (maintaining those specific subjects/styles across multiple generations). Many solutions for personalization and consistency involved fine-tuning pre-trained diffusion models, which, while effective, reintroduced the problem of slow inference and high computational cost.
This paper's work fits within this technological timeline by addressing the efficiency and consistency gap. It leverages the latest advancements in fast scale-wise autoregressive models (Infinity) and combines them with training-free techniques for identity and style consistency, offering a practical solution for real-world storytelling applications that demand both speed and coherence.
3.4. Differentiation Analysis
Compared to the main methods in related work, Infinite-Story offers several core differences and innovations:
-
Training-Free Nature: Unlike many
diffusion-basedapproaches for consistent T2I generation (e.g.,IP-Adapter,PhotoMaker,OneActor,The Chosen One), which often requirefine-tuningor specific training steps to achieve personalization/consistency,Infinite-Storyoperates entirely attest time. This significantly reduces computational overhead and increases flexibility for users.ConsiStoryand1Prompt1Storyare also training-free but are diffusion-based. -
Leveraging Scale-Wise Autoregressive Models: Most existing consistent T2I methods are built upon
diffusion models(e.g.,Stable Diffusion XL). While diffusion models excel in quality, they are inherently slow.Infinite-Storyuniquely builds upon ascale-wise autoregressive model(Infinity), which is designed for much faster inference, thereby addressing the critical latency issue. -
Dual Focus on Identity AND Style Consistency: While many prior works in
consistent text-to-image generationprimarily focus onidentity consistency(e.g.,ConsiStory,1Prompt1Story,OneActor), they often "overlook style consistency" across generated images.Infinite-Storyexplicitly tackles both challenges through itsAdaptive Style InjectionandSynchronized Guidance Adaptation, ensuring a unified visual narrative. -
Novel Techniques for Consistency:
Identity Prompt Replacement (IPR): Addresses a unique problem ofcontext biasin text encoders, which can subtly alter identity attributes based on surrounding descriptive words. This is a targeted approach to prompt engineering for consistency.Unified Attention Guidance (ASI + SGA): This mechanism directly manipulatesself-attention layersand integrates withClassifier-Free Guidanceto propagate identity and style from a reference image. This is a novel combination for achieving comprehensive consistency without training.
-
Superior Efficiency: The paper explicitly highlights its over
6X faster inferencecompared to the fastest existing consistent T2I models (1.72 seconds per image vs. 10+ seconds), making it highly suitable for interactive and real-time applications where others fall short.In essence,
Infinite-Storyinnovates by combining a fast generative backbone with lightweight, training-free interventions that simultaneously ensure both identity and style consistency, a comprehensive solution not fully achieved by prior slower, or less holistic methods.
4. Methodology
4.1. Principles
The core idea behind Infinite-Story is to achieve consistent subject identity and overall visual style across a sequence of images generated from multiple text prompts, all without requiring any additional training or fine-tuning of the underlying generative model. This is particularly challenging because standard Text-to-Image (T2I) models, when given different prompts, tend to produce images where the "same" subject or artistic style varies subtly, or even significantly, due to the contextual nature of text embeddings and the stochasticity of the generation process.
The theoretical basis or intuition is that consistency can be imposed at critical junctures of the generation process by leveraging a "reference" image or its extracted features. This reference acts as an anchor for identity and style. Specifically, the model focuses on three key areas for intervention:
-
Text Encoding: The way text prompts are encoded can introduce biases, where the same character description might yield different visual attributes depending on other words in the prompt. This suggests that manipulating prompt embeddings can unify identity representations.
-
Early Generation Steps: The early stages of image generation (e.g., lower resolution feature maps or initial diffusion steps) are often responsible for establishing global structure and style. Intervening here can set a consistent visual foundation.
-
Attention Mechanism: The
self-attentionmechanism withinTransformer-based models determines how different parts of the image (or its features) relate to each other and to the text prompt. By guiding this attention with reference features, the model can be encouraged to focus on and preserve consistent visual elements.By operating on a
scale-wise autoregressive model(which builds images incrementally from coarse to fine scales), the interventions can be strategically applied at early, global-affecting scales to propagate consistency efficiently.
4.2. Core Methodology In-depth (Layer by Layer)
Infinite-Story is built upon the Infinity architecture, a scale-wise autoregressive model that employs a next-scale prediction scheme. The overall pipeline involves a text encoder, a transformer for autoregressive feature map prediction, and a decoder for image reconstruction.
The framework aims to generate multiple images, denoted as , from corresponding text prompts, . Each prompt is conceptually composed of an identity prompt and an expression prompt . The goal is to ensure consistency in both identity and overall style across all generated images. All prompts are processed in parallel as a batch.
The generation process of the base Infinity model is given by:
where:
-
represents the final generated images.
-
is the
image decoderthat reconstructs images from the final feature maps. -
denotes the final feature map at the last step .
-
is the cumulative feature map at step , formed by upsampling and summing residual feature maps from previous steps.
-
is a bilinear upsampling function that scales the feature map to the target image size .
-
is the quantized residual -th feature map, predicted by the transformer . It has dimensions , where is the batch size (number of images), and are the spatial sizes at step .
-
is the
transformermodel that autoregressively predicts the residual feature map , conditioned on the previous cumulative feature map and the text embeddings . -
represents the encoded identity and expression features, obtained by the
text encoderfrom the input text prompts . , where areidentity embeddingsand areexpression embeddingsfor each prompt . -
The initial feature map is initialized from .
The overall pipeline of
Infinite-Story(Figure 4 from the paper) integrates three complementary techniques into this base generation process:
该图像是示意图,展示了无限故事框架的整体流程。图中包含文本编码器、身份提示替换模块和统一注意力引导机制,分别处理文本提示并生成图像。文本编码器将一组提示 和 转换为上下文嵌入 ,随后应用身份提示替换以增强一致性。最终,残差特征图通过变换器 解码成最终图像 。字符位置信息和风格指导通过适应性风格注入与同步指导适应共同作用。
Figure 4: Overall pipeline of our method. The text encoder E _ { T } (Chung et al. 2024) processes a set of text prompts t, producing contextual embeddings T that condition the transformer. Identity Prompt Replacement is applied to before generation to y y produces residual feature maps, which are decoded into final images I via the image decoder.
4.2.1. Identity Prompt Replacement (IPR)
The first step is to apply Identity Prompt Replacement (IPR) to the text embeddings before the image generation process begins. This technique addresses context bias, a phenomenon where the same identity description (e.g., "a dog") can lead to different visual attributes (e.g., breed, age, gender) depending on the surrounding words in the prompt (e.g., "a dog springing toward a frisbee" vs. "a dog on a porch swing"). This bias arises from the self-attention mechanism within text encoders, where the representation of an identity is influenced by its context.
IPR aims to align identity-related attributes across all prompts by enforcing a consistent representation of identity. It does this by replacing all identity embeddings in the batch with the identity embedding of a designated reference instance (typically the first sample in the batch). To maintain the semantic relationship between identity and expression, a proportional scaling is applied to the expression embeddings.
The Identity Prompt Replacement is formulated as follows:
where:
- represents the modified text embeddings after IPR. It consists of modified identity embeddings and modified expression embeddings .
- is the
identity embeddingof thereference instance(the first sample in the batch). This is used as the canonical identity for all samples. - is the original
identity embeddingfor the -th sample. - is the original
expression embeddingfor the -th sample. - The term acts as a scaling factor. It ensures that the expression embedding is scaled proportionally to the ratio of the reference identity embedding to its own original identity embedding. This aims to adapt the expression to the unified identity representation, preserving the overall semantic relationship while aligning the identity.
4.2.2. Unified Attention Guidance
Even with IPR, which aligns context-level identity attributes, preserving the precise visual appearance (appearance-level identity) and global visual style (e.g., art style, lighting) can still be a challenge. To address this, a Unified Attention Guidance mechanism is introduced, comprising Adaptive Style Injection (ASI) and Synchronized Guidance Adaptation (SGA). These are applied to the self-attention layers within the transformer during the early generation steps (), as these steps are crucial for establishing global visual properties.
4.2.2.1. Adaptive Style Injection (ASI)
ASI aims to align both the appearance of the identity and the overall scene style. It operates within the self-attention layers. In self-attention, input features are transformed into Query (Q), Key (K), and Value (V) matrices. ASI modifies the Key and Value features for each sample in the batch, using the features from the reference instance.
The Adaptive Style Injection is defined as follows for a given self-attention layer at step :
where:
- and are the modified
KeyandValuefeatures for the -th sample at step . - and denote the
KeyandValuefeatures of thereference sample(the first sample in the batch). - is the original
Valuefeature of the -th sample at step . - The
Keyfeatures for all samples are replaced with theKeyfeatures of the reference sample (). This encourages all samples toattendto semantically consistent regions as defined by the reference, ensuring structural alignment. - The
Valuefeatures are adaptively interpolated between the originalValuefeature of the -th sample () and theValuefeature of the reference sample (). - is an
adaptive interpolation weight. It is calculated using the cosine similarity between the reference'sValuefeatures and the -th sample'sValuefeatures: . - is a
scaling coefficient(a hyperparameter, set to 0.85 in experiments). It controls the strength of the style injection. cosrefers to the cosine similarity function. A higher cosine similarity means and are more similar, leading to a higher , and thus more weight given to the sample's own features. Conversely, if they are dissimilar, more weight is given to the reference's features, promoting stronger alignment. This similarity-guided interpolation allows for smooth and proportional alignment of appearance and global style.
4.2.2.2. Synchronized Guidance Adaptation (SGA)
Adaptive Style Injection (ASI) applies modifications only to the conditional branch of the generation process (the path guided by the text prompt). However, generative models often use Classifier-Free Guidance (CFG) to enhance prompt fidelity. CFG works by combining outputs from a conditional branch (guided by the prompt) and an unconditional branch (guided by a null or empty prompt). Applying ASI only to the conditional branch can disrupt the delicate balance between these two branches, potentially degrading prompt fidelity.
To resolve this, Synchronized Guidance Adaptation (SGA) applies the same operations to the unconditional branch using the identical interpolation weights () computed from the conditional path. This ensures that the guidance applied for consistency is synchronized across both branches, preserving the intended effect of CFG.
For the unconditional branch, the Key and Value features are modified as:
where:
-
and are the modified
KeyandValuefeatures for the -th sample in theunconditional branchat step . -
and denote the
KeyandValuefeatures of thereference samplein theunconditional branch. -
is the original
Valuefeature of the -th sample in theunconditional branchat step . -
Crucially, is the same adaptive weight shared from the
conditional branch(calculated inASI). This synchronization ensures that any consistency-inducing modifications to the conditional path are mirrored in the unconditional path, maintaining the relative scaling and direction of theCFGmechanism.By combining
IPR,ASI, andSGA,Infinite-Storyachieves consistent identity and style across generated images while maintaining highprompt fidelity, all during the inference phase without any additional training.
5. Experimental Setup
5.1. Datasets
The paper does not use a specific image dataset for training or fine-tuning, as Infinite-Story is a training-free framework built upon a pre-trained scale-wise autoregressive model. Instead, the evaluation focuses on a benchmark dataset of text prompts designed for consistent T2I generation.
- ConsiStory Benchmark: The evaluation follows the protocol proposed in
1Prompt1Story(Liu et al. 2025), which is an extension of the originalConsiStorybenchmark (Tewel et al. 2024).- Characteristics:
ConsiStory^+ is effective for validating the method's performance because it specifically targets the core problemInfinite-Storyaddresses: evaluating bothprompt alignment(how well images match their text descriptions) and theconsistency of subject identity and styleover a diverse set of prompts, which is crucial for storytelling scenarios.
- Characteristics:
5.2. Evaluation Metrics
The paper uses a comprehensive set of metrics to evaluate prompt fidelity, identity consistency, and style consistency.
5.2.1. Prompt Fidelity (CLIP-T)
- Conceptual Definition:
CLIP-Tmeasures how well the generated image aligns semantically with its corresponding text prompt. It quantifies the degree to which the visual content of the image matches the textual description. - Mathematical Formula: The
CLIP-Tscore is based on the cosine similarity between the CLIP text embedding and the CLIP image embedding. $ \mathrm{CLIP-T} = \mathrm{cosine_similarity}(E_I(I_{gen}), E_T(t_{prompt})) $ where the cosine similarity is scaled by a factor of 2.5 and a prefix "A photo depicts" is prepended to each prompt, as per the1Prompt1Storyprotocol. - Symbol Explanation:
- : A generated image.
- : The text prompt corresponding to .
- : The
CLIP image encoder(specificallyCLIP ViT-B/32). - : The
CLIP text encoder(specificallyCLIP ViT-B/32). - : The cosine of the angle between two vectors and , indicating their semantic similarity.
5.2.2. Identity Consistency (CLIP-I and DreamSim)
To measure identity consistency, two metrics are used:
5.2.2.1. CLIP-I
- Conceptual Definition:
CLIP-Imeasures the average pairwise visual similarity between images generated from the same identity prompt. It quantifies how consistently a particular subject's appearance is maintained across different generated scenes. To ensure it only reflects the subject's identity, backgrounds are removed. - Mathematical Formula: $ \mathrm{CLIP-I} = \frac{1}{M(M-1)/2} \sum_{i=1}^{M-1} \sum_{j=i+1}^{M} \mathrm{cosine_similarity}(E_I(I_i^{no_bg}), E_I(I_j^{no_bg})) $ where the average is taken over all unique pairs of images generated for the same identity.
- Symbol Explanation:
- : The number of images generated for a single identity prompt.
- : The -th generated image for a given identity, with its background removed using
CarveKit. - : The
CLIP image encoder(specificallyViT-B/16). - : The cosine similarity function.
5.2.2.2. DreamSim
- Conceptual Definition:
DreamSimis a perceptual similarity metric designed to correlate well with human judgment of visual similarity. It measures the "distance" or dissimilarity between images. For evaluation, it's converted to a similarity measure[1 - DreamSim]. LikeCLIP-I, backgrounds are removed to focus on subject identity. - Mathematical Formula:
DreamSimtypically outputs a distance, so for consistency evaluation, it's converted to similarity. $ \mathrm{DreamSim_{similarity}} = 1 - \mathrm{DreamSim_{distance}}(I_i^{no_bg}, I_j^{no_bg}) $ The average pairwise similarity is then computed. $ \mathrm{Average_DreamSim_{similarity}} = \frac{1}{M(M-1)/2} \sum_{i=1}^{M-1} \sum_{j=i+1}^{M} (1 - \mathrm{DreamSim_{distance}}(I_i^{no_bg}, I_j^{no_bg})) $ - Symbol Explanation:
- : The number of images generated for a single identity prompt.
- : The -th generated image for a given identity, with its background removed.
- : The
DreamSimperceptual distance metric between two images.
5.2.3. Style Consistency (DINO)
- Conceptual Definition:
DINOsimilarity (specifically using theCLS tokenfrom aDINO ViT-B/8model) capturesglobal visual similarity, including aspects like rendering, background, and texture. It's used to assess how consistent the overall artistic or visual style is among images conditioned on the same identity prompt. - Mathematical Formula: Similar to
CLIP-I,DINOsimilarity is computed as the average pairwise cosine similarity betweenDINOembeddings of images for the same identity. $ \mathrm{DINO} = \frac{1}{M(M-1)/2} \sum_{i=1}^{M-1} \sum_{j=i+1}^{M} \mathrm{cosine_similarity}(E_D(I_i), E_D(I_j)) $ - Symbol Explanation:
- : The number of images generated for a single identity prompt.
- : The -th generated image for a given identity (full image, no background removal).
- : The
DINO image encoder(DINO ViT-B/8), specifically extracting theCLS tokenembedding. - : The cosine similarity function.
5.2.4. Harmonic Score ()
- Conceptual Definition: The
harmonic scoreis an aggregate metric that combinesCLIP-T,CLIP-I,DreamSim(converted to similarity), andDINOusing the harmonic mean. The harmonic mean is sensitive to low values, meaning that a poor performance in any single metric will significantly lower the overall score. This provides a balanced view of both prompt fidelity and visual consistency across identity and style. - Mathematical Formula: $ S _ { H } = \mathrm { HM } ( \mathrm { CLIP \mathrm { - } T , C L I P \mathrm { - } I , 1 - D r e a m S i m , D I N O } ) $ where denotes the Harmonic Mean. For positive numbers , the Harmonic Mean is: $ \mathrm{HM}(x_1, x_2, \ldots, x_k) = \frac{k}{\frac{1}{x_1} + \frac{1}{x_2} + \ldots + \frac{1}{x_k}} $
- Symbol Explanation:
- : The
CLIP-Tscore (prompt fidelity). - : The
CLIP-Iscore (identity consistency). - : The
DreamSimdistance converted to a similarity score (identity consistency). - : The
DINOsimilarity score (style consistency).
- : The
5.2.5. Implementation Details
- The evaluation scripts from
1Prompt1Story(Liu et al. 2025) are adapted, with the addition of theDINOmetric. - All metrics are computed on a single
A6000 GPUusingPyTorch. Background removalusingCarveKit(Selin 2023) is applied consistently for identity-based metrics (CLIP-IandDreamSim) to isolate subject appearance.- All features are extracted following standard preprocessing pipelines provided by each model (CLIP, DINO).
5.3. Baselines
The paper compares Infinite-Story against a variety of state-of-the-art consistent text-to-image generation models, including both training-based and training-free approaches, leveraging different underlying generative backbones (primarily Stable Diffusion XL and Infinity).
5.3.1. General Baselines
- Vanilla SDXL (
Podell et al. 2023): A standarddiffusion modelwithout any consistency enhancements. This serves as a general baseline for image quality and shows the inherent inconsistency of a standalone T2I model. - Vanilla Infinity (
Han et al. 2024): The basescale-wise autoregressive modelthatInfinite-Storyis built upon, also without consistency enhancements. This highlights the improvements brought byInfinite-Story's techniques.
5.3.2. Image-Based Consistent Text-to-Image Models (Diffusion-based, often requiring external reference images)
These models typically use Stable Diffusion XL as their backbone and are provided with a reference image to guide consistency.
-
IP-Adapter (
Ye et al. 2023): A method that adapts text prompts with image prompts using a separate adapter module, allowing for identity preservation. (Training-based, using official code). -
PhotoMaker (
Li et al. 2024): A technique for customizing realistic human photos by stackingID embeddings. (Training-based, using official code). -
The Chosen One (
Avrahami et al. 2024): A diffusion-based method for consistent characters in text-to-image generation. (Training-based, using unofficial code). -
OneActor (
Wang et al. 2024): Focuses on consistent subject generation viacluster-conditioned guidance. (Training-based, using official repository). -
StoryDiffusion (
Zhou et al. 2024b): A training-free method using consistent self-attention for long-range image and video generation. (Training-free, using official repository).For these image-based models, a
reference imageis generated by providing only the identity portion of the full prompt to their respective base models (e.g., for "A graceful unicorn galloping through a flower field," the reference is generated from "A graceful unicorn"). This reference image is then consistently used across all prompts in the same sequence.
5.3.3. Non-Reference Consistent Text-to-Image Models (Diffusion-based, often training-free)
These models aim to achieve consistency without an explicit external reference image, often through internal mechanisms.
- ConsiStory (
Tewel et al. 2024): Atraining-freeconsistent text-to-image generation method. (Training-free, using official code). - 1Prompt1Story (
Liu et al. 2025): Atraining-freemethod using a single prompt structure for consistent T2I generation. (Training-free, using official repository).
5.3.4. Inference Settings
-
For all methods, the
DDIM sampling settingsprovided in their open-source implementations are adopted. -
The
number of DDIM sampling stepsis fixed to 50 for all models, includingThe Chosen One(unofficial implementation), to ensure consistency in comparisons.These baselines are representative as they cover a wide range of state-of-the-art approaches for consistent T2I generation, including both training-based and training-free methods, and models that focus on identity, style, or a combination. Comparing against both
Vanilla SDXLandVanilla Infinityclearly isolates the contribution ofInfinite-Story's consistency techniques.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that Infinite-Story achieves state-of-the-art performance in consistent text-to-image generation, particularly excelling in balancing identity consistency, style consistency, and prompt fidelity while offering significantly faster inference.
The following are the results from Table 1 of the original paper:
| Method | Train-Free | SH ↑ | DINO ↑ | CLIP-T↑ | CLIP-I ↑ | DreamSim ↓ | Inference Time (s) ↓ |
|---|---|---|---|---|---|---|---|
| Vanilla SDXL (Podell et al. 2023) | - | 0.7408 | 0.6067 | 0.9074 | 0.8793 | 0.3385 | 10.27 |
| Vanilla Infinity (Han et al. 2024) | - | 0.7891 | 0.6965 | 0.8836 | 0.8955 | 0.2780 | 1.71 |
| IP-Adapter (Ye et al. 2023) | X | 0.8323 | 0.7834 | 0.8661 | 0.9243 | 0.2266 | 10.40 |
| PhotoMaker (Li et al. 2024) | X | 0.7223 | 0.6516 | 0.8651 | 0.8465 | 0.3996 | 19.52 |
| The Chosen One (Avrahami et al. 2024) | X | 0.6494 | 0.5824 | 0.8162 | 0.7943 | 0.4893 | 13.47 |
| OneActor (Wang et al. 2024) | X | 0.8088 | 0.7172 | 0.8859 | 0.9070 | 0.2423 | 24.94 |
| ConsiStory (Tewel et al. 2024) | ✓ | 0.7902 | 0.6895 | 0.9019 | 0.8954 | 0.2787 | 37.76 |
| StoryDiffusion (Zhou et al. 2024b) | ✓ | 0.7634 | 0.6783 | 0.8403 | 0.8917 | 0.3212 | 23.68 |
| 1Prompt1Story (Liu et al. 2025) | √ | 0.8395 | 0.7687 | 0.8942 | 0.9117 | 0.1993 | 22.57 |
| Ours | ✓ | 0.8538 | 0.8089 | 0.8732 | 0.9267 | 0.1834 | 1.72 |
Key observations:
-
Overall Performance ():
Infinite-Story(Ours) achieves the highestharmonic score (S_H)at 0.8538, indicating the best balance across all core metrics (prompt fidelity, identity consistency, and style consistency). This is higher than the next best (1Prompt1Storyat 0.8395) and significantly better than vanilla models. -
Identity Consistency (
CLIP-IandDreamSim): Our method achieves the highestCLIP-Iscore (0.9267) and the lowestDreamSimscore (0.1834), both indicating superior identity consistency. This validates the effectiveness ofIdentity Prompt ReplacementandAdaptive Style Injectionin maintaining subject appearance. -
Style Consistency (
DINO):Infinite-Storyalso obtains the highestDINO similarity(0.8089), demonstrating its ability to maintain a consistent global visual style across generated images. This highlights the success ofAdaptive Style Injection. -
Prompt Fidelity (
CLIP-T): WhileCLIP-T(0.8732) is not the absolute highest (Vanilla SDXL and ConsiStory achieve slightly higher), it remains very competitive, especially considering the strong consistency achieved. This indicates thatSynchronized Guidance Adaptationsuccessfully preserves prompt fidelity despite the strong consistency injections. -
Inference Time: This is where
Infinite-Storyshows a stark advantage. At 1.72 seconds per image, it is dramatically faster than all other consistent T2I models. The next fastest,Vanilla Infinity(1.71s), achieves similar speed but significantly lower consistency ( 0.7891 vs. 0.8538). Compared to othertraining-freemethods like1Prompt1Story(22.57s) orConsiStory(37.76s),Infinite-Storyis over 13 times faster, making it practical for real-time applications. Even compared toIP-Adapter(10.40s), it's over 6 times faster. -
Training-Free Advantage: The table clearly shows that
Infinite-Storyachieves top-tier performance while beingtraining-free, distinguishing it from strong training-based contenders likeIP-AdapterandOneActorthat might achieve goodCLIP-Ibut are much slower and require training.These results strongly validate the effectiveness of
Infinite-Storyby demonstrating its superior balance across all evaluation criteria, especially its groundbreaking efficiency combined with state-of-the-art consistency and fidelity.
The following figure (Figure 3 from the original paper) provides a visual comparison of inference time and harmonic score, further emphasizing the efficiency of Infinite-Story.

该图像是一个散点图,展示了不同模型的推理时间与和谐评分之间的关系。我们的模型(标记为'Ours')在推理时间上达到了1.72秒,表现优于其他模型,体现了其高效性与一致性。
Figure 3: Comparison of inference time and harmonic score S _ { H } between our method and state-of-the-art identityconsistent text-to-image generation models.
The qualitative results shown in Figure 6 also corroborate the quantitative findings:

该图像是一个比较图,展示了多种基于共享身份提示组合不同表达提示下生成的图像。每一行呈现了一系列图像,左侧为不同风格的角色,而右侧则展示了多种小动物的细腻插图,体现了各个生成模型在一致性文本到图像生成上的差异。
Figure 6: Qualitative comparison with state-of-the-art consistent T2I generation models. Each row depicts a set of images generated using a shared identity prompt combined with varying expression prompts.
In the qualitative comparison, Infinite-Story successfully generates image sequences (e.g., the elf character and the watercolor hedgehog) that:
-
Maintain Subject Identity: The elf's facial structure and the hedgehog's distinct features remain consistent across various scenes and actions.
-
Preserve Unified Visual Style: The overall artistic style, rendering, and lighting remain coherent throughout the sequence.
-
Reflect Prompt-Specific Nuances: The images accurately depict the different expressions and contexts described in the varying prompts (e.g., the elf guarding, looking at a map, or on a boat).
This contrasts with other methods, where:
-
IP-Adapterpreserves identity but often fails to reflect prompt nuances. -
OneActorand1Prompt1Storycapture expression well but show style shifts. -
StoryDiffusionandConsiStoryshow style consistency but struggle with identity. -
PhotoMakerandThe Chosen Oneunderperform in multiple aspects.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Ablation Study on Proposed Components
An ablation study was conducted to evaluate the individual contributions of Identity Prompt Replacement (IPR), Adaptive Style Injection (ASI), and Synchronized Guidance Adaptation (SGA).
The following are the results from Table 2 of the original paper:
| Component | Quantitative Metrics | |||||||
|---|---|---|---|---|---|---|---|---|
| # | IPR | ASI | SGA | Sh ↑ | DINO ↑ | CLIP-T ↑ | CLIP-I↑ | DreamSim ↓ |
| (a) | 0.7891 | 0.6965 | 0.8836 | 0.8955 | 0.2780 | |||
| (b) | ✓ | 0.8013 | 0.7119 | 0.8814 | 0.9046 | 0.2569 | ||
| (c) | ✓ | ✓ | 0.8481 | 0.8082 | 0.8625 | 0.9242 | 0.1931 | |
| (d) | ✓ | ✓ | ✓ | 0.8538 | 0.8089 | 0.8732 | 0.9267 | 0.1834 |
Analysis:
-
(a) Baseline (Vanilla Infinity): This represents the base
Infinitymodel without any of the proposed consistency techniques. It serves as the starting point. -
(b) IPR Only: Adding only
Identity Prompt Replacement (IPR)shows improvements inCLIP-I(0.8955 to 0.9046) andDreamSim(0.2780 to 0.2569). This confirms thatIPReffectively mitigates text encodercontext biasand aligns identity-related attributes, leading to betteridentity consistency.DINOandCLIP-Talso see slight improvements, contributing to a higher (0.7891 to 0.8013). -
(c) IPR + ASI: Incorporating
Adaptive Style Injection (ASI)alongsideIPRleads to significant gains inDINO(0.7119 to 0.8082), indicating much improvedglobal style consistency.CLIP-I(0.9046 to 0.9242) andDreamSim(0.2569 to 0.1931) also further improve, demonstrating enhancedappearance-level identity alignment. The jumps considerably (0.8013 to 0.8481). However,CLIP-Tdrops (0.8814 to 0.8625), suggesting that injecting strong style guidance without synchronization can compromiseprompt fidelity. -
(d) Full Method (IPR + ASI + SGA): Adding
Synchronized Guidance Adaptation (SGA)on top ofIPRandASIrestores the balance.CLIP-Tsignificantly recovers (0.8625 to 0.8732), indicating thatSGAsuccessfully maintainsprompt fidelityby synchronizing modifications across conditional and unconditional branches ofCFG. The overallharmonic score (S_H)reaches its peak at 0.8538, andCLIP-IandDreamSimalso show minor additional improvements.This quantitative analysis confirms that each component contributes positively to the overall consistency and fidelity, with
SGAplaying a crucial role in mitigating the trade-off between consistency and prompt fidelity.
The following figure (Figure 7 from the original paper) provides a qualitative analysis of the ablation study, visually reinforcing the quantitative findings:

该图像是图表,展示了不同配置下的生成效果对比,其中(a)-(d)分别对应于表2中的配置。展示了水莲花和红狐狸的不同场景,突出了风格与身份一致性的视觉效果。
Figure 7: Qualitative analysis of ablation study. The results from (a)-(d) correspond to the configurations in Table 2.
Qualitative Analysis of Ablation Study:
- (a) Without any proposed methods: Both the flower (species, rendering style) and the red fox (fur texture, facial shape) show severe
identity inconsistencyandstyle inconsistency. - (b) With IPR only:
IPRhelps unify theidentity-related attributes. The lily maintains a more consistent floral structure, and the red fox shows more stable facial features and body proportions. However,global style elements(lighting, rendering) still vary. - (c) With IPR + ASI:
ASIdramatically improvesglobal styleandappearance-level identity consistency. The flower has stable coloration and stroke patterns, and the fox maintains consistent shading and background textures. However, someprompt-specific semanticsare underemphasized, and visual artifacts (unnatural outlines, distorted textures) appear, likely due to strong style injection overriding localized details. - (d) Full method (IPR + ASI + SGA):
SGArestores the balance, ensuringprompt fidelity. This results in visually coherent outputs that maintainconsistent subject appearanceandunified stylistic rendering, while accurately reflectingprompt-specific variations(posture, context, lighting).
6.2.2. Ablation Study on Adaptive Style Injection Scaling Coefficient ()
An additional ablation study was performed to analyze the effect of the scaling coefficient\lambda$$ in Adaptive Style Injection (ASI).
The following are the results from Table 5 of the original paper:
| Parameter | SH↑ | DINO ↑ | CLIP-T ↑ | CLIP-I ↑ | DreamSim ↓ |
|---|---|---|---|---|---|
| λ = 0.6 | 0.8420 | 0.7967 | 0.8745 | 0.9209 | 0.1998 |
| λ = 0.7 | 0.8473 | 0.7865 | 0.8737 | 0.9227 | 0.1919 |
| λ = 0.8 | 0.8506 | 0.8058 | 0.8735 | 0.9245 | 0.1904 |
| λ = 0.85 (Ours) | 0.8538 | 0.8089 | 0.8732 | 0.9267 | 0.1834 |
| λ = 0.9 | 0.8538 | 0.8102 | 0.8722 | 0.9251 | 0.1826 |
Analysis:
- As increases,
DINO,CLIP-I, andDreamSimgenerally improve, indicating that a stronger style injection leads to betteridentityandstyle consistency. The bestDINO(0.8102) andDreamSim(0.1826) scores are achieved at . - However,
CLIP-T(prompt fidelity) tends to degrade as increases. It is highest at (0.8745) and decreases to 0.8722 at . This confirms the trade-off: too strong a style injection can override prompt-specific details. - The chosen default value of is a compromise. It achieves the highest
harmonic score (S_H)of 0.8538 (tied with ), while offering a slightly betterCLIP-T(0.8732) than (0.8722), and maintaining very competitive consistency metrics. This demonstrates that strikes an optimal balance.
6.2.3. Generality of the Method
The paper also evaluated the generality of Infinite-Story by applying its techniques to other scale-wise autoregressive T2I models: Switti (Voronov et al. 2024) and HART (Tang et al. 2024).
The following are the results from Table 4 of the original paper:
| Method | SH ↑ | DINO ↑ | CLIP-T↑ | CLIP-I ↑ | DreamSim ↓ |
|---|---|---|---|---|---|
| Vanilla Switti (Voronov et al. 2024) | 0.7719 | 0.6595 | 0.8904 | 0.8871 | 0.2934 |
| Switti + Ours | 0.8146 | 0.7441 | 0.8756 | 0.9018 | 0.2398 |
| Vanilla HART (Tang et al. 2024) | 0.7434 | 0.6381 | 0.8848 | 0.8714 | 0.3488 |
| HART + Ours | 0.7894 | 0.7048 | 0.8505 | 0.8982 | 0.2945 |
Analysis:
Applying Infinite-Story's techniques (IPR, ASI, SGA) to Switti and HART consistently shows clear performance improvements across DINO, CLIP-I, and DreamSim. This indicates that the proposed techniques are not specific to the Infinity model but can generalize to other architectures within the scale-wise autoregressive family, enhancing their consistency capabilities. The harmonic score (S_H) also improves for both Switti and HART when integrated with Infinite-Story's methods.
6.3. User Study
A user study involving 55 participants was conducted to complement the quantitative evaluation, focusing on human perception of identity consistency, style consistency, and prompt fidelity.
The following are the results from Table 3 of the original paper:
| Method | Identity ↑ | Style ↑ | Prompt ↑ |
|---|---|---|---|
| 1Prompt1Story (Liu et al. 2025) | 18.0% | 13.2% | 28.2% |
| OneActor (Wang et al. 2024) | 7.2% | 7.2% | 10.6% |
| IP-Adapter (Ye et al. 2023) | 16.4% | 29.6% | 4.7% |
| Ours | 58.4% | 50.0% | 56.5% |
Analysis:
The results show that Infinite-Story (Ours) is significantly preferred by users across all three criteria:
-
Identity Consistency: 58.4% preference for
Infinite-Story, vastly outperforming others (e.g.,1Prompt1Storyat 18.0%). -
Style Consistency: 50.0% preference for
Infinite-Story, again leading by a large margin (e.g.,IP-Adapterat 29.6%). -
Prompt Fidelity: 56.5% preference for
Infinite-Story, demonstrating strong human-perceived alignment with text descriptions (e.g.,1Prompt1Storyat 28.2%).These results strongly indicate that
Infinite-Storyachieves superiorhuman-perceived consistencyandprompt fidelitycompared to leading baselines, further validating its effectiveness and practical value.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Infinite-Story, a novel training-free framework for generating consistent text-to-image sequences tailored for multi-prompt storytelling scenarios. By building upon a scale-wise autoregressive model (Infinity), the method effectively addresses the dual challenges of identity inconsistency and style inconsistency. The core of Infinite-Story lies in three lightweight yet effective techniques: Identity Prompt Replacement (IPR), which unifies identity attributes by mitigating text encoder context bias; and a unified attention guidance mechanism combining Adaptive Style Injection (ASI) and Synchronized Guidance Adaptation (SGA), which jointly enforce appearance-level identity and global style consistency while maintaining prompt fidelity. Extensive experiments, including quantitative comparisons against state-of-the-art models and a user study, demonstrate that Infinite-Story achieves superior identity and style consistency and prompt fidelity. Crucially, it offers over 6 times faster inference (1.72 seconds per image) than the fastest existing consistent T2I models, making it highly practical for real-time and interactive applications like visual storytelling.
7.2. Limitations & Future Work
The authors acknowledge one primary limitation:
-
Sensitivity to Anchor Selection:
Infinite-Storyrelies on a single reference image (anchor) within each batch to propagate identity and style features. While this enables efficient and training-free inference, it introducessensitivity to anchor selection. If the initial anchor image is of low quality or stylistically off-target, this degradation can propagate to the entire batch of generated images. Since the method does not alter the generation capabilities of the underlying Infinity model, its success is inherently tied to the quality of this initial output.Based on this limitation, the authors suggest the following future work:
-
Adaptive Reference Selection Strategies: Developing mechanisms to intelligently select or correct the anchor image could mitigate the propagation of low-quality or stylistically incorrect references.
-
Temporal Consistency in Video Generation: Extending the method to support
temporal consistencyin video generation. This would involve adapting the current framework, which focuses on image sequences, to handle the additional dimension of time for coherent video outputs.
7.3. Personal Insights & Critique
This paper presents a highly practical and impactful contribution to the field of generative AI. The commitment to a training-free and fast solution is particularly commendable, as it directly addresses major bottlenecks preventing wider adoption of consistent T2I generation in real-world interactive applications. The choice of a scale-wise autoregressive model as the backbone, rather than the dominant but slower diffusion models, is a clever architectural decision that underpins the impressive inference speed.
The three proposed techniques—IPR, ASI, and SGA—are elegantly designed to target specific aspects of inconsistency (context bias, appearance, style, and prompt fidelity) with minimal computational overhead. The Adaptive Style Injection with its similarity-guided interpolation weight is particularly intuitive, allowing for a nuanced control over how much the reference dictates the style. Synchronized Guidance Adaptation is a critical component, showcasing a deep understanding of Classifier-Free Guidance to prevent a common pitfall of sacrificing prompt fidelity for consistency.
Potential issues or areas for improvement:
- Anchor Image Quality Dependence: While acknowledged as a limitation, the dependence on a single anchor image's quality is a significant practical hurdle. If a user provides a poor or ambiguous initial prompt for the reference, the subsequent generations will inherit those flaws. Future work could explore:
- Reference Image Refinement: A self-correction mechanism that iteratively refines the anchor image based on some quality metrics or user feedback.
- Multiple Reference Fusion: Allowing multiple reference images to be provided, and then fusing their identity/style information to create a more robust "canonical" reference.
- Semantic Reference Selection: Automatically identifying the most "representative" image from a generated batch to serve as the anchor.
- Fine-Grained Control over Consistency: The current
Adaptive Style Injectionuses a global for scaling. While effective, more fine-grained control might be beneficial. For instance, allowing users to specify different values for specific visual elements (e.g., character appearance vs. background style) could offer greater artistic control. - Complex Scene Consistency: The examples are compelling, but visual storytelling can involve highly complex scenes with multiple interacting characters and intricate backgrounds. How
Infinite-Storyscales to maintaining consistency across many distinct subjects in a crowded scene, or handling occlusion, would be an interesting area to explore. The current framework seems primarily focused on a single main subject. - Implicit Bias Inheritance: While
IPRaddressescontext biasin text encoders, the underlyingInfinitymodel itself will still carry biases from its training data. If the base model has biases (e.g., consistently generating a certain race for a generic character description),Infinite-Storymight propagate these biases, potentially making them more consistent. Further research could investigate how to integrate bias mitigation techniques within the training-free consistency framework.
Transferability and Applicability: The methods presented in this paper are highly transferable.
-
Visual Storyboarding & Comic Creation: Directly applicable for artists and designers to rapidly prototype visual narratives, maintaining consistent characters and styles across panels.
-
Animated Content Pre-production: Could be used for concept art generation or character design iteration in animation pipelines.
-
Personalized Content Generation: For marketing, advertising, or educational content, generating sequences featuring a consistent brand character or mascot.
-
Image Editing & Style Transfer: The principles of
Adaptive Style Injectionand reference-based feature propagation could be adapted for consistent style transfer across image collections or for propagating edits consistently within a series of images. -
Other Generative Tasks: The idea of
training-free attention guidanceandprompt embedding manipulationcould inspire similar consistency-enhancing techniques for other generative tasks beyond images, such as 3D model generation or audio synthesis, where maintaining coherence across varied outputs is crucial.Overall,
Infinite-Storyis an impressive piece of work that pushes the boundaries of practical and efficient consistent text-to-image generation, offering a compelling solution for the growing demand for coherent visual storytelling tools.
Similar papers
Recommended via semantic vector search.