Paper status: completed

Infinite-Story: A Training-Free Consistent Text-to-Image Generation

Published:11/17/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Infinite-Story is a training-free framework for consistent text-to-image generation in multi-prompt scenarios, addressing identity and style inconsistencies. With key techniques, it achieves state-of-the-art performance and is over 6X faster during inference than existing models,

Abstract

We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Infinite-Story: A Training-Free Consistent Text-to-Image Generation". This title indicates the paper introduces a novel framework for generating a sequence of images from text prompts that maintain high consistency in identity and style, without requiring any model training or fine-tuning, focusing on visual storytelling scenarios.

1.2. Authors

The authors are Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, and Sunghoon Im (corresponding author, denoted by †*). All authors are affiliated with DGIST, South Korea. Their research backgrounds appear to be in computer vision, machine learning, and generative AI, specifically focusing on text-to-image generation and consistency in visual outputs.

1.3. Journal/Conference

The paper does not explicitly state the journal or conference it was published in. However, given the nature of the research (text-to-image generation, computer vision, machine learning) and the typical publication venues for such work (e.g., CVPR, ICCV, NeurIPS, ICLR, ACM SIGGRAPH), it is likely intended for a highly reputable conference or journal in these fields. The inclusion of an arXiv preprint suggests it's awaiting or has been submitted to a peer-reviewed venue.

1.4. Publication Year

The publication date for the preprint is 2025-11-17T05:46:16.000Z.

1.5. Abstract

This paper introduces Infinite-Story, a training-free framework designed for consistent text-to-image (T2I) generation in multi-prompt storytelling. Built upon a scale-wise autoregressive model, it addresses identity inconsistency and style inconsistency. The framework employs three main techniques: Identity Prompt Replacement to mitigate context bias in text encoders and align identity attributes, and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation to enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based methods that require fine-tuning or are slow, Infinite-Story operates entirely at test time, offering high identity and style consistency across diverse prompts. Experimental results show state-of-the-art generation performance and over 6X faster inference (1.72 seconds per image) compared to existing consistent T2I models, making it practical for real-world visual storytelling.

Original Source Link: https://arxiv.org/abs/2511.13002 PDF Link: https://arxiv.org/pdf/2511.13002v1.pdf This is a preprint publication on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the lack of consistency in generated images from Text-to-Image (T2I) models, particularly in scenarios requiring coherence across multiple images, such as visual storytelling, comic strip generation, and character-driven content. While large-scale diffusion-based T2I models (e.g., Rombach et al. 2022, Ramesh et al. 2021) have achieved remarkable performance in generating high-quality individual images, they struggle to maintain a consistent subject identity (e.g., the same character looks different in different scenes) and visual style (e.g., varying art styles or lighting across a sequence) when generating multiple images from different prompts.

This problem is important because inconsistent outputs severely limit the usability of T2I models for practical applications that require visual narratives. For instance, creating a comic strip where the main character's appearance changes from panel to panel would disrupt the story's coherence. Prior research in consistent text-to-image generation has largely focused on identity consistency but often overlooks style consistency, which is equally crucial for visually coherent narratives.

Another significant challenge is the inference speed of existing consistent T2I models, especially those based on diffusion models. These models typically take over 10 seconds per image, which is too slow for interactive applications and can lead to user frustration, as per Nielsen's usability guidelines. While scale-wise autoregressive models have emerged as faster alternatives, they also face consistency issues.

The paper's entry point is to leverage the speed benefits of scale-wise autoregressive models while introducing a training-free framework to address both identity and style inconsistencies. The innovative idea is to achieve this consistency by judiciously manipulating prompt embeddings and attention mechanisms during inference, without the need for extensive fine-tuning or re-training the base generative model. This makes the solution highly efficient and practical.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  • First Training-Free Scale-Wise Autoregressive Framework for Consistent T2I Generation: The paper introduces Infinite-Story, the first framework that achieves consistent text-to-image generation in a training-free manner using a scale-wise autoregressive model (specifically, based on the Infinity architecture). This is a significant departure from many prior diffusion-based methods that require fine-tuning or suffer from slow inference.

  • Identity Prompt Replacement (IPR) Technique: A novel technique is proposed that aligns identity attributes across prompts by unifying identity prompt embeddings. This mitigates context bias in text encoders, which often causes the same identity to appear differently based on surrounding descriptive words in a prompt.

  • Unified Attention Guidance Mechanism: This mechanism consists of two complementary parts:

    • Adaptive Style Injection (ASI): This technique enhances both identity appearance and global visual style consistency by injecting reference features into early-stage self-attention layers. It uses a similarity-guided adaptive operation to smoothly align appearance.
    • Synchronized Guidance Adaptation (SGA): This component ensures prompt fidelity by applying the same feature adaptation (derived from ASI) to both the conditional and unconditional branches of Classifier-Free Guidance (CFG). This maintains the balance necessary for accurately reflecting text prompts while preserving consistency.
  • State-of-the-Art Performance with High Efficiency: Infinite-Story achieves state-of-the-art generation quality in terms of both identity and style consistency, as demonstrated by quantitative metrics (highest DINO similarity, CLIP-I, and DreamSim scores) and qualitative results. Crucially, it offers significantly faster inference, approximately 1.72 seconds per image, which is over 6 times faster than the existing fastest diffusion-based consistent T2I models.

    These findings solve the problems of identity and style inconsistency in multi-prompt T2I generation and overcome the slow inference speed limitations of previous approaches. The training-free nature makes the method highly practical and adaptable, enabling real-time and interactive visual storytelling applications.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a beginner needs to grasp several foundational concepts in generative AI and natural language processing:

  • Text-to-Image (T2I) Generation: This is the task of generating photorealistic or artistic images from natural language descriptions (text prompts). Models learn to map text semantics to visual features.
  • Generative Models: Algorithms that can create new data instances that resemble the training data. Examples include Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Diffusion Models.
  • Diffusion Models: A class of generative models that learn to reverse a diffusion process. They iteratively denoise a random signal (e.g., Gaussian noise) into a coherent image, guided by a text prompt. They are known for high-quality image generation but can be computationally intensive and slow during inference.
  • Autoregressive Models: Models that generate data point by point, where each new point is conditioned on the previously generated points. In the context of images, this could mean generating pixels sequentially or in blocks. Traditional autoregressive models can be very slow for high-resolution images.
  • Scale-Wise Autoregressive Models: An evolution of autoregressive models that generate images by iteratively predicting features at increasing scales (e.g., from low resolution to high resolution). This next-scale prediction paradigm aims to improve efficiency compared to pixel-by-pixel autoregression or diffusion models, by processing information more coarsely at initial stages and refining it later. The paper uses the Infinity architecture (Han et al. 2024) as its backbone, which is a scale-wise autoregressive model.
  • Text Encoders: Neural networks (often Transformers) that convert text (words, sentences) into numerical representations called embeddings or contextual embeddings. These embeddings capture the semantic meaning of the text and are used to condition image generation models. The paper mentions Flan-T5 (Chung et al. 2024) as the text encoder, which is a powerful pre-trained language model.
  • Embeddings: Numerical vectors that represent words, sentences, or images in a continuous vector space. Similar items have embeddings that are close to each other in this space.
  • Transformer Architecture: A neural network architecture (Vaswani et al. 2017) that relies heavily on self-attention mechanisms. It has become dominant in natural language processing and is increasingly used in computer vision.
  • Self-Attention Mechanism: A core component of Transformers. It allows the model to weigh the importance of different parts of the input sequence (or image features) when processing each part. For a given query, it computes an attention score against all keys and uses these scores to create a weighted sum of values. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where QQ is the query matrix, KK is the key matrix, VV is the value matrix, and dkd_k is the dimension of the keys.
  • Classifier-Free Guidance (CFG): A technique used in conditional generative models (like T2I models) to improve the alignment of generated samples with the conditioning input (e.g., a text prompt). It works by performing two parallel generation processes: one conditioned on the prompt and one unconditional (or conditioned on an empty prompt). The output is then extrapolated in the direction of the conditional generation, effectively amplifying the influence of the prompt. This helps to achieve better prompt fidelity.
  • Fine-tuning: The process of taking a pre-trained model and further training it on a smaller, specific dataset for a particular task. This usually involves adjusting the model's weights. Training-free methods, like Infinite-Story, avoid this step, making them faster to deploy.

3.2. Previous Works

The paper categorizes related work into Text-to-Image generation, Personalized image generation, and Consistent text-to-image generation.

3.2.1. Text-to-Image Generation

  • Early T2I Models: These include GAN-based models (Kang et al. 2023) and autoregressive (AR)-based models (Chang et al. 2023, Han et al. 2024, Tang et al. 2024). AR models evolved from next-token prediction (Van Den Oord, Vinyals et al. 2017, Esser, Rombach, and Ommer 2021) to masked token generation (Chang et al. 2022, 2023) and then to next-scale prediction (Tian et al. 2024), the latter significantly improving efficiency.
  • Diffusion Models: Currently dominant, offering strong synthesis quality. Key examples include Ramesh et al. 2021 (DALL-E), Rombach et al. 2022 (Stable Diffusion), Saharia et al. 2022 (Imagen), Podell et al. 2023 (SDXL), and Labs 2024 (FLUX). While excellent for single image quality, they are known for slow inference times and struggle with subject identity consistency across multiple images.
  • This paper's fit: Infinite-Story builds upon the faster scale-wise autoregressive paradigm (Han et al. 2024) to address the latency issue while focusing on consistency, a common limitation of all T2I models.

3.2.2. Personalized Image Generation

This field focuses on generating images with user-specified features.

  • Subject-Driven Methods: Aim to inject specific concept embeddings (e.g., a particular person or object) from reference images. Examples include Li, Li, and Hoi 2023, Gal et al., Wei et al. 2023, Ye et al. 2023, Ruiz et al. 2023. Many require fine-tuning or parameter-efficient fine-tuning (Nam et al. 2024, Kumari et al. 2023), often needing external datasets.
  • Style-Driven Methods: Focus on maintaining visual consistency in terms of style. This can involve LoRA-based tuning (Frenkel et al. 2024, Shah et al. 2024, Sohn et al. 2023, Hu et al. 2022, Ryu 2022) or adapting attention mechanisms (Hertz et al. 2024, Park et al. 2025) for stylistic coherence.
  • Limitation: Most personalized generation methods rely on diffusion models, inheriting their slow inference speed.
  • This paper's fit: Infinite-Story tackles both subject (identity) and style consistency but does so in a training-free and fast manner, making it more suitable for interactive use cases than prior personalized generation methods.

3.2.3. Consistent Text-to-Image Generation

This is a more specific area within personalized image generation, focusing on preserving identity across multiple images.

  • Attention-based Consistency: Many recent works manipulate attention layer weights to modulate identity (Kumari et al. 2023, Li et al. 2024, Zhou et al. 2024b, Tewel et al. 2024).
  • Structured Control: Incorporating additional controls to aid identity preservation (Mou et al. 2023, Zhang, Rao, and Agrawala 2023).
  • Textual Conditioning: Leveraging the linguistic strength of transformer-based text encoders (Radford et al. 2021, Vaswani et al. 2017, Devlin et al. 2019, Chen et al. 2025, Raffel et al. 2023) and enhanced textual conditioning (Hertz et al. 2022, Gal et al. 2022) to improve consistency. Liu et al. 2025 specifically use prompt embedding variations.
  • Limitations: While these methods address identity consistency, they often "overlook style consistency" and are typically "diffusion-based," leading to slow inference.
  • This paper's fit: Infinite-Story is directly inspired by these insights, manipulating prompt embeddings and attention mechanisms, but uniquely extends it to a training-free and fast scale-wise autoregressive setting, ensuring both identity and style consistency.

3.3. Technological Evolution

The field of text-to-image generation has evolved from early attempts with GANs and autoregressive models generating lower-resolution or less coherent images to the current era dominated by powerful diffusion models capable of producing stunning high-resolution, photorealistic outputs. The Transformer architecture has played a pivotal role, first revolutionizing NLP and then becoming integral to modern generative models, including diffusion models and advanced autoregressive variants.

A key evolutionary step was the move from token-by-token or pixel-by-pixel generation in early autoregressive models to masked token generation and then next-scale prediction. This shift aimed to tackle the inherent slowness of autoregressive generation for complex data like images, making them more competitive in terms of speed while maintaining quality.

Initially, the focus was primarily on generating high-quality individual images. The subsequent evolution addressed specific challenges like personalization (making models generate specific subjects or styles) and consistency (maintaining those specific subjects/styles across multiple generations). Many solutions for personalization and consistency involved fine-tuning pre-trained diffusion models, which, while effective, reintroduced the problem of slow inference and high computational cost.

This paper's work fits within this technological timeline by addressing the efficiency and consistency gap. It leverages the latest advancements in fast scale-wise autoregressive models (Infinity) and combines them with training-free techniques for identity and style consistency, offering a practical solution for real-world storytelling applications that demand both speed and coherence.

3.4. Differentiation Analysis

Compared to the main methods in related work, Infinite-Story offers several core differences and innovations:

  • Training-Free Nature: Unlike many diffusion-based approaches for consistent T2I generation (e.g., IP-Adapter, PhotoMaker, OneActor, The Chosen One), which often require fine-tuning or specific training steps to achieve personalization/consistency, Infinite-Story operates entirely at test time. This significantly reduces computational overhead and increases flexibility for users. ConsiStory and 1Prompt1Story are also training-free but are diffusion-based.

  • Leveraging Scale-Wise Autoregressive Models: Most existing consistent T2I methods are built upon diffusion models (e.g., Stable Diffusion XL). While diffusion models excel in quality, they are inherently slow. Infinite-Story uniquely builds upon a scale-wise autoregressive model (Infinity), which is designed for much faster inference, thereby addressing the critical latency issue.

  • Dual Focus on Identity AND Style Consistency: While many prior works in consistent text-to-image generation primarily focus on identity consistency (e.g., ConsiStory, 1Prompt1Story, OneActor), they often "overlook style consistency" across generated images. Infinite-Story explicitly tackles both challenges through its Adaptive Style Injection and Synchronized Guidance Adaptation, ensuring a unified visual narrative.

  • Novel Techniques for Consistency:

    • Identity Prompt Replacement (IPR): Addresses a unique problem of context bias in text encoders, which can subtly alter identity attributes based on surrounding descriptive words. This is a targeted approach to prompt engineering for consistency.
    • Unified Attention Guidance (ASI + SGA): This mechanism directly manipulates self-attention layers and integrates with Classifier-Free Guidance to propagate identity and style from a reference image. This is a novel combination for achieving comprehensive consistency without training.
  • Superior Efficiency: The paper explicitly highlights its over 6X faster inference compared to the fastest existing consistent T2I models (1.72 seconds per image vs. 10+ seconds), making it highly suitable for interactive and real-time applications where others fall short.

    In essence, Infinite-Story innovates by combining a fast generative backbone with lightweight, training-free interventions that simultaneously ensure both identity and style consistency, a comprehensive solution not fully achieved by prior slower, or less holistic methods.

4. Methodology

4.1. Principles

The core idea behind Infinite-Story is to achieve consistent subject identity and overall visual style across a sequence of images generated from multiple text prompts, all without requiring any additional training or fine-tuning of the underlying generative model. This is particularly challenging because standard Text-to-Image (T2I) models, when given different prompts, tend to produce images where the "same" subject or artistic style varies subtly, or even significantly, due to the contextual nature of text embeddings and the stochasticity of the generation process.

The theoretical basis or intuition is that consistency can be imposed at critical junctures of the generation process by leveraging a "reference" image or its extracted features. This reference acts as an anchor for identity and style. Specifically, the model focuses on three key areas for intervention:

  1. Text Encoding: The way text prompts are encoded can introduce biases, where the same character description might yield different visual attributes depending on other words in the prompt. This suggests that manipulating prompt embeddings can unify identity representations.

  2. Early Generation Steps: The early stages of image generation (e.g., lower resolution feature maps or initial diffusion steps) are often responsible for establishing global structure and style. Intervening here can set a consistent visual foundation.

  3. Attention Mechanism: The self-attention mechanism within Transformer-based models determines how different parts of the image (or its features) relate to each other and to the text prompt. By guiding this attention with reference features, the model can be encouraged to focus on and preserve consistent visual elements.

    By operating on a scale-wise autoregressive model (which builds images incrementally from coarse to fine scales), the interventions can be strategically applied at early, global-affecting scales to propagate consistency efficiently.

4.2. Core Methodology In-depth (Layer by Layer)

Infinite-Story is built upon the Infinity architecture, a scale-wise autoregressive model that employs a next-scale prediction scheme. The overall pipeline involves a text encoder, a transformer for autoregressive feature map prediction, and a decoder for image reconstruction.

The framework aims to generate NN multiple images, denoted as  I ={Iˉn}n=1N\textbf { I } = \{ \bar { I } ^ { n } \} _ { n = 1 } ^ { N }, from corresponding text prompts,  t ={tn}n=1N\textbf { t } = \{ t ^ { n } \} _ { n = 1 } ^ { N }. Each prompt tnt^n is conceptually composed of an identity prompt tidennt_{iden}^n and an expression prompt texpnt_{exp}^n. The goal is to ensure consistency in both identity and overall style across all generated images. All prompts are processed in parallel as a batch.

The generation process of the base Infinity model is given by: I=D(FS), Fs=i=1supH×W(Ri), RsRN×hs×ws,Rs=G(Fs1,T), T=ET(t)={(Tidenn,Texpn)}n=1N, \begin{array} { l } { { \displaystyle { \bf I } = D ( { \bf F } _ { S } ) , ~ { \bf F } _ { s } = \sum _ { i = 1 } ^ { s } \mathsf { u p } _ { H \times W } ( { \bf R } _ { i } ) , ~ { \bf R } _ { s } \in \mathbb { R } ^ { N \times h _ { s } \times w _ { s } } , } } \\ { { \displaystyle { \bf R } _ { s } = G ( { \bf F } _ { s - 1 } , { \bf T } ) , ~ { \bf T } = E _ { T } ( { \bf t } ) = \left\{ \left( T _ { \mathrm { i d e n } } ^ { n } , T _ { \mathrm { e x p } } ^ { n } \right) \right\} _ { n = 1 } ^ { N } } , } \end{array} where:

  •  I \textbf { I } represents the final generated images.

  • DD is the image decoder that reconstructs images from the final feature maps.

  •  F S\textbf { F } _ { S } denotes the final feature map at the last step SS.

  •  F s\textbf { F } _ { s } is the cumulative feature map at step ss, formed by upsampling and summing residual feature maps from previous steps.

  • upH×W()\mathsf { u p } _ { H \times W } ( \cdot ) is a bilinear upsampling function that scales the feature map to the target image size H×WH \times W.

  •  R s\textbf { R } _ { s } is the quantized residual ss-th feature map, predicted by the transformer GG. It has dimensions N×hs×wsN \times h_s \times w_s, where NN is the batch size (number of images), and hs,wsh_s, w_s are the spatial sizes at step ss.

  • GG is the transformer model that autoregressively predicts the residual feature map  R s\textbf { R } _ { s }, conditioned on the previous cumulative feature map  F s1\textbf { F } _ { s - 1 } and the text embeddings  T \textbf { T }.

  •  T \textbf { T } represents the encoded identity and expression features, obtained by the text encoder ETE_T from the input text prompts  t \textbf { t }.  T ={(Tidenn,Texpn)}n=1N\textbf { T } = \{ ( T_{iden}^n, T_{exp}^n ) \}_{n=1}^N, where TidennT_{iden}^n are identity embeddings and TexpnT_{exp}^n are expression embeddings for each prompt nn.

  • The initial feature map  F 0\textbf { F } _ { 0 } is initialized from  T \textbf { T }.

    The overall pipeline of Infinite-Story (Figure 4 from the paper) integrates three complementary techniques into this base generation process:

    Figure 4: Overall pipeline of our method. The text encoder `E _ { T }` (Chung et al. 2024) processes a set of text prompts t, producing contextual embeddings T that condition the transformer. Identity Prompt Replacement is applied to \(\\mathbf { T }\) before generation to y y produces residual feature maps, which are decoded into final images I via the image decoder. 该图像是示意图,展示了无限故事框架的整体流程。图中包含文本编码器、身份提示替换模块和统一注意力引导机制,分别处理文本提示并生成图像。文本编码器将一组提示 tident_{iden}texpt_{exp} 转换为上下文嵌入 TT,随后应用身份提示替换以增强一致性。最终,残差特征图通过变换器 GG 解码成最终图像 II。字符位置信息和风格指导通过适应性风格注入与同步指导适应共同作用。

Figure 4: Overall pipeline of our method. The text encoder E _ { T } (Chung et al. 2024) processes a set of text prompts t, producing contextual embeddings T that condition the transformer. Identity Prompt Replacement is applied to T\mathbf { T } before generation to y y produces residual feature maps, which are decoded into final images I via the image decoder.

4.2.1. Identity Prompt Replacement (IPR)

The first step is to apply Identity Prompt Replacement (IPR) to the text embeddings  T \textbf { T } before the image generation process begins. This technique addresses context bias, a phenomenon where the same identity description (e.g., "a dog") can lead to different visual attributes (e.g., breed, age, gender) depending on the surrounding words in the prompt (e.g., "a dog springing toward a frisbee" vs. "a dog on a porch swing"). This bias arises from the self-attention mechanism within text encoders, where the representation of an identity is influenced by its context.

IPR aims to align identity-related attributes across all prompts by enforcing a consistent representation of identity. It does this by replacing all identity embeddings in the batch with the identity embedding of a designated reference instance (typically the first sample in the batch). To maintain the semantic relationship between identity and expression, a proportional scaling is applied to the expression embeddings.

The Identity Prompt Replacement is formulated as follows: T^=(T^iden,T^exp)={(Tiden1,Tiden1TidennTexpn)}n=1N, \hat { \mathbf { T } } = \left( \hat { \mathbf { T } } _ { \mathrm { i d e n } } , \hat { \mathbf { T } } _ { \mathrm { e x p } } \right) = \left\{ \left( T _ { \mathrm { i d e n } } ^ { 1 } , \frac { \left. T _ { \mathrm { i d e n } } ^ { 1 } \right. } { \left. T _ { \mathrm { i d e n } } ^ { n } \right. } \cdot T _ { \mathrm { e x p } } ^ { n } \right) \right\} _ { n = 1 } ^ { N } , where:

  • T^\hat { \mathbf { T } } represents the modified text embeddings after IPR. It consists of modified identity embeddings T^iden\hat { \mathbf { T } } _ { \mathrm { i d e n } } and modified expression embeddings T^exp\hat { \mathbf { T } } _ { \mathrm { e x p } }.
  • Tiden1T _ { \mathrm { i d e n } } ^ { 1 } is the identity embedding of the reference instance (the first sample in the batch). This is used as the canonical identity for all samples.
  • TidennT _ { \mathrm { i d e n } } ^ { n } is the original identity embedding for the nn-th sample.
  • TexpnT _ { \mathrm { e x p } } ^ { n } is the original expression embedding for the nn-th sample.
  • The term Tiden1Tidenn\frac { \left. T _ { \mathrm { i d e n } } ^ { 1 } \right. } { \left. T _ { \mathrm { i d e n } } ^ { n } \right. } acts as a scaling factor. It ensures that the expression embedding TexpnT _ { \mathrm { e x p } } ^ { n } is scaled proportionally to the ratio of the reference identity embedding to its own original identity embedding. This aims to adapt the expression to the unified identity representation, preserving the overall semantic relationship while aligning the identity.

4.2.2. Unified Attention Guidance

Even with IPR, which aligns context-level identity attributes, preserving the precise visual appearance (appearance-level identity) and global visual style (e.g., art style, lighting) can still be a challenge. To address this, a Unified Attention Guidance mechanism is introduced, comprising Adaptive Style Injection (ASI) and Synchronized Guidance Adaptation (SGA). These are applied to the self-attention layers within the transformer GG during the early generation steps (Searly\mathbf { S } _ { \mathrm { e a r l y } }), as these steps are crucial for establishing global visual properties.

4.2.2.1. Adaptive Style Injection (ASI)

ASI aims to align both the appearance of the identity and the overall scene style. It operates within the self-attention layers. In self-attention, input features are transformed into Query (Q), Key (K), and Value (V) matrices. ASI modifies the Key and Value features for each sample in the batch, using the features from the reference instance.

The Adaptive Style Injection is defined as follows for a given self-attention layer at step ss: Kˉsn=Ks1,Vˉsn=αsnVsn+(1αsn)Vs1,αsn=λcos(Vs1,Vsn),n{1,,N}, \begin{array} { r l } & { \bar { K } _ { s } ^ { n } = K _ { s } ^ { 1 } , \bar { V } _ { s } ^ { n } = \alpha _ { s } ^ { n } V _ { s } ^ { n } + ( 1 - \alpha _ { s } ^ { n } ) V _ { s } ^ { 1 } , } \\ & { \alpha _ { s } ^ { n } = \lambda \cdot \mathrm { cos } ( V _ { s } ^ { 1 } , V _ { s } ^ { n } ) , \forall n \in \{ 1 , \cdot \cdot , N \} , } \end{array} where:

  • Kˉsn\bar { K } _ { s } ^ { n } and Vˉsn\bar { V } _ { s } ^ { n } are the modified Key and Value features for the nn-th sample at step ss.
  • Ks1K _ { s } ^ { 1 } and Vs1V _ { s } ^ { 1 } denote the Key and Value features of the reference sample (the first sample in the batch).
  • VsnV _ { s } ^ { n } is the original Value feature of the nn-th sample at step ss.
  • The Key features for all samples are replaced with the Key features of the reference sample (Kˉsn=Ks1\bar { K } _ { s } ^ { n } = K _ { s } ^ { 1 }). This encourages all samples to attend to semantically consistent regions as defined by the reference, ensuring structural alignment.
  • The Value features are adaptively interpolated between the original Value feature of the nn-th sample (VsnV _ { s } ^ { n }) and the Value feature of the reference sample (Vs1V _ { s } ^ { 1 }).
  • αsn\alpha _ { s } ^ { n } is an adaptive interpolation weight. It is calculated using the cosine similarity between the reference's Value features and the nn-th sample's Value features: αsn=λcos(Vs1,Vsn)\alpha _ { s } ^ { n } = \lambda \cdot \mathrm { cos } ( V _ { s } ^ { 1 } , V _ { s } ^ { n } ).
  • λ\lambda is a scaling coefficient (a hyperparameter, set to 0.85 in experiments). It controls the strength of the style injection.
  • cos refers to the cosine similarity function. A higher cosine similarity means Vs1V _ { s } ^ { 1 } and VsnV _ { s } ^ { n } are more similar, leading to a higher αsn\alpha _ { s } ^ { n }, and thus more weight given to the sample's own features. Conversely, if they are dissimilar, more weight is given to the reference's features, promoting stronger alignment. This similarity-guided interpolation allows for smooth and proportional alignment of appearance and global style.

4.2.2.2. Synchronized Guidance Adaptation (SGA)

Adaptive Style Injection (ASI) applies modifications only to the conditional branch of the generation process (the path guided by the text prompt). However, generative models often use Classifier-Free Guidance (CFG) to enhance prompt fidelity. CFG works by combining outputs from a conditional branch (guided by the prompt) and an unconditional branch (guided by a null or empty prompt). Applying ASI only to the conditional branch can disrupt the delicate balance between these two branches, potentially degrading prompt fidelity.

To resolve this, Synchronized Guidance Adaptation (SGA) applies the same operations to the unconditional branch using the identical interpolation weights (αsn\alpha _ { s } ^ { n }) computed from the conditional path. This ensures that the guidance applied for consistency is synchronized across both branches, preserving the intended effect of CFG.

For the unconditional branch, the Key and Value features are modified as: k~sn=ks1,v~sn=αsnvsn+(1αsn)vs1,n{1,,N}, \tilde { k } _ { s } ^ { n } = k _ { s } ^ { 1 } , \quad \tilde { v } _ { s } ^ { n } = { \alpha _ { s } ^ { n } } v _ { s } ^ { n } + ( 1 - { \alpha _ { s } ^ { n } } ) v _ { s } ^ { 1 } , \forall n \in \{ 1 , \cdots , N \} , where:

  • k~sn\tilde { k } _ { s } ^ { n } and v~sn\tilde { v } _ { s } ^ { n } are the modified Key and Value features for the nn-th sample in the unconditional branch at step ss.

  • ks1k _ { s } ^ { 1 } and vs1v _ { s } ^ { 1 } denote the Key and Value features of the reference sample in the unconditional branch.

  • vsnv _ { s } ^ { n } is the original Value feature of the nn-th sample in the unconditional branch at step ss.

  • Crucially, αsn\alpha _ { s } ^ { n } is the same adaptive weight shared from the conditional branch (calculated in ASI). This synchronization ensures that any consistency-inducing modifications to the conditional path are mirrored in the unconditional path, maintaining the relative scaling and direction of the CFG mechanism.

    By combining IPR, ASI, and SGA, Infinite-Story achieves consistent identity and style across generated images while maintaining high prompt fidelity, all during the inference phase without any additional training.

5. Experimental Setup

5.1. Datasets

The paper does not use a specific image dataset for training or fine-tuning, as Infinite-Story is a training-free framework built upon a pre-trained scale-wise autoregressive model. Instead, the evaluation focuses on a benchmark dataset of text prompts designed for consistent T2I generation.

  • ConsiStory+^+ Benchmark: The evaluation follows the protocol proposed in 1Prompt1Story (Liu et al. 2025), which is an extension of the original ConsiStory benchmark (Tewel et al. 2024).
    • Characteristics: ConsiStory^+expandstheevaluationspacebyintroducingamorediverserangeofsubjects,promptdescriptions,andstyles.Scale:Theevaluationinvolves200distinctpromptsets,resultinginthegenerationofupto1,500imagesintotal.Domain:Thebenchmarkcoversvarioussubjects,scenarios,andartisticstylestothoroughlytesttheconsistencycapabilitiesofT2Imodels.ThechoiceofConsiStory+ expands the evaluation space by introducing a more diverse range of subjects, prompt descriptions, and styles. * **Scale:** The evaluation involves 200 distinct prompt sets, resulting in the generation of up to 1,500 images in total. * **Domain:** The benchmark covers various subjects, scenarios, and artistic styles to thoroughly test the consistency capabilities of T2I models. The choice of `ConsiStory`^+ is effective for validating the method's performance because it specifically targets the core problem Infinite-Story addresses: evaluating both prompt alignment (how well images match their text descriptions) and the consistency of subject identity and style over a diverse set of prompts, which is crucial for storytelling scenarios.

5.2. Evaluation Metrics

The paper uses a comprehensive set of metrics to evaluate prompt fidelity, identity consistency, and style consistency.

5.2.1. Prompt Fidelity (CLIP-T)

  • Conceptual Definition: CLIP-T measures how well the generated image aligns semantically with its corresponding text prompt. It quantifies the degree to which the visual content of the image matches the textual description.
  • Mathematical Formula: The CLIP-T score is based on the cosine similarity between the CLIP text embedding and the CLIP image embedding. $ \mathrm{CLIP-T} = \mathrm{cosine_similarity}(E_I(I_{gen}), E_T(t_{prompt})) $ where the cosine similarity is scaled by a factor of 2.5 and a prefix "A photo depicts" is prepended to each prompt, as per the 1Prompt1Story protocol.
  • Symbol Explanation:
    • IgenI_{gen}: A generated image.
    • tpromptt_{prompt}: The text prompt corresponding to IgenI_{gen}.
    • EI()E_I(\cdot): The CLIP image encoder (specifically CLIP ViT-B/32).
    • ET()E_T(\cdot): The CLIP text encoder (specifically CLIP ViT-B/32).
    • cosine_similarity(a,b)=abab\mathrm{cosine\_similarity}(a, b) = \frac{a \cdot b}{\|a\| \|b\|}: The cosine of the angle between two vectors aa and bb, indicating their semantic similarity.

5.2.2. Identity Consistency (CLIP-I and DreamSim)

To measure identity consistency, two metrics are used:

5.2.2.1. CLIP-I

  • Conceptual Definition: CLIP-I measures the average pairwise visual similarity between images generated from the same identity prompt. It quantifies how consistently a particular subject's appearance is maintained across different generated scenes. To ensure it only reflects the subject's identity, backgrounds are removed.
  • Mathematical Formula: $ \mathrm{CLIP-I} = \frac{1}{M(M-1)/2} \sum_{i=1}^{M-1} \sum_{j=i+1}^{M} \mathrm{cosine_similarity}(E_I(I_i^{no_bg}), E_I(I_j^{no_bg})) $ where the average is taken over all unique pairs of images generated for the same identity.
  • Symbol Explanation:
    • MM: The number of images generated for a single identity prompt.
    • Iino_bgI_i^{no\_bg}: The ii-th generated image for a given identity, with its background removed using CarveKit.
    • EI()E_I(\cdot): The CLIP image encoder (specifically ViT-B/16).
    • cosine_similarity(,)\mathrm{cosine\_similarity}(\cdot, \cdot): The cosine similarity function.

5.2.2.2. DreamSim

  • Conceptual Definition: DreamSim is a perceptual similarity metric designed to correlate well with human judgment of visual similarity. It measures the "distance" or dissimilarity between images. For evaluation, it's converted to a similarity measure [1 - DreamSim]. Like CLIP-I, backgrounds are removed to focus on subject identity.
  • Mathematical Formula: DreamSim typically outputs a distance, so for consistency evaluation, it's converted to similarity. $ \mathrm{DreamSim_{similarity}} = 1 - \mathrm{DreamSim_{distance}}(I_i^{no_bg}, I_j^{no_bg}) $ The average pairwise similarity is then computed. $ \mathrm{Average_DreamSim_{similarity}} = \frac{1}{M(M-1)/2} \sum_{i=1}^{M-1} \sum_{j=i+1}^{M} (1 - \mathrm{DreamSim_{distance}}(I_i^{no_bg}, I_j^{no_bg})) $
  • Symbol Explanation:
    • MM: The number of images generated for a single identity prompt.
    • Iino_bgI_i^{no\_bg}: The ii-th generated image for a given identity, with its background removed.
    • DreamSimdistance(,)\mathrm{DreamSim_{distance}}(\cdot, \cdot): The DreamSim perceptual distance metric between two images.

5.2.3. Style Consistency (DINO)

  • Conceptual Definition: DINO similarity (specifically using the CLS token from a DINO ViT-B/8 model) captures global visual similarity, including aspects like rendering, background, and texture. It's used to assess how consistent the overall artistic or visual style is among images conditioned on the same identity prompt.
  • Mathematical Formula: Similar to CLIP-I, DINO similarity is computed as the average pairwise cosine similarity between DINO embeddings of images for the same identity. $ \mathrm{DINO} = \frac{1}{M(M-1)/2} \sum_{i=1}^{M-1} \sum_{j=i+1}^{M} \mathrm{cosine_similarity}(E_D(I_i), E_D(I_j)) $
  • Symbol Explanation:
    • MM: The number of images generated for a single identity prompt.
    • IiI_i: The ii-th generated image for a given identity (full image, no background removal).
    • ED()E_D(\cdot): The DINO image encoder (DINO ViT-B/8), specifically extracting the CLS token embedding.
    • cosine_similarity(,)\mathrm{cosine\_similarity}(\cdot, \cdot): The cosine similarity function.

5.2.4. Harmonic Score (SHS_H)

  • Conceptual Definition: The harmonic score SHS_H is an aggregate metric that combines CLIP-T, CLIP-I, DreamSim (converted to similarity), and DINO using the harmonic mean. The harmonic mean is sensitive to low values, meaning that a poor performance in any single metric will significantly lower the overall score. This provides a balanced view of both prompt fidelity and visual consistency across identity and style.
  • Mathematical Formula: $ S _ { H } = \mathrm { HM } ( \mathrm { CLIP \mathrm { - } T , C L I P \mathrm { - } I , 1 - D r e a m S i m , D I N O } ) $ where HM\mathrm{HM} denotes the Harmonic Mean. For kk positive numbers x1,x2,,xkx_1, x_2, \ldots, x_k, the Harmonic Mean is: $ \mathrm{HM}(x_1, x_2, \ldots, x_k) = \frac{k}{\frac{1}{x_1} + \frac{1}{x_2} + \ldots + \frac{1}{x_k}} $
  • Symbol Explanation:
    • CLIPT\mathrm{CLIP-T}: The CLIP-T score (prompt fidelity).
    • CLIPI\mathrm{CLIP-I}: The CLIP-I score (identity consistency).
    • 1DreamSim1 - \mathrm{DreamSim}: The DreamSim distance converted to a similarity score (identity consistency).
    • DINO\mathrm{DINO}: The DINO similarity score (style consistency).

5.2.5. Implementation Details

  • The evaluation scripts from 1Prompt1Story (Liu et al. 2025) are adapted, with the addition of the DINO metric.
  • All metrics are computed on a single A6000 GPU using PyTorch.
  • Background removal using CarveKit (Selin 2023) is applied consistently for identity-based metrics (CLIP-I and DreamSim) to isolate subject appearance.
  • All features are extracted following standard preprocessing pipelines provided by each model (CLIP, DINO).

5.3. Baselines

The paper compares Infinite-Story against a variety of state-of-the-art consistent text-to-image generation models, including both training-based and training-free approaches, leveraging different underlying generative backbones (primarily Stable Diffusion XL and Infinity).

5.3.1. General Baselines

  • Vanilla SDXL (Podell et al. 2023): A standard diffusion model without any consistency enhancements. This serves as a general baseline for image quality and shows the inherent inconsistency of a standalone T2I model.
  • Vanilla Infinity (Han et al. 2024): The base scale-wise autoregressive model that Infinite-Story is built upon, also without consistency enhancements. This highlights the improvements brought by Infinite-Story's techniques.

5.3.2. Image-Based Consistent Text-to-Image Models (Diffusion-based, often requiring external reference images)

These models typically use Stable Diffusion XL as their backbone and are provided with a reference image to guide consistency.

  • IP-Adapter (Ye et al. 2023): A method that adapts text prompts with image prompts using a separate adapter module, allowing for identity preservation. (Training-based, using official code).

  • PhotoMaker (Li et al. 2024): A technique for customizing realistic human photos by stacking ID embeddings. (Training-based, using official code).

  • The Chosen One (Avrahami et al. 2024): A diffusion-based method for consistent characters in text-to-image generation. (Training-based, using unofficial code).

  • OneActor (Wang et al. 2024): Focuses on consistent subject generation via cluster-conditioned guidance. (Training-based, using official repository).

  • StoryDiffusion (Zhou et al. 2024b): A training-free method using consistent self-attention for long-range image and video generation. (Training-free, using official repository).

    For these image-based models, a reference image is generated by providing only the identity portion of the full prompt to their respective base models (e.g., for "A graceful unicorn galloping through a flower field," the reference is generated from "A graceful unicorn"). This reference image is then consistently used across all prompts in the same sequence.

5.3.3. Non-Reference Consistent Text-to-Image Models (Diffusion-based, often training-free)

These models aim to achieve consistency without an explicit external reference image, often through internal mechanisms.

  • ConsiStory (Tewel et al. 2024): A training-free consistent text-to-image generation method. (Training-free, using official code).
  • 1Prompt1Story (Liu et al. 2025): A training-free method using a single prompt structure for consistent T2I generation. (Training-free, using official repository).

5.3.4. Inference Settings

  • For all methods, the DDIM sampling settings provided in their open-source implementations are adopted.

  • The number of DDIM sampling steps is fixed to 50 for all models, including The Chosen One (unofficial implementation), to ensure consistency in comparisons.

    These baselines are representative as they cover a wide range of state-of-the-art approaches for consistent T2I generation, including both training-based and training-free methods, and models that focus on identity, style, or a combination. Comparing against both Vanilla SDXL and Vanilla Infinity clearly isolates the contribution of Infinite-Story's consistency techniques.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that Infinite-Story achieves state-of-the-art performance in consistent text-to-image generation, particularly excelling in balancing identity consistency, style consistency, and prompt fidelity while offering significantly faster inference.

The following are the results from Table 1 of the original paper:

Method Train-Free SH ↑ DINO ↑ CLIP-T↑ CLIP-I ↑ DreamSim ↓ Inference Time (s) ↓
Vanilla SDXL (Podell et al. 2023) - 0.7408 0.6067 0.9074 0.8793 0.3385 10.27
Vanilla Infinity (Han et al. 2024) - 0.7891 0.6965 0.8836 0.8955 0.2780 1.71
IP-Adapter (Ye et al. 2023) X 0.8323 0.7834 0.8661 0.9243 0.2266 10.40
PhotoMaker (Li et al. 2024) X 0.7223 0.6516 0.8651 0.8465 0.3996 19.52
The Chosen One (Avrahami et al. 2024) X 0.6494 0.5824 0.8162 0.7943 0.4893 13.47
OneActor (Wang et al. 2024) X 0.8088 0.7172 0.8859 0.9070 0.2423 24.94
ConsiStory (Tewel et al. 2024) 0.7902 0.6895 0.9019 0.8954 0.2787 37.76
StoryDiffusion (Zhou et al. 2024b) 0.7634 0.6783 0.8403 0.8917 0.3212 23.68
1Prompt1Story (Liu et al. 2025) 0.8395 0.7687 0.8942 0.9117 0.1993 22.57
Ours 0.8538 0.8089 0.8732 0.9267 0.1834 1.72

Key observations:

  • Overall Performance (SHS_H): Infinite-Story (Ours) achieves the highest harmonic score (S_H) at 0.8538, indicating the best balance across all core metrics (prompt fidelity, identity consistency, and style consistency). This is higher than the next best (1Prompt1Story at 0.8395) and significantly better than vanilla models.

  • Identity Consistency (CLIP-I and DreamSim): Our method achieves the highest CLIP-I score (0.9267) and the lowest DreamSim score (0.1834), both indicating superior identity consistency. This validates the effectiveness of Identity Prompt Replacement and Adaptive Style Injection in maintaining subject appearance.

  • Style Consistency (DINO): Infinite-Story also obtains the highest DINO similarity (0.8089), demonstrating its ability to maintain a consistent global visual style across generated images. This highlights the success of Adaptive Style Injection.

  • Prompt Fidelity (CLIP-T): While CLIP-T (0.8732) is not the absolute highest (Vanilla SDXL and ConsiStory achieve slightly higher), it remains very competitive, especially considering the strong consistency achieved. This indicates that Synchronized Guidance Adaptation successfully preserves prompt fidelity despite the strong consistency injections.

  • Inference Time: This is where Infinite-Story shows a stark advantage. At 1.72 seconds per image, it is dramatically faster than all other consistent T2I models. The next fastest, Vanilla Infinity (1.71s), achieves similar speed but significantly lower consistency (SHS_H 0.7891 vs. 0.8538). Compared to other training-free methods like 1Prompt1Story (22.57s) or ConsiStory (37.76s), Infinite-Story is over 13 times faster, making it practical for real-time applications. Even compared to IP-Adapter (10.40s), it's over 6 times faster.

  • Training-Free Advantage: The table clearly shows that Infinite-Story achieves top-tier performance while being training-free, distinguishing it from strong training-based contenders like IP-Adapter and OneActor that might achieve good CLIP-I but are much slower and require training.

    These results strongly validate the effectiveness of Infinite-Story by demonstrating its superior balance across all evaluation criteria, especially its groundbreaking efficiency combined with state-of-the-art consistency and fidelity.

The following figure (Figure 3 from the original paper) provides a visual comparison of inference time and harmonic score, further emphasizing the efficiency of Infinite-Story.

该图像是一个散点图,展示了不同模型的推理时间与和谐评分之间的关系。我们的模型(标记为'Ours')在推理时间上达到了1.72秒,表现优于其他模型,体现了其高效性与一致性。
该图像是一个散点图,展示了不同模型的推理时间与和谐评分之间的关系。我们的模型(标记为'Ours')在推理时间上达到了1.72秒,表现优于其他模型,体现了其高效性与一致性。

Figure 3: Comparison of inference time and harmonic score S _ { H } between our method and state-of-the-art identityconsistent text-to-image generation models.

The qualitative results shown in Figure 6 also corroborate the quantitative findings:

Figure 6: Qualitative comparison with state-of-the-art consistent T2I generation models. Each row depicts a set of images generated using a shared identity prompt combined with varying expression prompts.
该图像是一个比较图,展示了多种基于共享身份提示组合不同表达提示下生成的图像。每一行呈现了一系列图像,左侧为不同风格的角色,而右侧则展示了多种小动物的细腻插图,体现了各个生成模型在一致性文本到图像生成上的差异。

Figure 6: Qualitative comparison with state-of-the-art consistent T2I generation models. Each row depicts a set of images generated using a shared identity prompt combined with varying expression prompts.

In the qualitative comparison, Infinite-Story successfully generates image sequences (e.g., the elf character and the watercolor hedgehog) that:

  • Maintain Subject Identity: The elf's facial structure and the hedgehog's distinct features remain consistent across various scenes and actions.

  • Preserve Unified Visual Style: The overall artistic style, rendering, and lighting remain coherent throughout the sequence.

  • Reflect Prompt-Specific Nuances: The images accurately depict the different expressions and contexts described in the varying prompts (e.g., the elf guarding, looking at a map, or on a boat).

    This contrasts with other methods, where:

  • IP-Adapter preserves identity but often fails to reflect prompt nuances.

  • OneActor and 1Prompt1Story capture expression well but show style shifts.

  • StoryDiffusion and ConsiStory show style consistency but struggle with identity.

  • PhotoMaker and The Chosen One underperform in multiple aspects.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Study on Proposed Components

An ablation study was conducted to evaluate the individual contributions of Identity Prompt Replacement (IPR), Adaptive Style Injection (ASI), and Synchronized Guidance Adaptation (SGA).

The following are the results from Table 2 of the original paper:

Component Quantitative Metrics
# IPR ASI SGA Sh ↑ DINO ↑ CLIP-T ↑ CLIP-I↑ DreamSim ↓
(a) 0.7891 0.6965 0.8836 0.8955 0.2780
(b) 0.8013 0.7119 0.8814 0.9046 0.2569
(c) 0.8481 0.8082 0.8625 0.9242 0.1931
(d) 0.8538 0.8089 0.8732 0.9267 0.1834

Analysis:

  • (a) Baseline (Vanilla Infinity): This represents the base Infinity model without any of the proposed consistency techniques. It serves as the starting point.

  • (b) IPR Only: Adding only Identity Prompt Replacement (IPR) shows improvements in CLIP-I (0.8955 to 0.9046) and DreamSim (0.2780 to 0.2569). This confirms that IPR effectively mitigates text encoder context bias and aligns identity-related attributes, leading to better identity consistency. DINO and CLIP-T also see slight improvements, contributing to a higher SHS_H (0.7891 to 0.8013).

  • (c) IPR + ASI: Incorporating Adaptive Style Injection (ASI) alongside IPR leads to significant gains in DINO (0.7119 to 0.8082), indicating much improved global style consistency. CLIP-I (0.9046 to 0.9242) and DreamSim (0.2569 to 0.1931) also further improve, demonstrating enhanced appearance-level identity alignment. The SHS_H jumps considerably (0.8013 to 0.8481). However, CLIP-T drops (0.8814 to 0.8625), suggesting that injecting strong style guidance without synchronization can compromise prompt fidelity.

  • (d) Full Method (IPR + ASI + SGA): Adding Synchronized Guidance Adaptation (SGA) on top of IPR and ASI restores the balance. CLIP-T significantly recovers (0.8625 to 0.8732), indicating that SGA successfully maintains prompt fidelity by synchronizing modifications across conditional and unconditional branches of CFG. The overall harmonic score (S_H) reaches its peak at 0.8538, and CLIP-I and DreamSim also show minor additional improvements.

    This quantitative analysis confirms that each component contributes positively to the overall consistency and fidelity, with SGA playing a crucial role in mitigating the trade-off between consistency and prompt fidelity.

The following figure (Figure 7 from the original paper) provides a qualitative analysis of the ablation study, visually reinforcing the quantitative findings:

Figure 7: Qualitative analysis of ablation study. The results from (a)-(d) correspond to the configurations in Table 2.
该图像是图表,展示了不同配置下的生成效果对比,其中(a)-(d)分别对应于表2中的配置。展示了水莲花和红狐狸的不同场景,突出了风格与身份一致性的视觉效果。

Figure 7: Qualitative analysis of ablation study. The results from (a)-(d) correspond to the configurations in Table 2.

Qualitative Analysis of Ablation Study:

  • (a) Without any proposed methods: Both the flower (species, rendering style) and the red fox (fur texture, facial shape) show severe identity inconsistency and style inconsistency.
  • (b) With IPR only: IPR helps unify the identity-related attributes. The lily maintains a more consistent floral structure, and the red fox shows more stable facial features and body proportions. However, global style elements (lighting, rendering) still vary.
  • (c) With IPR + ASI: ASI dramatically improves global style and appearance-level identity consistency. The flower has stable coloration and stroke patterns, and the fox maintains consistent shading and background textures. However, some prompt-specific semantics are underemphasized, and visual artifacts (unnatural outlines, distorted textures) appear, likely due to strong style injection overriding localized details.
  • (d) Full method (IPR + ASI + SGA): SGA restores the balance, ensuring prompt fidelity. This results in visually coherent outputs that maintain consistent subject appearance and unified stylistic rendering, while accurately reflecting prompt-specific variations (posture, context, lighting).

6.2.2. Ablation Study on Adaptive Style Injection Scaling Coefficient (λ\lambda)

An additional ablation study was performed to analyze the effect of the scaling coefficient\lambda$$ in Adaptive Style Injection (ASI).

The following are the results from Table 5 of the original paper:

Parameter SH↑ DINO ↑ CLIP-T ↑ CLIP-I ↑ DreamSim ↓
λ = 0.6 0.8420 0.7967 0.8745 0.9209 0.1998
λ = 0.7 0.8473 0.7865 0.8737 0.9227 0.1919
λ = 0.8 0.8506 0.8058 0.8735 0.9245 0.1904
λ = 0.85 (Ours) 0.8538 0.8089 0.8732 0.9267 0.1834
λ = 0.9 0.8538 0.8102 0.8722 0.9251 0.1826

Analysis:

  • As λ\lambda increases, DINO, CLIP-I, and DreamSim generally improve, indicating that a stronger style injection leads to better identity and style consistency. The best DINO (0.8102) and DreamSim (0.1826) scores are achieved at λ=0.9\lambda = 0.9.
  • However, CLIP-T (prompt fidelity) tends to degrade as λ\lambda increases. It is highest at λ=0.6\lambda = 0.6 (0.8745) and decreases to 0.8722 at λ=0.9\lambda = 0.9. This confirms the trade-off: too strong a style injection can override prompt-specific details.
  • The chosen default value of λ=0.85\lambda = 0.85 is a compromise. It achieves the highest harmonic score (S_H) of 0.8538 (tied with λ=0.9\lambda = 0.9), while offering a slightly better CLIP-T (0.8732) than λ=0.9\lambda = 0.9 (0.8722), and maintaining very competitive consistency metrics. This demonstrates that λ=0.85\lambda = 0.85 strikes an optimal balance.

6.2.3. Generality of the Method

The paper also evaluated the generality of Infinite-Story by applying its techniques to other scale-wise autoregressive T2I models: Switti (Voronov et al. 2024) and HART (Tang et al. 2024).

The following are the results from Table 4 of the original paper:

Method SH ↑ DINO ↑ CLIP-T↑ CLIP-I ↑ DreamSim ↓
Vanilla Switti (Voronov et al. 2024) 0.7719 0.6595 0.8904 0.8871 0.2934
Switti + Ours 0.8146 0.7441 0.8756 0.9018 0.2398
Vanilla HART (Tang et al. 2024) 0.7434 0.6381 0.8848 0.8714 0.3488
HART + Ours 0.7894 0.7048 0.8505 0.8982 0.2945

Analysis: Applying Infinite-Story's techniques (IPR, ASI, SGA) to Switti and HART consistently shows clear performance improvements across DINO, CLIP-I, and DreamSim. This indicates that the proposed techniques are not specific to the Infinity model but can generalize to other architectures within the scale-wise autoregressive family, enhancing their consistency capabilities. The harmonic score (S_H) also improves for both Switti and HART when integrated with Infinite-Story's methods.

6.3. User Study

A user study involving 55 participants was conducted to complement the quantitative evaluation, focusing on human perception of identity consistency, style consistency, and prompt fidelity.

The following are the results from Table 3 of the original paper:

Method Identity ↑ Style ↑ Prompt ↑
1Prompt1Story (Liu et al. 2025) 18.0% 13.2% 28.2%
OneActor (Wang et al. 2024) 7.2% 7.2% 10.6%
IP-Adapter (Ye et al. 2023) 16.4% 29.6% 4.7%
Ours 58.4% 50.0% 56.5%

Analysis: The results show that Infinite-Story (Ours) is significantly preferred by users across all three criteria:

  • Identity Consistency: 58.4% preference for Infinite-Story, vastly outperforming others (e.g., 1Prompt1Story at 18.0%).

  • Style Consistency: 50.0% preference for Infinite-Story, again leading by a large margin (e.g., IP-Adapter at 29.6%).

  • Prompt Fidelity: 56.5% preference for Infinite-Story, demonstrating strong human-perceived alignment with text descriptions (e.g., 1Prompt1Story at 28.2%).

    These results strongly indicate that Infinite-Story achieves superior human-perceived consistency and prompt fidelity compared to leading baselines, further validating its effectiveness and practical value.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Infinite-Story, a novel training-free framework for generating consistent text-to-image sequences tailored for multi-prompt storytelling scenarios. By building upon a scale-wise autoregressive model (Infinity), the method effectively addresses the dual challenges of identity inconsistency and style inconsistency. The core of Infinite-Story lies in three lightweight yet effective techniques: Identity Prompt Replacement (IPR), which unifies identity attributes by mitigating text encoder context bias; and a unified attention guidance mechanism combining Adaptive Style Injection (ASI) and Synchronized Guidance Adaptation (SGA), which jointly enforce appearance-level identity and global style consistency while maintaining prompt fidelity. Extensive experiments, including quantitative comparisons against state-of-the-art models and a user study, demonstrate that Infinite-Story achieves superior identity and style consistency and prompt fidelity. Crucially, it offers over 6 times faster inference (1.72 seconds per image) than the fastest existing consistent T2I models, making it highly practical for real-time and interactive applications like visual storytelling.

7.2. Limitations & Future Work

The authors acknowledge one primary limitation:

  • Sensitivity to Anchor Selection: Infinite-Story relies on a single reference image (anchor) within each batch to propagate identity and style features. While this enables efficient and training-free inference, it introduces sensitivity to anchor selection. If the initial anchor image is of low quality or stylistically off-target, this degradation can propagate to the entire batch of generated images. Since the method does not alter the generation capabilities of the underlying Infinity model, its success is inherently tied to the quality of this initial output.

    Based on this limitation, the authors suggest the following future work:

  • Adaptive Reference Selection Strategies: Developing mechanisms to intelligently select or correct the anchor image could mitigate the propagation of low-quality or stylistically incorrect references.

  • Temporal Consistency in Video Generation: Extending the method to support temporal consistency in video generation. This would involve adapting the current framework, which focuses on image sequences, to handle the additional dimension of time for coherent video outputs.

7.3. Personal Insights & Critique

This paper presents a highly practical and impactful contribution to the field of generative AI. The commitment to a training-free and fast solution is particularly commendable, as it directly addresses major bottlenecks preventing wider adoption of consistent T2I generation in real-world interactive applications. The choice of a scale-wise autoregressive model as the backbone, rather than the dominant but slower diffusion models, is a clever architectural decision that underpins the impressive inference speed.

The three proposed techniques—IPR, ASI, and SGA—are elegantly designed to target specific aspects of inconsistency (context bias, appearance, style, and prompt fidelity) with minimal computational overhead. The Adaptive Style Injection with its similarity-guided interpolation weight is particularly intuitive, allowing for a nuanced control over how much the reference dictates the style. Synchronized Guidance Adaptation is a critical component, showcasing a deep understanding of Classifier-Free Guidance to prevent a common pitfall of sacrificing prompt fidelity for consistency.

Potential issues or areas for improvement:

  1. Anchor Image Quality Dependence: While acknowledged as a limitation, the dependence on a single anchor image's quality is a significant practical hurdle. If a user provides a poor or ambiguous initial prompt for the reference, the subsequent generations will inherit those flaws. Future work could explore:
    • Reference Image Refinement: A self-correction mechanism that iteratively refines the anchor image based on some quality metrics or user feedback.
    • Multiple Reference Fusion: Allowing multiple reference images to be provided, and then fusing their identity/style information to create a more robust "canonical" reference.
    • Semantic Reference Selection: Automatically identifying the most "representative" image from a generated batch to serve as the anchor.
  2. Fine-Grained Control over Consistency: The current Adaptive Style Injection uses a global λ\lambda for scaling. While effective, more fine-grained control might be beneficial. For instance, allowing users to specify different λ\lambda values for specific visual elements (e.g., character appearance vs. background style) could offer greater artistic control.
  3. Complex Scene Consistency: The examples are compelling, but visual storytelling can involve highly complex scenes with multiple interacting characters and intricate backgrounds. How Infinite-Story scales to maintaining consistency across many distinct subjects in a crowded scene, or handling occlusion, would be an interesting area to explore. The current framework seems primarily focused on a single main subject.
  4. Implicit Bias Inheritance: While IPR addresses context bias in text encoders, the underlying Infinity model itself will still carry biases from its training data. If the base model has biases (e.g., consistently generating a certain race for a generic character description), Infinite-Story might propagate these biases, potentially making them more consistent. Further research could investigate how to integrate bias mitigation techniques within the training-free consistency framework.

Transferability and Applicability: The methods presented in this paper are highly transferable.

  • Visual Storyboarding & Comic Creation: Directly applicable for artists and designers to rapidly prototype visual narratives, maintaining consistent characters and styles across panels.

  • Animated Content Pre-production: Could be used for concept art generation or character design iteration in animation pipelines.

  • Personalized Content Generation: For marketing, advertising, or educational content, generating sequences featuring a consistent brand character or mascot.

  • Image Editing & Style Transfer: The principles of Adaptive Style Injection and reference-based feature propagation could be adapted for consistent style transfer across image collections or for propagating edits consistently within a series of images.

  • Other Generative Tasks: The idea of training-free attention guidance and prompt embedding manipulation could inspire similar consistency-enhancing techniques for other generative tasks beyond images, such as 3D model generation or audio synthesis, where maintaining coherence across varied outputs is crucial.

    Overall, Infinite-Story is an impressive piece of work that pushes the boundaries of practical and efficient consistent text-to-image generation, offering a compelling solution for the growing demand for coherent visual storytelling tools.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.