Paper status: completed

Personalized Generation In Large Model Era: A Survey

Published:03/04/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This survey comprehensively investigates Personalized Generation (PGen) in the era of large models, conceptualizing its key components and objectives. A multi-level taxonomy reviews technical advancements and datasets while envisioning PGen's applications and future challenges, p

Abstract

In the era of large models, content generation is gradually shifting to Personalized Generation (PGen), tailoring content to individual preferences and needs. This paper presents the first comprehensive survey on PGen, investigating existing research in this rapidly growing field. We conceptualize PGen from a unified perspective, systematically formalizing its key components, core objectives, and abstract workflows. Based on this unified perspective, we propose a multi-level taxonomy, offering an in-depth review of technical advancements, commonly used datasets, and evaluation metrics across multiple modalities, personalized contexts, and tasks. Moreover, we envision the potential applications of PGen and highlight open challenges and promising directions for future exploration. By bridging PGen research across multiple modalities, this survey serves as a valuable resource for fostering knowledge sharing and interdisciplinary collaboration, ultimately contributing to a more personalized digital landscape.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Personalized Generation In Large Model Era: A Survey

1.2. Authors

The paper lists several authors from different institutions:

  • Yiyan Xu (University of Science and Technology of China)

  • Jinghao Zhang (University of Chinese Academy of Sciences)

  • Alireza Salemi (University of Massachusetts Amherst)

  • Xinting Hu (Nanyang Technological University)

  • Wenjie Wang (University of Science and Technology of China)

  • Fuli Feng (University of Science and Technology of China)

  • Hamed Zamani (University of Massachusetts Amherst)

  • Xiangnan He (University of Science and Technology of China)

  • Tat-Seng Chua (National University of Singapore)

    Their research backgrounds appear to span areas such as Natural Language Processing, Computer Vision, Information Retrieval, and Generative Models, reflecting the interdisciplinary nature of Personalized Generation (PGen).

1.3. Journal/Conference

The paper is published as a preprint on arXiv, indicated by the original source link https://arxiv.org/abs/2503.02614v2. As a survey paper, it is likely intended for a broad audience across multiple research communities. arXiv preprints are widely read and influential in the academic community, allowing for early dissemination and feedback before formal peer review and publication in a journal or conference.

1.4. Publication Year

2025 (specifically, published at UTC: 2025-03-04T13:34:19.000Z)

1.5. Abstract

This paper presents the first comprehensive survey on Personalized Generation (PGen) in the era of large models, where content creation is increasingly tailored to individual preferences and needs. The authors introduce a unified perspective for PGen, formalizing its key components, core objectives, and abstract workflows. Based on this framework, they propose a multi-level taxonomy to review technical advancements, datasets, and evaluation metrics across various modalities (text, image, video, audio, 3D) and personalized contexts. The survey also explores potential applications of PGen, identifies open challenges, and suggests future research directions. By bridging PGen research across different modalities, this work aims to foster knowledge sharing and interdisciplinary collaboration, ultimately promoting a more personalized digital environment.

https://arxiv.org/abs/2503.02614v2 PDF Link: https://arxiv.org/pdf/2503.02614v2.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the lack of a unified, comprehensive overview of Personalized Generation (PGen) research across different modalities and research communities. With the rapid advancements in large generative models, content generation is shifting from generic, one-size-fits-all approaches to highly personalized content tailored to individual user preferences and needs. This shift holds immense potential for enhancing user experience in various domains like e-commerce, marketing, and AI assistants.

However, research efforts in PGen have largely evolved independently within communities such such as Natural Language Processing (NLP), Computer Vision (CV), and Information Retrieval (IR). Existing surveys primarily adopt either a model-centric (e.g., surveys on Large Language Models (LLMs) or Diffusion Models (DMs)) or task-centric (e.g., dialogue generation or generative recommendation) perspective, offering only partial summaries. This fragmentation makes it challenging to understand the broader landscape of PGen, hindering cross-community communication and collaboration. The problem is exacerbated by modality-specific differences (e.g., text vs. image data structures), which create technical divergences.

The paper's entry point is to re-examine PGen from a high-level, modality-agnostic perspective. It proposes that PGen fundamentally involves user modeling based on various personalized contexts and multimodal instructions, extracting personalized signals to guide content generation. This unified perspective serves as the innovative idea to integrate disparate research efforts into a cohesive framework.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • Unified User-centric Perspective: It conceptualizes PGen by formalizing its key components (personalized contexts, multimodal instructions), core objectives (high quality, instruction alignment, personalization), and a general workflow (user modeling, generative modeling). This holistic framework integrates studies across different modalities.

  • Multi-level Taxonomy: Building on the unified perspective, the paper introduces a novel taxonomy that systematically reviews PGen's technical advancements, commonly used datasets, and evaluation metrics. This taxonomy categorizes research by modality (text, image, video, audio, 3D, cross-modal), personalized contexts (user behaviors, documents, profiles, personal face/body, personalized subjects), and specific tasks.

  • Outlook on Applications: It envisions potential applications of PGen in enhancing user-centric services, categorizing them into content creation (e.g., personalized writing assistants for creators) and content delivery (e.g., tailored marketing, e-commerce, entertainment, education, gaming, and AI assistants).

  • Identification of Open Challenges and Future Directions: The survey highlights critical open problems that need to be addressed, spanning technical challenges (scalability, deliberative reasoning, evolving user preferences, mitigating filter bubbles, user data management, multi-modal personalization, synergy with retrieval), benchmarks and metrics, and trustworthiness (privacy, fairness, safety).

    The key conclusions are that PGen is a rapidly growing and transformative field with significant potential to create a more personalized digital landscape. However, realizing this potential requires addressing substantial technical, ethical, and evaluative challenges, many of which can benefit from cross-disciplinary research fostered by a unified understanding of PGen. The paper serves as a valuable resource to bridge research across multiple modalities, facilitating knowledge sharing and collaboration.

3. Prerequisite Knowledge & Related Work

This section aims to provide foundational knowledge necessary to understand the survey on Personalized Generation (PGen).

3.1. Foundational Concepts

3.1.1. Generative Models

Generative models are a class of artificial intelligence models that learn to generate new data instances that resemble the training data. Unlike discriminative models that predict labels or categories, generative models learn the underlying distribution of the data.

  • Large Language Models (LLMs): These are powerful neural networks trained on vast amounts of text data to understand, generate, and process human language. They can perform a wide range of natural language tasks such as text completion, translation, summarization, and question answering. Examples include GPT-4, LLaMA, and Gemini. LLMs are often based on the Transformer architecture.
  • Multimodal Large Language Models (MLLMs): Extensions of LLMs that can process and generate content across multiple modalities, such as text, images, and audio. They learn to understand the relationships between different types of data, enabling tasks like image captioning, visual question answering, and text-to-image generation.
  • Diffusion Models (DMs): A class of generative models that work by systematically destroying training data by adding noise, and then learning to reverse this noise process to reconstruct the data. They have achieved state-of-the-art results in image, audio, and video generation due to their ability to produce high-quality and diverse samples.

3.1.2. Personalization

In the context of content generation, personalization refers to tailoring content to an individual user's specific preferences, needs, context, or characteristics. This goes beyond generic content to create a more relevant and engaging experience for each user. It often relies on user modeling, which involves collecting and analyzing various data points related to a user.

3.1.3. Modalities

Modality refers to the type of data or content being processed or generated. Common modalities discussed in the paper include:

  • Text: Human language in written form.
  • Image: Still visual content.
  • Video: Moving visual content, often with audio.
  • Audio: Sound, including speech, music, and environmental sounds.
  • 3D: Three-dimensional models or scenes.
  • Cross-modal: Involving interactions or generation across two or more different modalities.

3.1.4. User Modeling

User modeling is the process of collecting, analyzing, and representing information about individual users to predict their preferences, needs, and behaviors. This information, often called personalized contexts, can include:

  • User profiles: Demographic data (age, gender, location, occupation) or explicit interests.
  • User documents: User-generated text like comments, emails, social media posts, reflecting their writing style or creative preferences.
  • User behaviors: Implicit feedback from interactions such as clicks, likes, searches, purchases, views, or shares.
  • Personal face/body: Static (facial structure, body shape) and dynamic (expressions, gestures) biometric traits used in tasks like portrait generation or virtual try-ons.
  • Personalized subjects: User-specific concepts or entities, like a pet or favorite object, that reflect unique tastes.

3.1.5. Prompt Engineering

Prompt engineering involves carefully designing input prompts (textual instructions or queries) for generative models to elicit desired outputs. In PGen, prompts can be engineered to incorporate user-specific information or guide the model towards personalized content.

3.1.6. Retrieval-Augmented Generation (RAG)

RAG is a technique that enhances generative models by allowing them to retrieve relevant information from an external knowledge base before generating a response. This helps improve factual accuracy, reduce hallucinations, and incorporate specific, up-to-date information, which is particularly useful for personalization by retrieving user-specific data or preferences.

3.1.7. Fine-tuning

Fine-tuning is the process of further training a pre-trained model on a smaller, task-specific dataset to adapt it to a new task or domain.

  • Full Fine-tuning: All parameters of the pre-trained model are updated.
  • Parameter-Efficient Fine-Tuning (PEFT): Techniques that fine-tune only a small subset of a model's parameters or add a small number of new parameters, significantly reducing computational cost and memory footprint. LoRA (Low-Rank Adaptation) is a popular PEFT method.
  • Instruction Tuning: Fine-tuning a model on a dataset of instructions and corresponding responses to improve its ability to follow human instructions.

3.1.8. Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO)

These are techniques used to align generative models with human preferences.

  • RLHF: Involves training a reward model from human feedback (e.g., pairwise comparisons of generated outputs) and then using this reward model to fine-tune the generative model via reinforcement learning.
  • DPO: A simpler, more stable alternative to RLHF that directly optimizes the generative model's policy to align with human preferences without explicitly training a separate reward model. It reformulates the alignment problem as a classification task.

3.2. Previous Works

The paper primarily focuses on surveying PGen, so it references many specific techniques and models in each modality rather than providing a detailed background on each. However, understanding the general progression of generative models is helpful.

Early generative models include:

  • Generative Adversarial Networks (GANs): Consist of a generator and a discriminator network that compete against each other. The generator creates fake data, and the discriminator tries to distinguish it from real data. This adversarial process leads to increasingly realistic generated data. Many early personalized image and video generation tasks (e.g., virtual try-on, face manipulation) relied on GANs.

    • Example: StyleGAN (Karras et al., 2019) is a prominent GAN for high-quality image synthesis, particularly faces. Its latent space can be inverted or manipulated for personalized editing.
  • Variational Autoencoders (VAEs): Learn a latent representation of data and can generate new data by sampling from this latent space. They are known for their stable training but often produce less sharp outputs compared to GANs.

    The Transformer architecture (Vaswani et al., 2017) revolutionized sequence modeling in NLP and forms the backbone of most modern LLMs and MLLMs. Its core mechanism, self-attention, allows models to weigh the importance of different parts of the input sequence when processing each element. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Here, QQ represents the query matrix, KK the key matrix, and VV the value matrix. dkd_k is the dimension of the keys, used for scaling. The softmax function is applied to the scaled dot product of queries and keys to get attention weights, which are then used to weight the values. This allows the model to "attend" to relevant parts of the input.

Diffusion Models have recently surpassed GANs and VAEs in image quality and diversity for many generation tasks.

3.3. Technological Evolution

The evolution towards PGen is tightly coupled with advancements in generative models:

  1. Early Generative Models (Pre-2017): GANs and VAEs emerged, capable of generating generic content (e.g., realistic but anonymous faces, simple textures). Personalization was limited, often involving training separate models for specific styles or content types, or coarse-grained control.
  2. Transformer Era & LLM Emergence (2017 onwards): The Transformer architecture led to the development of powerful LLMs. Initially focused on text, these models could generate highly coherent and contextually relevant prose. Early personalization involved prompt engineering to guide LLMs towards specific personas or styles.
  3. Multimodal Integration (Recent Years): The integration of modalities, leading to MLLMs, allowed generative models to understand and generate content beyond text, such as text-to-image. This opened the door for more sophisticated personalized subjects (e.g., generating an image of a user's pet from a few examples).
  4. Diffusion Model Dominance (Very Recent): Diffusion Models significantly raised the bar for image, video, and audio quality. Techniques like Textual Inversion and DreamBooth (discussed in the paper) directly enable subject-driven personalization by learning unique identifiers for user-provided subjects.
  5. PGen Specialization & Alignment (Current): The current focus is on refining personalization by incorporating diverse user signals (behaviors, profiles, documents), ensuring instruction alignment and high quality, and developing sophisticated optimization strategies (fine-tuning, RLHF, DPO) to align generated content with nuanced human preferences and intentions. This survey positions itself at this crucial juncture, unifying these specialized efforts.

3.4. Differentiation Analysis

Compared to prior surveys that are either model-centric (e.g., only LLMs or DMs) or task-centric (e.g., only dialogue generation), this paper differentiates itself by offering a unified user-centric perspective and a multi-level taxonomy that spans multiple modalities.

  • Unified Framework: Instead of treating text, image, video, and audio generation as separate fields with their own personalization techniques, this survey proposes a common conceptualization of PGen. This involves abstracting common components like personalized contexts (user profiles, behaviors, documents, personal face/body, personalized subjects) and multimodal instructions, which then guide user modeling and generative modeling across any modality.

  • Cross-Modality Integration: Previous works often failed to bridge the inherent technical divergences between modalities. This survey actively seeks to connect these disparate research efforts, fostering interdisciplinary insights. For example, techniques for learning personalized subjects in image generation can inspire similar approaches in video or 3D generation.

  • Comprehensive Scope: By systematically reviewing technical advancements, datasets, and evaluation metrics across six modalities and five personalized contexts, the paper provides a much broader and deeper overview than existing, more narrowly focused surveys.

  • Focus on User-Centricity: The emphasis is consistently on how generative models are adapted to serve individual user needs, tastes, and identities, rather than just on the capabilities of the generative models themselves.

    This differentiation positions the survey as a foundational resource for a nascent yet rapidly expanding field, offering a high-level map that was previously unavailable.

4. Methodology

The paper conceptualizes Personalized Generation (PGen) from a unified perspective, formalizing its key components, core objectives, and abstract workflows. The methodology outlines a two-stage process: User Modeling and Generative Modeling.

4.1. Principles

The core idea of PGen, as presented, is to synthesize content that is uniquely tailored to an individual's preferences and specific needs. This requires:

  1. Understanding the User: Extracting meaningful signals from various user-specific data sources.

  2. Guiding Generation: Using these personalized signals to direct a generative model to produce content that is not only high-quality and aligned with explicit instructions but also deeply resonant with the user's implicit and explicit preferences.

    The theoretical basis is that by effectively capturing and integrating personalized contexts and multimodal instructions, generative models can move beyond generic outputs to create truly customized content.

The following figure (Figure 2 from the original paper) illustrates the personalized generation workflow:

Figure 2: Personalized generation workflow. 该图像是一个示意图,展示了个性化生成的工作流程,包括输入用户的多模态指令和个性化上下文,用户建模,通过生成建模生成个性化内容的三个主要步骤:基础模型、指导机制和优化策略。

Figure 2: Personalized generation workflow.

4.2. Core Methodology In-depth (Layer by Layer)

The PGen workflow involves two key processes: user modeling and generative modeling.

4.2.1. User Modeling

The goal of user modeling is to effectively capture user preferences and specific content needs from personalized contexts and users' multimodal instructions. The paper identifies five dimensions for personalized contexts:

  • User profiles: Demographic and personal attributes (e.g., age, gender, occupation, location).

  • User documents: User-created textual content (e.g., comments, emails, social media posts) reflecting creative preferences.

  • User behaviors: Interactions (e.g., searches, clicks, likes, comments, views, shares, purchases) captured during user engagement.

  • Personal face/body: Individual facial and bodily traits (e.g., facial structure, body shape, expressions, gestures, motions) used in tasks like portrait generation or virtual try-ons.

  • Personalized subjects: User-specific concepts or entities (e.g., pets, personal items, favorite objects) reflecting unique tastes.

    To leverage these diverse inputs, three key techniques are commonly employed in user modeling:

  1. Representation Learning: This technique involves encoding diverse user inputs (personalized contexts and multimodal instructions) into dense, continuous numerical vectors called embeddings or summarizing them into discrete representations (e.g., texts). These representations capture the underlying semantic and personal characteristics of the user and their content needs.

    • Purpose: To transform heterogeneous user data into a unified, machine-understandable format that generative models can process.
    • Mechanism: Neural networks are typically used to learn these embeddings. For instance, a user's interaction history (clicks, likes) can be embedded into a vector space where similar preferences are closer together. Textual prompts or historical user documents might be encoded using pre-trained language models like BERT or specialized encoders that create dense embeddings. For visual data, models like CLIP (Contrastive Language-Image Pre-training) can create joint embeddings for images and text, allowing for semantic alignment.
    • Example: In DreamBooth (Ruiz et al., 2023), a few images of a specific subject are used to learn a unique identifier in the embedding space of a diffusion model. This identifier then serves as a personalized representation to guide generation.
  2. Prompt Engineering: This involves designing specific input prompts (textual instructions) to structure user-specific information for generative models. Rather than directly embedding preferences, prompt engineering guides the model's behavior and output style.

    • Purpose: To explicitly instruct the generative model on how to personalize the content, often by incorporating details from user profiles or desired stylistic elements into the prompt.
    • Mechanism: Prompts can include explicit descriptions of user preferences ("Generate a summary in a casual tone, similar to my previous emails"), desired persona ("Act as a friendly AI assistant"), or specific content requirements derived from user data. For example, a system might analyze a user's writing style from their documents and then construct a prompt that asks an LLM to generate new text in that style.
    • Example: In personalized dialogue systems, a user's persona (e.g., "friendly", "expert") can be injected into the prompt to guide the LLM's conversational style.
  3. Retrieval-Augmented Generation (RAG): RAG enriches user-specific information by retrieving relevant external data (e.g., from a knowledge base or user's historical data) and integrating it with the user's instructions. It typically involves filtering out irrelevant information and providing contextual grounding.

    • Purpose: To provide generative models with up-to-date, factual, or highly specific personalized information that might not be implicitly learned during training or easily captured by embeddings or simple prompts.

    • Mechanism: A retrieval component searches a database (e.g., a user's past purchase history, a personal document archive, or a knowledge graph tailored to the user) for relevant snippets based on the user's query or current context. These retrieved snippets are then concatenated with the user's instruction and fed to the generative model.

    • Example: A personalized writing assistant might use RAG to retrieve snippets from a user's past emails to understand their writing style, specific jargon, or common phrases, then use these to inform the generation of a new email.

      By combining these techniques, user modeling establishes a robust foundation for PGen, extracting personalized signals that guide content personalization within the subsequent generative modeling process.

4.2.2. Generative Modeling

The generative modeling process focuses on producing the actual personalized content. It follows a structured three-step process:

4.2.2.1. Step 1: Foundation Model

The first crucial step is selecting an appropriate foundation model. These are large-scale, pre-trained generative models that serve as the backbone for content generation.

  • Selection Criteria: The choice of foundation model depends on the target modality (text, image, video, etc.), specific task requirements (e.g., generating photorealistic images vs. creative text), and the nature of the user-specific data.
  • Types of Foundation Models:
    • Large Language Models (LLMs): Primarily used for text generation tasks (e.g., personalized summaries, dialogues, writing assistance).
    • Multimodal Large Language Models (MLLMs): Used for tasks involving multiple modalities, such as generating text from images or vice-versa, or cross-modal dialogue.
    • Diffusion Models (DMs): Dominant for high-quality image, video, and audio generation (e.g., personalized image generation, virtual try-ons, subject-driven video).

4.2.2.2. Step 2: Guidance Mechanism

Once a foundation model is selected, guidance mechanisms are employed to effectively integrate the personalized signals extracted during user modeling into the generation process. Two primary types are identified:

  1. Instruction Guidance: This mechanism ensures that the model follows explicit user prompts and instructions. It often involves methods that adapt the pre-trained model's behavior without necessarily changing its core parameters directly in a fine-tuning sense, but rather through how inputs are provided.

    • In-context Learning (ICL): Providing demonstrations (input-output pairs) within the prompt itself to teach the model a new task or style without updating its weights. The model learns from the examples given directly in the context.
      • Purpose: To guide the model to adopt specific styles, tones, or formats by showing it examples.
      • Mechanism: The prompt is constructed to include several examples of personalized content (e.g., "Here are examples of my writing style: [examples]. Now, write a paragraph about [topic] in my style."). The model then attempts to mimic the demonstrated behavior.
    • Instruction Tuning: Fine-tuning the pre-trained model on a dataset of diverse instructions and their corresponding desired outputs. This makes the model more adept at following new, unseen instructions.
      • Purpose: To enhance the model's general ability to follow instructions, including those related to personalization.
      • Mechanism: A dataset of instruction-response pairs, where instructions might include personalization cues, is used to further train the model. This makes the model more robust to varied and complex personalized requests.
  2. Structural Guidance: This mechanism involves modifying the model's architecture by incorporating additional modules or mechanisms to directly embed personalized information. This usually implies changes to the model's internal structure or data flow.

    • Adapters: Small, lightweight neural network modules inserted into the layers of a pre-trained model. They are trained on personalized data while the main model weights remain frozen, making fine-tuning efficient.
      • Purpose: To inject personalized features into the model's internal representations in a parameter-efficient manner.
      • Mechanism: An adapter takes the output of a pre-trained layer, applies a small transformation based on personalized signals, and then passes it to the next layer. This allows the model to adapt its internal processing based on user-specific information.
    • Cross-Attention Mechanisms: A key component in Transformer architectures, cross-attention allows a model to weigh the importance of different parts of a separate input sequence (e.g., personalized context embeddings) when processing its primary input (e.g., the generation process).
      • Purpose: To enable the generative model to selectively focus on and integrate specific personalized signals (e.g., from user embeddings or personalized subjects) during content creation.
      • Mechanism: Personalized embeddings (queries) can attend to the internal representations of the generative model (keys and values), or vice versa, to fuse personalized information directly into the generation process.

4.2.2.3. Step 3: Optimization Strategy

The final step involves optimization strategies to empower large generative models with personalized generation capabilities. These strategies dictate how model parameters are learned or adapted.

  1. Tuning-free Methods: These methods utilize pre-trained models for personalized generation without modifying their parameters. They are generally efficient and maintain the broad capabilities of the foundation model.

    • Model Fusion: Assembling multiple pre-trained models (e.g., combining a general text-to-image model with a personalized style model) or merging different LoRA (Low-Rank Adaptation) adapters, each trained for a specific style or subject.
      • Purpose: To combine the strengths of various models or personalized modules without further training.
      • Mechanism: Involves techniques like weighted sum of model weights or feature-level fusion during inference.
    • Interactive Generation: Collecting real-time user feedback during a multi-turn interaction process and using this feedback for iterative refinement of the generated content.
      • Purpose: To align content with individual preferences through direct user involvement.
      • Mechanism: The model generates initial content, presents it to the user, receives feedback (e.g., "make it warmer," "change the style"), and then refines the output in subsequent turns.
  2. Supervised Fine-tuning: This strategy involves optimizing model parameters using explicit supervision signals, typically from labeled datasets where desired personalized outputs are provided.

    • Full Fine-tuning: All parameters of the pre-trained foundation model are updated using a personalized dataset. This can lead to strong personalization but is computationally expensive and risks catastrophic forgetting (losing general capabilities).
      • Purpose: To deeply embed personalized preferences or styles into the model's weights.
      • Mechanism: The model is trained on pairs of personalized inputs (e.g., user profile + prompt) and desired personalized outputs (e.g., a text generated in the user's style).
    • Parameter-Efficient Fine-Tuning (PEFT) Techniques: Optimizing only a small subset of the model's parameters or introducing a few new trainable parameters. This includes methods like LoRA (Low-Rank Adaptation).
      • Purpose: To achieve personalization with significantly reduced computational cost and memory usage compared to full fine-tuning, while mitigating catastrophic forgetting.
      • Mechanism: For example, LoRA injects small, trainable low-rank matrices into the Transformer layers, which are then optimized on personalized data. The original pre-trained weights remain frozen.
  3. Preference-based Optimization: This strategy incorporates user preference data (e.g., explicit ratings or implicit comparisons) to update model parameters, aiming to align the generated content with subjective human tastes.

    • Reinforcement Learning with Human Feedback (RLHF): Involves training a separate reward model based on human judgments of generated content (e.g., humans rank multiple generated options). The generative model is then fine-tuned using reinforcement learning, optimizing its policy to maximize the reward predicted by the reward model.
      • Purpose: To align model outputs with nuanced, subjective human preferences that are difficult to encode in standard supervised loss functions.
      • Mechanism:
        1. Collect comparison data: Humans rank multiple generations from the model.
        2. Train a reward model: A neural network learns to predict a scalar reward for generated content based on human preferences.
        3. Fine-tune the generative model: Using algorithms like Proximal Policy Optimization (PPO), the generative model is updated to produce content that receives high rewards from the reward model.
    • Direct Preference Optimization (DPO): Offers a streamlined alternative to RLHF by directly optimizing the generative model's policy to align with pairwise user preferences, without the need for an explicit reward model.
      • Purpose: To simplify the process of aligning generative models with human preferences, making it more stable and efficient than RLHF.

      • Mechanism: DPO re-frames the preference learning problem as a classification task where the model learns to assign higher probabilities to preferred generations over dispreferred ones. The loss function directly optimizes the generative model's parameters based on these pairwise preferences.

        By integrating these advanced techniques and strategies, the PGen workflow ensures adaptability to diverse personalized contexts and user instructions, leveraging the evolving landscape of large generative models to offer scalable solutions for personalized content creation.

5. Experimental Setup

This section summarizes the commonly used datasets and evaluation metrics for Personalized Generation (PGen) across different modalities, as detailed in Tables 2, 3, and 4 of the original paper. The paper itself is a survey and does not present new experimental results or a specific experimental setup, but rather aggregates the setups used in the surveyed literature.

5.1. Datasets

The choice of datasets is critical for PGen, as they must capture user-specific information to enable personalization. The following are representative datasets mentioned in the survey, categorized by modality and personalized context:

5.1.1. Text Generation (Section 3.1)

  • User Behaviors (Recommendation):
    • Amazon (Hou et al., 2024; Ni et al., 2019): E-commerce datasets containing user reviews, ratings, and purchase histories for various products, used to infer preferences for recommendation tasks.
    • MovieLens (Harper and Konstan, 2015): Datasets of movie ratings provided by users, widely used for collaborative filtering and recommendation.
    • MIND (Wu et al., 2020a): A large-scale dataset for news recommendation, including user click histories and news article content. This is useful for personalized news headline generation.
    • Goodreads (Wan and McAuley, 2018; Wan et al., 2019): Book review datasets with user ratings and textual reviews, useful for personalized book recommendations and review generation.
  • User Behaviors (Information Seeking):
    • SE-PQA (Kasela et al., 2024): Personalized Community Question Answering, likely including user profiles and past interactions to tailor answers.
    • PWSC (Eugene et al., 2013): Personalized Web Search Challenge, involving user search histories and preferences.
    • AOL4PS (Guo et al., 2021): A large-scale dataset for personalized search, similar to PWSC.
  • User Documents (Writing Assistant):
    • LaMP (Salemi et al., 2024b): A benchmark for short-form personalized text generation (e.g., headlines, subject lines).
    • LongLaMP (Kumar et al., 2024a): An extension of LaMP targeting longer-form personalized text generation tasks (e.g., product reviews, email completion).
    • PLAB (Alhafni et al., 2024): Personalized Linguistic Attributes Benchmark, designed for text completion, leveraging user-written documents to extract stylistic features.
  • User Profiles (Dialogue System):
    • LiveChat (Gao et al., 2023): Large-scale dataset from live streaming interactions, useful for conversational AI with dynamic user context.
    • FoCus (Jang et al., 2021): Focuses on conversational information-seeking scenarios, incorporating user profiles.
    • Pchatbot (Qian et al., 2021): Compiles data from Weibo and judicial forums, likely including persona information.
    • HPD (Chen et al., 2023c): Harry Potter Dialogue, aligning dialogue agents with specific characters/personas.
  • User Profiles (User Simulation):
    • OpinionsQA (Santurkar et al., 2023): Datasets for assessing biases and opinions in language models, useful for user simulation by assigning personas.
    • RoleLLM (Wang et al., 2024f): A large-scale dataset for evaluating the role-playing abilities of LLMs.

5.1.2. Image Generation (Section 3.2)

  • User Behaviors (General-purpose generation):
    • Pinterest (Geng et al., 2015): Social media dataset with user interactions (pins, likes), useful for inferring visual tastes.
    • MIND (Wu et al., 2020b): News recommendation dataset, also applicable for personalized news posters.
    • POG (Chen et al., 2019): Personalized Outfit Generation, containing user interaction data related to fashion.
  • Personalized Subjects (Subject-driven T2I generation):
    • Dreambench (Ruiz et al., 2023): A collection of diverse concepts for subject-driven text-to-image generation, often involving few-shot images of a subject.
    • Dreambench++ (Peng et al., 2024), CustomConcept101 (Kumari et al., 2023), ConceptBed (Patel et al., 2024a): Benchmarks and datasets specifically designed to evaluate the capability of models to learn and generate novel concepts from limited subject images.
  • Personal Face/Body (Face generation):
    • CelebA-HQ (Karras et al., 2018), FFHQ (Karras et al., 2021), SFHQ (Beniaguv, 2022): High-quality datasets of celebrity faces, often used for training general face generation models that are then personalized.
    • LV-MHP-v2 (Zhaot al., 2018): Large-scale dataset for multi-human parsing, potentially useful for understanding body features.
    • AddMe-1.6M (Yue et al., 2025): Large dataset for inserting people into scenes, likely containing diverse human photos.
    • FaceForensics++ (Rossler et al., 2019): Dataset for detecting manipulated facial images, also useful for training robust face generation models.
    • VGGFace2 (Cao et al., 2018): A large-scale dataset for face recognition across pose and age, providing diverse face images.
  • Personal Face/Body (Virtual try-on):
    • VITON (Han et al., 2018), VITON-HD (Choi et al., 2021), DressCode (Morelli et al., 2022): Datasets for virtual try-on, containing pairs of person images and garment images.
    • StreetTryOn (Cui et al., 2024a): A dataset for in-the-wild virtual try-on from unpaired images.
    • DeepFashion (Ge et al., 2019), Deepfashion-Multimodal (Jiang et al., 2022b): Large-scale datasets for fashion analysis, including detection, pose estimation, and segmentation of clothing.
    • SHHQ (Fu et al., 2022): StyleGAN-Human, a dataset for human generation, often used as a base for personalized human image synthesis.

5.1.3. Video Generation (Section 3.3)

  • Personalized Subjects (Subject-driven T2V generation):
    • WebVid-10M (Bain et al., 2021): A large-scale text-video dataset, useful for training general T2V models.
    • UCF101 (Soomro, 2012): A dataset of 101 human action classes from videos, for action recognition, but also used for motion synthesis.
    • AnimateBench (Zhang et al., 2024i): A benchmark for evaluating video generation models.
  • Personal Face/Body (Talking head generation):
    • LRW (Chung and Zisserman, 2017a), VoxCeleb (Nagrani et al., 2020), VoxCeleb2 (Chung et al., 2018): Large-scale audio-visual datasets primarily for speaker recognition and lip reading, but also used for talking head generation.
    • TCD-TIMIT (Harte and Gillen, 2015), LRS2 (Son Chung et al., 2017), HDTF (Zhang et al., 2021b): Datasets for lip synchronization and talking head generation.
    • MEAD (Wang et al., 2020): A large-scale audio-visual dataset for emotional talking-face generation.
    • GRID (Cooke et al., 2006): An audio-visual corpus for speech perception and automatic speech recognition.
  • Personal Face/Body (Pose-guided video generation):
    • FashionVideo (Zablotskaia et al., 2019), TikTok (Jafarian and Park, 2021): Datasets containing videos of people, often used to extract pose sequences for guiding video generation.
    • Everybody-dance-now (Chan et al., 2019): A dataset for transferring motion from a source video to a target person.
  • Personal Face/Body (Video virtual try-on):
    • VVT (Dong et al., 2019b): Video Virtual Try-on, containing videos of people trying on clothes.
    • TikTokDress (Nguyen et al., 2024a): TikTok videos with people trying on various outfits.

5.1.4. 3D Generation (Section 3.4)

  • Personalized Subjects (Image-to-3D generation):
    • Dreambench (Ruiz et al., 2023): Also used for 3D generation by extending T2I concepts.
    • Objaverse (Deitke et al., 2023): A universe of annotated 3D objects, useful for training general 3D generation models that can then be personalized.
  • Personal Face/Body (3D face generation):
    • Mystyle (Nitzan et al., 2022): Likely a dataset for personalized style transfer or generation.
    • BIWI (Fanelli et al., 2013), VOCASET (Cudeiro et al., 2019): Datasets for 3D face analysis, including expressions and talking faces.
  • Personal Face/Body (3D human pose generation):
    • Human3.6M (Ionescu et al., 2013): A large-scale dataset for 3D human sensing in natural environments, providing 3D poses and shapes.
    • 3DPW (Von Marcard et al., 2018): 3D Poses in the Wild, for recovering accurate 3D human pose from IMUs and a moving camera.
    • 3DOH50K (Zhang et al., 2020a): Object-occluded human shape and pose estimation.
  • Personal Face/Body (3D virtual try-on):
    • BUFF (Zhang et al., 2017): A dataset related to body shape and clothing.

5.1.5. Audio Generation (Section 3.6)

  • User Behaviors (Music generation):
    • Echo (Bertin-Mahieux et al., 2011): The Million Song Dataset, a large collection of audio features and metadata, useful for music recommendation and generation.
    • MAESTRO (Hawthorne et al., 2019): A large dataset of piano performances, used for music modeling and generation.
  • Personalized Subjects (Text-to-audio generation):
    • TASBench (Li et al., 2024k): A benchmark for text-to-audio spatialization.
    • AudioCaps (Kim et al., 2019): A dataset of audio clips with corresponding textual captions.
    • AudioLDM (Liu et al., 2023a): A dataset related to latent diffusion models for audio generation.
  • Personal Face (Face-to-speech generation):
    • Voxceleb2 (Chung et al., 2018), LibriTTS (Zen et al., 2019), VGGFace2 (Cao et al., 2018), GRID (Cooke et al., 2006): These datasets contain audio-visual data, enabling the extraction of speaker-specific attributes from facial images/videos for speech generation.

5.1.6. Cross-Modal Generation (Section 3.7)

  • User Behaviors (Robotics):
    • D4RL (Fu et al., 2020): Datasets for Deep Data-Driven Reinforcement Learning, including various robot locomotion and manipulation tasks.
    • Ravens (Zeng et al., 2021): A simulated environment for robotic manipulation.
    • Habitat-Rearrange (Puig et al., 2023), RoboTHOR (Deitke et al., 2020): Platforms and datasets for embodied AI and robotic manipulation tasks in realistic environments.
  • User Documents (Caption/Comment generation):
    • TripAdvisor (Geng et al., 2022), Yelp (Geng et al., 2022): Review datasets used for personalized explanation generation and comment generation.
    • PerVid-Com (Lin et al., 2024b): A dataset likely designed for personalized video comment generation.

5.2. Evaluation Metrics

Evaluating personalized content generation is inherently complex due to the subjective nature of "personalization." The paper categorizes metrics into three primary objectives: High quality, Instruction alignment, and Personalization. It then lists various metrics for each modality.

5.2.1. Text Generation (Section 3.1)

The following are the results from Table 3 of the original paper:

Text (Section 3.1) Metrics Evaluation Dimensions Representative Works
1. Recommendation NDCG (Järvelin and Kekäläinen, 2002) Overall BIGRec (Bao et al., 2023), DEALRec (Lin et al., 2024a), AOL4PS (Guo et al., 2021)
Hit Rate Overall BIGRec (Bao et al., 2023)
2. Information Seeking Precision Overall LLM-Rec (Lyu et al., 2024), AOL4PS (Guo et al., 2021)
3. Content Generation Recall Overall LLM-Rec (Lyu et al, 2024), DEALRec (Lin et al, 2024a), AOL4PS (Guo et al., 2021)
4. Writing Assistant win-rate Overall Personalized RLHF (Li et al., 2024g)
5. Dialogue System ROUGE (Lin, 2004) Overall LaMP (Salemi et al., 2024b), RSPG (Salemi et al., 2024a), Hydra (Zhuang et al., 2024)
6. User Simulation BLEU (Papineni et al., 2002) Overall AuthorPred (Li et al., 2023a)
BERTScore (Zhang et al., 2020b) Overall LongLaMP (Kumar et al., 2024a)
GEMBA (Kocmi and Federmann, 2023) Overall REST-PG (Salemi et al, 2025b)
G-Eval (Liu et al., 2023c) Overall REST-PG (Salemi et al., 2025b)
ExPerT (Salemi et al., 2025a) Personalization ExPerT (Salemi et al., 2025a)
AuPEL (Wang et al., 2023g) Personalization AuPEL (Wang et al., 2023g)
PERSE (Wang et al., 2024a) Personalization PERSE (Wang et al., 2024a)
  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) (Lin, 2004):
    • Conceptual Definition: A set of metrics for evaluating automatic summarization and machine translation. It works by comparing an automatically produced summary or translation with a set of reference summaries (or translations) created by humans. It measures the overlap of n-grams, word sequences, or word pairs between the candidate and reference texts.
    • Mathematical Formula: For ROUGE-N (N-gram overlap): $ \mathrm{ROUGE-N} = \frac{\sum_{\text{Reference Summaries}} \sum_{n\text{-gram} \in \text{Reference}} \mathrm{CountMatch}(n\text{-gram})}{\sum_{\text{Reference Summaries}} \sum_{n\text{-gram} \in \text{Reference}} \mathrm{Count}(n\text{-gram})} $
    • Symbol Explanation:
      • CountMatch(n-gram)\mathrm{CountMatch}(n\text{-gram}): The maximum number of n-gramsn\text{-grams} co-occurring in a candidate summary and a reference summary.
      • Count(n-gram)\mathrm{Count}(n\text{-gram}): The number of n-gramsn\text{-grams} in the reference summary.
      • n-gramn\text{-gram}: A contiguous sequence of nn items (words or characters) from a given sample of text. ROUGE-1 measures unigram overlap, ROUGE-2 measures bigram overlap, and ROUGE-L measures the longest common subsequence.
  • BLEU (Bilingual Evaluation Understudy) (Papineni et al., 2002):
    • Conceptual Definition: A metric for evaluating the quality of text which has been machine-translated from one natural language to another. It measures the similarity between a candidate translation and a set of human reference translations. It computes a weighted average of n-gram precision and applies a brevity penalty.
    • Mathematical Formula: $ \mathrm{BLEU} = \mathrm{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $ where BP\mathrm{BP} is the brevity penalty and pnp_n is the n-gramn\text{-gram} precision. The brevity penalty is: $ \mathrm{BP} = \begin{cases} 1 & \text{if } c > r \ e^{(1 - r/c)} & \text{if } c \le r \end{cases} $ And n-gramn\text{-gram} precision is: $ p_n = \frac{\sum_{\text{sentence} \in \text{candidate}} \sum_{n\text{-gram} \in \text{sentence}} \mathrm{CountClip}(n\text{-gram})}{\sum_{\text{sentence} \in \text{candidate}} \sum_{n\text{-gram} \in \text{sentence}} \mathrm{Count}(n\text{-gram})} $
    • Symbol Explanation:
      • NN: The maximum n-gram order considered (typically 4).
      • wnw_n: Positive weights for each n-gram precision (usually 1/N1/N).
      • pnp_n: The n-gramn\text{-gram} precision.
      • cc: Length of the candidate translation.
      • rr: Effective reference corpus length.
      • CountClip(n-gram)\mathrm{CountClip}(n\text{-gram}): The number of n-grams in the candidate that also appear in any reference, clipped to the maximum count of that n-gram in any single reference.
      • Count(n-gram)\mathrm{Count}(n\text{-gram}): The total number of n-grams in the candidate.
  • BERTScore (Zhang et al., 2020b):
    • Conceptual Definition: A metric that leverages contextual embeddings from pre-trained BERT models to compute similarity between candidate and reference texts. Instead of exact n-gram matching, it measures semantic similarity by comparing token embeddings.
    • Mathematical Formula: Given a candidate sentence x=(x1,,xk)x = (x_1, \dots, x_k) and a reference sentence y=(y1,,ym)y = (y_1, \dots, y_m), let Ex=(e1x,,ekx)E_x = (\mathbf{e}_1^x, \dots, \mathbf{e}_k^x) and Ey=(e1y,,emy)E_y = (\mathbf{e}_1^y, \dots, \mathbf{e}_m^y) be their token embeddings from BERT. $ P = \frac{1}{k} \sum_{x_i \in x} \max_{y_j \in y} \mathbf{e}i^x \cdot \mathbf{e}j^y $ $ R = \frac{1}{m} \sum{y_j \in y} \max{x_i \in x} \mathbf{e}_i^x \cdot \mathbf{e}_j^y $ $ \mathrm{F1} = 2 \frac{P \cdot R}{P + R} $
    • Symbol Explanation:
      • PP: Precision, measuring how much of the candidate text is present in the reference.
      • RR: Recall, measuring how much of the reference text is present in the candidate.
      • eix\mathbf{e}_i^x: Embedding of the i-thi\text{-th} token in the candidate sentence.
      • ejy\mathbf{e}_j^y: Embedding of the j-thj\text{-th} token in the reference sentence.
      • \cdot: Dot product (cosine similarity is often used instead of raw dot product after normalization).
  • GEMBA (Generative Model-Based Evaluation) (Kocmi and Federmann, 2023):
    • Conceptual Definition: A state-of-the-art metric that uses large language models themselves to evaluate translation quality, demonstrating strong correlation with human judgments. It prompts an LLM to score the quality of generated text given source and reference.
    • Mathematical Formula: Not a single mathematical formula in the traditional sense, but rather a methodology for querying an LLM. The score is the LLM's output (e.g., a numerical rating or a categorical assessment).
  • G-Eval (Liu et al., 2023c):
    • Conceptual Definition: An approach for evaluating Natural Language Generation (NLG) tasks using GPT-4. It leverages the advanced reasoning capabilities of LLMs to provide comprehensive evaluations along various dimensions (e.g., coherence, relevance, factual accuracy, personalized alignment) by carefully crafted prompts.
    • Mathematical Formula: Similar to GEMBA, G-Eval is a methodology of prompting LLMs. The output is typically a score or a qualitative assessment generated by the LLM.
  • ExPerT (Effective and Explainable Evaluation of Personalized Long-Form Text Generation) (Salemi et al., 2025a):
    • Conceptual Definition: A newly introduced metric designed specifically for evaluating personalized text generation in reference-based settings. It segments both generated and reference texts into atomic facts and scores them based on content and writing style similarity, with a focus on personalization.
    • Mathematical Formula: The paper introduces ExPerT and its full methodology is likely detailed in its specific publication. Generally, such metrics involve:
      1. Fact extraction from text.
      2. Matching facts between generated and reference texts.
      3. Evaluating similarity for matched facts on content and style.
      4. Aggregating scores to produce a personalization-aware metric.
  • AuPEL (Automated Personalized Evaluation for Language Models) (Wang et al., 2023g) & PERSE (Personalized Evaluation for Response Selection and Generation) (Wang et al., 2024a):
    • Conceptual Definition: Metrics for personalization, often involving prompting LLMs to assess how well generated content aligns with a user profile or preferences. These methods aim to capture the subjective aspects of personalization.
    • Mathematical Formula: These are typically frameworks or methodologies leveraging LLMs, rather than single formulas. The output is usually a score indicating the degree of personalization.

5.2.2. Image Generation (Section 3.2)

The following are the results from Table 3 of the original paper:

Metrics Evaluation Dimensions Representative Works
1. General-purpose generation CLIP-I (Radford et al., 2021) Personalization Textual Inversion (Gal et al., 2023), Custom Diffusion (Kumari et al., 2023), DreamBooth (Ruiz et al., 2023)
2. Fashion design DINO-I (Caron et al., 2021; Oquab et al., 2024) Personalization DreamBooth (Ruiz et al., 2023), BLIP-Diffusion (Li et al. 2024b), ELITE (Wei et al., 2023)
3. E-commerce product image LPIPS (Zhang et al., 2018) Personalization DreamSteerer (Yu et al., 2024), DiFashion (Xu et al., 2024b), PMG (Shen et al., 2024b)
4. Subject-driven T2I generation PSNR (Hore and Ziou, 2010) Personalization SCW-VTON (Han et al., 2024)
5. Face generation SSIM (Wang et al., 2004) Personalization DreamSteerer (Yu et al., 2024), PMG (Shen et al., 2024b)
6. Virtual try-on MS-SSIM (Wang et al., 2003) Personalization DreamSteerer (Yu et al., 2024), Pigeon (Xu et al., 2024c)
DreamSim (Fu et al., 2023) Personalization IMPRINT (Song et al., 2024b)
Face similarity (Deng et al., 2019; Schroff et al., 2015; Kim et al., 2022) Personalization Infinite-ID (Wu et al., 2025a), PhotoMaker (Li et al., 2024l)
Face detection rate (Deng et al., 2019; Zhang et al., 2016) Personalization SeFi-IDE (Li et al. 2024h), Celeb Basis (Yuan et al., 2023)
CLIP-T (Radford et al., 2021) Instruction Alignment Textual Inversion (Gal et al., 2023), Custom Diffusion (Kumari et al., 2023)
ImageReward (Xu et al., 2023a) Instruction Alignment InstructBooth (Chae et al., 2023), DiffLoRA (Wu et al. 2024f), IMAGDressing-v1 (Shen et al., 2024a)
PickScore (Kirstain et al., 2023) Instruction Alignment FABRIC (Von Rütte et al., 2023), Stellar (Achlioptas et al., 2023)
HPSv1 (Wu et al., 2023c) Instruction Alignment Stellar (Achlioptas et al., 2023)
HPSv2 (Wu et al., 2023b) Instruction Alignment Stellar (Achlioptas et al., 2023)
R-precision (Xu et al., 2018) Instruction Alignment COTI (Yang et al., 2023b)
PAR score (Gani et al., 2024) Instruction Alignment Vashishtha et al. (2024)
FID (Heusel et al., 2017) Content Quality COTI (Yang et al., 2023b), IMPRINT (Song et al., 2024b), DiFashion (Xu et al., 2024b)
KID (Bikowski et al., 2018) Content Quality Custom Diffusion (Kumari et al., 2023), OOTDiffusion (Xu et al, 2024d), LaDI-VTON (Morelli et al. 2023)
IS (Salimans et al., 2016) Content Quality PE-VTON (Zhang et al., 2024d), DF-VTON (Dong et al., 2024), Layout-and-Retouch (Kim et al., 2024b)
LAION-Aesthetics (Christoph and Romain, 2022) Content Quality BLIP-Diffusion (Li et al., 2024b), UniPortrait (He et al. 2024a)
TOPIQ (Chen et al., 2024a) Content Quality DreamSteerer (Yu et al., 2024)
MUSIQ (Ke et al., 2021) Content Quality DreamSteerer (Yu et al., 2024), PE-VTON (Zhang et al., 2024d)
MANIQA (Yang et al., 2022) Content Quality PE-VTON (Zhang et al., 2024d)
LIQE (Zhang et al., 2023c) Content Quality DreamSteerer (Yu et al., 2024)
QS (Gu et al., 2020) Content Quality AddMe (Yue et al., 2025)
BRISQUE (Mittal et al., 2012a) Content Quality Vashishtha et al. (2024)
CTR Overall CG4CTR (Yang et al., 2024a), Czapp et al. (2024)
Stellar metrics (Achlioptas et al., 2023) Overall Stellar (Achlioptas et al., 2023)
CAMI (Shen et al., 2024a) Overall
  • CLIP-I (CLIP Image Similarity) (Radford et al., 2021):
    • Conceptual Definition: Measures the cosine similarity between the image embedding of a generated image and the image embedding of a reference image (e.g., a personalized subject image). It captures how well the generated image preserves the visual identity or style of the reference.
    • Mathematical Formula: $ \mathrm{CLIP-I}(I_g, I_r) = \frac{E(I_g) \cdot E(I_r)}{|E(I_g)| |E(I_r)|} $
    • Symbol Explanation:
      • IgI_g: Generated image.
      • IrI_r: Reference image (e.g., a personal subject image).
      • E()E(\cdot): Function that maps an image to its CLIP embedding.
      • \cdot: Dot product.
      • \|\cdot\|: L2 norm (magnitude of the vector).
  • DINO-I (DINO Image Similarity) (Caron et al., 2021; Oquab et al., 2024):
    • Conceptual Definition: Similar to CLIP-I but uses embeddings from DINO (Self-supervised Vision Transformers) models. DINO excels at learning rich visual features, often capturing fine-grained structural and semantic information without explicit labels, making it suitable for comparing visual similarity.
    • Mathematical Formula: $ \mathrm{DINO-I}(I_g, I_r) = \frac{E_{DINO}(I_g) \cdot E_{DINO}(I_r)}{|E_{DINO}(I_g)| |E_{DINO}(I_r)|} $
    • Symbol Explanation:
      • IgI_g: Generated image.
      • IrI_r: Reference image.
      • EDINO()E_{DINO}(\cdot): Function that maps an image to its DINO embedding.
  • LPIPS (Learned Perceptual Image Patch Similarity) (Zhang et al., 2018):
    • Conceptual Definition: A perceptual metric that uses features from pre-trained deep neural networks (like VGG or AlexNet) to measure the perceptual similarity between two images. It correlates better with human judgment than traditional pixel-based metrics (PSNR, SSIM) because it captures higher-level visual differences.
    • Mathematical Formula: $ \mathrm{LPIPS}(x, x_0) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} |w_l \odot (\phi_l(x){h,w} - \phi_l(x_0){h,w})|_2 $
    • Symbol Explanation:
      • x,x0x, x_0: Two images being compared.
      • ϕl\phi_l: Features from the l-thl\text{-th} layer of a pre-trained network.
      • wlw_l: A learned scalar weight for each layer.
      • Hl,WlH_l, W_l: Height and width of the feature map at layer ll.
      • \odot: Element-wise product.
  • PSNR (Peak Signal-to-Noise Ratio) (Hore and Ziou, 2010):
    • Conceptual Definition: A traditional, pixel-based metric used to quantify the reconstruction quality of an image. It is most easily defined via the Mean Squared Error (MSE). A higher PSNR generally indicates higher quality.
    • Mathematical Formula: $ \mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $ $ \mathrm{PSNR} = 10 \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $
    • Symbol Explanation:
      • II: Original image.
      • KK: Generated (reconstructed) image.
      • m, n: Dimensions of the image.
      • MAXI\mathrm{MAX}_I: The maximum possible pixel value of the image (e.g., 255 for an 8-bit image).
  • SSIM (Structural Similarity Index Measure) (Wang et al., 2004):
    • Conceptual Definition: A perceptual metric that measures the similarity between two images based on luminance, contrast, and structure. It aims to approximate human perception of image quality.
    • Mathematical Formula: $ \mathrm{SSIM}(x, y) = [l(x,y)]^{\alpha} \cdot [c(x,y)]^{\beta} \cdot [s(x,y)]^{\gamma} $ where $ l(x,y) = \frac{2\mu_x\mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1} $ $ c(x,y) = \frac{2\sigma_x\sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2} $ $ s(x,y) = \frac{\sigma_{xy} + C_3}{\sigma_x\sigma_y + C_3} $
    • Symbol Explanation:
      • x, y: Two image patches.
      • μx,μy\mu_x, \mu_y: Mean of xx and yy.
      • σx,σy\sigma_x, \sigma_y: Standard deviation of xx and yy.
      • σxy\sigma_{xy}: Covariance of xx and yy.
      • C1,C2,C3C_1, C_2, C_3: Small constants to prevent division by zero.
      • α,β,γ\alpha, \beta, \gamma: Weights for luminance, contrast, and structural components (typically 1).
  • FID (Fréchet Inception Distance) (Heusel et al., 2017):
    • Conceptual Definition: A popular metric for evaluating the quality of images generated by generative adversarial networks (GANs) or other generative models. It measures the "distance" between the feature distributions of real and generated images using a pre-trained Inception-v3 network. Lower FID scores indicate higher quality and diversity.
    • Mathematical Formula: $ \mathrm{FID} = |\mu_1 - \mu_2|_2^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
    • Symbol Explanation:
      • μ1,μ2\mu_1, \mu_2: Mean feature vectors of real and generated images, respectively, extracted from an intermediate layer of an Inception-v3 model.
      • Σ1,Σ2\Sigma_1, \Sigma_2: Covariance matrices of the feature vectors for real and generated images.
      • Tr()\mathrm{Tr}(\cdot): The trace of a matrix.
  • KID (Kernel Inception Distance) (Bikowski et al., 2018):
    • Conceptual Definition: Another metric for evaluating generative models, similar to FID but using the Maximum Mean Discrepancy (MMD) with a polynomial kernel on Inception-v3 features. It is considered more robust than FID, especially with limited sample sizes.
    • Mathematical Formula: For a polynomial kernel k(x,y)=(xTy+1)dk(x, y) = (x^T y + 1)^d: $ \mathrm{KID}^2 = \mathbb{E}[k(X, X')] + \mathbb{E}[k(Y, Y')] - 2\mathbb{E}[k(X, Y)] $
    • Symbol Explanation:
      • X, Y: Feature vectors from real and generated images, respectively.
      • E[]\mathbb{E}[\cdot]: Expectation.
      • k(,)k(\cdot, \cdot): A kernel function, typically a polynomial kernel applied to Inception-v3 features.
  • ImageReward (Xu et al., 2023a), PickScore (Kirstain et al., 2023), HPSv1/HPSv2 (Human Preference Score) (Wu et al., 2023c; Wu et al., 2023b):
    • Conceptual Definition: Metrics specifically designed to evaluate how well generated images align with human preferences or instructions. They are often learned from large datasets of human feedback (e.g., ratings or pairwise comparisons of images based on a prompt).
    • Mathematical Formula: These are typically scores predicted by a learned model (e.g., a reward model trained via RLHF) and do not have a single standard formula.
  • CTR (Click-Through Rate):
    • Conceptual Definition: A marketing metric that measures the percentage of people who clicked on a specific link or advertisement after seeing it. In personalized image generation for e-commerce, it evaluates how engaging the generated product images are to target consumers.
    • Mathematical Formula: $ \mathrm{CTR} = \frac{\text{Number of Clicks}}{\text{Number of Impressions}} \times 100% $
    • Symbol Explanation:
      • Number of Clicks: How many times the generated content was clicked.
      • Number of Impressions: How many times the generated content was displayed.
  • Stellar metrics (Achlioptas et al., 2023):
    • Conceptual Definition: Specialized metrics designed for systematic evaluation of human-centric personalized text-to-image methods, focusing on aspects like identity preservation and human generation quality. They encompass multiple sub-metrics tailored to the domain.
  • Face similarity metrics (ArcFace, FaceNet, AdaFace, VGG-Face) (Deng et al., 2019; Schroff et al., 2015; Kim et al., 2022; Cao et al., 2018):
    • Conceptual Definition: Use embeddings from pre-trained face recognition models to quantify the similarity between the face in a generated image and a reference face. This is crucial for identity preservation in personalized face generation tasks.
    • Mathematical Formula: Typically cosine similarity between feature vectors from a face recognition model. $ \mathrm{FaceSim}(F_g, F_r) = \frac{E_{Face}(F_g) \cdot E_{Face}(F_r)}{|E_{Face}(F_g)| |E_{Face}(F_r)|} $
    • Symbol Explanation:
      • FgF_g: Generated face image.
      • FrF_r: Reference face image.
      • EFace()E_{Face}(\cdot): Function that maps a face image to its embedding from a face recognition model.

5.2.3. Video, 3D, Audio, and Cross-Modal Generation

The following are the results from Table 4 of the original paper:

Metrics Evaluation Dimensions Representative Works
Video (Section 3.3) 1. Subject-driven T2V generation CLIP-I (Radford et al., 2021) Personalization PIA (Zhang et al., 2024i), PoseCrafter (Zhong et al., 2025), ID-Animator (He et al., 2024b)
2. ID-preserving T2V generation DINO-I (Caron et al., 2021; Oquab et al., 2024) Personalization DisenStudio (Chen et al., 2024b), DreamVideo (Wei et al., 2024b), Magic-Me (Ma et al., 2024c)
3. Talking head generation SSIM (Wang et al., 2004) Personalization OutfitAnyone (Sun et al., 2024a), ViViD (Fang et al., 2024b)
4. Pose-guided video generation PSNR (Hore and Ziou, 2010) Personalization AnimateAnyone (Hu, 2024), Yi et al. (2020), Zhua et al. (2023)
5. Video virtual try-on LPIPS (Zhang et al., 2018) Personalization AnimateAnyone (Hu, 2024), DiffTalk (Shen et al., 2023)
VGG (Johnson et al., 2016) Personalization DisCo (Wang et al., 2023b), DreamPose (Karras et al., 2023)
L1 error Personalization DisCo (Wang et al., 2023b), DreamPose (Karras et al., 2023)
Face similarity (Deng et al., 2019; Huang et al., 2020; Kim et al., 2022) Personalization ID-Animator (He et al., 2024b), MagicPose (Chang et al, 2023), ConsisID (Yuan et al., 2024)
CLIP-T (Radford et al., 2021) Instruction Alignment PIA (Zhang et al., 2024i), ConsisID (Yuan et al., 2024), PoseCrafter (Zhong et al., 2025)
UMT score (Liu et al., 2022) Instruction Alignment StyleMaster (Ye et al., 2024)
AKD (Siarohin et al., 2021) Instruction Alignment MagicAnimate (Xu et al., 2024f)
MKR (Siarohin et al., 2021) Instruction Alignment MagicAnimate (Xu et al., 2024f)
SyncNet score (Chung and Zisserman, 2017b) Instruction Alignment DreamTalk (Ma et al., 2023), EMO (Tian et al., 2025), MEMO (Zheng et al., 2024b)
LMD (Chen et al., 2018b) Instruction Alignment DFA-NeRF (Yao et al., 2022), DreamTalk (Ma et al., 2023), Yi et al. (2020)
LSE-C (Prajwal et al., 2020) Instruction Alignment StyleLipsync (Ki and Min, 2023), DiffTalker (Chen et al., 2023b), Choi et al. (2024)
LSE-D (Prajwal et al., 2020) Instruction Alignment StyleLipsync (Ki and Min, 2023), DiffTalker (Chen et al., 2023b), Choi et al. (2024)
PD (Baldrati et al., 2023) Instruction Alignment ACF (Yang et al., 2024f)
FID (Heusel et al., 2017) Content Quality OutfitAnyone (Sun et al., 2024a), DisCo (Wang et al., 2023b)
KID (Bikowski et al., 2018) Content Quality WildVidFit (He et al., 2025)
ArtFID (Wright and Ommer, 2022) Content Quality StyleMaster (Ye et al., 2024)
VFID (Wang et al., 2018c) Content Quality OutfitAnyone (Sun et al., 2024a), ViViD (Fang et al., 2024b)
FVD (Unterthiner et al., 2018) Content Quality PersonalVideo (Li et al., 2024c), MotionBooth (Wu et al., 2024a), AnimateAnyone (Hu, 2024)
FID-VID (Balaji et al., 2019) Content Quality DisCo (Wang et al., 2023b), MagicAnimate (Xu et al., 2024f)
KVD (Unterthiner et al., 2018) Content Quality MagicAnimate (Xu et al., 2024f)
E-FID (Tian et al., 2025) Content Quality Animate-A-Story (He et al., 2023)
NIQE (Mittal et al., 2012b) Content Quality MagicFight (Huang et al., 2024a)
CPBD (Narvekar and Karam, 2011) Content Quality DreamTalk (Ma et al., 2023)
Temporal consistency (Radford et al., 2021) Content Quality AnimateDiff (Guo et al., 2024a), Magic-Me (Ma et al., 2024c), DreamVideo (Wei et al., 2024b)
Dynamic degree (Huang et al., 2024d) Content Quality StyleMaster (Ye et al., 2024), ID-Animator (He et al., 2024b), PersonalVideo (Li et al., 2024c)
Flow error (Shi et al., 2023a) Content Quality MagDiff (Zhao et al., 2025)
Video IS (Saito et al., 2020) Content Quality MotionBooth (Wu et al., 2024a)
Stitch score Content Quality MotionBooth (Wu et al., 2024a)
Dover score (Wu et al., 2023a) Content Quality ID-Animator (He et al., 2024b)
Motion score (Shi et al., 2023a) Content Quality ID-Animator (He et al., 2024b)
3D (Section 3.4) 1. Image-to-3D generation LPIPS (Zhang et al., 2018) Personalization Wonder3D (Long et al., 2024), PuzzleAvatar (Xiu et al., 2024), My3DGen (Qi et al., 2023a)
2. 3D face generation PSNR (Hore and Ziou, 2010) Personalization Wonder3D (Long et al., 2024), PuzzleAvatar (Xiu et al., 2024), My3DGen (Qi et al., 2023a)
3. 3D human pose generation SSIM (Wang et al., 2004) Personalization Wonder3D (Long et al., 2024), PuzzleAvatar (Xiu et al., 2024), My3DGen (Qi et al., 2023a)
4. 3D virtual try-on Chamfer Distances (Butt and Maragos, 1998) Personalization Wonder3D (Long et al., 2024)
CLIP (Radford et al., 2021) Personalization DreamVTON (Xie et al., 2024)
Volume IoU (Zhou et al., 2019) Personalization DreamVTON (Xie et al., 2024)
Lip Sync Error (LSE) (Prajwal et al., 2020) Personalization DiffSpeaker (Ma et al., 2024d)
Facial Dynamics Deviation (FDD) (Ma et al., 2024d) Personalization DiffSpeaker (Ma et al., 2024d), DiffusionTalker (Chen et al., 2023d)
FRelD (Huang et al., 2021) Personalization FewShotMotionTransfer (Huang et al., 2021)
CLIP-T (Radford et al., 2021) Instruction Alignment MVDream (Shi et al., 2023b), DreamBooth3D (Raj et al., 2023), MakeYour3D (Liu et al., 2025a)
FID (Heusel et al., 2017) Content Quality MVDream (Shi et al., 2023b), 3DAvatarGAN (Abdal et al., 2023), TextureDreamer (Yeh et al., 2024), DreamVTON (Xie et al., 2024)
IS (Salimans et al., 2016) Content Quality MVDream (Shi et al., 2023b)
Audio (Section 3.6) 1. Face-to-speech generation CLAP (Elizalde et al., 2023) Personalization DB&TI (Plitsis et al., 2024)
2. Music generation Embedding Distance Personalization UMP (Ma et al., 2022), FR-PSS (Wang et al., 2022a)
3. Text-to-audio generation FAD (Kilgour et al., 2018) Personalization UIGAN (Wang et al., 2024k), DiffAVA (Mo et al., 2023), DB&TI (Plitsis et al., 2024)
IS (Salimans et al., 2016) Content Quality DiffAVA (Mo et al., 2023)
STOI, ESTOI, PESQ (Sheng et al., 2023) Content Quality Lip2Speech (Sheng et al., 2023)
Cross-modal (Section 3.7) 1. Robotics BLEU, Meteor Overall PVCG (Wu et al., 2024e), METER (Geng et al., 2022), PV-LLM (Lin et al., 2024b)
2. Caption/Comment generation Recall, Precision, F1 Overall MyVLM (Alaluf et al., 2025), Yo'LLaVA (Nguyen et al., 2024b)
3. Multimodal dialogue systems success rate Overall VPL (Poddar et al., 2024), Promptable Behaviors (Hwang et al., 2024)
  • CLIP-T (CLIP Text Similarity) (Radford et al., 2021):
    • Conceptual Definition: Measures the cosine similarity between the text embedding of the generated content (e.g., video description, 3D object label) and the text embedding of the input prompt. It quantifies how well the generated content aligns semantically with the textual instructions.
    • Mathematical Formula: $ \mathrm{CLIP-T}(T_g, T_p) = \frac{E_{CLIP}(T_g) \cdot E_{CLIP}(T_p)}{|E_{CLIP}(T_g)| |E_{CLIP}(T_p)|} $
    • Symbol Explanation:
      • TgT_g: Textual description of the generated content (or extracted features from generated content).
      • TpT_p: Input textual prompt.
      • ECLIP()E_{CLIP}(\cdot): Function that maps text to its CLIP embedding.
  • FVD (Fréchet Video Distance) (Unterthiner et al., 2018):
    • Conceptual Definition: An extension of FID for videos. It measures the "distance" between the feature distributions of real and generated videos using features extracted from a pre-trained 3D convolutional network (e.g., I3D). A lower FVD indicates higher quality and realism in generated videos.
    • Mathematical Formula: Similar to FID, but applied to video feature distributions. $ \mathrm{FVD} = |\mu_1 - \mu_2|_2^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
    • Symbol Explanation:
      • μ1,μ2\mu_1, \mu_2: Mean feature vectors of real and generated videos, respectively, extracted from a pre-trained video feature extractor.
      • Σ1,Σ2\Sigma_1, \Sigma_2: Covariance matrices of the feature vectors for real and generated videos.
  • KVD (Kernel Video Distance) (Unterthiner et al., 2018):
    • Conceptual Definition: An extension of KID for videos, using MMD with a polynomial kernel on video features. More robust than FVD for smaller sample sizes.
    • Mathematical Formula: Similar to KID, but applied to video feature distributions.
  • SyncNet score (Audio-visual synchronization) (Chung and Zisserman, 2017b):
    • Conceptual Definition: Measures the synchronization between audio and video streams, particularly for talking head generation. It quantifies how well the lip movements in the video align with the speech in the audio.
    • Mathematical Formula: SyncNet produces a confidence score, often a distance metric (e.g., embedding distance) between audio and visual features, indicating synchronization. A lower distance implies better synchronization.
  • LSE (Lip Sync Error) (LSE-C, LSE-D) (Prajwal et al., 2020):
    • Conceptual Definition: Metrics specifically for evaluating the accuracy of lip synchronization in generated talking heads. LSE-C (contrastive) and LSE-D (distance) are based on the SyncNet model to measure how well the generated mouth movements match the audio.
    • Mathematical Formula: Derived from SyncNet's similarity scores or embedding distances.
  • Chamfer Distance (CD) (Butt and Maragos, 1998):
    • Conceptual Definition: A metric for comparing two point clouds or 3D shapes. It measures the average closest-point distance between points on one surface and points on the other. It is widely used to evaluate the geometric similarity of generated 3D models to ground truth.
    • Mathematical Formula: For two point clouds S1S_1 and S2S_2: $ \mathrm{CD}(S_1, S_2) = \frac{1}{|S_1|} \sum_{x \in S_1} \min_{y \in S_2} |x-y|2 + \frac{1}{|S_2|} \sum{y \in S_2} \min_{x \in S_1} |x-y|_2 $
    • Symbol Explanation:
      • S1,S2S_1, S_2: Two point clouds.
      • xx: A point in S1S_1.
      • yy: A point in S2S_2.
      • 2\|\cdot\|_2: Euclidean distance.
  • Volume IoU (Intersection over Union) (Zhou et al., 2019):
    • Conceptual Definition: For 3D shapes represented as voxels or meshes, IoU measures the overlap between the generated 3D object and the ground truth 3D object. It is a standard metric for object detection and segmentation tasks, adapted for 3D.
    • Mathematical Formula: $ \mathrm{IoU} = \frac{\text{Volume of Intersection}}{\text{Volume of Union}} $
    • Symbol Explanation:
      • Volume of Intersection: The volume shared by the generated and ground truth 3D objects.
      • Volume of Union: The total volume covered by both generated and ground truth 3D objects.
  • CLAP (Contrastive Language-Audio Pre-training) (Elizalde et al., 2023):
    • Conceptual Definition: Measures the cosine similarity between the embeddings of a generated audio clip and a reference text (or another audio clip). Similar to CLIP, but for audio and text modalities, assessing semantic alignment between audio content and its description.
    • Mathematical Formula: $ \mathrm{CLAP}(A_g, T_p) = \frac{E_{CLAP}(A_g) \cdot E_{CLAP}(T_p)}{|E_{CLAP}(A_g)| |E_{CLAP}(T_p)|} $
    • Symbol Explanation:
      • AgA_g: Generated audio clip.
      • TpT_p: Input textual prompt or description.
      • ECLAP()E_{CLAP}(\cdot): Function that maps audio or text to its CLAP embedding.
  • FAD (Fréchet Audio Distance) (Kilgour et al., 2018):
    • Conceptual Definition: An extension of FID for audio generation. It measures the "distance" between the feature distributions of real and generated audio clips using features extracted from a pre-trained audio classification network (e.g., VGGish). Lower FAD indicates higher quality and realism in generated audio.
    • Mathematical Formula: Similar to FID, but applied to audio feature distributions.
  • STOI (Short-Time Objective Intelligibility), ESTOI (Extended STOI), PESQ (Perceptual Evaluation of Speech Quality) (Sheng et al., 2023):
    • Conceptual Definition: Metrics primarily used in speech processing to objectively assess speech intelligibility and quality. They compare a degraded speech signal (generated speech) to a clean reference speech signal. Higher values indicate better quality.
    • Mathematical Formula: These are complex signal processing metrics, each with specific algorithms defined in their respective standards (e.g., ITU-T P.862 for PESQ). Their formulas are extensive and go beyond simple algebraic expressions.

5.3. Baselines

As a survey paper, the document itself does not compare against baselines in an experimental setting. Instead, it reviews papers that conduct such comparisons. The baselines for PGen typically fall into several categories:

  • Generic Generative Models: Non-personalized versions of LLMs, DMs, or GANs that generate content without specific user input or preferences. Comparing against these highlights the value added by personalization.

  • Non-Generative Personalized Methods: For tasks like recommendation, traditional retrieval-based recommender systems serve as baselines. PGen methods would aim to show advantages in generating novel, tailored items beyond retrieving existing ones.

  • Earlier Personalization Techniques: Previous methods for personalization that might be task-specific or limited to a single modality, or less efficient (e.g., full fine-tuning vs. PEFT).

  • Simpler Personalization Cues: Baselines that use fewer or less sophisticated personalized contexts (e.g., only user profiles vs. user behaviors and documents).

    The effectiveness of PGen methods is typically validated by demonstrating superior performance across the evaluation dimensions (quality, instruction alignment, personalization) compared to these baselines, often corroborated by human evaluations.

6. Results & Analysis

This section synthesizes the collective findings and advancements reported in the surveyed literature on Personalized Generation (PGen). Since this is a survey paper, it does not present new experimental results but rather analyzes the overall trends and successes documented across various research works.

6.1. Core Results Analysis

The paper's core analysis demonstrates that PGen, in the era of large models, is making significant strides in tailoring content across diverse modalities to individual preferences and needs. The effectiveness of the proposed methods is consistently validated by their ability to achieve three core objectives: High quality, Instruction alignment, and Personalization.

  • Text Generation: LLMs have shown remarkable capabilities in personalizing textual content. For tasks like writing assistants, dialogue systems, and user simulation, models can effectively mimic user styles, generate contextually relevant responses, and even adopt specific personas. The use of RAG and prompt engineering has been particularly effective in grounding text generation with user-specific information, leading to outputs that are both coherent and highly personalized. Evaluations often involve human judgment, indicating the complexity of capturing subjective aspects of personalization.

  • Image Generation: This modality has seen substantial breakthroughs, especially with the advent of Diffusion Models. Techniques like Textual Inversion and DreamBooth have enabled users to teach models new concepts (e.g., their pet, a specific object) from just a few images, and then generate novel images of these subjects in various contexts and styles. Personalized face generation and virtual try-on applications also highlight the ability to preserve identity while generating diverse expressions or fitting clothes. The combination of visual and textual cues (CLIP, DINO embeddings) for personalized context has proven powerful.

  • Video Generation: Building on image generation successes, PGen for video aims to generate personalized dynamic content. Subject-driven text-to-video generation, ID-preserving video synthesis, talking head generation, and video virtual try-on are key areas. The challenge here is maintaining temporal consistency and identity preservation across frames while incorporating personalized elements. Adaptations of diffusion models, often with additional modules for motion or temporal dynamics, have yielded promising results.

  • 3D Generation: This domain focuses on creating personalized 3D assets from visual or textual inputs. Tasks include image-to-3D generation (e.g., creating a 3D model of a personalized subject), 3D face generation, 3D human pose generation, and 3D virtual try-on. The complexity lies in accurately capturing geometry and texture, and ensuring consistency with personalized attributes.

  • Audio Generation: Personalized audio generation involves creating tailored music or speech based on user preferences or specific auditory inputs. Music generation leverages user listening histories, while text-to-audio often uses voice cloning or style transfer from user-provided audio clips. Face-to-speech generation demonstrates the ability to synthesize speech from visual cues of a speaker.

  • Cross-modal Generation: This exciting frontier combines multiple modalities, such as personalized robotics (inferring user preferences for robot actions) or personalized caption/comment generation (generating text based on user documents and images/videos). Multimodal Large Language Models (MLLMs) are crucial here, enabling systems to understand and generate personalized content across different data types simultaneously.

    Across all modalities, the trend shows a movement from generic, high-quality generation towards user-centric outputs that are not only aesthetically pleasing but also deeply relevant to individual users. The advantages of PGen over non-personalized methods are clear: increased engagement, enhanced user experience, and the ability to create bespoke content on demand. Disadvantages mainly revolve around the computational cost, the complexity of managing diverse user data, and the challenges in comprehensive evaluation.

6.2. Data Presentation (Tables)

The paper summarizes the landscape of PGen using two main tables for datasets and evaluation metrics.

The following are the results from Table 1 of the original paper:

Modality Personalized Contexts Tasks Representative Works
Text (Section 3.1) User behaviors Recommendation LLM-Rec (Lyu et al., 2024), DEALRec (Lin et al., 2024a), Bi- gRec (Bao et al., 2023), DreamRec (Yang et al., 2024e)
User documents Information seeking Writing Assistant P-RLHF (Li et al., 2024g), ComPO (Kumar et al., 2024b) REST-PG (Salemi et al., 2025b), RSPG (Salemi et al., 2024a), Hydra (Zhuang et al., 2024), PEARL (Mysore et al., 2024), Panza (Nicolicioiu et al., 2024)
User profiles Dialogue System PAED (Zhu et al., 2023b), BoB (Song et al., 2021), UniMS- RAG (Wang et al., 2024e), ORIG (Chen et al., 2023b)
User Simulation Drama Machine (Magee et al., 2024), Character-LLM (Shao et al., 2023), RoleLLM (Wang et al., 2024f)
Image (Section 3.2) User behaviors General-purpose generation PMG (Shen et al., 2024b), Pigeon (Xu et al., 2024c), PASTA (Nabati et al., 2024)
Fashion design DiFashion (Xu et al., 2024b), Yu et al. (2019)
E-commerce product image Vashishtha et al. (2024)
User profiles Fashion design LVA-COG (Forouzandehmehr et al., 2023)
E-commerce product image CG4CTR (Yang et al., 2024a)
Personalized subjects Subject-driven T2I generation Textual Inversion (Gal et al., 2023), DreamBooth (Ruiz et al., 2023), Custom Diffusion (Kumari et al., 2023)
Video Personal face/body Face generation PhotoMaker (Li et al., 2024l), InstantBooth (Shi et al., 2024), InstantID (Wang et al., 2024g)
Virtual try-on IDM-VTON (Choi et al., 2025), OOTDiffusion (Xu et al., 2024d), OutfitAnyone (Sun et al., 2024a)
Personalized subjects Subject-driven T2V generation AnimateDiff (Guo et al., 2024a), AnimateLCM (Wang et al., 2024b), PIA (Zhang et al., 2024i)
Personal face/body ID-preserving T2V generation Magic-Me (Ma et al., 2024c), ID-Animator (He et al., 2024b), ConsisID (Yuan et al., 2024)
Talking head generation DreamTalk (Ma et al., 2023), EMO (Tian et al., 2025), MEMO (Zheng et al., 2024b)
Pose-guided video generation Disco (Wang et al., 2023b), AnimateAnyone (Hu, 2024), MagicAnimate (Xu et al., 2024f)
Video virtual try-on ViViD (Fang et al., 2024b), VITON-DiT (Zheng et al., 2024a), WildVidFit (He et al., 2025)
3D (Section 3.4) Personalized subjects Image-to-3D generation MVDream (Shi et al., 2023b), DreamBooth3D (Raj et al., 2023), Wonder3D (Long et al., 2024)
Personal face/body 3D face generation PoseGAN (Zhang et al., 2021a), My3DGen (Qi et al., 2023a), DiffSpeaker (Ma et al., 2024d)
3D human pose generation FewShotMotionTransfer (Huang et al., 2021), PGG (Hu et al., 2023), 3DHM (Li et al., 2024a), DreamWaltz (Huang et al., 2024c)
3D virtual try-on Pergamo (Casado-Elvira et al., 2022), DreamVTON (Xie et al., 2024)
Audio (Section 3.6) Personal face Face-to-speech generation VioceMe (van Rijn et al., 2022), FR-PSS (Wang et al., 2022a), Lip2Speech (Sheng et al., 2023)
User behaviors Music generation UMP (Ma et al., 2022), UP-Transformer (Hu et al., 2022), UIGAN (Wang et al., 2024k)
Personalized subjects Text-to-audio generation DiffAVA (Mo et al., 2023), TAS (Li et al., 2024k)
Cross-Modal (Section 3.7) User behaviors Robotics VPL (Poddar et al., 2024), Promptable Behaviors (Hwang et al., 2024)
User documents Caption/Comment generation PV-LLM (Lin et al., 2024b), PVCG (Wu et al., 2024e), METER (Geng et al., 2022)
Personalized subjects Cross-modal dialogue systems MyVLM (Alaluf et al., 2025), Yo'LLaVA (Nguyen et al., 2024b), MC-LLaVA (An et al., 2024)

This table (Table 1 from the original paper) systematically categorizes the field of PGen by modality, personalized contexts, tasks, and representative works. It clearly shows the breadth of research and the specific techniques being applied in each area. For example, in Text Generation, User Behaviors are leveraged for Recommendation tasks using models like LLM-Rec. In Image Generation, Personalized Subjects are key for Subject-driven T2I generation with Textual Inversion and DreamBooth. This structured view demonstrates the fragmentation but also the underlying commonalities of PGen approaches across different data types.

6.3. Ablation Studies / Parameter Analysis

The survey paper itself does not conduct ablation studies or parameter analysis. However, it references numerous individual research papers that do perform such analyses. Across the surveyed literature, common themes in ablation studies for PGen often include:

  • Impact of Personalized Contexts: Studies frequently evaluate the contribution of different personalized context types (e.g., user profiles vs. user behaviors vs. personalized subjects) to the overall personalization effect. For instance, in image generation, ablating the personalized subject input would typically result in generic images rather than identity-preserving ones.

  • Effectiveness of Guidance Mechanisms: Researchers often conduct ablations on instruction guidance (e.g., removing in-context learning or specific prompt engineering elements) and structural guidance (e.g., removing adapters or cross-attention modules) to quantify their contribution to instruction alignment and personalization.

  • Choice of Optimization Strategy: Comparisons between tuning-free methods, supervised fine-tuning (full vs. PEFT), and preference-based optimization (RLHF vs. DPO) are common. These studies often analyze the trade-offs between performance, computational cost, and data requirements. For example, PEFT methods are often shown to achieve comparable personalization to full fine-tuning with significantly fewer trainable parameters.

  • Role of Multimodal Components: In MLLMs and cross-modal tasks, ablations might investigate the contribution of specific encoders (e.g., CLIP image encoder) or fusion mechanisms to understanding multimodal personalized instructions.

    These studies are crucial for understanding which components are most effective for different PGen tasks and for optimizing the balance between personalization, quality, efficiency, and robustness.

7. Conclusion & Reflections

7.1. Conclusion Summary

This survey provides the first comprehensive overview of Personalized Generation (PGen) in the era of large models. It successfully unifies diverse research efforts by proposing a holistic framework that formalizes key components (personalized contexts, multimodal instructions), core objectives (high quality, instruction alignment, personalization), and a general workflow (user modeling, generative modeling). The paper introduces a multi-level taxonomy that categorizes PGen methods across modalities (text, image, video, audio, 3D, cross-modal) and personalized contexts, reviewing technical advancements, datasets, and evaluation metrics. Furthermore, it highlights the vast potential applications of PGen in content creation and delivery, while also identifying critical open problems and future research directions. By bridging knowledge across different modalities, the survey serves as a foundational resource for fostering interdisciplinary collaboration and advancing the personalized digital landscape.

7.2. Limitations & Future Work

The authors acknowledge several limitations and outline promising directions for future research:

7.2.1. Technical Challenges

  • Scalability and Efficiency: Large generative models are resource-intensive. Future research needs to focus on developing scalable and efficient algorithms for PGen to enable real-time, large-scale deployment.
  • Deliberative Reasoning for PGen: Current approaches may lack deep reasoning. Inspired by LLM reasoning, future PGen could incorporate more extensive logical and contextual analysis to enhance content personalization, especially in scenarios where quality and deep understanding outweigh real-time speed (e.g., digital advertising).
  • Evolving User Preference: User preferences are dynamic. PGen models need to adapt to track and respond to shifts in user preferences over time, an area well-studied in traditional recommender systems but still nascent in generative contexts.
  • Mitigating Filter Bubbles: PGen risks reinforcing existing preferences, potentially leading to polarization. Future work should explore strategies like diversity enhancement, user-controllable inference, and multi-agent generation to introduce diverse perspectives.
  • User Data Management: Effective and efficient lifelong management of user data (collection, storage, structuring, and user-controllable personalization) remains an open problem.
  • Multi-modal Personalization: Most PGen research focuses on single modalities. Developing methods for consistent and personalized generation across multiple modalities (e.g., a personalized social media post with both image and text) is a significant challenge, though emerging unified multimodal models offer promising avenues.
  • Synergy Between Generation and Retrieval: Integrating PGen with traditional retrieval-based content delivery systems (like recommender systems) holds potential to create more powerful systems that not only find existing content but also generate new, tailored content when needed.

7.2.2. Benchmarks and Metrics

  • Current evaluation metrics (e.g., BLEU for text, CLIP-I for images) often fail to fully capture the nuanced alignment with user preferences. Developing more robust and effective metrics specifically for personalization is crucial.
  • The rapid evolution of the field necessitates continuous updates to taxonomies, datasets, and benchmarks.

7.2.3. Trustworthiness

  • Privacy: The reliance on user-specific data raises significant privacy concerns. Balancing effective personalization with robust privacy protection (e.g., on-device personalization, federated learning, differential privacy, adversarial perturbation) is a critical area.
  • Fairness and Bias: PGen models can inadvertently perpetuate biases and stereotypes present in training data. Research on debiasing strategies (e.g., context steering, structured prompts, causal-guided biased sample identification) is vital to ensure fair and equitable outcomes.
  • Safety: Establishing transparent governance protocols, reliable moderation mechanisms, and explainable generation processes is essential to maintain user trust and uphold safety standards in PGen.

7.3. Personal Insights & Critique

This survey is a timely and valuable contribution to the rapidly expanding field of personalized generation. Its primary strength lies in its unified, modality-agnostic perspective, which is crucial for fostering cross-disciplinary understanding and collaboration. By abstracting PGen into core components of user modeling and generative modeling, it provides a mental model for researchers to approach personalization regardless of the data type.

One significant insight drawn from this paper is the growing sophistication of user modeling. Moving beyond simple user profiles, the incorporation of user behaviors, documents, and even personal subjects like faces or objects, highlights a trend towards fine-grained personalization. The emphasis on preference-based optimization (RLHF, DPO) also underscores the increasing recognition that personalization is often subjective and best captured through direct human feedback.

The paper's discussion on potential applications across content creation and delivery showcases the immense economic and societal value of PGen. For instance, personalized AI assistants, dynamic gaming experiences, and tailored e-learning could fundamentally change how individuals interact with digital content and services.

However, a potential critique, which the authors themselves partially address as a limitation, is the inherent difficulty in establishing truly universal benchmarks and metrics for personalization. While metrics like ExPerT are emerging, the subjective nature of "preference" means that human evaluation will likely remain indispensable, but also costly and inconsistent. The survey could further emphasize the need for standardized and scalable human evaluation protocols to complement objective metrics.

Another area for deeper exploration, especially given the "Large Model Era" in the title, could be a more detailed analysis of how model size and architectural choices within foundation models specifically enable or constrain personalization. For example, how do specific properties of Transformer vs. Diffusion models lend themselves to different types of personalization challenges?

Finally, while the paper highlights safety, fairness, and privacy as open challenges, the practical implementation of solutions (e.g., how to "unlearn" personal data from deeply embedded models, or robustly mitigate biases in highly specific user contexts) will be a complex and active research area. This survey lays an excellent foundation, and future work will undoubtedly build on its comprehensive overview to tackle these nuanced problems. The presented framework can indeed be applied to other domains requiring personalized AI, serving as a blueprint for designing user-centric generative systems across various industries.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.