Paper status: completed

A Survey on Generative Recommendation: Data, Model, and Tasks

Published:10/31/2025

Large Language Model Fine-Tuning (50)Multimodal Large Language Model (24)Diffusion Models (8)LLM-based Recommendation Systems (28)Generative Recommendation Systems (37)

Original Link PDF

Price: 0.100000

11 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This survey reviews generative recommendation via a unified framework, analyzing data augmentation, model alignment, and task design, highlighting innovations in large language and diffusion models that enable knowledge integration, natural language understanding, and personalize

Abstract

Recommender systems serve as foundational infrastructure in modern information ecosystems, helping users navigate digital content and discover items aligned with their preferences. At their core, recommender systems address a fundamental problem: matching users with items. Over the past decades, the field has experienced successive paradigm shifts, from collaborative filtering and matrix factorization in the machine learning era to neural architectures in the deep learning era. Recently, the emergence of generative models, especially large language models (LLMs) and diffusion models, have sparked a new paradigm: generative recommendation, which reconceptualizes recommendation as a generation task rather than discriminative scoring. This survey provides a comprehensive examination through a unified tripartite framework spanning data, model, and task dimensions. Rather than simply categorizing works, we systematically decompose approaches into operational stages-data augmentation and unification, model alignment and training, task formulation and execution. At the data level, generative models enable knowledge-infused augmentation and agent-based simulation while unifying heterogeneous signals. At the model level, we taxonomize LLM-based methods, large recommendation models, and diffusion approaches, analyzing their alignment mechanisms and innovations. At the task level, we illuminate new capabilities including conversational interaction, explainable reasoning, and personalized content generation. We identify five key advantages: world knowledge integration, natural language understanding, reasoning capabilities, scaling laws, and creative generation. We critically examine challenges in benchmark design, model robustness, and deployment efficiency, while charting a roadmap toward intelligent recommendation assistants that fundamentally reshape human-information interaction.

Mind Map

In-depth Reading

English Analysis~41 min read · 62,055 chars

1. Bibliographic Information

1.1. Title

The title of the paper is "A Survey on Generative Recommendation: Data, Model, and Tasks." This title clearly indicates that the paper provides a comprehensive review of the emerging field of generative recommendation, organized around three key dimensions: data, models, and tasks.

1.2. Authors

The authors are:

Min Hou, Le Wu, Yuxin Liao, Zhen Zhang, Yu Wang, Changlong Zheng, Han Wu, and Richang Hong from Hefei University of Technology, Hefei, 230009, Anhui, China.
Yonghui Yang from National University of Singapore, Singapore.

Their affiliations suggest a strong background in computer science, likely with a focus on artificial intelligence, machine learning, and recommender systems, given the topic of the paper.

1.3. Journal/Conference

The paper is published at arXiv, with a specified publication date of 2025-10-31T04:02:58.000Z. arXiv is a well-regarded preprint server for physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While it hosts preprints, many papers later appear in top-tier conferences and journals. The paper itself mentions searching top-tier conferences and journals such as ICML, ICLR, NeurIPS, ACL, SIGIR, KDD, WWW, RecSys, TKDE, TOIS for related works, indicating the expected quality and relevance of its content to these venues.

1.4. Publication Year

The publication year, based on the provided UTC timestamp, is 2025.

1.5. Abstract

Recommender systems (RSs) are crucial for helping users find relevant items in modern digital ecosystems. Traditionally, RSs focused on matching users with items through discriminative scoring. The field has evolved from collaborative filtering and matrix factorization to neural architectures. Recently, the rise of generative models, particularly large language models (LLMs) and diffusion models, has introduced a new paradigm: generative recommendation, which re-conceptualizes recommendation as a generation task.

This survey offers a comprehensive examination of this new paradigm using a unified tripartite framework that covers data, model, and task dimensions. Instead of merely categorizing works, it systematically decomposes approaches into operational stages: data augmentation and unification, model alignment and training, and task formulation and execution.

At the data level, generative models enable knowledge-infused augmentation and agent-based simulation, while unifying heterogeneous signals. At the model level, the survey taxonomizes LLM-based methods, large recommendation models, and diffusion approaches, analyzing their alignment mechanisms and innovations. At the task level, it highlights new capabilities such as conversational interaction, explainable reasoning, and personalized content generation.

The paper identifies five key advantages of generative recommendation: world knowledge integration, natural language understanding, reasoning capabilities, scaling laws, and creative generation. It critically examines challenges in benchmark design, model robustness, and deployment efficiency, concluding with a roadmap toward intelligent recommendation assistants that aim to fundamentally reshape human-information interaction.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/2510.27157 PDF Link: https://arxiv.org/pdf/2510.27157v1.pdf Publication Status: This is a preprint published on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the fundamental challenge of matching users with items in recommender systems (RSs), which are foundational infrastructure in modern information ecosystems. RSs are critical for alleviating information overload, helping users discover digital content aligned with their preferences, and boosting traffic and revenue for service providers across diverse domains like e-commerce, social media, education, online video, and music services.

Despite significant advancements over the past decades, driven by deep learning architectures and large-scale user behavior data, traditional RSs (primarily discriminative matching paradigms) face several challenges:

Limited Semantic Knowledge: They often rely on manually processed and limited semantic knowledge.
Sub-optimal Performance for Small-Scale Models: Their performance can be constrained with smaller models.
Dependence on Fixed Candidate Sets: They require a pre-defined set of items to choose from.
Task-Specific Architectures: They often need task-specific architectures and training objectives.
Cold-Start Scenarios: They struggle to provide recommendations for new users or items with little historical data.
Lack of Transparency: They find it difficult to offer transparent, context-rich explanations for their recommendations.

The paper's entry point and innovative idea stem from the recent emergence of generative models, particularly Large Language Models (LLMs) and diffusion models. These models have sparked a new paradigm called generative recommendation. This paradigm reconceptualizes the recommendation problem from a discriminative scoring procedure (i.e., predicting relevance) to a generation task (i.e., directly synthesizing recommendations). This shift promises to address the aforementioned challenges and unlock new capabilities in RSs.

2.2. Main Contributions / Findings

The paper provides a comprehensive examination of the generative recommendation paradigm through a unified tripartite framework spanning data, model, and task dimensions. Its primary contributions and key conclusions are:

Unified Tripartite Framework: The survey proposes a novel framework that systematically decomposes generative recommendation approaches into operational stages: data augmentation and unification, model alignment and training, and task formulation and execution. This provides a structured understanding of how generative models influence the entire recommendation pipeline.
Data-Level Innovations: Generative models enable:
- Knowledge-infused augmentation: Leveraging vast world knowledge to enrich sparse recommendation data and item representations.
- Agent-based simulation: Simulating user behaviors and interactions to address data sparsity and cold-start issues.
- Unification of heterogeneous signals: Integrating diverse data types (multi-domain, multi-task, multi-modal) into coherent inputs, moving towards "one model for all."
Model-Level Innovations: The survey taxonomizes and analyzes three main approaches:
- LLM-based methods: Using pre-trained LLMs as recommendation backbones.
- Large Recommendation Models (LRMs): Scaling up traditional recommendation architectures with generative components, demonstrating scaling laws native to recommendation.
- Diffusion approaches: Reconceptualizing recommendation as a denoising process to generate user preferences or item rankings.
Task-Level Innovations: Generative models unlock new capabilities beyond traditional top-K recommendations, including:
- Conversational interaction: Enabling multi-turn, natural language dialogues for dynamic preference elicitation.
- Explainable reasoning: Providing transparent justifications and logical processes behind recommendations.
- Personalized content generation: Creating novel content (e.g., personalized text, visual designs) rather than just ranking existing items.
Five Key Advantages of Generative Recommendation:
1. World knowledge integration: Seamlessly incorporating real-world semantic information.
2. Natural language understanding: Interpreting nuanced user expressions in free-form text.
3. Reasoning capabilities: Modeling logical processes behind user decisions.
4. Scaling laws: Predicting performance improvement with increased model size and data.
5. Creative generation: Producing novel content and diverse recommendations.
Identified Challenges and Roadmap: The paper critically examines significant challenges in benchmark design (current datasets are unsuitable), model robustness (bias and adversarial attacks), and deployment efficiency (training and inference costs). It charts a roadmap towards intelligent recommendation assistants that are open, task-agnostic, and responsive to evolving user needs, fundamentally reshaping human-information interaction.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following foundational concepts in recommender systems and machine learning:

Recommender Systems (RSs): At its core, a recommender system aims to predict a user's preference for an item and suggest items that the user is likely to enjoy. They address information overload by filtering content and providing personalized suggestions. Examples include product recommendations on e-commerce sites (e.g., Amazon), movie suggestions (e.g., Netflix), or news feeds (e.g., social media).
Collaborative Filtering (CF): This is a traditional approach where recommendations are based on the tastes of other users. If user A and user B have similar preferences (e.g., they both liked the same movies), then user A might be recommended movies that user B liked but A hasn't seen yet. There are two main types: user-based CF (finding similar users) and item-based CF (finding similar items).
Matrix Factorization (MF): A powerful technique in CF, especially popularized by the Netflix Prize. It works by decomposing the user-item interaction matrix (where rows are users, columns are items, and entries are ratings) into two lower-dimensional matrices: a user-factor matrix and an item-factor matrix.
- User-factor matrix: Represents users in a latent space, where each dimension captures a certain aspect of preference.
- Item-factor matrix: Represents items in the same latent space. The dot product of a user's latent vector and an item's latent vector then predicts the rating the user would give to that item. This approach helps in uncovering hidden latent features or factors that explain observed ratings.
Deep Learning (DL) in RSs: The application of neural networks to recommender systems. DL models (such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Graph Neural Networks (GNNs), and Transformers) can learn complex, non-linear relationships and rich representations from diverse data types (text, images, social networks, knowledge graphs). This allows them to capture intricate semantics of user-item interactions and improve user/item representation learning.
Discriminative Models vs. Generative Models: This is a crucial distinction made in the paper.
- Discriminative Models: These models learn to map inputs to outputs or distinguish between different classes. In the context of RSs, a discriminative model learns a function f(u, i) that predicts the probability or score of a user $u$ interacting with an item $i$ . They model the conditional probability $P(y|x)$ (the probability of output $y$ given input $x$ ). For example, predicting whether a user will click on an item or what rating they will give.
- Generative Models: These models learn the underlying distribution of the data itself. They can generate new data samples that are similar to the training data. In RSs, a generative model learns how users and items are "generated" together, i.e., the joint probability distribution P(x,y). This allows them to directly generate recommended items or personalized content, rather than just scoring existing ones.
Large Language Models (LLMs): These are a type of generative model that has gained significant attention recently. LLMs are neural network models with a massive number of parameters (billions to trillions) trained on vast amounts of textual data (e.g., internet text, books). Their core ability is to generate human-like text by predicting the next word in a sequence. They exhibit emergent abilities such as in-context learning (learning from examples within the prompt), complex reasoning, and natural language understanding, making them highly versatile for various tasks beyond pure text generation.
Diffusion Models: Another type of generative model primarily known for generating high-quality images. The core idea is to learn to reverse a gradual noise process. During training, noise is progressively added to an image until it becomes pure noise. The model then learns to reverse this process, denoising the image step-by-step to reconstruct the original. This allows them to generate diverse and realistic data. In RSs, they can be adapted to generate user preferences, item rankings, or even synthetic user data by treating recommendation as a denoising problem.

3.2. Previous Works

The paper frames itself as a comprehensive survey that builds upon, yet differentiates from, several prior reviews on LLM-based recommender systems. These previous works, primarily published in 2024 or earlier, have set the stage for understanding the intersection of LLMs and RSs:

Wu et al. [187] (2024): This survey systematically reviewed LLM-based recommendation systems, categorizing studies based on modeling paradigms: LLM Embeddings enhanced RS, LLM Tokens enhanced RS, and LLM as RS. This provided an early structural understanding of how LLMs were being integrated.
Lin et al. [92] (2025): They introduced two orthogonal perspectives: where (e.g., for data augmentation, as core model) and how (e.g., tuning LLM, involving conventional models) to adapt LLMs in recommender systems. This offered a more detailed view of the integration strategies.
Zhao et al. [233] (2024): Reviewed LLM-empowered recommender systems from various aspects, including pre-training, fine-tuning, and prompting paradigms. This focused on the training and deployment aspects of LLMs in RSs.
Deldjoo et al. [22] (2024): Connected key advancements in RS using Generative Models (Gen-RecSys), covering interaction-driven generative models, LLM and textual data for natural language recommendation, and multimodal models for generating/processing images/videos. This was a broader view encompassing various generative models.
Liu et al. [106] (2024): Explored advancements in multimodal pretraining, adaptation, and generation techniques, and their applications to recommender systems. This highlighted the growing importance of multimodal data.
Li et al. [80] (2023): Reviewed recent progress of LLM-based generative recommendation and provided a general formulation for each generative recommendation task. This focused on task-specific applications.
Wang et al. [170] (2024): Reviewed existing LLM-based recommendation works and discussed the gap from academic research to industrial application. This provided a practical perspective.

While these surveys laid important groundwork, the current paper emphasizes that they largely reflect the state of the field up until 2024, potentially overlooking newer research, especially from 2025 and beyond.

3.3. Technological Evolution

The evolution of recommender systems can be broadly outlined as follows:

Early Heuristics (1990s): Initial systems relied on content-based filtering (recommending items similar to those a user liked previously, based on item attributes) and collaborative filtering (CF, recommending based on user-item interaction patterns, e.g., similar users like similar items). These were often rule-based or used simple statistical methods.
Machine Learning Era (2000s): This era was marked by the prominence of Matrix Factorization (MF), especially after the Netflix Prize. MF provided a more sophisticated mathematical framework for CF, learning latent factors that explain user preferences. This shifted focus to learning underlying representations.
Deep Learning Era (Mid-2010s onwards): Advancements in neural networks (CNNs, RNNs, GNNs, Transformers) led to deep learning-based recommendation methods. These approaches leveraged the powerful representation capabilities of deep learning to handle complex, heterogeneous data (text, images, social networks) and learn non-linear mappings of user-item interactions, enhancing the learning of user and item representations. This moved RSs from shallow models to deeper, more complex architectures.
Generative AI Era (Recently): The latest paradigm shift, driven by Large Language Models (LLMs) and diffusion models. This era re-conceptualizes recommendation from a discriminative matching (scoring) paradigm to a generative synthesis (creating) paradigm. Instead of predicting a score for an existing item, the goal becomes to directly generate the target document or item itself. This fundamentally changes how recommendations are modeled and opens up new possibilities for content creation, interactive experiences, and leveraging vast world knowledge.

This paper's work fits squarely into the current Generative AI Era, providing a timely and comprehensive overview of how these cutting-edge generative models are transforming the entire RS pipeline.

3.4. Differentiation Analysis

Compared to the main methods and prior surveys in related work, this paper's core differences and innovations are:

Broader Coverage of Generative Paradigms: While earlier surveys primarily focused on LLM-based recommendation, this survey offers a more expansive view by categorizing research into LLM-based generative recommendation, large recommendation models (LRMs), and diffusion-based generative recommendation. This inclusive approach ensures that the latest advancements across different generative model types are captured.
Unified Data-Model-Task Framework: This survey introduces a data-model-task framework, which is a systematic and comprehensive way to analyze the contributions of generative models. This goes beyond simple categorization by providing a pipeline-centric understanding:
- Data Level: How generative models enhance or augment data (e.g., data generation, data unification).
- Model Level: How generative mechanisms are integrated into the core recommendation architecture.
- Task Level: How generative modeling extends to high-level objectives and novel capabilities. This framework provides a more holistic and in-depth understanding of the evolving role of generative models.
Emphasis on Task-Level Innovations: The survey dedicates a significant section to task-level innovations, examining how generative models enable novel recommendation scenarios. This includes interactive recommendation, conversational recommendation, and personalized content generation, areas that were often underexplored or not highlighted as central to generative RSs in prior works.
Up-to-Date Roadmap and Challenges: The paper incorporates the latest research up to 2025 and beyond, addressing new developments such as agent-based recommender systems using LLMs and diverse LLM-based recommendation methods beyond supervised fine-tuning (SFT). It concludes with an updated discussion on current challenges (e.g., benchmark design, model robustness, deployment efficiency) and future research directions, offering a more current roadmap for the field.

In essence, this survey provides a more structured, comprehensive, and forward-looking perspective on generative recommendation, integrating diverse model types and operational stages within a unified analytical framework.

4. Methodology

4.1. Principles

The core idea behind the methodology outlined in this survey is to shift the paradigm of recommender systems from discriminative matching to generative synthesis. Traditionally, recommender systems operate by learning a scoring or ranking function to estimate the relevance of existing items to a user. In contrast, the generative recommendation paradigm reconceptualizes this problem as a generation task, where the system directly produces the target item or recommendation output.

The theoretical basis for this shift lies in the fundamental difference between discriminative and generative models:

Discriminative Models: Learn the conditional probability $P(y|x)$ , or directly map input $x$ to predict output $y$ . In recommendation, this means learning $P(\text{relevance}| \text{user}, \text{item})$ or a function f(u,i) that scores the likelihood of interaction.
Generative Models: Learn the joint probability distribution P(x,y), meaning they model how both the input $x$ (e.g., user preferences) and the label $y$ (e.g., item) are generated together. This allows them to synthesize new data points (items) or sequences (recommendation lists).

The intuition is that by embracing generative capabilities, recommender systems can move beyond merely selecting from a fixed set of candidates to actively creating personalized content, engaging in dynamic conversations, leveraging vast world knowledge, and exhibiting emergent reasoning capabilities—qualities inherent to powerful generative models like LLMs and diffusion models. This survey systematically examines how this generative capability is applied across the entire recommendation pipeline: at the data level (how data is prepared and enriched), the model level (how the core recommendation engine is built), and the task level (what new functionalities and applications become possible).

4.2. Core Methodology In-depth (Layer by Layer)

The paper deconstructs generative recommendation through its proposed tripartite framework: data, model, and task. To properly contextualize, it first revisits the traditional discriminative recommendation paradigm.

4.2.1. Preliminaries of Discriminative Recommendation Models

Discriminative recommendation models focus on learning a scoring or ranking function f(u,i) that estimates the relevance or affinity between a user $u$ and an item $i$ .

Data Preparation: Given training data consisting of tuples $D = \{(u, i, y_{ui})\}$ , where $u$ represents a user, $i$ represents an item, and $y_{ui}$ denotes the observed interaction. This interaction can be ratings (for explicit feedback) or binary values $\{0, 1\}$ (for implicit feedback). Inputs $u$ and $i$ can be one-hot IDs in collaborative filtering methods. Auxiliary content data like user social networks, profiles, multimedia descriptions (images, videos, texts, audio), and knowledge graphs are often used to enrich $u$ and $i$ .

Model Construction: At training time, discriminative recommendation methods typically start by using embedding layers to map each user and item to a dense embedding vector: $\mathbf{e}_u = \phi_u(u)$ and $\mathbf{e}_i = \phi_i(i)$ . Here, $\phi_u(u)$ and $\phi_i(i)$ are embedding layers for users and items, which can be simple lookup tables or more complex architectures like Multi-Layer Perceptrons (MLPs), Graph Neural Networks (GNNs), Convolutional Neural Networks (CNNs), or transformers.

Then, these models compute a matching score between the user and item embeddings: $f_{ui} = \mathrm{Score}(\mathbf{e}_u, \mathbf{e}_i)$ . Common scoring functions include inner product, distance-based metrics, and neural network-based metrics.

The models are usually trained to discriminate between positive and negative interactions. Commonly used loss functions include:

Mean Squared Error (MSE) loss $\mathcal{L}_{\mathrm{rating}}$ for explicit feedback (e.g., predicting ratings): $ \mathcal{L}{\mathrm{rating}} = \frac{1}{N} \sum{i=1}^{N} (y_{ui} - f(u,i))^2 $ Here, $N$ is the total number of observed interactions, $y_{ui}$ is the observed interaction (e.g., rating) between user $u$ and item $i$ , and f(u,i) is the predicted interaction score.
Binary Cross Entropy (BCE) loss $\mathcal{L}_{\mathrm{point}}$ for implicit feedback (e.g., predicting clicks): $ \mathcal{L}{\mathrm{point}} = - \sum{(u,i) \in D} [y_{ui} \log \sigma (f_{ui}) + (1-y_{ui}) \log (1-\sigma (f_{ui}))] $ In this formula, $D$ is the dataset of observed interactions, $y_{ui}$ is typically 1 for a positive interaction and 0 for a negative interaction, $f_{ui}$ is the raw predicted score, and $\sigma(\cdot)$ is the sigmoid function, which squashes the score into a probability between 0 and 1.
Bayesian Personalized Ranking (BPR) loss $\mathcal{L}_{\mathrm{pair}}$ for implicit feedback, which optimizes for ranking rather than absolute scores: $ \mathcal{L}{\mathrm{pair}} = - \sum{(u,i^+,i^-) \in D} \log \sigma (f_{ui^+} - f_{ui^-}) $ Here, $D$ represents triplets of (user $u$ , positive item $i^+$ , negative item $i^-$ ), where $i^+$ is an item user $u$ interacted with, and $i^-$ is an item user $u$ did not interact with. The objective is to maximize the difference between the score of the positive item and the negative item for a given user.

Recommendation Task: For discriminative systems, the final recommendation task is mostly to select the top-K items that users might like from a candidate list. This is referred to as top-K recommendation. At inference time, given a user $u$ and a candidate item set $I$ , the models compute scores and rank the items: $ \hat{i} = \arg \max_{i \in I} f(u,i) $ $ \mathrm{TopK}u = \mathrm{Top-K}{i \in I} f(u,i) $ Here, $\hat{i}$ represents the item with the highest predicted score for user $u$ , and $\mathrm{TopK}_u$ denotes the set of top-K items ranked by their predicted scores f(u,i) for user $u$ . This process requires calculating a matching score for each item in the candidate list, then ranking and selecting the top items.

4.2.2. Generative Recommendation Models

In contrast to the discriminative approach, generative recommendation is defined as a broad paradigm where generative models (such as LLMs, diffusion models) are utilized across various stages of the recommendation pipeline. This paradigm can still be categorized into three major phases: data, model, and tasks.

Data-Level Synthesis: Generative models are used to synthesize training data, including both user/item features and interaction records. This is particularly useful for addressing challenges like cold-start problems (where there's insufficient data for new users or items) or sparsity in datasets. Formally, let $\mathcal{V}$ represent the original user set, $I$ the original item set, and $\mathcal{Y}$ the original interaction set. Generative models produce: $ \mathcal{V}', I', \mathcal{Y}' = G_{\mathrm{data}}(\mathcal{V}, I, \mathcal{Y} | \theta_g) $ Here, $G_{\mathrm{data}}$ is a generative model parameterized by $\theta_g$ , which generates synthetic user features ( $\mathcal{V}'$ ), item features ( $I'$ ), and interaction records ( $\mathcal{Y}'$ ) to create an enriched training dataset for recommendation models.

Model-Level Recommendation: At the model level, generative models serve as the core recommendation engine, directly learning user preferences and generating personalized recommendations. Current mainstream approaches fall into three categories:

LLM-based approaches: Leverage pre-trained LLMs as recommendation backbones. They convert recommendation data into samples with textual input and output, modeling the recommendation task as a neural language generation process.
Large Recommendation Models (LRMs): Scale up traditional recommendation architectures with generative components. These models use massive parameters to model complex user-item interactions and generate high-quality recommendations, often exhibiting scaling laws similar to LLMs.
Diffusion model-based approaches: Treat recommendation as a denoising process. They learn to generate user preferences or item rankings through iterative refinement, transforming noise into meaningful recommendation signals.

Task-Level Generation: At the task level, generative models reformulate recommendation as a generation task that produces outputs in natural language or structured formats. This paradigm not only addresses traditional tasks (like sequential recommendation or click-through rate prediction) through generative means but also enables novel capabilities:

Generating personalized explanations for recommendations.
Creating conversational recommendation dialogues.
Producing item reviews and descriptions.
Creating virtual items or multi-modal recommendation content (combining text, images, etc.).

This opens new possibilities for interpretable and interactive recommendation experiences.

4.2.3. Data-Level Opportunities

LLMs provide new possibilities for data-centric advancements in RSs by enabling effective data generation and unification.

The following figure (Figure 4 from the original paper) shows the outline of key techniques in LLM-empowered data generation:

Figure 4: Outline of key techniques in LLM-empowered data generation. 该图像是图4，展示了基于LLM的数据生成关键技术框架，涵盖内容增强、行为增强、结构增强和交互模拟等方面，强调了通过开放世界知识和代理行为模拟提升数据质量。

Figure 4: Outline of key techniques in LLM-empowered data generation.

The following are the results from Table 1 of the original paper:

Category	Representative Works	Description / Focus
Content Augmentation	ONCE (WSDM'24), LLM-Rec (NAACL'24), LRD (SIGIR'24), MSIT (ACL'25), EXP3RT (SIGIR'25), Lettingo (KDD'25), SINGLE (WWW'24), KAR (RecSys'24), IRLLRec (SIGIR'25), LLM4SBR (TOIS'25), SeRALM (SIGIR'24), TRAWL (ArXiv'24)	Generate natural-language user/item profiles, summarize histories, enrich sparse metadata, and align textual semantics with feedback.
Representation Augmentation	DynLLM (ArXiv'24), GE4Rec (ICML'24), Hy- perLLM (SIGIR'25)	Automated feature construction, multimodal attribute extraction, external knowledge distil- lation, and hierarchical category generation.
Behavior Augmentation	ColdLLM (WSDM'25), Wang et al. (WWW'25), LLM-FairRec (SIGIR'25), LLM4IDRec (TOIS'25)	Generate synthetic user-item interactions, simulate cold-start preferences, ensure fair- ness, and integrate pseudo-interactions into ID-based pipelines.
Structure Augmentation	SBR (SIGIR'25), LLMRec (WSDM'24), Chang et al. (AAAI'25), CORONA (SIGIR'25), LLM-KERec (CIKM'24), TCR-QF (IJCAI'25), COSMO (SIGMOD'24)	Relation discovery, graph completion, social network generation, subgraph retrieval, knowl- edge graph construction & distillation.

3.1. Data Generation: LLMs leverage their open-world knowledge, natural language understanding, and generative capabilities to enrich, synthesize, and unify recommendation data. This is categorized into four dimensions of augmentation and agent-based simulation:

3.1.1. Open-world Knowledge for Augmentation: LLMs' vast pre-training data allows them to enrich and synthesize recommendation data.
- Content Augmentation: LLMs generate natural-language representations for users and items. For instance, LLM-REC [116] uses diverse prompting to extract insights from LLM knowledge. LRD [202] uses variational reasoning to find item relationships. MSIT [76] leverages multimodal LLMs (MLLMs) to mine item attributes from images and text. Approaches like SINGLE [111] and KAR [189] extract user preferences from interaction logs. SeRALM [145] and LettinGo [168] design prompts and use Direct Preference Optimization (DPO) to align generated content with recommendation goals.
- Representation Augmentation: LLMs automate feature construction. DynLLM [234] uses an LLM as a content encoder. HyperLLM [16] generates hierarchical categories. GE4Rec [209] proposes a generative feature generation paradigm.
- Behavior Augmentation: LLMs address data sparsity and cold-start by generating synthetic user-item interactions. ColdLLM [58] uses a coupled-funnel architecture for cold-start user interaction simulation. LLM-FairRec [75] generates fair pseudo-interactions. LLM4IDRec [14] augments ID-based interaction data.
- Structure Augmentation: LLMs induce higher-level semantic structures. SBR [10] aligns item features with hierarchical intents. LLMRec [185] infers missing graph nodes/edges. CORONA [12] retrieves intent-aware subgraphs. LLM-KERec [232] infers new knowledge graph triples.
3.1.2. Agent-Based Behavior Simulation: LLM-driven agents simulate human-like cognition, memory, emotions, decision-making, and reflection to generate authentic user profiles and dynamic behaviors.
- Interaction Simulation: Agents simulate individual user behaviors. Agent4Rec [219] simulates diverse user behaviors with factual and emotional memories. AgentCF [221] models collaborative filtering by simulating both user and item agents. SimUSER [5] and SUBER [18] design cognitive agents with episodic memory for realistic behavior logs.
- Social Simulation: Agents simulate large-scale social dynamics. GGBond [239] models evolving social ties and trust dynamics. RecAgent [169] constructs a sandbox to study information silos and conformity.

3.2. Data Unification: LLMs unify heterogeneous data across tasks, domains, and modalities.

The following figure (Figure 5 from the original paper) illustrates LLM empowered data unification:

Figure 5: LLM empowered data unification 该图像是图表，展示了图5中大语言模型（LLM）驱动的数据统一示意，涵盖多领域、多任务和多模态数据的统一处理，体现“一模型多用”的理念。

Figure 5: LLM empowered data unification

3.2.1. Multi-Domain Data Unification: Addresses cross-domain recommendation challenges like behavioral sparsity and domain gaps. DMCDR [83] uses a diffusion model for preference transfer. LLM4CDSR [105] and LLMCDSR [194] use LLMs to extract semantic representations and generate pseudo-interactions across domains. UniCTR [31] and MoLoRec [51] learn domain-general knowledge.
3.2.2. Multi-Task Data Unification: Integrates diverse recommendation objectives (rating ranking, explanation, intent recognition) into a single framework. P5 [38] formulates tasks as text-to-text generation. GPSD [163] combines generative pretraining with discriminative fine-tuning. ARTS [113] uses self-prompting for joint prediction and explanation.
3.2.3. Multi-Modal Data Unification: Integrates text, images, and behavior logs using large vision-language models (LVLMs). UniMP [184] and MQL4GRec [217] unify multimodal inputs into shared semantic spaces. LLaRA [90] integrates item IDs and text. PAD [178] aligns modalities via a three-stage pretrain-align-disentangle process.
3.2.4. One Model for All: A paradigm shift towards unified, general-purpose models. P5 [38] pioneered reformulating recommendation as text-to-text generation. M6-Rec [20] enables open-ended multimodal generation. UniTRec [122] integrates generative modeling with contrastive learning. CLLM4Rec [244] incorporates user/item IDs into LLM vocabularies. A-LLMRec [69] integrates pre-trained LLMs with collaborative filtering embeddings.

4.2.4. Model-Level Opportunities

This section explores how generative models serve as the core recommendation engine.

4.1. LLM-Based Generative Recommendation: This approach leverages pre-trained LLMs to produce personalized suggestions.

The following figure (Figure 6 from the original paper) illustrates the paradigms aligning LLMs to recommendation:

该图像是一个示意图，展示了基于大语言模型（LLM）的推荐系统中不同输入与输出形式的对比，包括(a)文本元数据、(b)协同令牌、(c)ID号和(d)可训练ID令牌四种方案，直观体现了输入处理和推荐生成的过程。

Figure 6: The paradigms aligning LLMs to recommendation. Inspired by the figure [214].

4.1.1. Pretrained LLMs Recommendation: Uses prompt design and in-context learning without heavy retraining.
- LLM-as-Enhancer [150, 54, 67, 102, 47]: LLMs rewrite user/item profiles into natural-language features to augment traditional recommenders.
- LLM-as-Recommender [36]: LLMs directly generate recommendations (e.g., item titles) with task-specific prompts, even in zero-shot mode. This extends to multimodal LLMs (MLLMs).

4.1.2. Aligning LLMs for Recommendation: Fine-tuning LLMs on recommendation-specific data to bridge the gap between generic language modeling objectives and recommendation goals. This involves presenting the LLM with structured, compact, and consistent collaborative signals.

Text Prompting Based Methods: User profiles are built entirely in natural language, combining task descriptions with chronological interaction history. The following are the results from Table 2 of the original paper:

Methods	User formulation				Backbone
Methods	Task description	Historical interactions	Profile	Feedback	Backbone
Chat-Rec [36]	ranking	history interactions	✓		GPT-3.5
TALLRec [4]	preference classification	user preference			LLaMA-7B
LlamaRec [214]	retrieval, ranking	history interactions			LLaMA2-7B
LRD [202]	ranking	history interactions			GPT-3.5
ReLLa [93]	ranking	history interactions			Vicuna-7B
CALRec [86]	ranking	history interactions			PaLM-2 XXS
BiLLP [149]	long-term Interactive	history interactions, reward model			GPT-3.5, GPT-4, LLaMA2-7B
PO4ISR [157]	Ranking	history interactions			LLaMA2-7B
LLM-TRSR [237]	Ranking	history interactions			LLaMA2-7B
RecGPT [124]	Ranking	history interactions, user preference			RecGPT-7B
KAR [189]	Ranking	history interactions, user preference			GPT-3.5
LLM4CDSR [105]	Ranking	history interactions			GPT-3.5, GLM4-Flash
EXP3RT [68]	rating prediction	history interactions			LLaMA3-8B
SERAL [191]	retrieval, ranking	history interactions			Qwen2-0.5B
LettinGo [168]	Ranking	history interactions			LLaMA3-8B
Reason4Rec [28]	Rating Prediction	history interactions, user preference			LLaMA3-8B
InstructRec [224]	Ranking	history interactions			Flan-T5-XL
Uni-CTR [31]	Rating Prediction	history interactions, user preference			DeBERTaV3-large
BIGRec [3]	Ranking	history interactions			LLaMA-7B
UPSR [140]	Ranking	history interactions			T5, FLAN-T5

Early works like [86, 149, 157] fed sequences of consumed items. Later studies, such as TALLRec [4] and LlamaRec [215], enriched prompts with explicit preference statements or constrained candidate sets. LettinGo [168] used direct preference optimization (DPO) for flexible profile adaptation.

Collaborative Signal Based Methods: Inject collaborative signals into the user/item profile so the LLM sees both semantics and relational knowledge. The following are the results from Table 3 of the original paper:

Methods	User Formulation				Combining Method	Backbone
Methods	Task Description	Historical Interactions	Profile	Feedback	Combining Method	Backbone
iLoRA [72]	Ranking	history interactions			Concatenation	GPT-3.5
LLM-ESR [104]	Ranking	history interactions			Concatenation	LLaMA2-7B
LLaRA [90]	Ranking	history interactions			Concatenation	LLaMA2-7B
A-LLMRec [69]	Ranking	history interactions			Concatenation	OPT-6.7B
RLMRec [144]	Ranking	history interactions, user preference			Concatenation	GPT-3.5
CoRAL [186]	Ranking	history interactions, user preference			Retrieval-Augmented	GPT-4
BinLLM [226]	Ranking	history interactions, user preference			Concatenation	Vicuna-7B
E4SRec [82]	Ranking	history interactions			Concatenation	Vicuna-7B
SeRALM [145]	Ranking	history interactions			Concatenation	LLaMA2-7b
CORONA [12]	Ranking	history interactions			Pipeline Integration	GPT-4o-mini
HyperLLM [16]	Ranking	history interactions			Pipeline Integration	LLaMA3-8B
RecLM [65]	Ranking	history interactions			Concatenation	LLaMA2-7b
CoLLM [227]	Ranking	history interactions			Concatenation	Vicuna-7B
PAD [177]	Ranking	history interactions, user preference			Concatenation	LLaMA3-8B
IDP [188]	Ranking	history interactions, user preference			Concatenation	T5

This includes LLM-augmented representation for CF models (e.g., [72, 104, 90] concatenating information) and LLM-assisted summarization for CF models (e.g., CORONA [12], CoRAL [186]).

Item Tokenization Based Methods: Map items into the LLM's vocabulary using identifiable tokens. The following are the results from Table 4 of the original paper:

Methods	User formulation			Backbone
Methods	Task description	Historical interactions	Token types	Backbone
P5 [38]	Ranking	historical interactions, user preference	ID-based tokenization	Transformer
CLLM4Rec [244]	Ranking	historical interactions	ID-based tokenization	GPT-2
BIGRec [3]	Ranking	historical interactions	Text-based tokenization	LLaMA-7B
M6 [20]	Retrieval, Ranking	historical interactions	Text-based tokenization	M6
IDGenRec [158]	Ranking	historical interactions	Text-based tokenization	BERT4Rec
TIGER [141]	Ranking	historical interactions	Codebook-based tokenization	T5
RPG [52]	Ranking	historical interactions	Codebook-based tokenization	LLaMA-2-7B
LC-Rec [235]	Ranking	historical interactions	Codebook-based tokenization	LLaMA-2-7B
ActionPiece [53]	Retrieval	historical interactions	Codebook-based tokenization	LLaMA-2-7B
LETTER [172]	Ranking	historical interactions	Codebooks with collaborative signals	LLaMA-7B
TokenRec [138]	Retrieval	historical interactions	Codebooks with collaborative signals	T5-small
SETRec [94]	Ranking	historical interactions	Codebooks with collaborative signals	T5, Qwen
CCFRec [100]	Ranking	historical interactions	Codebooks with collaborative signals	LLaMA-2-7B
LLM2Rec [48]	Ranking	historical interactions	Codebooks with collaborative signals	LLaMA-2-7B
SIIT [15]	Retrieval	historical interactions	Self-adaptive tokenization	LLaMA-2-7B

Methods include ID-based tokenization (simple but not scalable), text-based tokenization (semantic but lengthy), codebook-based tokenization (compact), codebooks with collaborative signals (bridging gaps, e.g., LETTER [172], TokenRec [138]), and self-adaptive tokenization (LLMs refine identifiers, e.g., SIT [15]).

4.1.3. Training Objective & Inference: The following are the results from Table 5 of the original paper:

Category	Representative Works	Formula
Supervised Fine-Tuning	P5 (RecSys'22) LGIR (AAAI'24) LLM-Rec (TOÍS'25) RecRanker (TOIS'25)	− lo θ(y+ \| )
Self-Supervised Learning	FELLAS (TOIS'24) HFAR (TOIS'25)	exp(sim(y+y )τ) - log ∑N exp(sim(y+)/τ)
Reinforcement Learning	LEA (SIGIR'24) RPP (TOIS'25)	[(x, ) − βπθ( \| \|]
Direct Preference Optimization	LettinGo (KDD'25) RosePO (ArXiv'24) SPRec (WWW'25)	− log σ(β log θ(y+) log \$πθ(yx ref(v+ ref(y

The formulas provided in the table are incomplete or partially rendered. I will transcribe them as accurately as possible from the provided text, and then provide a standard form for context where the paper's representation is highly abbreviated. For the table's notation: $x$ is user profile/context; $y^+/y^-$ is preferred/rejected item; $\pi_\theta$ is policy model; $\pi_{\mathrm{ref}}$ is reference model; $\mathrm{sim}(\cdot, \cdot)$ is similarity; $\tau$ is temperature; $\mathcal{N}$ is negative set; $r_\phi$ is reward; $\beta$ is penalty/scale; $D_{\mathrm{KL}}$ is KL divergence; $\sigma$ is sigmoid.

Training Objective: Focuses on next-item prediction.
- Supervised Fine-Tuning (SFT): LLMs are fine-tuned with predefined templates. The formula provided in the table, $− lo θ(y+ | )$ , is highly abbreviated. A standard SFT loss, typically negative log-likelihood, aims to maximize the probability of generating the correct output given the input. For a next-item prediction task, given an input context $x$ (user profile, history) and a target item $y^+$ : $ \mathcal{L}{\mathrm{SFT}} = - \log \pi{\theta}(y^+ | x) $ Here, $\pi_{\theta}(y^+ | x)$ is the probability of generating the positive item $y^+$ given the input $x$ by the model with parameters $\theta$ . P5 [38] and LGIR [26] are examples. SFT learns mainly from positive pairs.
- Self-Supervised Learning (SSL): Generates auxiliary training signals to reduce reliance on manual templates. The formula provided in the table, $exp(sim(y+y )τ) - log ∑N exp(sim(y+)/τ)$ , is also highly abbreviated. A common SSL objective, such as contrastive learning, involves maximizing agreement between different views of the same data point and minimizing agreement with negative samples. For instance, in a contrastive loss like InfoNCE, where $\hat{y}$ is a representation of $y^+$ and $\hat{y}'$ represents other items: $ \mathcal{L}{\mathrm{SSL}} = - \log \frac{\exp(\mathrm{sim}(\hat{y}, \hat{y}^+)/\tau)}{\sum{\hat{y}' \in \mathcal{N} \cup {y^+}} \exp(\mathrm{sim}(\hat{y}, \hat{y}')/\tau)} $ Here, $\mathrm{sim}(\cdot, \cdot)$ denotes a similarity function (e.g., cosine similarity), $\tau$ is a temperature parameter, $\mathcal{N}$ is the set of negative samples, $\hat{y}$ is the embedding of the current item, and $\hat{y}^+$ is the embedding of the positive item. FELLAS [213] is an example.
- Reinforcement Learning (RL): Introduces reward-driven optimization to handle non-differentiable metrics. The formula provided in the table, $[(x, ) − βπθ( | |]$ , is again very abbreviated. A common RL objective, specifically for Proximal Policy Optimization (PPO) or similar, involves optimizing a policy $\pi_{\theta}$ based on a reward function $r_{\phi}$ and a regularization term (often KL divergence from a reference policy). A simplified form often seen in policy gradient methods with KL regularization: $ \mathcal{L}{\mathrm{RL}} = - \mathbb{E}{(x,y) \sim D} [r_{\phi}(x,y) - \beta D_{\mathrm{KL}}(\pi_{\theta}(y|x) || \pi_{\mathrm{ref}}(y|x))] $ Where $r_{\phi}(x,y)$ is the reward for taking action $y$ (recommending item $y$ ) given state $x$ (user profile/context), $\beta$ is a scaling factor for the KL divergence, and $D_{\mathrm{KL}}(\cdot || \cdot)$ is the Kullback-Leibler (KL) divergence between the current policy $\pi_{\theta}$ and a reference policy $\pi_{\mathrm{ref}}$ . LEA [166] and RPP [120] are examples.
- Direct Preference Optimization (DPO): Optimizes directly on preference pairs $(y^+, y^-)$ without training a separate reward model. The formula provided in the table, $− log σ(β log θ(y+) log$ πθ(yx ref(v+ ref(y, is malformed. The standard DPO loss for preference pairs(y^+, y^-)$ is: $ \mathcal{L}{\mathrm{DPO}} = - \log \sigma \left( \beta \left( \log \frac{\pi{\theta}(y^+|x)}{\pi_{\mathrm{ref}}(y^+|x)} - \log \frac{\pi_{\theta}(y^-|x)}{\pi_{\mathrm{ref}}(y^-|x)} \right) \right) $ Here, $\sigma(\cdot)$ is the sigmoid function, $\beta$ is a hyperparameter scaling the preference difference, $\pi_{\theta}$ is the policy model being optimized, and $\pi_{\mathrm{ref}}$ is a fixed reference policy (usually the initial pre-trained LLM). The objective is to increase the log probability ratio of the preferred response $y^+$ relative to the rejected response $y^-$ from the reference policy. LettinGo [168] and RosePO [89] are examples.
Inference:
- Reranking: Improves output quality by injecting stronger ranking signals. RecRanker [114] uses a two-stage pipeline (retrieval then LLM-based reranking). LLM4Rerank [35] frames inference as multi-node reasoning.
- Acceleration: Reduces latency and memory. FELLAS [213] limits LLM use for embeddings. Prompt Distillation (GenRec [155]) compresses histories. AtSpeed [97] applies speculative decoding and tree-based attention.

4.2. Large Recommendation Model (LRM): LRMs are specialized architectures optimized directly for user behavior data, establishing native scaling laws for recommendation tasks. They aim to address diminishing returns from complex discriminative models and the high costs of cascaded architectures.

The following figure (Figure 7 from the original paper) illustrates the end recommendation:

该图像是两张示意图，分别展示了大规模推荐模型(LRM)的架构(a)及端到端推荐系统架构(b)对比。图(a)描绘了编码器、解码器及多种输入序列；图(b)展示了端到端训练与偏好对齐及级联检索排序流程。

Figure 7: End Recommendation.

The following are the results from Table 6 of the original paper:

Methods	User Formulation		Architectures	Backbone
Methods	Task	Historical Interactions	Architectures	Backbone
LEARN [62]	Ranking	history interactions, user preference	Cascaded	Baichuan2-7B, Transformer
HLLM [9]	Retrieval, Ranking	history interactions	Cascaded	TinyLlama-1.1B, Baichuan2-7B
KuaiFormer [98]	Retrieval	history interactions	Cascaded	Stacked Transformer
SRP4CTR [44]	Ranking	history interactions, user preference	Cascaded	FG-BERT
HSTU [216]	Ranking	history interactions	Cascaded	Transformer
MTGR [45]	Ranking	history interactions	Cascaded	Transformer
UniROM [135]	Ranking	history interactions	End-to-End	RecFormer
URM [63]	Ranking	history interactions, user preference	End-to-End	BERT
OneRec [23]	Generative Retrieval and Ranking	history interactions, user preference	End-to-End	Transformer
OneSug [43]	Generative Retrieval and Ranking	history interactions, user preference	End-to-End	Transformer
EGA-V2 [238]	Generative Retrieval and Ranking	history interactions, user preference	End-to-End	Transformer

The Scaling Law of LRMs:
- HSTU [216] (Meta) is a groundbreaking LRM that validates the applicability of LLM scaling laws to recommendation. It transforms CTR prediction into a generative sequence modeling task, unifying multiple pointwise user-item interactions into a single sequence. It uses a causal autoregressive modeling approach for ultra-long user sequences, unifying retrieval and ranking into a sequence generation problem. It shows that performance improves with model scale, reaching 1.5 trillion parameters, far exceeding traditional discriminative models' stagnation point.
- MTGR [46] (Meituan) is a generative ranking framework that incorporates cross features and uses Group LayerNorm and dynamic hybrid masking for sequence encoding.
- GenRank [60] (Redbook) focuses on iteratively predicting actions associated with items, suitable for resource-sensitive ranking.
End-to-End Recommendations: LRMs enable unifying the recommendation framework, reducing engineering costs and optimizing objectives.
- OneRec [23, 240] (Kuaishou) is an end-to-end generative recommendation model that replaces the traditional retrieval-coarse ranking-fine ranking cascade architecture. It uses an encoder-decoder structure with MoE (Mixture of Experts) for capacity expansion. It proposes a session-based generation approach and incorporates Direct Preference Optimization (DPO) with a reward model.
- OneSug [43] extends this idea to query recommendation.
- EGA-V2 [238] pushes further with hierarchical tokenization and multi-token prediction, integrating various tasks (user interest modeling, POI generation, ad allocation, payment computation) into a single generative framework.

4.3. Diffusion-Based Generative Recommendation: Explores extending diffusion models to recommendation tasks.

The following figure (Figure 8 from the original paper) illustrates Item Generation:

该图像是一个示意图，展示了基于扩散模型的数据增强与目标项目生成流程。图中包含扩散模型的前向和反向过程，以及条件引导反向过程，体现了从嘈杂社交网络到精炼网络和从输入序列到目标项目的转变。

Figure 8: Item Generation.

4.3.1. Augmented Data Generation: Leverages denoising and generative nature of diffusion models.
- Generate high-quality interaction data: DGFedRS [24] uses diffusion models for personalized user information and high-quality interactions. MoDiCF [77] and TDM [121] handle missing data. Diffurec [88] adds Gaussian noise to embeddings for diversity.
- Generate robust representations: ARD [156] refines social networks. DDRM [230] and DRGO [229] learn robust representations.
- Preference injected conditional generation: DMCDR [83] uses preference guidance for cross-domain user representations. InDiRec [139] generates forward views with consistent intent.
4.3.2. Target Item Generation: Directly generate items or rankings.
- Diffusion recommender model: DiffRec [174] treats interaction prediction as a denoising process. DreamRec [205] noises the target item space to generate recommendations directly, eliminating negative sampling. DiffRIS [129] uses implicit features as guidance. DiQDiff [119] enhances robustness with semantic vector quantization and contrastive discrepancy maximization.
- Diversity and uncertainty modeling: DiffDiv [7] designs diversity-aware guided learning to capture diverse user preferences.
- Tailored Optimization for DM-based recommendation: DDSR [192] uses discrete diffusion for fuzzy sets of interaction sequences. ADRec [11] and PreferDiff [108] propose tailored optimization objectives to address embedding collapse.

4.2.5. Task-Level Opportunities

Generative models reformulate recommendation as a generation task, enabling novel capabilities.

5.1. Top-K Recommendation:
- Vocabulary-Constrained Decoding: Restricts the decoding space to valid item identifiers. P5 [38] uses constrained decoding with beam-search. IDGenRec [158] uses a prefix tree. TransRec [95] uses FM-index for position-free generation and multi-facet identifiers.
- Post-Generation Filtering: LLM freely generates text, then outputs are mapped/reranked to in-catalog items. BIGRec [3] grounds identifiers via L2 distance.
- Prompt Augmentation: Injects candidate items into text prompts for selection. Examples include LLaRA [90], A-LLMRec [69], iLoRA [72].
5.2. Personalized Content Generation: Capitalizes on generative capabilities to create entirely new items or content.
- Personalized visual content generation: DiFashion generates personalized outfits. DreamVTON [193] creates 3D virtual try-ons. InstantBooth [148] enables personalized image generation. OOTDiffusion [197] generates virtual try-on images.
- Personalized textual content generation: [198] explores personalization for review generation ([126, 78, 154, 79]) and news headline generation ([1, 6, 152]).
5.3. Conversational Recommendation: Elicits dynamic user preferences through multi-turn natural language interactions.
- Prompting and Zero-shot Methods: He et al. [49] showed off-the-shelf LLMs can outperform baselines. [153] incorporates iterative user feedback. [21] uses demonstrations to guide.
- Retrieval-augmented and Knowledge-enhanced Approaches: Combine LLMs with retrieval modules or knowledge graphs to prevent hallucinations. [136] retrieves relevant items. [203] integrates collaborative filtering signals.
- Unified and Parameter-efficient Architectures: [142] reformulates CRS as a single NLP task. MemoCRS [190] uses memory modules. Chat-REC [37] enhances interaction and explainability.
- Evaluation: [199] proposes assessing alignment with human expectations.
5.4. Explainable Recommendation: Provides transparency and increases trustworthiness.
- P5 [38] uses prompts to generate explanations, though LLMs can hallucinate. LLM2ER [201] fine-tunes with explainable quality reward models.
- Combines with graphs: XRec [118] uses GNNs to model graph structure for embeddings, then feeds to LLMs. G-Refer [87] uses hybrid graph retrieval and retrieval-augmented fine-tuning.
- Leverages thinking models: [211, 231] use thought processes as explanations.
5.5. Recommendation Reasoning: Performs multi-step deduction for accurate and explainable recommendations.
- Explicit reasoning methods: Generate human-readable reasoning processes. Reason4Rec [30] introduces deliberative recommendation for LLMs to explicitly reason. Reason-to-Recommend [231] uses Interaction-of-Thought (IoT) reasoning. ThinkRec [212] uses thinking prompts. OneRec-Think [112] activates LLM reasoning via CoT-based SFT and RL reward functions.
- Implicit reasoning methods: Perform latent reasoning without textual interpretability. LatentR3 [228] encodes reasoning into latent tokens. ReaRec [159] enables multi-step latent reasoning for sequential recommenders. STREAM-Rec [225] introduces slow thinking for iterative residual-based reasoning.
- LLM reasoning augmentation methods: LLMs generate reasoning steps to enhance training of traditional RSs. DeepRec [236] proposes an autonomous interaction paradigm. LLMRG [176] constructs reasoning graphs through LLM-driven chain reasoning, divergent extension, and self-verification.

5. Experimental Setup

This paper is a comprehensive survey of existing research on generative recommendation. As such, it does not present its own experimental setup, datasets, evaluation metrics, or baselines in the traditional sense of empirical research. Instead, it reviews and synthesizes the experimental methodologies and findings of the works it covers.

5.1. Datasets

The paper discusses the historical evolution of datasets that have driven advancements in recommender systems, and critically evaluates their suitability for the new paradigm of generative recommendation.

Historically Important Datasets:
- MovieLens: A foundational dataset in the early stages, providing large-scale rating data that enabled the development and evaluation of collaborative filtering and matrix factorization methods.
  - Characteristics: Contains movie ratings (1-5 stars) from users.
  - Example Data Sample: A tuple like (user_id, movie_id, rating, timestamp).
- Netflix Prize dataset: A milestone dataset that further pushed latent factor models and matrix factorization techniques due to its large scale and challenging nature.
  - Characteristics: Contains anonymous ratings from Netflix users for movies.
  - Example Data Sample: Similar to MovieLens, but with a larger scale.
- Amazon Review dataset: Shifted attention from explicit ratings to implicit feedback (e.g., clicks, purchases) and auxiliary content like text reviews and product metadata. This fostered research on hybrid recommendation.
  - Characteristics: Contains user reviews, ratings, product metadata across various categories.
  - Example Data Sample: A review like (reviewerID, asin, reviewerName, helpful, reviewText, overall, summary, unixReviewTime, reviewTime). A reviewText example might be: "This product is amazing! It arrived quickly and works perfectly. Highly recommend."
- Yelp dataset: With its rich user reviews and business information, it advanced the use of deep learning-based recommendation, leveraging natural language processing for sentiment analysis, representation learning, and context-aware suggestions.
  - Characteristics: Contains business data, user reviews, and tips.
  - Example Data Sample: A review text example: "Great food and service! The pasta was delicious and the ambiance was cozy."
Suitability for Generative Recommendation: The paper argues that while these datasets were crucial for traditional RSs, they are no longer fully suitable for generative recommendation.
- Limitations: Most of these datasets are non-interactive, offline, and static. They capture point-in-time user preferences rather than the dynamic feedback loops and multi-round interactions that characterize real-world generative recommendation scenarios.
- Assessment Gap: These datasets are more suitable for assessing the accuracy performance of traditional RSs, but they restrict the assessment of generative models as personalized assistants that operate across multiple scenarios and handle diverse tasks in interactive settings.
- Need for New Benchmarks: This highlights an urgent need for new benchmarks that can better support the next stage of generative recommendation research, focusing on interactive, dynamic, and multi-task capabilities.

5.2. Evaluation Metrics

As a survey, the paper does not define or use specific evaluation metrics for its own experiments. However, it discusses the tasks and objectives of generative recommendation, which inherently rely on various metrics for evaluation in the individual works it reviews. Based on the tasks mentioned (e.g., rating prediction, CTR prediction, Top-K recommendation, explainable recommendation, conversational recommendation, content generation), the following types of metrics are commonly used in the field:

5.2.1. Accuracy Metrics (for traditional tasks like rating/CTR prediction)

Root Mean Squared Error (RMSE):
- Conceptual Definition: RMSE is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. It quantifies the square root of the average of the squared differences between predicted and actual values. It gives a relatively high weight to large errors. It is suitable for rating prediction tasks where the output is a continuous value.
- Mathematical Formula: $ \mathrm{RMSE} = \sqrt{\frac{1}{N} \sum_{(u,i) \in D} (r_{ui} - \hat{r}_{ui})^2} $
- Symbol Explanation:
  - $N$ : The total number of observed ratings in the dataset $D$ .
  - $(u,i) \in D$ : An observed interaction (rating) for user $u$ and item $i$ in the dataset.
  - $r_{ui}$ : The true (observed) rating given by user $u$ to item $i$ .
  - $\hat{r}_{ui}$ : The predicted rating for user $u$ and item $i$ by the recommendation model.
Area Under the Receiver Operating Characteristic Curve (AUC):
- Conceptual Definition: AUC measures the ability of a classifier to distinguish between classes. It represents the probability that a randomly chosen positive instance will be ranked higher than a randomly chosen negative instance by the classifier. In recommendation, particularly for click-through rate (CTR) prediction, it indicates how well the model can differentiate between items a user will interact with (positive) and those they won't (negative). A higher AUC suggests better discriminative power.
- Mathematical Formula: AUC is typically calculated by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. The area under this ROC curve is then computed. There isn't a single simple closed-form formula like RMSE; it's often computed numerically. $ \text{AUC} = \int_{0}^{1} \text{TPR}(\text{FPR}^{-1}(x)) dx $ Alternatively, it can be defined as the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example: $ \text{AUC} = P(\text{score}(\text{positive}) > \text{score}(\text{negative})) $
- Symbol Explanation:
  - $\text{TPR}$ : True Positive Rate, also known as Recall or Sensitivity, calculated as $\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$ .
  - $\text{FPR}$ : False Positive Rate, calculated as $\frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}$ .
  - $\text{score}(\cdot)$ : The predicted relevance score of an item by the model.
  - $P(\cdot)$ : Probability.

5.2.2. Ranking Metrics (for `Top-K recommendation`)

Recall@K:
- Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved within the top K recommendations. It indicates how many of the items a user would like were actually shown to them among the top K.
- Mathematical Formula: $ \mathrm{Recall@K} = \frac{|\mathrm{Relevant_Items} \cap \mathrm{Recommended_Items@K}|}{|\mathrm{Relevant_Items}|} $
- Symbol Explanation:
  - $|\cdot|$ : Denotes the cardinality (number of elements) of a set.
  - $\mathrm{Relevant\_Items}$ : The set of items that are truly relevant to a user (e.g., items the user interacted with in the test set).
  - $\mathrm{Recommended\_Items@K}$ : The set of the top K items recommended by the system for the user.
  - $\cap$ : Set intersection.
Normalized Discounted Cumulative Gain (NDCG@K):
- Conceptual Definition: NDCG@K is a measure of ranking quality that takes into account the position of relevant items in the ranked list. It assigns higher scores to relevant items that appear at higher ranks (top of the list) and penalizes relevant items that appear lower. It also incorporates the gain (relevance score) of each item. It is normalized to ensure scores are comparable across different users and queries.
- Mathematical Formula: First, Discounted Cumulative Gain (DCG@K): $ \mathrm{DCG@K} = \sum_{j=1}^{K} \frac{2^{rel_j} - 1}{\log_2(j+1)} $ Then, Ideal Discounted Cumulative Gain (IDCG@K) is the DCG for the ideally ranked list (all relevant items at the very top, ordered by decreasing relevance): $ \mathrm{IDCG@K} = \sum_{j=1}^{K} \frac{2^{rel_j^{opt}} - 1}{\log_2(j+1)} $ Finally, Normalized Discounted Cumulative Gain (NDCG@K): $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
- Symbol Explanation:
  - $K$ : The number of top recommendations considered.
  - $j$ : The position of an item in the ranked list.
  - $rel_j$ : The relevance score of the item at position $j$ in the generated ranked list. This can be binary (0 or 1 for irrelevant/relevant) or graded (e.g., 1-5 for ratings).
  - $rel_j^{opt}$ : The relevance score of the item at position $j$ in the ideal (perfect) ranked list.
  - $\log_2(j+1)$ : The logarithmic discount factor, which reduces the importance of items at lower ranks.

5.2.3. Qualitative/Human Evaluation (for `Explainable`, `Conversational`, `Content Generation` tasks)

For tasks like explainable recommendation, conversational recommendation, and personalized content generation, traditional quantitative metrics might be insufficient. These often rely on:

Human Evaluation: Experts or user studies to assess aspects like coherence, fluency, informativeness, trustworthiness, interactivity, appropriateness, and creativity of the generated explanations, dialogues, or content.
Task-Specific Metrics: For conversational systems, metrics like turn-based success rate or dialogue completion rate might be used. For content generation, metrics from natural language generation or image generation (e.g., FID for image quality, BLEU/ROUGE for text similarity, though less suited for open-ended generation) can be adapted, often combined with human judgment.

5.3. Baselines

As a survey paper, it does not propose a new model to be compared against baselines. Instead, it discusses the evolution of recommender systems, implicitly treating older paradigms (e.g., collaborative filtering, matrix factorization, deep learning-based discriminative models) as conceptual baselines against which the new generative recommendation paradigm is evaluated in terms of capabilities, advantages, and limitations. The paper highlights how generative models aim to overcome the shortcomings of these traditional approaches.

6. Results & Analysis

Since this paper is a survey, it does not present new experimental results from its own models or comparisons. Instead, its "results" are the synthesis and analysis of findings from the numerous research papers it reviews. The analysis focuses on the identified advantages, paradigm shifts, and open challenges presented by the generative recommendation approach.

6.1. Core Results Analysis

The paper's core analysis validates the significant shift brought by generative models to recommender systems by highlighting several key advantages and outlining a fundamental paradigm change:

Five Key Advantages of Generative Models in RSs:
1. World Knowledge Integration: Generative models, especially LLMs, are pre-trained on vast and diverse datasets, inherently encoding extensive world knowledge. This allows them to incorporate rich semantic information about entities, events, and cultural contexts directly into recommendations without requiring explicit knowledge extraction pipelines or separate knowledge base construction. This leads to more contextually aware and informed recommendations.
2. Natural Language Understanding (NLU): LLMs possess advanced NLU capabilities, enabling them to interpret user expressions in free-form text, understanding nuances, context, and intent. This bridges the gap between how users naturally communicate their preferences (e.g., search queries, reviews, conversations) and how RSs operate, supporting conversational interfaces and complex queries.
3. Reasoning Capabilities: Generative models exhibit emergent reasoning capabilities, allowing them to model logical processes behind user decisions. Unlike traditional models that rely on pattern matching, generative RSs can understand why a user might prefer an item, considering feature relationships, temporal sequences, and contextual factors. This enables explainable recommendations and justifies suggestions.
4. Scaling Laws: The paper emphasizes that the scaling law observed in LLMs (performance improving predictably with increased model size, training data, and compute) also applies to Large Recommendation Models (LRMs). This provides a systematic pathway for building more capable RSs, as increasing scale can lead to emergent capabilities like better intent understanding and nuanced preference modeling, reducing the need for manual feature engineering.
5. Creative Generation for Novel Recommendations: Unlike discriminative models that only rank existing items, generative models can create novel content and recommendations. This is particularly valuable in cold-start scenarios and helps to diversify recommendations beyond the filter bubble. They can suggest customized bundles, personalized content variations, or even generate new item descriptions tailored to individual preferences.
Paradigm Shift from Discriminative to Generative: The paper effectively illustrates how generative RSs are breaking away from the traditional discriminative paradigm, which was characterized by:
- Mapping-based: Learning a mapping from user-item pairs to a score.
- Feature-driven: Heavily relying on hand-crafted or learned features.
- Small-model: Typically smaller, specialized models.
- Task-independent: Designed for a single, fixed task (e.g., CTR prediction).
- Reliant on predefined candidate sets: Limited to recommending from an existing catalog.
  
  In contrast, generative recommenders are reshaping the paradigm towards recommendation assistants that are:
- Open: Leveraging open-world knowledge.
- Capable of handling a wide range of tasks: Task-agnostic by design, able to rank items, generate explanations, and produce personalized content within a single framework.
- Adaptive to changing user needs: Through interactive and conversational capabilities.
  
  This shift represents a fundamental change in how recommendations are made, moving towards a more dynamic, responsive, and intelligent interaction between humans and information systems.

The following figure (Figure 9 from the original paper) illustrates the comparison between traditional discriminative recommendation and a generative recommendation assistant:

该图像是论文中对比传统判别式推荐与生成式推荐助理的示意图，展示了两者数据流程、交互方式、特点及面临的挑战。

Figure 9: Illustration of traditional discriminative recommendation and generative recommendation assistant.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Category	Representative Works	Description / Focus
Content Augmentation	ONCE (WSDM'24), LLM-Rec (NAACL'24), LRD (SIGIR'24), MSIT (ACL'25), EXP3RT (SIGIR'25), Lettingo (KDD'25), SINGLE (WWW'24), KAR (RecSys'24), IRLLRec (SIGIR'25), LLM4SBR (TOIS'25), SeRALM (SIGIR'24), TRAWL (ArXiv'24)	Generate natural-language user/item profiles, summarize histories, enrich sparse metadata, and align textual semantics with feedback.
Representation Augmentation	DynLLM (ArXiv'24), GE4Rec (ICML'24), Hy- perLLM (SIGIR'25)	Automated feature construction, multimodal attribute extraction, external knowledge distil- lation, and hierarchical category generation.
Behavior Augmentation	ColdLLM (WSDM'25), Wang et al. (WWW'25), LLM-FairRec (SIGIR'25), LLM4IDRec (TOIS'25)	Generate synthetic user-item interactions, simulate cold-start preferences, ensure fair- ness, and integrate pseudo-interactions into ID-based pipelines.
Structure Augmentation	SBR (SIGIR'25), LLMRec (WSDM'24), Chang et al. (AAAI'25), CORONA (SIGIR'25), LLM-KERec (CIKM'24), TCR-QF (IJCAI'25), COSMO (SIGMOD'24)	Relation discovery, graph completion, social network generation, subgraph retrieval, knowl- edge graph construction & distillation.

The following are the results from Table 2 of the original paper:

Methods	User formulation				Backbone
Methods	Task description	Historical interactions	Profile	Feedback	Backbone
Chat-Rec [36]	ranking	history interactions	✓		GPT-3.5
TALLRec [4]	preference classification	user preference			LLaMA-7B
LlamaRec [214]	retrieval, ranking	history interactions			LLaMA2-7B
LRD [202]	ranking	history interactions			GPT-3.5
ReLLa [93]	ranking	history interactions			Vicuna-7B
CALRec [86]	ranking	history interactions			PaLM-2 XXS
BiLLP [149]	long-term Interactive	history interactions, reward model			GPT-3.5, GPT-4, LLaMA2-7B
PO4ISR [157]	Ranking	history interactions			LLaMA2-7B
LLM-TRSR [237]	Ranking	history interactions			LLaMA2-7B
RecGPT [124]	Ranking	history interactions, user preference			RecGPT-7B
KAR [189]	Ranking	history interactions, user preference			GPT-3.5
LLM4CDSR [105]	Ranking	history interactions			GPT-3.5, GLM4-Flash
EXP3RT [68]	rating prediction	history interactions			LLaMA3-8B
SERAL [191]	retrieval, ranking	history interactions			Qwen2-0.5B
LettinGo [168]	Ranking	history interactions			LLaMA3-8B
Reason4Rec [28]	Rating Prediction	history interactions, user preference			LLaMA3-8B
InstructRec [224]	Ranking	history interactions			Flan-T5-XL
Uni-CTR [31]	Rating Prediction	history interactions, user preference			DeBERTaV3-large
BIGRec [3]	Ranking	history interactions			LLaMA-7B
UPSR [140]	Ranking	history interactions			T5, FLAN-T5

The following are the results from Table 3 of the original paper:

Methods	User Formulation				Combining Method	Backbone
Methods	Task Description	Historical Interactions	Profile	Feedback	Combining Method	Backbone
iLoRA [72]	Ranking	history interactions			Concatenation	GPT-3.5
LLM-ESR [104]	Ranking	history interactions			Concatenation	LLaMA2-7B
LLaRA [90]	Ranking	history interactions			Concatenation	LLaMA2-7B
A-LLMRec [69]	Ranking	history interactions			Concatenation	OPT-6.7B
RLMRec [144]	Ranking	history interactions, user preference			Concatenation	GPT-3.5
CoRAL [186]	Ranking	history interactions, user preference			Retrieval-Augmented	GPT-4
BinLLM [226]	Ranking	history interactions, user preference			Concatenation	Vicuna-7B
E4SRec [82]	Ranking	history interactions			Concatenation	Vicuna-7B
SeRALM [145]	Ranking	history interactions			Concatenation	LLaMA2-7b
CORONA [12]	Ranking	history interactions			Pipeline Integration	GPT-4o-mini
HyperLLM [16]	Ranking	history interactions			Pipeline Integration	LLaMA3-8B
RecLM [65]	Ranking	history interactions			Concatenation	LLaMA2-7b
CoLLM [227]	Ranking	history interactions			Concatenation	Vicuna-7B
PAD [177]	Ranking	history interactions, user preference			Concatenation	LLaMA3-8B
IDP [188]	Ranking	history interactions, user preference			Concatenation	T5

The following are the results from Table 4 of the original paper:

Methods	User formulation			Backbone
Methods	Task description	Historical interactions	Token types	Backbone
P5 [38]	Ranking	historical interactions, user preference	ID-based tokenization	Transformer
CLLM4Rec [244]	Ranking	historical interactions	ID-based tokenization	GPT-2
BIGRec [3]	Ranking	historical interactions	Text-based tokenization	LLaMA-7B
M6 [20]	Retrieval, Ranking	historical interactions	Text-based tokenization	M6
IDGenRec [158]	Ranking	historical interactions	Text-based tokenization	BERT4Rec
TIGER [141]	Ranking	historical interactions	Codebook-based tokenization	T5
RPG [52]	Ranking	historical interactions	Codebook-based tokenization	LLaMA-2-7B
LC-Rec [235]	Ranking	historical interactions	Codebook-based tokenization	LLaMA-2-7B
ActionPiece [53]	Retrieval	historical interactions	Codebook-based tokenization	LLaMA-2-7B
LETTER [172]	Ranking	historical interactions	Codebooks with collaborative signals	LLaMA-7B
TokenRec [138]	Retrieval	historical interactions	Codebooks with collaborative signals	T5-small
SETRec [94]	Ranking	historical interactions	Codebooks with collaborative signals	T5, Qwen
CCFRec [100]	Ranking	historical interactions	Codebooks with collaborative signals	LLaMA-2-7B
LLM2Rec [48]	Ranking	historical interactions	Codebooks with collaborative signals	LLaMA-2-7B
SIIT [15]	Retrieval	historical interactions	Self-adaptive tokenization	LLaMA-2-7B

The following are the results from Table 5 of the original paper:

Category	Representative Works	Formula
Supervised Fine-Tuning	P5 (RecSys'22) LGIR (AAAI'24) LLM-Rec (TOÍS'25) RecRanker (TOIS'25)	− lo θ(y+ \| )
Self-Supervised Learning	FELLAS (TOIS'24) HFAR (TOIS'25)	exp(sim(y+y )τ) - log ∑N exp(sim(y+)/τ)
Reinforcement Learning	LEA (SIGIR'24) RPP (TOIS'25)	[(x, ) − βπθ( \| \|]
Direct Preference Optimization	LettinGo (KDD'25) RosePO (ArXiv'24) SPRec (WWW'25)	− log σ(β log θ(y+) log \$πθ(yx ref(v+ ref(y

The following are the results from Table 6 of the original paper:

Methods	User Formulation		Architectures	Backbone
Methods	Task	Historical Interactions	Architectures	Backbone
LEARN [62]	Ranking	history interactions, user preference	Cascaded	Baichuan2-7B, Transformer
HLLM [9]	Retrieval, Ranking	history interactions	Cascaded	TinyLlama-1.1B, Baichuan2-7B
KuaiFormer [98]	Retrieval	history interactions	Cascaded	Stacked Transformer
SRP4CTR [44]	Ranking	history interactions, user preference	Cascaded	FG-BERT
HSTU [216]	Ranking	history interactions	Cascaded	Transformer
MTGR [45]	Ranking	history interactions	Cascaded	Transformer
UniROM [135]	Ranking	history interactions	End-to-End	RecFormer
URM [63]	Ranking	history interactions, user preference	End-to-End	BERT
OneRec [23]	Generative Retrieval and Ranking	history interactions, user preference	End-to-End	Transformer
OneSug [43]	Generative Retrieval and Ranking	history interactions, user preference	End-to-End	Transformer
EGA-V2 [238]	Generative Retrieval and Ranking	history interactions, user preference	End-to-End	Transformer

6.3. Ablation Studies / Parameter Analysis

As a survey paper, this work does not conduct its own ablation studies or parameter analyses. Instead, it synthesizes the findings and observations regarding component effectiveness and hyperparameter sensitivities from the individual research papers it reviews, integrating them into the broader discussion of model-level opportunities and open challenges. For instance, in Section 4.1.2, it discusses how different alignment mechanisms (text prompting, collaborative signals, item tokenization) contribute to LLM performance in recommendation. In Section 4.1.3, it covers various training objectives (SFT, SSL, RL, DPO) and their impact on model behavior. The scaling law discussion in Section 4.2 inherently touches upon how model size and data volume (parameters) affect performance. The challenges section (6.2) explicitly addresses bias (e.g., popularity bias, positional biases) and robustness concerns, which are often investigated through sensitivity analyses in primary research.

7. Conclusion & Reflections

7.1. Conclusion Summary

This survey provides a comprehensive examination of how generative models, particularly Large Language Models (LLMs) and diffusion models, are fundamentally revolutionizing recommender systems (RSs). It argues that this marks a significant paradigm shift from discriminative matching to intelligent synthesis.

The core contribution is a unified tripartite framework that analyzes this transformation across three dimensions:

Data Level: Generative models are enabling unprecedented data augmentation (e.g., knowledge-infused, agent-based simulation) and data unification (e.g., multi-domain, multi-task, multi-modal), addressing long-standing issues like data sparsity and cold-start.
Model Level: The survey categorizes and details the rise of LLM-based methods, Large Recommendation Models (LRMs), and diffusion approaches as core recommendation engines. These models leverage scaling laws and novel alignment mechanisms to achieve powerful capabilities.
Task Level: Generative models unlock new functionalities beyond traditional top-K recommendations, including conversational interaction, explainable reasoning, and personalized content generation, fundamentally redefining user-system interaction.

The paper highlights five key advantages of this new paradigm: world knowledge integration, natural language understanding, reasoning capabilities, scaling laws, and creative generation. It concludes that generative RSs are moving towards intelligent recommendation assistants that are open, task-agnostic, and adaptive, fundamentally reshaping how humans interact with information.

7.2. Limitations & Future Work

The survey critically examines several open challenges that need to be addressed for the full potential of generative recommendation to be realized:

Data Challenges:
- Benchmark Design: Current datasets are largely non-interactive, offline, and static, making them unsuitable for evaluating generative models as personalized assistants that operate in dynamic, multi-round, interactive settings across diverse tasks. There's an urgent need for new benchmarks that capture real-world complexity and support the assessment of generative capabilities.
Model Challenges:
- Bias: Generative RSs face significant bias issues:
  - Popularity bias: LLMs, trained on vast corpora, tend to rank popular items higher, reducing diversity and potentially marginalizing less popular content. Existing methods to mitigate this often involve adjusting or generating training data, but more research is needed on using LLMs to generate fair user interactions.
  - Fairness: Models can implicitly utilize sensitive attributes (e.g., gender, race) from data, leading to biased recommendations. Ensuring unbiased recommendations requires careful consideration of the overlap between user preference modeling and sensitive information.
  - Positional biases: LLMs are sensitive to prompt structure, item order, and content, which can introduce positional biases and increase output uncertainty.
- Robustness:
  - Natural noise: Recommendation tasks are plagued by noise (e.g., clickbait, unintended interactions). While LLMs' knowledge and reasoning might help, a significant gap exists between LLM pre-training objectives and denoising recommendation. LLMs can hallucinate and misclassify noise.
  - Malicious attack: Injection attacks are a concern for traditional RSs, but for generative RSs, textual simulation attacks (rephrasing item descriptions to be adversarial) are particularly low-cost and transferable across models. The robustness of multimodal LLM-based recommendations against attacks is an emerging research area, and existing defense strategies for data poisoning attacks are limited.
Deployment Challenges:
- Training Efficiency: Parameter-efficient fine-tuning (PEFT) methods exist, but are insufficient for the rapidly increasing scale of recommendation datasets. The challenge is data-efficient fine-tuning, rapidly adapting LLMs with fewer data and computational resources, especially considering correlations in interaction records.
- Inference Efficiency: Autoregressive decoding in generative models leads to inefficient inference due to multiple serial calls, hindering real-time recommendation. Traditional LM acceleration methods (e.g., speculative decoding) are hard to apply directly to top-K distinct sequence generation (which requires beam search). Knowledge distillation can help but is not a complete solution.

7.3. Personal Insights & Critique

This survey provides an incredibly timely and well-structured overview of an rapidly evolving field. Its tripartite data-model-task framework is particularly insightful, offering a clear lens through which to understand the multifaceted impact of generative AI on recommender systems. The emphasis on the transition from discriminative scoring to generative synthesis is a powerful conceptual shift that truly captures the essence of this new paradigm.

Inspirations and Applications:

Beyond Ranking: The most significant inspiration is the move beyond mere item ranking to genuine content generation and conversational interaction. This opens doors for RSs to become truly intelligent assistants. Imagine a fashion recommender that doesn't just suggest clothes, but generates unique outfit compositions based on your preferences and current trends, or a travel assistant that generates a full itinerary including personalized activities and even hypothetical experiences.
Cold-Start & Long-Tail Solutions: The potential for data augmentation and agent-based simulation to address cold-start and long-tail problems is enormous. This could democratize recommendations for niche content or new users, currently underserved by traditional data-hungry methods.
Explainability & Trust: The ability to reason and generate explanations fundamentally enhances user trust and transparency. Instead of a black-box suggestion, users can understand the logic, which is crucial for high-stakes recommendations (e.g., financial products, healthcare).
Cross-Domain Transfer: The data unification capabilities of LLMs suggest that a single, large generative recommender could potentially serve multiple domains (e.g., movies, music, books) without significant retraining, leading to more efficient and scalable systems.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Evaluation Gap: While the survey articulates the need for new benchmarks, the practical challenges of evaluating open-ended generative outputs (e.g., personalized text, images) are immense. Traditional metrics like Recall@K or RMSE don't capture creativity, coherence, or conversational flow. Developing robust, scalable, and fair qualitative and quantitative evaluation frameworks remains a grand challenge. The risk of hallucination in generated content is also a major concern that needs rigorous evaluation.
Control and Safety: Generative models, especially LLMs, are known for issues like factual inaccuracies, toxicity, and bias amplification. In recommendation, this translates to generating misleading information, inappropriate content, or reinforcing harmful stereotypes. While the paper mentions bias and robustness as challenges, the concrete mechanisms for controlling output generation to ensure safety, fairness, and reliability in real-world deployment (beyond just avoiding popular items) need significant development. This is especially critical as RSs move into sensitive domains.
"One Model for All" Feasibility: The vision of "one model for all" is appealing for engineering efficiency, but it faces practical hurdles. Different domains and tasks might have conflicting objectives or require highly specialized knowledge that a single, large model might struggle to encapsulate efficiently. The cost-benefit analysis of training and maintaining such a colossal model versus an ensemble of specialized models needs continuous re-evaluation in industrial settings.
Interpretability of Latent Reasoning: While explicit reasoning is promising, the paper also mentions implicit reasoning methods. The challenge here is ensuring that this implicit reasoning truly aligns with human logic and is not just a statistical correlation. If the reasoning is latent, how can we trust or audit it for fairness and correctness?
Sustainability and Ethics: The immense computational resources required for training and deploying large generative models raise environmental sustainability concerns. Furthermore, the ethical implications of highly personalized content generation (e.g., filter bubbles, manipulation, privacy concerns with deep user profiles) need to be proactively addressed as this technology matures.

Overall, this survey is a valuable resource for navigating the complex and exciting landscape of generative recommendation. It effectively maps the progress and potential while soberly identifying the significant hurdles that remain.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

A Survey on Generative Recommendation: Data, Model, and Tasks

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~41 min read · 62,055 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Preliminaries of Discriminative Recommendation Models

4.2.2. Generative Recommendation Models

4.2.3. Data-Level Opportunities

4.2.4. Model-Level Opportunities

4.2.5. Task-Level Opportunities

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Accuracy Metrics (for traditional tasks like rating/CTR prediction)

5.2.2. Ranking Metrics (for Top-K recommendation)

5.2.3. Qualitative/Human Evaluation (for Explainable, Conversational, Content Generation tasks)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

5.2.2. Ranking Metrics (for `Top-K recommendation`)

5.2.3. Qualitative/Human Evaluation (for `Explainable`, `Conversational`, `Content Generation` tasks)