A Survey on Generative Recommendation: Data, Model, and Tasks
TL;DR Summary
This survey reviews generative recommendation via a unified framework, analyzing data augmentation, model alignment, and task design, highlighting innovations in large language and diffusion models that enable knowledge integration, natural language understanding, and personalize
Abstract
Recommender systems serve as foundational infrastructure in modern information ecosystems, helping users navigate digital content and discover items aligned with their preferences. At their core, recommender systems address a fundamental problem: matching users with items. Over the past decades, the field has experienced successive paradigm shifts, from collaborative filtering and matrix factorization in the machine learning era to neural architectures in the deep learning era. Recently, the emergence of generative models, especially large language models (LLMs) and diffusion models, have sparked a new paradigm: generative recommendation, which reconceptualizes recommendation as a generation task rather than discriminative scoring. This survey provides a comprehensive examination through a unified tripartite framework spanning data, model, and task dimensions. Rather than simply categorizing works, we systematically decompose approaches into operational stages-data augmentation and unification, model alignment and training, task formulation and execution. At the data level, generative models enable knowledge-infused augmentation and agent-based simulation while unifying heterogeneous signals. At the model level, we taxonomize LLM-based methods, large recommendation models, and diffusion approaches, analyzing their alignment mechanisms and innovations. At the task level, we illuminate new capabilities including conversational interaction, explainable reasoning, and personalized content generation. We identify five key advantages: world knowledge integration, natural language understanding, reasoning capabilities, scaling laws, and creative generation. We critically examine challenges in benchmark design, model robustness, and deployment efficiency, while charting a roadmap toward intelligent recommendation assistants that fundamentally reshape human-information interaction.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is "A Survey on Generative Recommendation: Data, Model, and Tasks." This title clearly indicates that the paper provides a comprehensive review of the emerging field of generative recommendation, organized around three key dimensions: data, models, and tasks.
1.2. Authors
The authors are:
-
Min Hou, Le Wu, Yuxin Liao, Zhen Zhang, Yu Wang, Changlong Zheng, Han Wu, and Richang Hong from Hefei University of Technology, Hefei, 230009, Anhui, China.
-
Yonghui Yang from National University of Singapore, Singapore.
Their affiliations suggest a strong background in computer science, likely with a focus on artificial intelligence, machine learning, and recommender systems, given the topic of the paper.
1.3. Journal/Conference
The paper is published at arXiv, with a specified publication date of 2025-10-31T04:02:58.000Z. arXiv is a well-regarded preprint server for physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While it hosts preprints, many papers later appear in top-tier conferences and journals. The paper itself mentions searching top-tier conferences and journals such as ICML, ICLR, NeurIPS, ACL, SIGIR, KDD, WWW, RecSys, TKDE, TOIS for related works, indicating the expected quality and relevance of its content to these venues.
1.4. Publication Year
The publication year, based on the provided UTC timestamp, is 2025.
1.5. Abstract
Recommender systems (RSs) are crucial for helping users find relevant items in modern digital ecosystems. Traditionally, RSs focused on matching users with items through discriminative scoring. The field has evolved from collaborative filtering and matrix factorization to neural architectures. Recently, the rise of generative models, particularly large language models (LLMs) and diffusion models, has introduced a new paradigm: generative recommendation, which re-conceptualizes recommendation as a generation task.
This survey offers a comprehensive examination of this new paradigm using a unified tripartite framework that covers data, model, and task dimensions. Instead of merely categorizing works, it systematically decomposes approaches into operational stages: data augmentation and unification, model alignment and training, and task formulation and execution.
At the data level, generative models enable knowledge-infused augmentation and agent-based simulation, while unifying heterogeneous signals. At the model level, the survey taxonomizes LLM-based methods, large recommendation models, and diffusion approaches, analyzing their alignment mechanisms and innovations. At the task level, it highlights new capabilities such as conversational interaction, explainable reasoning, and personalized content generation.
The paper identifies five key advantages of generative recommendation: world knowledge integration, natural language understanding, reasoning capabilities, scaling laws, and creative generation. It critically examines challenges in benchmark design, model robustness, and deployment efficiency, concluding with a roadmap toward intelligent recommendation assistants that aim to fundamentally reshape human-information interaction.
1.6. Original Source Link
Official Source Link: https://arxiv.org/abs/2510.27157 PDF Link: https://arxiv.org/pdf/2510.27157v1.pdf Publication Status: This is a preprint published on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the fundamental challenge of matching users with items in recommender systems (RSs), which are foundational infrastructure in modern information ecosystems. RSs are critical for alleviating information overload, helping users discover digital content aligned with their preferences, and boosting traffic and revenue for service providers across diverse domains like e-commerce, social media, education, online video, and music services.
Despite significant advancements over the past decades, driven by deep learning architectures and large-scale user behavior data, traditional RSs (primarily discriminative matching paradigms) face several challenges:
-
Limited Semantic Knowledge: They often rely on
manually processedandlimited semantic knowledge. -
Sub-optimal Performance for Small-Scale Models: Their performance can be constrained with smaller models.
-
Dependence on Fixed Candidate Sets: They require a pre-defined set of items to choose from.
-
Task-Specific Architectures: They often need
task-specific architecturesandtraining objectives. -
Cold-Start Scenarios: They struggle to provide recommendations for new users or items with little historical data.
-
Lack of Transparency: They find it difficult to offer
transparent,context-rich explanationsfor their recommendations.The paper's entry point and innovative idea stem from the recent emergence of
generative models, particularlyLarge Language Models(LLMs) anddiffusion models. These models have sparked anew paradigmcalledgenerative recommendation. This paradigmreconceptualizesthe recommendation problem from adiscriminative scoring procedure(i.e., predicting relevance) to ageneration task(i.e., directly synthesizing recommendations). This shift promises to address the aforementioned challenges and unlock new capabilities in RSs.
2.2. Main Contributions / Findings
The paper provides a comprehensive examination of the generative recommendation paradigm through a unified tripartite framework spanning data, model, and task dimensions. Its primary contributions and key conclusions are:
- Unified Tripartite Framework: The survey proposes a novel framework that systematically decomposes generative recommendation approaches into operational stages:
data augmentation and unification,model alignment and training, andtask formulation and execution. This provides a structured understanding of how generative models influence the entire recommendation pipeline. - Data-Level Innovations: Generative models enable:
Knowledge-infused augmentation: Leveraging vast world knowledge to enrich sparse recommendation data and item representations.Agent-based simulation: Simulating user behaviors and interactions to address data sparsity and cold-start issues.Unification of heterogeneous signals: Integrating diverse data types (multi-domain, multi-task, multi-modal) into coherent inputs, moving towards "one model for all."
- Model-Level Innovations: The survey taxonomizes and analyzes three main approaches:
LLM-based methods: Using pre-trained LLMs as recommendation backbones.Large Recommendation Models (LRMs): Scaling up traditional recommendation architectures with generative components, demonstratingscaling lawsnative to recommendation.Diffusion approaches: Reconceptualizing recommendation as adenoising processto generate user preferences or item rankings.
- Task-Level Innovations: Generative models unlock new capabilities beyond traditional top-K recommendations, including:
Conversational interaction: Enabling multi-turn, natural language dialogues for dynamic preference elicitation.Explainable reasoning: Providing transparent justifications and logical processes behind recommendations.Personalized content generation: Creating novel content (e.g., personalized text, visual designs) rather than just ranking existing items.
- Five Key Advantages of Generative Recommendation:
World knowledge integration: Seamlessly incorporating real-world semantic information.Natural language understanding: Interpreting nuanced user expressions in free-form text.Reasoning capabilities: Modeling logical processes behind user decisions.Scaling laws: Predicting performance improvement with increased model size and data.Creative generation: Producing novel content and diverse recommendations.
- Identified Challenges and Roadmap: The paper critically examines significant challenges in
benchmark design(current datasets are unsuitable),model robustness(bias and adversarial attacks), anddeployment efficiency(training and inference costs). It charts a roadmap towardsintelligent recommendation assistantsthat are open, task-agnostic, and responsive to evolving user needs, fundamentally reshaping human-information interaction.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following foundational concepts in recommender systems and machine learning:
-
Recommender Systems (RSs): At its core, a recommender system aims to predict a user's preference for an item and suggest items that the user is likely to enjoy. They address
information overloadby filtering content and providing personalized suggestions. Examples include product recommendations on e-commerce sites (e.g., Amazon), movie suggestions (e.g., Netflix), or news feeds (e.g., social media). -
Collaborative Filtering (CF): This is a traditional approach where recommendations are based on the tastes of other users. If user A and user B have similar preferences (e.g., they both liked the same movies), then user A might be recommended movies that user B liked but A hasn't seen yet. There are two main types:
user-based CF(finding similar users) anditem-based CF(finding similar items). -
Matrix Factorization (MF): A powerful technique in CF, especially popularized by the Netflix Prize. It works by decomposing the user-item interaction matrix (where rows are users, columns are items, and entries are ratings) into two lower-dimensional matrices: a user-factor matrix and an item-factor matrix.
- User-factor matrix: Represents users in a latent space, where each dimension captures a certain aspect of preference.
- Item-factor matrix: Represents items in the same latent space.
The dot product of a user's latent vector and an item's latent vector then predicts the rating the user would give to that item. This approach helps in uncovering hidden
latent featuresorfactorsthat explain observed ratings.
-
Deep Learning (DL) in RSs: The application of neural networks to recommender systems. DL models (such as
Convolutional Neural Networks(CNNs),Recurrent Neural Networks(RNNs),Graph Neural Networks(GNNs), andTransformers) can learn complex, non-linear relationships and rich representations from diverse data types (text, images, social networks, knowledge graphs). This allows them to capture intricate semantics of user-item interactions and improve user/item representation learning. -
Discriminative Models vs. Generative Models: This is a crucial distinction made in the paper.
- Discriminative Models: These models learn to
map inputs to outputsordistinguish between different classes. In the context of RSs, a discriminative model learns a functionf(u, i)that predicts the probability or score of a user interacting with an item . They model theconditional probability(the probability of output given input ). For example, predicting whether a user will click on an item or what rating they will give. - Generative Models: These models learn the
underlying distribution of the dataitself. They cangenerate new data samplesthat are similar to the training data. In RSs, a generative model learns how users and items are "generated" together, i.e., thejoint probability distributionP(x,y). This allows them todirectly generate recommended itemsor personalized content, rather than just scoring existing ones.
- Discriminative Models: These models learn to
-
Large Language Models (LLMs): These are a type of
generative modelthat has gained significant attention recently. LLMs areneural network modelswith a massive number of parameters (billions to trillions) trained on vast amounts oftextual data(e.g., internet text, books). Their core ability is togenerate human-like textby predicting the next word in a sequence. They exhibitemergent abilitiessuch asin-context learning(learning from examples within the prompt),complex reasoning, andnatural language understanding, making them highly versatile for various tasks beyond pure text generation. -
Diffusion Models: Another type of
generative modelprimarily known for generating high-quality images. The core idea is to learn toreverse a gradual noise process. During training, noise is progressively added to an image until it becomes pure noise. The model then learns to reverse this process,denoisingthe image step-by-step to reconstruct the original. This allows them to generate diverse and realistic data. In RSs, they can be adapted to generate user preferences, item rankings, or even synthetic user data by treating recommendation as a denoising problem.
3.2. Previous Works
The paper frames itself as a comprehensive survey that builds upon, yet differentiates from, several prior reviews on LLM-based recommender systems. These previous works, primarily published in 2024 or earlier, have set the stage for understanding the intersection of LLMs and RSs:
-
Wu et al. [187] (2024): This survey systematically reviewed
LLM-based recommendation systems, categorizing studies based onmodeling paradigms:LLM Embeddings enhanced RS,LLM Tokens enhanced RS, andLLM as RS. This provided an early structural understanding of how LLMs were being integrated. -
Lin et al. [92] (2025): They introduced two orthogonal perspectives:
where(e.g., for data augmentation, as core model) andhow(e.g., tuning LLM, involving conventional models) to adapt LLMs in recommender systems. This offered a more detailed view of the integration strategies. -
Zhao et al. [233] (2024): Reviewed
LLM-empowered recommender systemsfrom various aspects, includingpre-training,fine-tuning, andprompting paradigms. This focused on the training and deployment aspects of LLMs in RSs. -
Deldjoo et al. [22] (2024): Connected key advancements in RS using
Generative Models (Gen-RecSys), coveringinteraction-driven generative models,LLM and textual data for natural language recommendation, andmultimodal modelsfor generating/processing images/videos. This was a broader view encompassing various generative models. -
Liu et al. [106] (2024): Explored advancements in
multimodal pretraining,adaptation, andgeneration techniques, and their applications to recommender systems. This highlighted the growing importance of multimodal data. -
Li et al. [80] (2023): Reviewed recent progress of
LLM-based generative recommendationand provided ageneral formulationfor eachgenerative recommendation task. This focused on task-specific applications. -
Wang et al. [170] (2024): Reviewed existing
LLM-based recommendation worksand discussed the gap fromacademic research to industrial application. This provided a practical perspective.While these surveys laid important groundwork, the current paper emphasizes that they largely reflect the state of the field up until 2024, potentially overlooking newer research, especially from 2025 and beyond.
3.3. Technological Evolution
The evolution of recommender systems can be broadly outlined as follows:
-
Early Heuristics (1990s): Initial systems relied on
content-based filtering(recommending items similar to those a user liked previously, based on item attributes) andcollaborative filtering(CF, recommending based on user-item interaction patterns, e.g., similar users like similar items). These were often rule-based or used simple statistical methods. -
Machine Learning Era (2000s): This era was marked by the prominence of
Matrix Factorization(MF), especially after the Netflix Prize. MF provided a more sophisticated mathematical framework for CF, learninglatent factorsthat explain user preferences. This shifted focus to learning underlying representations. -
Deep Learning Era (Mid-2010s onwards): Advancements in
neural networks(CNNs, RNNs, GNNs, Transformers) led to deep learning-based recommendation methods. These approaches leveraged the powerful representation capabilities of deep learning to handle complex,heterogeneous data(text, images, social networks) and learn non-linear mappings of user-item interactions, enhancing the learning of user and item representations. This moved RSs from shallow models to deeper, more complex architectures. -
Generative AI Era (Recently): The latest paradigm shift, driven by
Large Language Models(LLMs) anddiffusion models. This era re-conceptualizes recommendation from adiscriminative matching(scoring) paradigm to agenerative synthesis(creating) paradigm. Instead of predicting a score for an existing item, the goal becomes to directlygenerate the target document or itemitself. This fundamentally changes how recommendations are modeled and opens up new possibilities for content creation, interactive experiences, and leveraging vastworld knowledge.This paper's work fits squarely into the current
Generative AI Era, providing a timely and comprehensive overview of how these cutting-edge generative models are transforming the entire RS pipeline.
3.4. Differentiation Analysis
Compared to the main methods and prior surveys in related work, this paper's core differences and innovations are:
-
Broader Coverage of Generative Paradigms: While earlier surveys primarily focused on
LLM-based recommendation, this survey offers a more expansive view by categorizing research intoLLM-based generative recommendation,large recommendation models (LRMs), anddiffusion-based generative recommendation. This inclusive approach ensures that the latest advancements across different generative model types are captured. -
Unified Data-Model-Task Framework: This survey introduces a
data-model-task framework, which is a systematic and comprehensive way to analyze the contributions of generative models. This goes beyond simple categorization by providing a pipeline-centric understanding:- Data Level: How generative models enhance or augment data (e.g.,
data generation,data unification). - Model Level: How generative mechanisms are integrated into the core recommendation architecture.
- Task Level: How generative modeling extends to high-level objectives and novel capabilities. This framework provides a more holistic and in-depth understanding of the evolving role of generative models.
- Data Level: How generative models enhance or augment data (e.g.,
-
Emphasis on Task-Level Innovations: The survey dedicates a significant section to
task-level innovations, examining how generative models enable novel recommendation scenarios. This includesinteractive recommendation,conversational recommendation, andpersonalized content generation, areas that were often underexplored or not highlighted as central to generative RSs in prior works. -
Up-to-Date Roadmap and Challenges: The paper incorporates the latest research up to 2025 and beyond, addressing new developments such as
agent-based recommender systemsusing LLMs and diverseLLM-based recommendation methods beyond supervised fine-tuning (SFT). It concludes with an updated discussion on current challenges (e.g.,benchmark design,model robustness,deployment efficiency) and future research directions, offering a more current roadmap for the field.In essence, this survey provides a more structured, comprehensive, and forward-looking perspective on generative recommendation, integrating diverse model types and operational stages within a unified analytical framework.
4. Methodology
4.1. Principles
The core idea behind the methodology outlined in this survey is to shift the paradigm of recommender systems from discriminative matching to generative synthesis. Traditionally, recommender systems operate by learning a scoring or ranking function to estimate the relevance of existing items to a user. In contrast, the generative recommendation paradigm reconceptualizes this problem as a generation task, where the system directly produces the target item or recommendation output.
The theoretical basis for this shift lies in the fundamental difference between discriminative and generative models:
-
Discriminative Models: Learn the
conditional probability, or directly map input to predict output . In recommendation, this means learning or a functionf(u,i)that scores the likelihood of interaction. -
Generative Models: Learn the
joint probability distributionP(x,y), meaning they model how both the input (e.g., user preferences) and the label (e.g., item) are generated together. This allows them to synthesize new data points (items) or sequences (recommendation lists).The intuition is that by embracing generative capabilities, recommender systems can move beyond merely selecting from a fixed set of candidates to actively creating personalized content, engaging in dynamic conversations, leveraging vast
world knowledge, and exhibitingemergent reasoning capabilities—qualities inherent to powerful generative models like LLMs and diffusion models. This survey systematically examines how this generative capability is applied across the entire recommendation pipeline: at thedata level(how data is prepared and enriched), themodel level(how the core recommendation engine is built), and thetask level(what new functionalities and applications become possible).
4.2. Core Methodology In-depth (Layer by Layer)
The paper deconstructs generative recommendation through its proposed tripartite framework: data, model, and task. To properly contextualize, it first revisits the traditional discriminative recommendation paradigm.
4.2.1. Preliminaries of Discriminative Recommendation Models
Discriminative recommendation models focus on learning a scoring or ranking function f(u,i) that estimates the relevance or affinity between a user and an item .
Data Preparation:
Given training data consisting of tuples , where represents a user, represents an item, and denotes the observed interaction. This interaction can be ratings (for explicit feedback) or binary values (for implicit feedback). Inputs and can be one-hot IDs in collaborative filtering methods. Auxiliary content data like user social networks, profiles, multimedia descriptions (images, videos, texts, audio), and knowledge graphs are often used to enrich and .
Model Construction:
At training time, discriminative recommendation methods typically start by using embedding layers to map each user and item to a dense embedding vector: and . Here, and are embedding layers for users and items, which can be simple lookup tables or more complex architectures like Multi-Layer Perceptrons (MLPs), Graph Neural Networks (GNNs), Convolutional Neural Networks (CNNs), or transformers.
Then, these models compute a matching score between the user and item embeddings: . Common scoring functions include inner product, distance-based metrics, and neural network-based metrics.
The models are usually trained to discriminate between positive and negative interactions. Commonly used loss functions include:
- Mean Squared Error (MSE) loss for explicit feedback (e.g., predicting ratings):
$
\mathcal{L}{\mathrm{rating}} = \frac{1}{N} \sum{i=1}^{N} (y_{ui} - f(u,i))^2
$
Here, is the total number of observed interactions, is the observed interaction (e.g., rating) between user and item , and
f(u,i)is the predicted interaction score. - Binary Cross Entropy (BCE) loss for implicit feedback (e.g., predicting clicks):
$
\mathcal{L}{\mathrm{point}} = - \sum{(u,i) \in D} [y_{ui} \log \sigma (f_{ui}) + (1-y_{ui}) \log (1-\sigma (f_{ui}))]
$
In this formula, is the dataset of observed interactions, is typically 1 for a positive interaction and 0 for a negative interaction, is the raw predicted score, and is the
sigmoid function, which squashes the score into a probability between 0 and 1. - Bayesian Personalized Ranking (BPR) loss for implicit feedback, which optimizes for ranking rather than absolute scores: $ \mathcal{L}{\mathrm{pair}} = - \sum{(u,i^+,i^-) \in D} \log \sigma (f_{ui^+} - f_{ui^-}) $ Here, represents triplets of (user , positive item , negative item ), where is an item user interacted with, and is an item user did not interact with. The objective is to maximize the difference between the score of the positive item and the negative item for a given user.
Recommendation Task:
For discriminative systems, the final recommendation task is mostly to select the top-K items that users might like from a candidate list. This is referred to as top-K recommendation.
At inference time, given a user and a candidate item set , the models compute scores and rank the items:
$
\hat{i} = \arg \max_{i \in I} f(u,i)
$
$
\mathrm{TopK}u = \mathrm{Top-K}{i \in I} f(u,i)
$
Here, represents the item with the highest predicted score for user , and denotes the set of top-K items ranked by their predicted scores f(u,i) for user . This process requires calculating a matching score for each item in the candidate list, then ranking and selecting the top items.
4.2.2. Generative Recommendation Models
In contrast to the discriminative approach, generative recommendation is defined as a broad paradigm where generative models (such as LLMs, diffusion models) are utilized across various stages of the recommendation pipeline. This paradigm can still be categorized into three major phases: data, model, and tasks.
Data-Level Synthesis:
Generative models are used to synthesize training data, including both user/item features and interaction records. This is particularly useful for addressing challenges like cold-start problems (where there's insufficient data for new users or items) or sparsity in datasets. Formally, let represent the original user set, the original item set, and the original interaction set. Generative models produce:
$
\mathcal{V}', I', \mathcal{Y}' = G_{\mathrm{data}}(\mathcal{V}, I, \mathcal{Y} | \theta_g)
$
Here, is a generative model parameterized by , which generates synthetic user features (), item features (), and interaction records () to create an enriched training dataset for recommendation models.
Model-Level Recommendation: At the model level, generative models serve as the core recommendation engine, directly learning user preferences and generating personalized recommendations. Current mainstream approaches fall into three categories:
- LLM-based approaches: Leverage pre-trained LLMs as recommendation backbones. They convert recommendation data into samples with
textual inputandoutput, modeling the recommendation task as aneural language generation process. - Large Recommendation Models (LRMs): Scale up traditional recommendation architectures with
generative components. These models use massive parameters to model complex user-item interactions and generate high-quality recommendations, often exhibitingscaling lawssimilar to LLMs. - Diffusion model-based approaches: Treat recommendation as a
denoising process. They learn to generate user preferences or item rankings through iterative refinement, transforming noise into meaningful recommendation signals.
Task-Level Generation:
At the task level, generative models reformulate recommendation as a generation task that produces outputs in natural language or structured formats. This paradigm not only addresses traditional tasks (like sequential recommendation or click-through rate prediction) through generative means but also enables novel capabilities:
-
Generating
personalized explanationsfor recommendations. -
Creating
conversational recommendation dialogues. -
Producing
item reviewsanddescriptions. -
Creating
virtual itemsormulti-modal recommendation content(combining text, images, etc.).This opens new possibilities for
interpretableandinteractiverecommendation experiences.
4.2.3. Data-Level Opportunities
LLMs provide new possibilities for data-centric advancements in RSs by enabling effective data generation and unification.
The following figure (Figure 4 from the original paper) shows the outline of key techniques in LLM-empowered data generation:
该图像是图4,展示了基于LLM的数据生成关键技术框架,涵盖内容增强、行为增强、结构增强和交互模拟等方面,强调了通过开放世界知识和代理行为模拟提升数据质量。
Figure 4: Outline of key techniques in LLM-empowered data generation.
The following are the results from Table 1 of the original paper:
| Category | Representative Works | Description / Focus |
| Content Augmentation | ONCE (WSDM'24), LLM-Rec (NAACL'24), LRD (SIGIR'24), MSIT (ACL'25), EXP3RT (SIGIR'25), Lettingo (KDD'25), SINGLE (WWW'24), KAR (RecSys'24), IRLLRec (SIGIR'25), LLM4SBR (TOIS'25), SeRALM (SIGIR'24), TRAWL (ArXiv'24) | Generate natural-language user/item profiles, summarize histories, enrich sparse metadata, and align textual semantics with feedback. |
| Representation Augmentation | DynLLM (ArXiv'24), GE4Rec (ICML'24), Hy- perLLM (SIGIR'25) | Automated feature construction, multimodal attribute extraction, external knowledge distil- lation, and hierarchical category generation. |
| Behavior Augmentation | ColdLLM (WSDM'25), Wang et al. (WWW'25), LLM-FairRec (SIGIR'25), LLM4IDRec (TOIS'25) | Generate synthetic user-item interactions, simulate cold-start preferences, ensure fair- ness, and integrate pseudo-interactions into ID-based pipelines. |
| Structure Augmentation | SBR (SIGIR'25), LLMRec (WSDM'24), Chang et al. (AAAI'25), CORONA (SIGIR'25), LLM-KERec (CIKM'24), TCR-QF (IJCAI'25), COSMO (SIGMOD'24) | Relation discovery, graph completion, social network generation, subgraph retrieval, knowl- edge graph construction & distillation. |
3.1. Data Generation:
LLMs leverage their open-world knowledge, natural language understanding, and generative capabilities to enrich, synthesize, and unify recommendation data. This is categorized into four dimensions of augmentation and agent-based simulation:
-
3.1.1. Open-world Knowledge for Augmentation: LLMs' vast pre-training data allows them to enrich and synthesize recommendation data.
- Content Augmentation: LLMs generate natural-language representations for users and items. For instance,
LLM-REC [116]uses diverse prompting to extract insights from LLM knowledge.LRD [202]usesvariational reasoningto find item relationships.MSIT [76]leveragesmultimodal LLMs (MLLMs)to mine item attributes from images and text. Approaches likeSINGLE [111]andKAR [189]extract user preferences from interaction logs.SeRALM [145]andLettinGo [168]design prompts and useDirect Preference Optimization (DPO)to align generated content with recommendation goals. - Representation Augmentation: LLMs automate feature construction.
DynLLM [234]uses an LLM as a content encoder.HyperLLM [16]generates hierarchical categories.GE4Rec [209]proposes a generative feature generation paradigm. - Behavior Augmentation: LLMs address
data sparsityandcold-startby generatingsynthetic user-item interactions.ColdLLM [58]uses acoupled-funnel architecturefor cold-start user interaction simulation.LLM-FairRec [75]generates fair pseudo-interactions.LLM4IDRec [14]augmentsID-based interaction data. - Structure Augmentation: LLMs induce higher-level semantic structures.
SBR [10]aligns item features with hierarchical intents.LLMRec [185]infers missing graph nodes/edges.CORONA [12]retrievesintent-aware subgraphs.LLM-KERec [232]infers new knowledge graph triples.
- Content Augmentation: LLMs generate natural-language representations for users and items. For instance,
-
3.1.2. Agent-Based Behavior Simulation: LLM-driven agents simulate human-like cognition, memory, emotions, decision-making, and reflection to generate authentic user profiles and dynamic behaviors.
- Interaction Simulation: Agents simulate individual user behaviors.
Agent4Rec [219]simulates diverse user behaviors with factual and emotional memories.AgentCF [221]modelscollaborative filteringby simulating both user and item agents.SimUSER [5]andSUBER [18]design cognitive agents with episodic memory for realistic behavior logs. - Social Simulation: Agents simulate large-scale social dynamics.
GGBond [239]models evolving social ties and trust dynamics.RecAgent [169]constructs a sandbox to study information silos and conformity.
- Interaction Simulation: Agents simulate individual user behaviors.
3.2. Data Unification: LLMs unify heterogeneous data across tasks, domains, and modalities.
The following figure (Figure 5 from the original paper) illustrates LLM empowered data unification:
该图像是图表,展示了图5中大语言模型(LLM)驱动的数据统一示意,涵盖多领域、多任务和多模态数据的统一处理,体现“一模型多用”的理念。
Figure 5: LLM empowered data unification
- 3.2.1. Multi-Domain Data Unification: Addresses
cross-domain recommendationchallenges likebehavioral sparsityanddomain gaps.DMCDR [83]uses a diffusion model for preference transfer.LLM4CDSR [105]andLLMCDSR [194]use LLMs to extract semantic representations and generate pseudo-interactions across domains.UniCTR [31]andMoLoRec [51]learndomain-general knowledge. - 3.2.2. Multi-Task Data Unification: Integrates diverse recommendation objectives (rating ranking, explanation, intent recognition) into a single framework.
P5 [38]formulates tasks astext-to-text generation.GPSD [163]combinesgenerative pretrainingwithdiscriminative fine-tuning.ARTS [113]usesself-promptingfor joint prediction and explanation. - 3.2.3. Multi-Modal Data Unification: Integrates text, images, and behavior logs using
large vision-language models (LVLMs).UniMP [184]andMQL4GRec [217]unify multimodal inputs into shared semantic spaces.LLaRA [90]integrates item IDs and text.PAD [178]aligns modalities via a three-stagepretrain-align-disentangleprocess. - 3.2.4. One Model for All: A paradigm shift towards unified, general-purpose models.
P5 [38]pioneered reformulating recommendation as text-to-text generation.M6-Rec [20]enablesopen-ended multimodal generation.UniTRec [122]integratesgenerative modelingwithcontrastive learning.CLLM4Rec [244]incorporates user/item IDs into LLM vocabularies.A-LLMRec [69]integrates pre-trained LLMs withcollaborative filtering embeddings.
4.2.4. Model-Level Opportunities
This section explores how generative models serve as the core recommendation engine.
4.1. LLM-Based Generative Recommendation: This approach leverages pre-trained LLMs to produce personalized suggestions.
The following figure (Figure 6 from the original paper) illustrates the paradigms aligning LLMs to recommendation:
该图像是一个示意图,展示了基于大语言模型(LLM)的推荐系统中不同输入与输出形式的对比,包括(a)文本元数据、(b)协同令牌、(c)ID号和(d)可训练ID令牌四种方案,直观体现了输入处理和推荐生成的过程。
Figure 6: The paradigms aligning LLMs to recommendation. Inspired by the figure [214].
-
4.1.1. Pretrained LLMs Recommendation: Uses
prompt designandin-context learningwithout heavy retraining.LLM-as-Enhancer [150, 54, 67, 102, 47]: LLMs rewrite user/item profiles into natural-language features to augment traditional recommenders.LLM-as-Recommender [36]: LLMs directly generate recommendations (e.g., item titles) with task-specific prompts, even inzero-shot mode. This extends tomultimodal LLMs(MLLMs).
-
4.1.2. Aligning LLMs for Recommendation: Fine-tuning LLMs on recommendation-specific data to bridge the gap between generic language modeling objectives and recommendation goals. This involves presenting the LLM with structured, compact, and consistent
collaborative signals.-
Text Prompting Based Methods: User profiles are built entirely in natural language, combining task descriptions with chronological interaction history. The following are the results from Table 2 of the original paper:
Methods User formulation Backbone Task description Historical interactions Profile Feedback Chat-Rec [36] ranking history interactions ✓ GPT-3.5 TALLRec [4] preference classification user preference LLaMA-7B LlamaRec [214] retrieval, ranking history interactions LLaMA2-7B LRD [202] ranking history interactions GPT-3.5 ReLLa [93] ranking history interactions Vicuna-7B CALRec [86] ranking history interactions PaLM-2 XXS BiLLP [149] long-term Interactive history interactions, reward model GPT-3.5, GPT-4, LLaMA2-7B PO4ISR [157] Ranking history interactions LLaMA2-7B LLM-TRSR [237] Ranking history interactions LLaMA2-7B RecGPT [124] Ranking history interactions, user preference RecGPT-7B KAR [189] Ranking history interactions, user preference GPT-3.5 LLM4CDSR [105] Ranking history interactions GPT-3.5, GLM4-Flash EXP3RT [68] rating prediction history interactions LLaMA3-8B SERAL [191] retrieval, ranking history interactions Qwen2-0.5B LettinGo [168] Ranking history interactions LLaMA3-8B Reason4Rec [28] Rating Prediction history interactions, user preference LLaMA3-8B InstructRec [224] Ranking history interactions Flan-T5-XL Uni-CTR [31] Rating Prediction history interactions, user preference DeBERTaV3-large BIGRec [3] Ranking history interactions LLaMA-7B UPSR [140] Ranking history interactions T5, FLAN-T5 Early works like
[86, 149, 157]fed sequences of consumed items. Later studies, such asTALLRec [4]andLlamaRec [215], enriched prompts with explicit preference statements or constrained candidate sets.LettinGo [168]useddirect preference optimization (DPO)for flexible profile adaptation. -
Collaborative Signal Based Methods: Inject
collaborative signalsinto the user/item profile so the LLM sees both semantics and relational knowledge. The following are the results from Table 3 of the original paper:Methods User Formulation Combining Method Backbone Task Description Historical Interactions Profile Feedback iLoRA [72] Ranking history interactions Concatenation GPT-3.5 LLM-ESR [104] Ranking history interactions Concatenation LLaMA2-7B LLaRA [90] Ranking history interactions Concatenation LLaMA2-7B A-LLMRec [69] Ranking history interactions Concatenation OPT-6.7B RLMRec [144] Ranking history interactions, user preference Concatenation GPT-3.5 CoRAL [186] Ranking history interactions, user preference Retrieval-Augmented GPT-4 BinLLM [226] Ranking history interactions, user preference Concatenation Vicuna-7B E4SRec [82] Ranking history interactions Concatenation Vicuna-7B SeRALM [145] Ranking history interactions Concatenation LLaMA2-7b CORONA [12] Ranking history interactions Pipeline Integration GPT-4o-mini HyperLLM [16] Ranking history interactions Pipeline Integration LLaMA3-8B RecLM [65] Ranking history interactions Concatenation LLaMA2-7b CoLLM [227] Ranking history interactions Concatenation Vicuna-7B PAD [177] Ranking history interactions, user preference Concatenation LLaMA3-8B IDP [188] Ranking history interactions, user preference Concatenation T5 This includes
LLM-augmented representation for CF models(e.g.,[72, 104, 90]concatenating information) andLLM-assisted summarization for CF models(e.g.,CORONA [12],CoRAL [186]). -
Item Tokenization Based Methods: Map items into the LLM's vocabulary using identifiable tokens. The following are the results from Table 4 of the original paper:
Methods User formulation Backbone Task description Historical interactions Token types P5 [38] Ranking historical interactions, user preference ID-based tokenization Transformer CLLM4Rec [244] Ranking historical interactions ID-based tokenization GPT-2 BIGRec [3] Ranking historical interactions Text-based tokenization LLaMA-7B M6 [20] Retrieval, Ranking historical interactions Text-based tokenization M6 IDGenRec [158] Ranking historical interactions Text-based tokenization BERT4Rec TIGER [141] Ranking historical interactions Codebook-based tokenization T5 RPG [52] Ranking historical interactions Codebook-based tokenization LLaMA-2-7B LC-Rec [235] Ranking historical interactions Codebook-based tokenization LLaMA-2-7B ActionPiece [53] Retrieval historical interactions Codebook-based tokenization LLaMA-2-7B LETTER [172] Ranking historical interactions Codebooks with collaborative signals LLaMA-7B TokenRec [138] Retrieval historical interactions Codebooks with collaborative signals T5-small SETRec [94] Ranking historical interactions Codebooks with collaborative signals T5, Qwen CCFRec [100] Ranking historical interactions Codebooks with collaborative signals LLaMA-2-7B LLM2Rec [48] Ranking historical interactions Codebooks with collaborative signals LLaMA-2-7B SIIT [15] Retrieval historical interactions Self-adaptive tokenization LLaMA-2-7B Methods include
ID-based tokenization(simple but not scalable),text-based tokenization(semantic but lengthy),codebook-based tokenization(compact),codebooks with collaborative signals(bridging gaps, e.g.,LETTER [172],TokenRec [138]), andself-adaptive tokenization(LLMs refine identifiers, e.g.,SIT [15]).
-
-
4.1.3. Training Objective & Inference: The following are the results from Table 5 of the original paper:
Category Representative Works Formula Supervised Fine-Tuning P5 (RecSys'22) LGIR (AAAI'24) LLM-Rec (TOÍS'25) RecRanker (TOIS'25) − lo θ(y+ | ) Self-Supervised Learning FELLAS (TOIS'24) HFAR (TOIS'25) exp(sim(y+y )τ) - log ∑N exp(sim(y+)/τ) Reinforcement Learning LEA (SIGIR'24) RPP (TOIS'25) [(x, ) − βπθ( | |] Direct Preference Optimization LettinGo (KDD'25) RosePO (ArXiv'24) SPRec (WWW'25) − log σ(β log θ(y+) log \$πθ(yx ref(v+ ref(y The formulas provided in the table are incomplete or partially rendered. I will transcribe them as accurately as possible from the provided text, and then provide a standard form for context where the paper's representation is highly abbreviated. For the table's notation: is user profile/context; is preferred/rejected item; is policy model; is reference model; is similarity; is temperature; is negative set; is reward; is penalty/scale; is KL divergence; is sigmoid.
-
Training Objective: Focuses on
next-item prediction.- Supervised Fine-Tuning (SFT): LLMs are fine-tuned with predefined templates.
The formula provided in the table, , is highly abbreviated. A standard SFT loss, typically
negative log-likelihood, aims to maximize the probability of generating the correct output given the input. For a next-item prediction task, given an input context (user profile, history) and a target item : $ \mathcal{L}{\mathrm{SFT}} = - \log \pi{\theta}(y^+ | x) $ Here, is the probability of generating the positive item given the input by the model with parameters .P5 [38]andLGIR [26]are examples. SFT learns mainly from positive pairs. - Self-Supervised Learning (SSL): Generates auxiliary training signals to reduce reliance on manual templates.
The formula provided in the table, , is also highly abbreviated. A common SSL objective, such as
contrastive learning, involves maximizing agreement between different views of the same data point and minimizing agreement with negative samples. For instance, in a contrastive loss like InfoNCE, where is a representation of and represents other items: $ \mathcal{L}{\mathrm{SSL}} = - \log \frac{\exp(\mathrm{sim}(\hat{y}, \hat{y}^+)/\tau)}{\sum{\hat{y}' \in \mathcal{N} \cup {y^+}} \exp(\mathrm{sim}(\hat{y}, \hat{y}')/\tau)} $ Here, denotes a similarity function (e.g., cosine similarity), is a temperature parameter, is the set of negative samples, is the embedding of the current item, and is the embedding of the positive item.FELLAS [213]is an example. - Reinforcement Learning (RL): Introduces
reward-driven optimizationto handle non-differentiable metrics. The formula provided in the table, , is again very abbreviated. A common RL objective, specifically forProximal Policy Optimization (PPO)or similar, involves optimizing a policy based on a reward function and a regularization term (often KL divergence from a reference policy). A simplified form often seen in policy gradient methods with KL regularization: $ \mathcal{L}{\mathrm{RL}} = - \mathbb{E}{(x,y) \sim D} [r_{\phi}(x,y) - \beta D_{\mathrm{KL}}(\pi_{\theta}(y|x) || \pi_{\mathrm{ref}}(y|x))] $ Where is the reward for taking action (recommending item ) given state (user profile/context), is a scaling factor for the KL divergence, and is theKullback-Leibler (KL) divergencebetween the current policy and a reference policy .LEA [166]andRPP [120]are examples. - Direct Preference Optimization (DPO): Optimizes directly on preference pairs without training a separate reward model.
The formula provided in the table, πθ(yx ref(v+ ref(y
, is malformed. The standard DPO loss for preference pairs(y^+, y^-)$ is: $ \mathcal{L}{\mathrm{DPO}} = - \log \sigma \left( \beta \left( \log \frac{\pi{\theta}(y^+|x)}{\pi_{\mathrm{ref}}(y^+|x)} - \log \frac{\pi_{\theta}(y^-|x)}{\pi_{\mathrm{ref}}(y^-|x)} \right) \right) $ Here, is thesigmoid function, is a hyperparameter scaling the preference difference, is the policy model being optimized, and is a fixed reference policy (usually the initial pre-trained LLM). The objective is to increase the log probability ratio of the preferred response relative to the rejected response from the reference policy.LettinGo [168]andRosePO [89]are examples.
- Supervised Fine-Tuning (SFT): LLMs are fine-tuned with predefined templates.
The formula provided in the table, , is highly abbreviated. A standard SFT loss, typically
-
Inference:
- Reranking: Improves output quality by injecting stronger ranking signals.
RecRanker [114]uses a two-stage pipeline (retrieval then LLM-based reranking).LLM4Rerank [35]frames inference as multi-node reasoning. - Acceleration: Reduces latency and memory.
FELLAS [213]limits LLM use for embeddings.Prompt Distillation (GenRec [155])compresses histories.AtSpeed [97]appliesspeculative decodingandtree-based attention.
- Reranking: Improves output quality by injecting stronger ranking signals.
-
4.2. Large Recommendation Model (LRM):
LRMs are specialized architectures optimized directly for user behavior data, establishing native scaling laws for recommendation tasks. They aim to address diminishing returns from complex discriminative models and the high costs of cascaded architectures.
The following figure (Figure 7 from the original paper) illustrates the end recommendation:
该图像是两张示意图,分别展示了大规模推荐模型(LRM)的架构(a)及端到端推荐系统架构(b)对比。图(a)描绘了编码器、解码器及多种输入序列;图(b)展示了端到端训练与偏好对齐及级联检索排序流程。
Figure 7: End Recommendation.
The following are the results from Table 6 of the original paper:
| Methods | User Formulation | Architectures | Backbone | |
| Task | Historical Interactions | |||
| LEARN [62] | Ranking | history interactions, user preference | Cascaded | Baichuan2-7B, Transformer |
| HLLM [9] | Retrieval, Ranking | history interactions | Cascaded | TinyLlama-1.1B, Baichuan2-7B |
| KuaiFormer [98] | Retrieval | history interactions | Cascaded | Stacked Transformer |
| SRP4CTR [44] | Ranking | history interactions, user preference | Cascaded | FG-BERT |
| HSTU [216] | Ranking | history interactions | Cascaded | Transformer |
| MTGR [45] | Ranking | history interactions | Cascaded | Transformer |
| UniROM [135] | Ranking | history interactions | End-to-End | RecFormer |
| URM [63] | Ranking | history interactions, user preference | End-to-End | BERT |
| OneRec [23] | Generative Retrieval and Ranking | history interactions, user preference | End-to-End | Transformer |
| OneSug [43] | Generative Retrieval and Ranking | history interactions, user preference | End-to-End | Transformer |
| EGA-V2 [238] | Generative Retrieval and Ranking | history interactions, user preference | End-to-End | Transformer |
-
The Scaling Law of LRMs:
HSTU [216](Meta) is a groundbreaking LRM that validates the applicability of LLMscaling lawsto recommendation. It transforms CTR prediction into agenerative sequence modeling task, unifying multiple pointwise user-item interactions into a single sequence. It uses acausal autoregressive modelingapproach for ultra-long user sequences, unifyingretrievalandrankinginto asequence generation problem. It shows that performance improves with model scale, reaching 1.5 trillion parameters, far exceeding traditional discriminative models' stagnation point.MTGR [46](Meituan) is a generative ranking framework that incorporatescross featuresand usesGroup LayerNormand dynamic hybrid masking for sequence encoding.GenRank [60](Redbook) focuses on iteratively predicting actions associated with items, suitable for resource-sensitive ranking.
-
End-to-End Recommendations: LRMs enable unifying the recommendation framework, reducing engineering costs and optimizing objectives.
OneRec [23, 240](Kuaishou) is anend-to-end generative recommendation modelthat replaces the traditionalretrieval-coarse ranking-fine ranking cascade architecture. It uses anencoder-decoder structurewithMoE(Mixture of Experts) for capacity expansion. It proposes asession-based generationapproach and incorporatesDirect Preference Optimization (DPO)with areward model.OneSug [43]extends this idea toquery recommendation.EGA-V2 [238]pushes further withhierarchical tokenizationandmulti-token prediction, integrating various tasks (user interest modeling, POI generation, ad allocation, payment computation) into a single generative framework.
4.3. Diffusion-Based Generative Recommendation: Explores extending diffusion models to recommendation tasks.
The following figure (Figure 8 from the original paper) illustrates Item Generation:
该图像是一个示意图,展示了基于扩散模型的数据增强与目标项目生成流程。图中包含扩散模型的前向和反向过程,以及条件引导反向过程,体现了从嘈杂社交网络到精炼网络和从输入序列到目标项目的转变。
Figure 8: Item Generation.
-
4.3.1. Augmented Data Generation: Leverages denoising and generative nature of diffusion models.
- Generate high-quality interaction data:
DGFedRS [24]uses diffusion models forpersonalized user informationand high-quality interactions.MoDiCF [77]andTDM [121]handle missing data.Diffurec [88]adds Gaussian noise to embeddings for diversity. - Generate robust representations:
ARD [156]refines social networks.DDRM [230]andDRGO [229]learn robust representations. - Preference injected conditional generation:
DMCDR [83]uses preference guidance for cross-domain user representations.InDiRec [139]generates forward views with consistent intent.
- Generate high-quality interaction data:
-
4.3.2. Target Item Generation: Directly generate items or rankings.
- Diffusion recommender model:
DiffRec [174]treats interaction prediction as adenoising process.DreamRec [205]noises the target item space to generate recommendations directly, eliminatingnegative sampling.DiffRIS [129]usesimplicit featuresas guidance.DiQDiff [119]enhances robustness withsemantic vector quantizationandcontrastive discrepancy maximization. - Diversity and uncertainty modeling:
DiffDiv [7]designsdiversity-aware guided learningto capture diverse user preferences. - Tailored Optimization for DM-based recommendation:
DDSR [192]usesdiscrete diffusionfor fuzzy sets of interaction sequences.ADRec [11]andPreferDiff [108]propose tailored optimization objectives to addressembedding collapse.
- Diffusion recommender model:
4.2.5. Task-Level Opportunities
Generative models reformulate recommendation as a generation task, enabling novel capabilities.
-
5.1. Top-K Recommendation:
- Vocabulary-Constrained Decoding: Restricts the decoding space to valid item identifiers.
P5 [38]uses constrained decoding withbeam-search.IDGenRec [158]uses aprefix tree.TransRec [95]usesFM-indexfor position-free generation andmulti-facet identifiers. - Post-Generation Filtering: LLM freely generates text, then outputs are mapped/reranked to in-catalog items.
BIGRec [3]grounds identifiers viaL2 distance. - Prompt Augmentation: Injects candidate items into text prompts for selection. Examples include
LLaRA [90],A-LLMRec [69],iLoRA [72].
- Vocabulary-Constrained Decoding: Restricts the decoding space to valid item identifiers.
-
5.2. Personalized Content Generation: Capitalizes on generative capabilities to create entirely new items or content.
- Personalized visual content generation:
DiFashiongenerates personalized outfits.DreamVTON [193]creates 3D virtual try-ons.InstantBooth [148]enables personalized image generation.OOTDiffusion [197]generates virtual try-on images. - Personalized textual content generation:
[198]explores personalization for review generation ([126, 78, 154, 79]) and news headline generation ([1, 6, 152]).
- Personalized visual content generation:
-
5.3. Conversational Recommendation: Elicits dynamic user preferences through multi-turn natural language interactions.
- Prompting and Zero-shot Methods:
He et al. [49]showed off-the-shelf LLMs can outperform baselines.[153]incorporates iterative user feedback.[21]uses demonstrations to guide. - Retrieval-augmented and Knowledge-enhanced Approaches: Combine LLMs with retrieval modules or knowledge graphs to prevent
hallucinations.[136]retrieves relevant items.[203]integratescollaborative filteringsignals. - Unified and Parameter-efficient Architectures:
[142]reformulates CRS as a single NLP task.MemoCRS [190]uses memory modules.Chat-REC [37]enhances interaction and explainability. - Evaluation:
[199]proposes assessing alignment with human expectations.
- Prompting and Zero-shot Methods:
-
5.4. Explainable Recommendation: Provides transparency and increases trustworthiness.
P5 [38]uses prompts to generate explanations, though LLMs canhallucinate.LLM2ER [201]fine-tunes withexplainable quality reward models.- Combines with
graphs:XRec [118]uses GNNs to model graph structure for embeddings, then feeds to LLMs.G-Refer [87]useshybrid graph retrievalandretrieval-augmented fine-tuning. - Leverages
thinking models:[211, 231]use thought processes as explanations.
-
5.5. Recommendation Reasoning: Performs multi-step deduction for accurate and explainable recommendations.
- Explicit reasoning methods: Generate human-readable reasoning processes.
Reason4Rec [30]introducesdeliberative recommendationfor LLMs to explicitly reason.Reason-to-Recommend [231]usesInteraction-of-Thought (IoT)reasoning.ThinkRec [212]uses thinking prompts.OneRec-Think [112]activates LLM reasoning viaCoT-based SFTandRL reward functions. - Implicit reasoning methods: Perform latent reasoning without textual interpretability.
LatentR3 [228]encodes reasoning intolatent tokens.ReaRec [159]enables multi-step latent reasoning for sequential recommenders.STREAM-Rec [225]introducesslow thinkingfor iterative residual-based reasoning. - LLM reasoning augmentation methods: LLMs generate reasoning steps to enhance training of traditional RSs.
DeepRec [236]proposes anautonomous interaction paradigm.LLMRG [176]constructsreasoning graphsthrough LLM-drivenchain reasoning,divergent extension, andself-verification.
- Explicit reasoning methods: Generate human-readable reasoning processes.
5. Experimental Setup
This paper is a comprehensive survey of existing research on generative recommendation. As such, it does not present its own experimental setup, datasets, evaluation metrics, or baselines in the traditional sense of empirical research. Instead, it reviews and synthesizes the experimental methodologies and findings of the works it covers.
5.1. Datasets
The paper discusses the historical evolution of datasets that have driven advancements in recommender systems, and critically evaluates their suitability for the new paradigm of generative recommendation.
-
Historically Important Datasets:
- MovieLens: A foundational dataset in the early stages, providing
large-scale rating datathat enabled the development and evaluation ofcollaborative filteringandmatrix factorizationmethods.- Characteristics: Contains movie ratings (1-5 stars) from users.
- Example Data Sample: A tuple like
(user_id, movie_id, rating, timestamp).
- Netflix Prize dataset: A milestone dataset that further pushed
latent factor modelsandmatrix factorization techniquesdue to its large scale and challenging nature.- Characteristics: Contains anonymous ratings from Netflix users for movies.
- Example Data Sample: Similar to MovieLens, but with a larger scale.
- Amazon Review dataset: Shifted attention from explicit ratings to
implicit feedback(e.g., clicks, purchases) andauxiliary contentlike text reviews and product metadata. This fostered research onhybrid recommendation.- Characteristics: Contains user reviews, ratings, product metadata across various categories.
- Example Data Sample: A review like
(reviewerID, asin, reviewerName, helpful, reviewText, overall, summary, unixReviewTime, reviewTime). AreviewTextexample might be: "This product is amazing! It arrived quickly and works perfectly. Highly recommend."
- Yelp dataset: With its rich user reviews and business information, it advanced the use of
deep learning-based recommendation, leveragingnatural language processingforsentiment analysis,representation learning, andcontext-aware suggestions.- Characteristics: Contains business data, user reviews, and tips.
- Example Data Sample: A review text example: "Great food and service! The pasta was delicious and the ambiance was cozy."
- MovieLens: A foundational dataset in the early stages, providing
-
Suitability for Generative Recommendation: The paper argues that while these datasets were crucial for traditional RSs, they are
no longer fully suitablefor generative recommendation.- Limitations: Most of these datasets are
non-interactive,offline, andstatic. They capturepoint-in-time user preferencesrather than thedynamic feedback loopsandmulti-round interactionsthat characterize real-world generative recommendation scenarios. - Assessment Gap: These datasets are more suitable for assessing the
accuracy performanceof traditional RSs, but they restrict the assessment ofgenerative modelsaspersonalized assistantsthat operate across multiple scenarios and handle diverse tasks in interactive settings. - Need for New Benchmarks: This highlights an urgent need for
new benchmarksthat can better support the next stage of generative recommendation research, focusing on interactive, dynamic, and multi-task capabilities.
- Limitations: Most of these datasets are
5.2. Evaluation Metrics
As a survey, the paper does not define or use specific evaluation metrics for its own experiments. However, it discusses the tasks and objectives of generative recommendation, which inherently rely on various metrics for evaluation in the individual works it reviews. Based on the tasks mentioned (e.g., rating prediction, CTR prediction, Top-K recommendation, explainable recommendation, conversational recommendation, content generation), the following types of metrics are commonly used in the field:
5.2.1. Accuracy Metrics (for traditional tasks like rating/CTR prediction)
-
Root Mean Squared Error (RMSE):
- Conceptual Definition: RMSE is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. It quantifies the square root of the average of the squared differences between predicted and actual values. It gives a relatively high weight to large errors. It is suitable for rating prediction tasks where the output is a continuous value.
- Mathematical Formula: $ \mathrm{RMSE} = \sqrt{\frac{1}{N} \sum_{(u,i) \in D} (r_{ui} - \hat{r}_{ui})^2} $
- Symbol Explanation:
- : The total number of observed ratings in the dataset .
- : An observed interaction (rating) for user and item in the dataset.
- : The true (observed) rating given by user to item .
- : The predicted rating for user and item by the recommendation model.
-
Area Under the Receiver Operating Characteristic Curve (AUC):
- Conceptual Definition: AUC measures the ability of a classifier to distinguish between classes. It represents the probability that a randomly chosen positive instance will be ranked higher than a randomly chosen negative instance by the classifier. In recommendation, particularly for
click-through rate (CTR) prediction, it indicates how well the model can differentiate between items a user will interact with (positive) and those they won't (negative). A higher AUC suggests better discriminative power. - Mathematical Formula: AUC is typically calculated by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. The area under this ROC curve is then computed. There isn't a single simple closed-form formula like RMSE; it's often computed numerically. $ \text{AUC} = \int_{0}^{1} \text{TPR}(\text{FPR}^{-1}(x)) dx $ Alternatively, it can be defined as the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example: $ \text{AUC} = P(\text{score}(\text{positive}) > \text{score}(\text{negative})) $
- Symbol Explanation:
- : True Positive Rate, also known as Recall or Sensitivity, calculated as .
- : False Positive Rate, calculated as .
- : The predicted relevance score of an item by the model.
- : Probability.
- Conceptual Definition: AUC measures the ability of a classifier to distinguish between classes. It represents the probability that a randomly chosen positive instance will be ranked higher than a randomly chosen negative instance by the classifier. In recommendation, particularly for
5.2.2. Ranking Metrics (for Top-K recommendation)
-
Recall@K:
- Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved within the top K recommendations. It indicates how many of the items a user would like were actually shown to them among the top K.
- Mathematical Formula: $ \mathrm{Recall@K} = \frac{|\mathrm{Relevant_Items} \cap \mathrm{Recommended_Items@K}|}{|\mathrm{Relevant_Items}|} $
- Symbol Explanation:
- : Denotes the cardinality (number of elements) of a set.
- : The set of items that are truly relevant to a user (e.g., items the user interacted with in the test set).
- : The set of the top K items recommended by the system for the user.
- : Set intersection.
-
Normalized Discounted Cumulative Gain (NDCG@K):
- Conceptual Definition: NDCG@K is a measure of ranking quality that takes into account the position of relevant items in the ranked list. It assigns higher scores to relevant items that appear at higher ranks (top of the list) and penalizes relevant items that appear lower. It also incorporates the
gain(relevance score) of each item. It is normalized to ensure scores are comparable across different users and queries. - Mathematical Formula:
First,
Discounted Cumulative Gain (DCG@K): $ \mathrm{DCG@K} = \sum_{j=1}^{K} \frac{2^{rel_j} - 1}{\log_2(j+1)} $ Then,Ideal Discounted Cumulative Gain (IDCG@K)is the DCG for the ideally ranked list (all relevant items at the very top, ordered by decreasing relevance): $ \mathrm{IDCG@K} = \sum_{j=1}^{K} \frac{2^{rel_j^{opt}} - 1}{\log_2(j+1)} $ Finally,Normalized Discounted Cumulative Gain (NDCG@K): $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ - Symbol Explanation:
- : The number of top recommendations considered.
- : The position of an item in the ranked list.
- : The relevance score of the item at position in the generated ranked list. This can be binary (0 or 1 for irrelevant/relevant) or graded (e.g., 1-5 for ratings).
- : The relevance score of the item at position in the ideal (perfect) ranked list.
- : The logarithmic discount factor, which reduces the importance of items at lower ranks.
- Conceptual Definition: NDCG@K is a measure of ranking quality that takes into account the position of relevant items in the ranked list. It assigns higher scores to relevant items that appear at higher ranks (top of the list) and penalizes relevant items that appear lower. It also incorporates the
5.2.3. Qualitative/Human Evaluation (for Explainable, Conversational, Content Generation tasks)
For tasks like explainable recommendation, conversational recommendation, and personalized content generation, traditional quantitative metrics might be insufficient. These often rely on:
- Human Evaluation: Experts or user studies to assess aspects like
coherence,fluency,informativeness,trustworthiness,interactivity,appropriateness, andcreativityof the generated explanations, dialogues, or content. - Task-Specific Metrics: For conversational systems, metrics like
turn-based success rateordialogue completion ratemight be used. For content generation, metrics fromnatural language generationorimage generation(e.g.,FIDfor image quality,BLEU/ROUGEfor text similarity, though less suited for open-ended generation) can be adapted, often combined with human judgment.
5.3. Baselines
As a survey paper, it does not propose a new model to be compared against baselines. Instead, it discusses the evolution of recommender systems, implicitly treating older paradigms (e.g., collaborative filtering, matrix factorization, deep learning-based discriminative models) as conceptual baselines against which the new generative recommendation paradigm is evaluated in terms of capabilities, advantages, and limitations. The paper highlights how generative models aim to overcome the shortcomings of these traditional approaches.
6. Results & Analysis
Since this paper is a survey, it does not present new experimental results from its own models or comparisons. Instead, its "results" are the synthesis and analysis of findings from the numerous research papers it reviews. The analysis focuses on the identified advantages, paradigm shifts, and open challenges presented by the generative recommendation approach.
6.1. Core Results Analysis
The paper's core analysis validates the significant shift brought by generative models to recommender systems by highlighting several key advantages and outlining a fundamental paradigm change:
-
Five Key Advantages of Generative Models in RSs:
- World Knowledge Integration: Generative models, especially LLMs, are pre-trained on vast and diverse datasets, inherently encoding extensive
world knowledge. This allows them to incorporate rich semantic information about entities, events, and cultural contexts directly into recommendations without requiring explicitknowledge extraction pipelinesor separateknowledge base construction. This leads to more contextually aware and informed recommendations. - Natural Language Understanding (NLU): LLMs possess advanced NLU capabilities, enabling them to interpret user expressions in
free-form text, understanding nuances, context, and intent. This bridges the gap between how users naturally communicate their preferences (e.g., search queries, reviews, conversations) and how RSs operate, supportingconversational interfacesand complex queries. - Reasoning Capabilities: Generative models exhibit emergent
reasoning capabilities, allowing them to model logical processes behind user decisions. Unlike traditional models that rely onpattern matching, generative RSs can understandwhya user might prefer an item, considering feature relationships, temporal sequences, and contextual factors. This enablesexplainable recommendationsand justifies suggestions. - Scaling Laws: The paper emphasizes that the
scaling lawobserved in LLMs (performance improving predictably with increased model size, training data, and compute) also applies toLarge Recommendation Models (LRMs). This provides a systematic pathway for building more capable RSs, as increasing scale can lead toemergent capabilitieslike better intent understanding and nuanced preference modeling, reducing the need for manual feature engineering. - Creative Generation for Novel Recommendations: Unlike
discriminative modelsthat only rank existing items, generative models cancreate novel content and recommendations. This is particularly valuable incold-start scenariosand helps to diversify recommendations beyond thefilter bubble. They can suggest customized bundles, personalized content variations, or even generate new item descriptions tailored to individual preferences.
- World Knowledge Integration: Generative models, especially LLMs, are pre-trained on vast and diverse datasets, inherently encoding extensive
-
Paradigm Shift from Discriminative to Generative: The paper effectively illustrates how generative RSs are breaking away from the traditional discriminative paradigm, which was characterized by:
-
Mapping-based: Learning a mapping from user-item pairs to a score. -
Feature-driven: Heavily relying on hand-crafted or learned features. -
Small-model: Typically smaller, specialized models. -
Task-independent: Designed for a single, fixed task (e.g., CTR prediction). -
Reliant on predefined candidate sets: Limited to recommending from an existing catalog.In contrast, generative recommenders are reshaping the paradigm towards
recommendation assistantsthat are: -
Open: Leveragingopen-world knowledge. -
Capable of handling a wide range of tasks:Task-agnosticby design, able to rank items, generate explanations, and produce personalized content within a single framework. -
Adaptive to changing user needs: Throughinteractiveandconversationalcapabilities.This shift represents a fundamental change in how recommendations are made, moving towards a more dynamic, responsive, and intelligent interaction between humans and information systems.
-
The following figure (Figure 9 from the original paper) illustrates the comparison between traditional discriminative recommendation and a generative recommendation assistant:
该图像是论文中对比传统判别式推荐与生成式推荐助理的示意图,展示了两者数据流程、交互方式、特点及面临的挑战。
Figure 9: Illustration of traditional discriminative recommendation and generative recommendation assistant.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Category | Representative Works | Description / Focus |
| Content Augmentation | ONCE (WSDM'24), LLM-Rec (NAACL'24), LRD (SIGIR'24), MSIT (ACL'25), EXP3RT (SIGIR'25), Lettingo (KDD'25), SINGLE (WWW'24), KAR (RecSys'24), IRLLRec (SIGIR'25), LLM4SBR (TOIS'25), SeRALM (SIGIR'24), TRAWL (ArXiv'24) | Generate natural-language user/item profiles, summarize histories, enrich sparse metadata, and align textual semantics with feedback. |
| Representation Augmentation | DynLLM (ArXiv'24), GE4Rec (ICML'24), Hy- perLLM (SIGIR'25) | Automated feature construction, multimodal attribute extraction, external knowledge distil- lation, and hierarchical category generation. |
| Behavior Augmentation | ColdLLM (WSDM'25), Wang et al. (WWW'25), LLM-FairRec (SIGIR'25), LLM4IDRec (TOIS'25) | Generate synthetic user-item interactions, simulate cold-start preferences, ensure fair- ness, and integrate pseudo-interactions into ID-based pipelines. |
| Structure Augmentation | SBR (SIGIR'25), LLMRec (WSDM'24), Chang et al. (AAAI'25), CORONA (SIGIR'25), LLM-KERec (CIKM'24), TCR-QF (IJCAI'25), COSMO (SIGMOD'24) | Relation discovery, graph completion, social network generation, subgraph retrieval, knowl- edge graph construction & distillation. |
The following are the results from Table 2 of the original paper:
| Methods | User formulation | Backbone | |||
| Task description | Historical interactions | Profile | Feedback | ||
| Chat-Rec [36] | ranking | history interactions | ✓ | GPT-3.5 | |
| TALLRec [4] | preference classification | user preference | LLaMA-7B | ||
| LlamaRec [214] | retrieval, ranking | history interactions | LLaMA2-7B | ||
| LRD [202] | ranking | history interactions | GPT-3.5 | ||
| ReLLa [93] | ranking | history interactions | Vicuna-7B | ||
| CALRec [86] | ranking | history interactions | PaLM-2 XXS | ||
| BiLLP [149] | long-term Interactive | history interactions, reward model | GPT-3.5, GPT-4, LLaMA2-7B | ||
| PO4ISR [157] | Ranking | history interactions | LLaMA2-7B | ||
| LLM-TRSR [237] | Ranking | history interactions | LLaMA2-7B | ||
| RecGPT [124] | Ranking | history interactions, user preference | RecGPT-7B | ||
| KAR [189] | Ranking | history interactions, user preference | GPT-3.5 | ||
| LLM4CDSR [105] | Ranking | history interactions | GPT-3.5, GLM4-Flash | ||
| EXP3RT [68] | rating prediction | history interactions | LLaMA3-8B | ||
| SERAL [191] | retrieval, ranking | history interactions | Qwen2-0.5B | ||
| LettinGo [168] | Ranking | history interactions | LLaMA3-8B | ||
| Reason4Rec [28] | Rating Prediction | history interactions, user preference | LLaMA3-8B | ||
| InstructRec [224] | Ranking | history interactions | Flan-T5-XL | ||
| Uni-CTR [31] | Rating Prediction | history interactions, user preference | DeBERTaV3-large | ||
| BIGRec [3] | Ranking | history interactions | LLaMA-7B | ||
| UPSR [140] | Ranking | history interactions | T5, FLAN-T5 | ||
The following are the results from Table 3 of the original paper:
| Methods | User Formulation | Combining Method | Backbone | |||
| Task Description | Historical Interactions | Profile | Feedback | |||
| iLoRA [72] | Ranking | history interactions | Concatenation | GPT-3.5 | ||
| LLM-ESR [104] | Ranking | history interactions | Concatenation | LLaMA2-7B | ||
| LLaRA [90] | Ranking | history interactions | Concatenation | LLaMA2-7B | ||
| A-LLMRec [69] | Ranking | history interactions | Concatenation | OPT-6.7B | ||
| RLMRec [144] | Ranking | history interactions, user preference | Concatenation | GPT-3.5 | ||
| CoRAL [186] | Ranking | history interactions, user preference | Retrieval-Augmented | GPT-4 | ||
| BinLLM [226] | Ranking | history interactions, user preference | Concatenation | Vicuna-7B | ||
| E4SRec [82] | Ranking | history interactions | Concatenation | Vicuna-7B | ||
| SeRALM [145] | Ranking | history interactions | Concatenation | LLaMA2-7b | ||
| CORONA [12] | Ranking | history interactions | Pipeline Integration | GPT-4o-mini | ||
| HyperLLM [16] | Ranking | history interactions | Pipeline Integration | LLaMA3-8B | ||
| RecLM [65] | Ranking | history interactions | Concatenation | LLaMA2-7b | ||
| CoLLM [227] | Ranking | history interactions | Concatenation | Vicuna-7B | ||
| PAD [177] | Ranking | history interactions, user preference | Concatenation | LLaMA3-8B | ||
| IDP [188] | Ranking | history interactions, user preference | Concatenation | T5 | ||
The following are the results from Table 4 of the original paper:
| Methods | User formulation | Backbone | ||
| Task description | Historical interactions | Token types | ||
| P5 [38] | Ranking | historical interactions, user preference | ID-based tokenization | Transformer |
| CLLM4Rec [244] | Ranking | historical interactions | ID-based tokenization | GPT-2 |
| BIGRec [3] | Ranking | historical interactions | Text-based tokenization | LLaMA-7B |
| M6 [20] | Retrieval, Ranking | historical interactions | Text-based tokenization | M6 |
| IDGenRec [158] | Ranking | historical interactions | Text-based tokenization | BERT4Rec |
| TIGER [141] | Ranking | historical interactions | Codebook-based tokenization | T5 |
| RPG [52] | Ranking | historical interactions | Codebook-based tokenization | LLaMA-2-7B |
| LC-Rec [235] | Ranking | historical interactions | Codebook-based tokenization | LLaMA-2-7B |
| ActionPiece [53] | Retrieval | historical interactions | Codebook-based tokenization | LLaMA-2-7B |
| LETTER [172] | Ranking | historical interactions | Codebooks with collaborative signals | LLaMA-7B |
| TokenRec [138] | Retrieval | historical interactions | Codebooks with collaborative signals | T5-small |
| SETRec [94] | Ranking | historical interactions | Codebooks with collaborative signals | T5, Qwen |
| CCFRec [100] | Ranking | historical interactions | Codebooks with collaborative signals | LLaMA-2-7B |
| LLM2Rec [48] | Ranking | historical interactions | Codebooks with collaborative signals | LLaMA-2-7B |
| SIIT [15] | Retrieval | historical interactions | Self-adaptive tokenization | LLaMA-2-7B |
The following are the results from Table 5 of the original paper:
| Category | Representative Works | Formula |
| Supervised Fine-Tuning | P5 (RecSys'22) LGIR (AAAI'24) LLM-Rec (TOÍS'25) RecRanker (TOIS'25) | − lo θ(y+ | ) |
| Self-Supervised Learning | FELLAS (TOIS'24) HFAR (TOIS'25) | exp(sim(y+y )τ) - log ∑N exp(sim(y+)/τ) |
| Reinforcement Learning | LEA (SIGIR'24) RPP (TOIS'25) | [(x, ) − βπθ( | |] |
| Direct Preference Optimization | LettinGo (KDD'25) RosePO (ArXiv'24) SPRec (WWW'25) | − log σ(β log θ(y+) log \$πθ(yx ref(v+ ref(y |
The following are the results from Table 6 of the original paper:
| Methods | User Formulation | Architectures | Backbone | |
| Task | Historical Interactions | |||
| LEARN [62] | Ranking | history interactions, user preference | Cascaded | Baichuan2-7B, Transformer |
| HLLM [9] | Retrieval, Ranking | history interactions | Cascaded | TinyLlama-1.1B, Baichuan2-7B |
| KuaiFormer [98] | Retrieval | history interactions | Cascaded | Stacked Transformer |
| SRP4CTR [44] | Ranking | history interactions, user preference | Cascaded | FG-BERT |
| HSTU [216] | Ranking | history interactions | Cascaded | Transformer |
| MTGR [45] | Ranking | history interactions | Cascaded | Transformer |
| UniROM [135] | Ranking | history interactions | End-to-End | RecFormer |
| URM [63] | Ranking | history interactions, user preference | End-to-End | BERT |
| OneRec [23] | Generative Retrieval and Ranking | history interactions, user preference | End-to-End | Transformer |
| OneSug [43] | Generative Retrieval and Ranking | history interactions, user preference | End-to-End | Transformer |
| EGA-V2 [238] | Generative Retrieval and Ranking | history interactions, user preference | End-to-End | Transformer |
6.3. Ablation Studies / Parameter Analysis
As a survey paper, this work does not conduct its own ablation studies or parameter analyses. Instead, it synthesizes the findings and observations regarding component effectiveness and hyperparameter sensitivities from the individual research papers it reviews, integrating them into the broader discussion of model-level opportunities and open challenges. For instance, in Section 4.1.2, it discusses how different alignment mechanisms (text prompting, collaborative signals, item tokenization) contribute to LLM performance in recommendation. In Section 4.1.3, it covers various training objectives (SFT, SSL, RL, DPO) and their impact on model behavior. The scaling law discussion in Section 4.2 inherently touches upon how model size and data volume (parameters) affect performance. The challenges section (6.2) explicitly addresses bias (e.g., popularity bias, positional biases) and robustness concerns, which are often investigated through sensitivity analyses in primary research.
7. Conclusion & Reflections
7.1. Conclusion Summary
This survey provides a comprehensive examination of how generative models, particularly Large Language Models (LLMs) and diffusion models, are fundamentally revolutionizing recommender systems (RSs). It argues that this marks a significant paradigm shift from discriminative matching to intelligent synthesis.
The core contribution is a unified tripartite framework that analyzes this transformation across three dimensions:
-
Data Level: Generative models are enabling unprecedented
data augmentation(e.g., knowledge-infused, agent-based simulation) anddata unification(e.g., multi-domain, multi-task, multi-modal), addressing long-standing issues likedata sparsityandcold-start. -
Model Level: The survey categorizes and details the rise of
LLM-based methods,Large Recommendation Models (LRMs), anddiffusion approachesas core recommendation engines. These models leveragescaling lawsand novelalignment mechanismsto achieve powerful capabilities. -
Task Level: Generative models unlock new functionalities beyond traditional
top-K recommendations, includingconversational interaction,explainable reasoning, andpersonalized content generation, fundamentally redefining user-system interaction.The paper highlights five key advantages of this new paradigm:
world knowledge integration,natural language understanding,reasoning capabilities,scaling laws, andcreative generation. It concludes that generative RSs are moving towardsintelligent recommendation assistantsthat are open, task-agnostic, and adaptive, fundamentally reshaping how humans interact with information.
7.2. Limitations & Future Work
The survey critically examines several open challenges that need to be addressed for the full potential of generative recommendation to be realized:
- Data Challenges:
- Benchmark Design: Current datasets are largely
non-interactive,offline, andstatic, making them unsuitable for evaluating generative models aspersonalized assistantsthat operate in dynamic, multi-round, interactive settings across diverse tasks. There's an urgent need fornew benchmarksthat capture real-world complexity and support the assessment of generative capabilities.
- Benchmark Design: Current datasets are largely
- Model Challenges:
- Bias: Generative RSs face significant
biasissues:Popularity bias: LLMs, trained on vast corpora, tend to rank popular items higher, reducingdiversityand potentially marginalizing less popular content. Existing methods to mitigate this often involve adjusting or generating training data, but more research is needed on using LLMs to generatefair user interactions.Fairness: Models can implicitly utilizesensitive attributes(e.g., gender, race) from data, leading to biased recommendations. Ensuringunbiased recommendationsrequires careful consideration of the overlap between user preference modeling and sensitive information.Positional biases: LLMs are sensitive to prompt structure, item order, and content, which can introducepositional biasesand increaseoutput uncertainty.
- Robustness:
Natural noise: Recommendation tasks are plagued bynoise(e.g., clickbait, unintended interactions). While LLMs' knowledge and reasoning might help, a significant gap exists between LLM pre-training objectives anddenoising recommendation. LLMs canhallucinateand misclassify noise.Malicious attack:Injection attacksare a concern for traditional RSs, but for generative RSs,textual simulation attacks(rephrasing item descriptions to be adversarial) are particularly low-cost and transferable across models. The robustness of multimodal LLM-based recommendations against attacks is an emerging research area, and existing defense strategies fordata poisoning attacksare limited.
- Bias: Generative RSs face significant
- Deployment Challenges:
- Training Efficiency:
Parameter-efficient fine-tuning (PEFT)methods exist, but are insufficient for the rapidly increasing scale of recommendation datasets. The challenge isdata-efficient fine-tuning, rapidly adapting LLMs with fewer data and computational resources, especially considering correlations in interaction records. - Inference Efficiency:
Autoregressive decodingin generative models leads toinefficient inferencedue to multiple serial calls, hinderingreal-time recommendation. Traditional LM acceleration methods (e.g.,speculative decoding) are hard to apply directly totop-K distinct sequence generation(which requiresbeam search).Knowledge distillationcan help but is not a complete solution.
- Training Efficiency:
7.3. Personal Insights & Critique
This survey provides an incredibly timely and well-structured overview of an rapidly evolving field. Its tripartite data-model-task framework is particularly insightful, offering a clear lens through which to understand the multifaceted impact of generative AI on recommender systems. The emphasis on the transition from discriminative scoring to generative synthesis is a powerful conceptual shift that truly captures the essence of this new paradigm.
Inspirations and Applications:
- Beyond Ranking: The most significant inspiration is the move beyond mere item ranking to genuine
content generationandconversational interaction. This opens doors for RSs to become trulyintelligent assistants. Imagine a fashion recommender that doesn't just suggest clothes, but generates unique outfit compositions based on your preferences and current trends, or a travel assistant that generates a full itinerary including personalized activities and even hypothetical experiences. - Cold-Start & Long-Tail Solutions: The potential for
data augmentationandagent-based simulationto addresscold-startandlong-tailproblems is enormous. This could democratize recommendations for niche content or new users, currently underserved by traditional data-hungry methods. - Explainability & Trust: The ability to
reasonandgenerate explanationsfundamentally enhances usertrustandtransparency. Instead of a black-box suggestion, users can understand the logic, which is crucial for high-stakes recommendations (e.g., financial products, healthcare). - Cross-Domain Transfer: The
data unificationcapabilities of LLMs suggest that a single, large generative recommender could potentially serve multiple domains (e.g., movies, music, books) without significant retraining, leading to more efficient and scalable systems.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Evaluation Gap: While the survey articulates the need for new benchmarks, the practical challenges of evaluating open-ended
generative outputs(e.g., personalized text, images) are immense. Traditional metrics likeRecall@KorRMSEdon't capture creativity, coherence, or conversational flow. Developing robust, scalable, and fair qualitative and quantitative evaluation frameworks remains a grand challenge. The risk ofhallucinationin generated content is also a major concern that needs rigorous evaluation. -
Control and Safety: Generative models, especially LLMs, are known for issues like
factual inaccuracies,toxicity, andbias amplification. In recommendation, this translates to generating misleading information, inappropriate content, or reinforcing harmful stereotypes. While the paper mentionsbiasandrobustnessas challenges, the concrete mechanisms forcontrolling output generationto ensuresafety,fairness, andreliabilityin real-world deployment (beyond just avoiding popular items) need significant development. This is especially critical as RSs move into sensitive domains. -
"One Model for All" Feasibility: The vision of "one model for all" is appealing for engineering efficiency, but it faces practical hurdles. Different domains and tasks might have conflicting objectives or require highly specialized knowledge that a single, large model might struggle to encapsulate efficiently. The
cost-benefit analysisof training and maintaining such a colossal model versus an ensemble of specialized models needs continuous re-evaluation in industrial settings. -
Interpretability of
Latent Reasoning: Whileexplicit reasoningis promising, the paper also mentionsimplicit reasoningmethods. The challenge here is ensuring that this implicit reasoning truly aligns with human logic and is not just a statistical correlation. If the reasoning islatent, how can we trust or audit it for fairness and correctness? -
Sustainability and Ethics: The immense computational resources required for training and deploying
large generative modelsraiseenvironmental sustainabilityconcerns. Furthermore, the ethical implications of highly personalized content generation (e.g.,filter bubbles,manipulation,privacyconcerns with deep user profiles) need to be proactively addressed as this technology matures.Overall, this survey is a valuable resource for navigating the complex and exciting landscape of generative recommendation. It effectively maps the progress and potential while soberly identifying the significant hurdles that remain.
Similar papers
Recommended via semantic vector search.