Large Language Models for Generative Recommendation: A Survey and Visionary Discussions
TL;DR Summary
This survey explores generative recommendation using LLMs, advocating a single-stage approach that directly generates recommendations from the full item set, addressing traditional multi-stage limitations and reshaping recommender system paradigms.
Abstract
Large language models (LLM) not only have revolutionized the field of natural language processing (NLP) but also have the potential to reshape many other fields, e.g., recommender systems (RS). However, most of the related work treats an LLM as a component of the conventional recommendation pipeline (e.g., as a feature extractor), which may not be able to fully leverage the generative power of LLM. Instead of separating the recommendation process into multiple stages, such as score computation and re-ranking, this process can be simplified to one stage with LLM: directly generating recommendations from the complete pool of items. This survey reviews the progress, methods, and future directions of LLM-based generative recommendation by examining three questions:
- What generative recommendation is, 2) Why RS should advance to generative recommendation, and 3) How to implement LLM-based generative recommendation for various RS tasks. We hope that this survey can provide the context and guidance needed to explore this interesting and emerging topic.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Large Language Models for Generative Recommendation: A Survey and Visionary Discussions
- Authors:
- Lei Li (Hong Kong Baptist University)
- Yongfeng Zhang (Rutgers University)
- Dugang Liu (Guangdong Laboratory of Artificial Intelligence and Digital Economy)
- Li Chen (Hong Kong Baptist University)
- The authors are established researchers in the fields of recommender systems and information retrieval, lending significant credibility to this survey.
- Journal/Conference: This paper is an arXiv preprint. arXiv is a well-regarded open-access repository for pre-publication manuscripts in quantitative fields. While not peer-reviewed in the formal sense of a conference or journal, influential surveys and position papers often appear on arXiv first to rapidly disseminate ideas.
- Publication Year: 2023
- Abstract: The paper surveys the emerging field of generative recommendation using Large Language Models (LLMs). The authors argue that instead of using LLMs as mere components in traditional multi-stage recommender systems (RS), they can be leveraged to create a single-stage generative pipeline. This new paradigm directly generates item recommendations from the entire item pool. The survey is structured around three core questions: what generative recommendation is, why it is the future for RS, and how to implement it for various tasks. The goal is to provide context and guidance for researchers exploring this new direction.
- Original Source Link:
- Official Source: https://arxiv.org/abs/2309.01157
- PDF Link: https://arxiv.org/pdf/2309.01157v2.pdf
- Publication Status: This is a preprint and has not yet been published in a peer-reviewed journal or conference at the time of this analysis.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Traditional industrial recommender systems rely on a multi-stage pipeline (e.g., retrieval, ranking, re-ranking) to handle a massive number of items. This process is computationally expensive, complex, and creates a gap between sophisticated academic models and practical industrial applications, as advanced models are only applied to a small, pre-filtered set of candidates.
- Why Now: The revolutionary capabilities of Large Language Models (LLMs) in natural language understanding and generation offer a new path forward. However, most existing work simply integrates LLMs as a component (e.g., a feature extractor) into the old multi-stage framework, failing to fully exploit their generative power.
- Fresh Angle: The paper proposes a paradigm shift from discriminative recommendation (scoring and ranking candidates) to generative recommendation. The core innovation is to treat the entire recommendation process as a single-stage generation task, where the LLM directly outputs the recommended items, effectively considering all items in the pool implicitly.
-
Main Contributions / Findings (What):
- Definitional Clarity: The paper provides formal definitions for
ID in Recommender SystemsandGenerative Recommendationto establish a conceptual foundation for this new research area. - Taxonomy and Formulation: It organises the field by creating a taxonomy of seven distinct generative recommendation tasks (e.g., top-N, sequential, explainable) and provides a general, prompt-based formulation for each, offering a practical guide for implementation.
- Systematic Review: To our knowledge, this is the first survey that systematically focuses on LLM-based generative recommendation, distinguishing it from broader surveys on LLMs in RS that also cover discriminative approaches.
- Visionary Discussion: The paper identifies and discusses eight critical challenges and future opportunities, including hallucination, fairness, controllability, and LLM-based agents, setting a research agenda for the community.
- Definitional Clarity: The paper provides formal definitions for
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Recommender Systems (RS): These are systems designed to combat information overload by predicting a user's interest in an item and suggesting relevant items. Traditionally, they operate via a multi-stage filtering paradigm:
- Recall/Retrieval: A fast but simple model (e.g., rule-based) selects a few hundred or thousand candidate items from millions.
- Ranking: A more complex model scores and ranks these candidates to find the most relevant ones.
- Re-ranking: The top-ranked items are re-ordered based on additional criteria like diversity, novelty, or business rules.
- Large Language Models (LLM): These are massive neural networks (e.g., GPT-3, LLaMA) pre-trained on vast amounts of text data. They excel at understanding context and generating coherent, human-like text in response to a given prompt.
- Discriminative vs. Generative AI:
- Discriminative AI: Aims to learn a boundary between different classes of data. In RS, this means learning a function that assigns a high score to a "good" user-item pair and a low score to a "bad" one. It discriminates between candidates.
- Generative AI: Aims to learn the underlying probability distribution of the data. In this context, it means learning to generate the identifier of a suitable item for a user directly, without scoring all other items.
- Hallucination: A critical failure mode of LLMs where the model generates text that is plausible and grammatically correct but factually incorrect or nonsensical. In RS, this could mean recommending an item that does not exist.
- Recommender Systems (RS): These are systems designed to combat information overload by predicting a user's interest in an item and suggesting relevant items. Traditionally, they operate via a multi-stage filtering paradigm:
-
Previous Works & Differentiation:
- The paper acknowledges that many recent works have applied LLMs to RS. However, it points out a crucial limitation: most treat the LLM as a "component" in the conventional discriminative pipeline. For example, an LLM might be used to encode item descriptions into feature vectors, which are then fed into a traditional ranking model.
- Other recent surveys on "LLM-based recommendation" are broader, covering both discriminative and generative uses. This paper differentiates itself by focusing exclusively on the generative paradigm. This sharp focus allows for a deeper and more structured exploration of what it means to build a recommender system that is truly generative from end to end. The authors argue that this is the path to fully leveraging an LLM's power and bridging the gap between academia and industry.
4. Methodology (Core Technology & Implementation)
The paper's core "methodology" is a framework for thinking about and implementing generative recommendation.
-
Principles: The central idea is to reframe recommendation as a language generation problem. Instead of scoring items, the system generates them. This is made possible by two key concepts:
-
Generalized Item Identifiers (IDs): The paper broadens the concept of an ID beyond a simple integer used for embedding lookups.
Definition 1 (ID in Recommender Systems): An ID in recommender systems is a sequence of tokens that can uniquely identify an entity, such as a user or an item.
This is a pivotal redefinition. An ID can now be a sequence of numbers (e.g.,
"56 78"), an item title ("Dune"), or even abstract tokens learned from data. This allows any item to be represented in a format that an LLM can process and generate. The paper notes that with a vocabulary of 1000 tokens and an ID length of 10, one can represent unique items, far more than needed for any real-world system. -
Single-Stage Generative Pipeline: The multi-stage filtering process is replaced by a single LLM that directly generates recommended item IDs.
Definition 2 (Generative Recommendation): A generative recommender system directly generates recommendations or recommendation-related content without the need to calculate each candidate's ranking score one by one.
This process is illustrated in the figure below.
该图像是论文中展示的示意图,比较了传统推荐系统和基于大语言模型(LLM)的生成式推荐的流程。左侧为传统推荐,包含召回、预排序、排序和重排序阶段;右侧为生成式推荐,通过LLM直接生成推荐结果,简化了流程。Figure 1 Analysis: The diagram starkly contrasts the two paradigms.
- Left (Traditional RS): A funnel-like process where a massive item pool is sequentially narrowed down by
Recall,Pre-ranking,Ranking, andRe-rankingstages. Advanced models only see a tiny fraction of the total items. - Right (Generative Recommendation): A single, unified process. The LLM takes user information as input and directly generates item IDs as output. This implicitly considers the entire item pool at each token generation step, making the process more holistic.
- Left (Traditional RS): A funnel-like process where a massive item pool is sequentially narrowed down by
-
-
Steps & Procedures (ID Creation Methods): For the generative paradigm to work, item IDs must not only be unique but also meaningful, ideally encoding collaborative filtering signals (i.e., similar items should have similar IDs). The paper reviews three advanced methods for creating such IDs:
-
Singular Value Decomposition (SVD):
- First, SVD is performed on the user-item interaction matrix to get item latent factors (embeddings).
- These embeddings are processed through normalization, noise addition (to ensure uniqueness), quantization (to convert continuous values to integers), and offset adjustment.
- The result is a unique sequence of integers for each item, which serves as its ID.
-
Collaborative Indexing:
- An item-item co-occurrence graph is constructed.
- Spectral clustering is applied recursively to group similar items, forming a hierarchical tree.
- Each item (a leaf node) gets a unique ID by concatenating the tokens of the nodes on the path from the root to the leaf. Similar items will share a longer common prefix in their IDs.
-
Residual-Quantized Variational AutoEncoder (RQ-VAE):
- An item's textual description is encoded into an embedding.
- This embedding is passed to an RQ-VAE. The VAE quantizes the embedding into a sequence of discrete codes (codewords) from a learned codebook.
- This sequence of codewords becomes the item's ID. This method effectively "compresses" semantic information into a structured, token-based ID.
-
-
Implementation via Prompting: The paper provides a general formulation for various recommendation tasks using prompts, which instruct the LLM on what to do. Below is a summary of the tasks and example prompts.
-
Manual Transcription of Table 1: Methods of representing IDs for LLM-based generative recommendation.
Item ID User ID Related Work Token Sequence (e.g., "56 78") Token Sequence (Petrov and Macdonald, 2023), TransRec (Lin et al., 2023b), LC-Rec (Zheng et al., 2023), (Hua et al., 2023b) Item Title (e.g., "Dune") Interaction History (e.g., "Dune", "Her", ...)) P5 (Geng et al., 2022c; Wang et al., 2023c; Lin and Zhang, 2023; Di Palma et al., 2023; Li et al., 2023d) Item Title + Metadata Metadata (e.g., age) InteRecAgent (Huang et al., 2023), (Zhang et al., 2023b; He et al., 2023) Embedding ID Embedding ID PEPLER (Li et al., 2023a) -
Task Formulations:
- Rating Prediction: Predict the rating a user would give an item.
- Input: Prompt
p(u, i), e.g., - Output: A numerical string, e.g.,
"4.12"
- Input: Prompt
- Top-N Recommendation: Generate a list of N items for a user.
- Input: Prompt , e.g.,
- Output: An item ID, e.g.,
"9312"
- Sequential Recommendation: Predict the next item a user will interact with based on their history.
- Input: Prompt , e.g.,
- Output: An item ID, e.g.,
"6789"
- Explainable Recommendation: Generate a natural language explanation for a recommendation.
- Input: Prompt
p(u, i), e.g., - Output: A sentence, e.g.,
"The movie is top-notch."
- Input: Prompt
- Review Generation: Generate a review for a user-item interaction.
- Input: Prompt
p(u, i), e.g., - Output: A paragraph of text.
- Input: Prompt
- Review Summarization: Summarize a given review.
- Input: Prompt
p(u, i, R), e.g., - Output: A short summary, e.g.,
"great location"
- Input: Prompt
- Conversational Recommendation: Interact with a user over multiple turns to provide recommendations.
- Input: A dialogue history with speaker labels (
"USER","SYSTEM"). - Output: A generated response from the
"SYSTEM".
- Input: A dialogue history with speaker labels (
- Rating Prediction: Predict the rating a user would give an item.
-
5. Experimental Setup
Since this is a survey, it does not present its own experiments but instead summarizes the evaluation protocols used in the field.
-
Datasets: The paper does not focus on specific datasets but implicitly refers to standard recommendation datasets used by the cited works, such as those from Amazon, Yelp, and MovieLens.
-
Evaluation Metrics: The paper discusses metrics for two categories of tasks.
-
For Recommendation Tasks (Rating, Top-N, Sequential):
- Root Mean Square Error (RMSE):
- Conceptual Definition: Measures the average magnitude of the errors between predicted ratings and actual ratings. It penalizes larger errors more heavily than smaller ones.
- Mathematical Formula:
- Symbol Explanation:
- : The set of user-item pairs in the test set.
- : The predicted rating for user on item .
- : The actual rating given by user to item .
- Mean Absolute Error (MAE):
- Conceptual Definition: Measures the average absolute difference between predicted and actual ratings. Unlike RMSE, it treats all errors linearly.
- Mathematical Formula:
- Symbol Explanation: Same as RMSE.
- Normalized Discounted Cumulative Gain (NDCG):
- Conceptual Definition: A ranking metric that evaluates the quality of a recommended list. It gives higher scores to more relevant items and penalizes placing relevant items lower in the list. The score is normalized by the ideal ranking score.
- Mathematical Formula:
- Symbol Explanation:
- : The number of items in the recommended list.
- : The relevance score of the item at rank .
- : Discounted Cumulative Gain at rank .
- : The DCG score of the ideal ranking of the top items.
- Precision@K and Recall@K:
- Conceptual Definition:
- Precision: What fraction of the recommended items are relevant?
- Recall: What fraction of all relevant items were successfully recommended?
- Mathematical Formula:
- Symbol Explanation:
- : The length of the recommended list.
{Recommended Items}: The set of top- items recommended by the model.{Relevant Items}: The set of items in the test set that the user actually interacted with.
- Conceptual Definition:
- Root Mean Square Error (RMSE):
-
For Natural Language Generation Tasks:
- BLEU (Bilingual Evaluation Understudy):
- Conceptual Definition: Measures how many n-grams (sequences of n words) in the machine-generated text overlap with n-grams in the human-written reference text. It emphasizes precision.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
- Conceptual Definition: Similar to BLEU, but recall-oriented. It measures how many n-grams from the reference text appear in the generated text.
- BERTScore:
- Conceptual Definition: A more advanced metric that uses contextual embeddings (from models like BERT) to measure the semantic similarity between tokens in the generated and reference texts, rather than just exact string matching.
- BLEU (Bilingual Evaluation Understudy):
-
-
Baselines: The paper categorizes and lists numerous models from recent literature as examples for each task, effectively treating them as representative approaches rather than baselines in a direct comparison.
6. Results & Analysis
This survey synthesizes findings from across the literature rather than presenting new results. The key analysis is the organization of existing work into a coherent taxonomy.
-
Core Results (State of the Field): The primary "result" is the comprehensive mapping of recent papers to the new generative recommendation paradigm.
-
Manual Transcription of Table 2: Seven typical generative recommendation tasks with LLM.
Rating Prediction Top-N Recommendation Sequential Recommendation Explainable Recommendation Review Generation Review Summarization Conversational Recommendation P5 (Geng et al., 2022c), BookGPT (Li et al., 2023g), LLMRec (Liu et al., 2023b), RecMind (Wang et al., 2023d), Llama4Rec (Luo et al., 2024), (Liu et al., 2023a; Dai et al., 2023; Li et al., 2023d) P5 (Geng et al., 2022c), UP5 (Hua et al., 2024), VIP5 (Geng et al., 2023), OpenP5 (Xu et al., 2023b), POD (Li et al., 2023b), GPTRec (Petrov and Macdonald, 2023), LLMRec (Liu et al., 2023b), RecMind (Wang et al., 2023d), NIR (Wang and Lim, 2023), Llama4Rec (Luo et al., 2024), (Zhang et al., 2023b,c; Liu et al., 2023a; Li et al., 2023f; Dai et al., 2023; Di Palma et al., 2023; Carraro and Bridge, 2024) P5 (Geng et al., 2022c), UP5 (Hua et al., 2024), VIP5 (Geng et al., 2023), OpenP5 (Xu et al., 2023b), POD (Li et al., 2023b), GenRec (Ji et al., 2024), GPTRec (Petrov and Macdonald, 2023), LMRecSys (Zhang et al., 2021), PALR (Yan et al., 2023), LLM-Rec (Liu et al., 2023b), RecMind (Wang et al., 2023d), BIGRec (Bao et al., 2023a), TransRec (Lin et al., 2023b), LC-Rec (Zheng et al., 2023), LLaRa (Liao et al., 2023), (Hua et al., 2023b; Liu et al., 2023a; Hou et al., 2024; Zhang et al., 2023c) P5 (Geng et al., 2022c), VIP5 (Geng et al., 2023), POD (Li et al., 2023b), PEPLER (Li et al., 2023a), M6-Rec (Cui et al., 2022), LLMRec (Liu et al., 2023b), RecMind (Wang et al., 2023d), Logic-Scaffolding (Rahdari et al., 2024), (Liu et al., 2023a) P5 (Geng et al., 2022c), LLMRec (Liu et al., 2023b), RecMind (Wang et al., 2023d), (Liu et al., 2023a) M6-Rec (Cui et al., 2022), RecLLM (Friedman et al., 2023), InteRecAgent (Huang et al., 2023), PECRS (Liu et al., 2023c; Lin and Zhang, 2023; He et al., 2023), (Ravaut et al., 2024), (Wang et al., 2023c) -
Analysis: This table shows that
Top-NandSequential Recommendationare the most explored tasks, whileReview Generationis a notable gap. Models likeP5are versatile, tackling multiple tasks by changing the instructional prompt. The sheer number of recent citations (mostly from 2023 and 2024) highlights the rapid growth and timeliness of this research area.
-
-
Ablations / Parameter Sensitivity: Not applicable, as this is a survey paper.
7. Conclusion & Reflections
-
Conclusion Summary: The paper concludes that LLM-based generative recommendation represents a significant and promising evolution for the field of recommender systems. By generalizing the concept of IDs and reframing recommendation as a direct generation task, LLMs can simplify complex pipelines and potentially create more powerful, flexible, and personalized services. The survey provides a structured overview of the progress, methods, and key tasks, and outlines a rich set of future research directions.
-
Limitations & Future Work (as "Challenges and Opportunities"): The authors present a detailed roadmap of open problems:
- LLM-based Agents: Moving beyond simple data simulation to building agents that can use tools (e.g., map APIs, SQL) to fulfill complex, multi-step user requests like planning a trip.
- Hallucination: A critical safety issue. Recommending non-existent items is unacceptable. Proposed solutions include using structured IDs (like those from collaborative indexing) that are guaranteed to exist, or using retrieval-augmented generation to ground the LLM's output in a real item database.
- Bias and Fairness: LLMs can both inherit and amplify biases from training data (e.g., gender bias in explanations, popularity bias in recommendations). A key open question is defining the boundary between harmful bias and legitimate personalization.
- Transparency and Explainability: Two levels of explainability are needed: 1) generating natural language explanations for a recommendation, and 2) explaining the internal reasoning of the LLM itself, which remains a major challenge.
- Controllability: Users should be able to control the properties of recommendations (e.g., price range, brand) and explanations (e.g., feature to focus on). Current LLMs often fail to follow such constraints precisely.
- Inference Efficiency: The massive size of LLMs makes real-time inference a major hurdle for latency-sensitive recommendation applications. More research is needed on improving efficiency beyond standard techniques like adapter tuning.
- Multimodal Recommendation: Extending the generative paradigm to other modalities like images, audio, and video for tasks like fashion, music, or short-video recommendation.
- Cold-start Recommendation: The world knowledge embedded in LLMs gives them a natural advantage in handling new users or items, as they can rely on metadata (descriptions, user profiles) when interaction data is sparse.
-
Personal Insights & Critique:
- Strength: The paper's primary strength is its clear, visionary framing. By sharply defining "generative recommendation" and contrasting it with the discriminative use of LLMs, it provides an invaluable conceptual lens for navigating a chaotic and fast-moving field. The taxonomy of tasks and ID creation methods is highly pragmatic.
- Critique: As a survey in a field evolving this quickly, its specific examples and model citations may become outdated rapidly. The proposed ID creation methods (SVD, Spectral Clustering, RQ-VAE), while elegant, may pose significant engineering challenges for re-indexing in industrial systems with millions of new items added daily.
- Future Impact: This paper is likely to be highly influential. It provides the vocabulary and structure for a new subfield. Future research will likely build directly on the challenges it outlines, particularly in tackling hallucination, controllability, and efficiency to make the generative vision a practical reality. The distinction between "bias" and "personalization" is a profound and difficult question that the community must address.
Similar papers
Recommended via semantic vector search.