A Survey of Generative Recommendation from a Tri-Decoupled Perspective: Tokenization, Architecture, and Optimization
TL;DR Summary
This survey explores three key aspects of generative recommendation systems: tokenization, architecture, and optimization, highlighting how generative methods mitigate error propagation, enhance hardware utilization, and extend beyond local user behavior, while tracing the evolut
Abstract
Abstract not provided in the supplied PDF first-page text.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
A Survey of Generative Recommendation from a Tri-Decoupled Perspective: Tokenization, Architecture, and Optimization
1.2. Authors
Xiaopeng Li, Bo Chen, Junda She, Shiteng Cao, You Wang, Qinlin Jia, Haiying He, Zheli Zhou, Zhao Liu, Ji Liu, Zhiyang Zhang, Yu Zhou, Guoping Tang, Yiqing Yang, Chengcheng Guo, Si Dong, Kuo Cai, Pengyue Jia, Maolin Wang, Wanyu Wang, Shiyao Wang, Xinchen Luo, Qigen Hu, Qiang Luo, Xiao Lv, Chaoyi Ma, Ruiming Tang, Kun Gai, Guorui Zhou, and Xiangyu Zhao.
Their affiliations include City University of Hong Kong and Kuaishou Technology. Kun Gai is unaffiliated. Ruiming Tang, Guorui Zhou, and Xiangyu Zhao are the corresponding authors.
1.3. Journal/Conference
This paper is published on Preprints.org, a free multidisciplinary platform providing preprint services. Preprints posted on Preprints.org appear in Web of Science, Crossref, Google Scholar, Scilit, and Europe PMC. As a preprint platform, it allows for early dissemination of research outputs before formal peer review and publication in a traditional journal or conference.
1.4. Publication Year
2025 (Posted Date: 4 December 2025)
1.5. Abstract
The recommender systems community is experiencing a rapid shift from traditional multi-stage cascaded discriminative pipelines (retrieval, ranking, and re-ranking) towards unified generative frameworks that directly generate items. This emerging paradigm, driven by advancements in generative models and the demand for end-to-end architectures that improve Model FLOPS Utilization (MFU), offers several benefits: mitigating cascaded error propagation, improving hardware utilization, and optimizing beyond local user behaviors. This survey provides a comprehensive analysis of generative recommendation from a tri-decoupled perspective: tokenization, architecture, and optimization. It traces the evolution of tokenization from sparse ID- and text-based encodings to semantic identifiers; analyzes encoder-decoder, decoder-only, and diffusion-based architectures; and reviews the transition from supervised next-token prediction to reinforcement learning-based preference alignment. The authors also summarize practical deployments across cascade stages and application scenarios and examine key open challenges. The survey aims to serve as a foundational reference for researchers and an actionable blueprint for industrial practitioners building next-generation generative recommender systems.
1.6. Original Source Link
/files/papers/6932e0a6574a23595ada718d/paper.pdf (This is a local file path, indicating the paper was provided as a local PDF. Its publication status is a preprint on Preprints.org.)
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inherent limitations of traditional discriminative recommender systems, which have been the predominant paradigm in academia and industry. These limitations include:
-
Semantic Isolation and Cold-Start Problem: Traditional systems treat items as atomic units within embedding tables, leading to semantic isolation. This exacerbates cold-start problems for new or less-interacted items and introduces computational inefficiencies, with embedding tables consuming over 90% of parameters.
-
Inefficient Architecture and Low
MFU: Discriminative models often rely on specialized, small-scale operators, leading to considerable communication and data transfer overhead. This results in severely limited hardware utilization efficiency, withModel FLOPS Utilization (MFU)typically less than 5%, a stark contrast toLarge Language Models (LLMs)achieving over 40%MFUduring training. -
Limited Scaling and Emergent Capabilities: Production discriminative systems usually employ modest-sized models (e.g., dense
MLP< 0.1B parameters), which constrains their capacity for scaling up and prevents them from exhibitingemergent capabilitiesobserved inLLMs. -
Local and Constrained Optimization: Current discriminative training strategies primarily optimize local decision boundaries based on users' posterior behaviors, lacking explicit characterization of the full probability distribution over items and multi-dimensional preference modeling (e.g., platform-level objectives).
-
Cascaded Error Propagation: The multi-stage cascaded framework (retrieval, pre-ranking, ranking, re-ranking) inevitably introduces cumulative errors and information loss, degrading recommendation quality.
The rapid advancements in
LLMshave demonstrated exceptional semantic understanding and reasoning capabilities, prompting researchers to explore their application in recommender systems. However, manyLLM-enhancedapproaches still operate within the discriminative paradigm. The paper argues for aGenerative Recommendation (GR)paradigm shift that directly generates item identifiers, eliminating the need for multi-stage cascaded processing and addressing the aforementioned limitations.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Comprehensive Tri-Dimensional Survey: It presents the first comprehensive survey analyzing
generative recommender systemsthrough a tri-dimensional decomposition:tokenization,architectural design, andoptimization strategies. This framework organizes existing work and traces the evolution from discriminative to generative paradigms. -
Identification of Key Trends: Through systematic overview and analysis, it identifies key trends:
- Tokenization: Towards efficient representation with
semantic identifiers (SIDs)that balance vocabulary compactness and semantic expressiveness. - Architecture: Advances in
model architecturethat facilitate improved scalability and resource-efficient computation. - Optimization: Multi-dimensional
preference alignmentaimed at balancing the objectives of users, the platform, and additional stakeholders.
- Tokenization: Towards efficient representation with
-
In-depth Discussion of Applications and Challenges: It provides an in-depth discussion of
GRapplications across different stages and scenarios (e.g., cold start, cross-domain, search, auto-bidding), examines current challenges, and outlines promising future directions (e.g., end-to-end modeling, efficiency, reasoning, data optimization, interactive agents, and transition from recommendation to generation).The key conclusions are that
generative recommendationrepresents a fundamental paradigm shift with significant potential. It offers advantages in:
- Tokenization: Revolutionizing item representation from sparse
IDstotext-basedorsemantic IDs, enabling rich semantic modeling, addressing cold-start issues, and achieving parameter efficiency. - Architecture: Employing unified
encoder-decoder,decoder-only, ordiffusion-basedstructures that possess inherent scalability, higherMFU, and can leverage innovations fromNLP. - Optimization: Moving from local decision boundary optimization to capturing full probability distributions using
Next-Token Prediction (NTP)andReinforcement Learning (RL)-basedpreference alignmentfor multi-dimensional preference and platform-level objective optimization, enabling end-to-end training.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Recommender Systems: Systems designed to predict user preferences and suggest items (products, movies, music, etc.) that users are likely to enjoy. They are crucial for enhancing user engagement and platform value.
- Discriminative Models: A class of models that learn to distinguish between different classes or predict a specific outcome. In recommender systems, they typically learn to score items or predict interaction probabilities (e.g., click-through rate, purchase probability) and then rank items based on these scores.
- Generative Models: A class of models that learn the underlying distribution of data and can generate new, similar data instances. In
generative recommendation, this means directly generating item identifiers or sequences of items. - Multi-stage Cascaded Pipeline: A common architecture in industrial recommender systems where the recommendation process is broken down into sequential stages (e.g., retrieval, pre-ranking, ranking, re-ranking). Each stage progressively filters down a large set of items to a smaller, more relevant list.
- Retrieval: The initial stage, which narrows down millions of items to a few thousand candidates that are potentially relevant to the user.
- Ranking: Ranks the retrieved candidates to present the most relevant ones to the user.
- Re-ranking: Further refines the ranked list, often considering global list properties like diversity or novelty, or specific business objectives.
- Embedding Table: A lookup table used in discriminative models to convert sparse, categorical features (like user
IDsor itemIDs) into dense, continuous vector representations (embeddings). These embeddings capture semantic relationships and are then fed into neural networks. MLP(Multi-Layer Perceptron): A type of artificial neural network consisting of multiple layers of nodes in a directed graph, where each layer is fully connected to the next.MLPsare used to learn complex non-linear relationships between features.Model FLOPS Utilization (MFU): A metric that measures how efficiently a model utilizes the theoretical peak floating-point operations per second (FLOPS) of the hardware. A lowMFUindicates that the hardware is underutilized.LLM(Large Language Model): A type of deep learning model that has been trained on a massive amount of text data.LLMsare capable of understanding, generating, and reasoning about human language, exhibiting capabilities like semantic understanding and reasoning.Next-Token Prediction (NTP): A common training objective forLLMsand generative sequence models, where the model is trained to predict the next token in a sequence given the preceding tokens.Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by performing actions in an environment and receiving rewards or penalties. It's often used to optimize long-term objectives.- Tokenization: The process of converting input data (text, items, features) into discrete units (tokens) that a model can process.
- Encoder-Decoder Architecture: A neural network architecture often used for sequence-to-sequence tasks (e.g., machine translation). An
encoderprocesses the input sequence into a fixed-size context vector, and adecodergenerates an output sequence from that context vector. - Decoder-Only Architecture: A neural network architecture that consists only of a
decodercomponent, typically used inLLMsfor generative tasks. It generates output tokens sequentially based on previous output tokens and an initial prompt. - Diffusion-Based Models: A class of generative models that learn to generate data by gradually denoising a signal starting from random noise. They can generate data in parallel and offer flexible control over the generation process.
3.2. Previous Works
The paper discusses previous works primarily in the context of discriminative models and their limitations, as well as early attempts to integrate LLMs into recommender systems.
- Discriminative Models:
Embedding & MLPParadigm: This is the foundation of many traditional discriminative recommender systems, where features are encoded into dense embeddings and then processed byMLPs. Examples mentioned includeDeepFM[7] andDeep & Cross Network (DCN)[8] forCTRprediction.DeepFM[7]: Combines a Factorization Machine (FM) component for feature interactions with a deep neural network (DNN) component for high-order feature learning. The FM part models pairwise feature interactions, while the DNN learns complex non-linear interactions.DCN[8]: Designed to efficiently capture feature interactions of various orders. It combines a deepneural networkwith across networkthat explicitly applies feature crosses at each layer, enabling it to learn bounded-degree feature interactions.
- Behavior Modeling: Models like
DIN[52] focus on capturing users' short-term and long-term behavior patterns from their temporal interaction sequences.DIN(Deep Interest Network) [52]: Introduces anattention mechanismto adaptively calculate the representation of user interests from their historical behavior data with respect to a candidate item. This allows the model to focus on relevant past behaviors. The core idea is that different candidate items should activate different subsets of user's past behaviors. The attention mechanism inDINcalculates an activation weight for each historical item given a candidate item : where is the embedding of historical item , is the embedding of candidate item , and is a smallMLPthat computes the compatibility between and . The user's interest representation is then a weighted sum of historical item embeddings:
- Multi-stage Cascaded Framework: Industrial systems often use stages like
retrieval,pre-ranking,ranking, andre-rankingto handle millions of items under strict latency constraints, as exemplified by YouTube's recommendation system [22].
LLM-enhancedApproaches: Early integrations ofLLMsinto recommendation tasks are categorized into:- Semantic Enhancement [26,27]: Using
LLMsto extract or augment semantic information for items or users. - Data Enhancement [55,56]: Using
LLMsto generate synthetic data or augment existing data. - Alignment Enhancement [57,58]: Using
LLMsto align different modalities or objectives. - The paper notes that despite these enhancements, these approaches remain fundamentally constrained by the discriminative paradigm.
- Semantic Enhancement [26,27]: Using
3.3. Technological Evolution
The technological evolution in recommender systems has generally moved from simpler Machine Learning (ML)-based models (e.g., collaborative filtering [48,49], matrix factorization [50,51]) to Deep Learning (DL)-based models (Embedding & MLP paradigm). The current shift is from these discriminative DL models to generative models, heavily influenced by the rise of LLMs.
- Early
MLModels: Focused on similarity and implicit relationships, often suffering from sparsity and cold-start. DLEra (Discriminative): Addressed limitations ofMLmodels by learning complex feature interactions and user behaviors through neural networks. However, they introduced issues like semantic isolation in embedding tables, lowMFU, and limited scalability.LLM-enhanced(Transitional): Initially,LLMswere used to augment or enhance discriminative systems, but the core paradigm remained discriminative.Generative Recommendation(Emerging): Represents a fundamental paradigm shift. Instead of scoring and ranking, it directly generates items. This new paradigm leveragesLLMarchitectures and training strategies, aiming to overcome the limitations of discriminative models regarding tokenization, architecture, and optimization.Semantic ID (SID)-based methods are highlighted as a key evolution in tokenization, combining the benefits of semantic information with efficient representation.
3.4. Differentiation Analysis
The core differences and innovations of the generative approach, as presented in this paper, compared to the discriminative paradigm, are summarized across three dimensions:
-
Tokenization:
- Discriminative: Relies on
sparse ID-based embeddings, where each item is an atomic unit with a randomly assignedID. This leads to semantic isolation, cold-start problems, and large, sparse embedding tables (90%+ of parameters). - Generative: Revolutionizes tokenization by operating at the semantic level. It uses
textualorsemantic identifiers (SIDs)for feature extraction, enabling rich semantic modeling, addressing cold-start/cross-domain issues, and achieving parameter efficiency through compact vocabulary design.SID-based methods, in particular, aim to provide compact, semantically rich representations with controllable vocabulary sizes.
- Discriminative: Relies on
-
Architecture:
- Discriminative: Employs a variety of specialized, small-scale operators in a cascaded framework, leading to irregular computation patterns, low
MFU(typically < 5%), and limited scalability. - Generative: Typically uses unified
encoder-decoder,decoder-only, ordiffusion-basedarchitectures similar toLLMs. This leads to enhanced computational regularity, higherMFU(potential > 40%), and inherent scalability, allowing models to exhibitemergent capabilities. It also enables seamless leveraging ofNLPcommunity innovations.
- Discriminative: Employs a variety of specialized, small-scale operators in a cascaded framework, leading to irregular computation patterns, low
-
Optimization:
-
Discriminative: Relies on
discriminative trainingto optimize local decision boundaries for users' posterior behaviors (e.g.,CTRprediction). It struggles with explicitly characterizing the full probability distribution over items and multi-dimensional preference modeling. -
Generative: Trains with
Next-Token Prediction (NTP)to naturally capture the full probability distribution over items and model the entire user behavior generation process. It incorporatespreference alignment strategies, particularlyReinforcement Learning (RL)-based techniques, for multi-dimensional preference optimization (e.g., platform-level objectives), enabling end-to-end optimization and preventing cumulative information loss.Figure 2 from the original paper effectively illustrates these differences:
VLM Description: The image is a chart that compares discriminative recommendation and generative recommendation, highlighting the differences in input, optimization, and architecture across the two paradigms. It specifically illustrates aspects like discriminative output, generative output with probability predictions, feature embedding, and model architecture, clearly showing the distinctions between the two recommendation approaches.
-
The original caption: Figure 2. Comparison of discriminative and generative recommendation paradigms.
4. Methodology
The paper systematically examines the methodology of generative recommendation through three core components: tokenizer, model architecture, and optimization strategy.
4.1. Tokenizer
The tokenizer defines how items and features are represented as discrete fundamental units (tokens) for the generative model. An effective tokenizer must balance semantic expressiveness with vocabulary size and ensure accurate grounding of generated tokens to actual items.
4.1.1. Evolution of Tokenizer Paradigms
The paper categorizes current approaches into three types: sparse ID-based, text-based, and semantic ID (SID)-based.
4.1.1.1. Sparse ID-Based Identifiers
These follow the conventional embedding & MLP paradigm of discriminative models. Each item is represented by a randomly assigned sparse ID, which carries no semantic information. The embedding layer assigns an independent parameterized vector to each sparse ID. MLP layers then process these embeddings to learn feature interactions and generate recommendations.
- Advantages:
- Unique ID: Avoids
ID collisionwith a uniqueIDfor each item. - Diverse Feature Representation: Facilitates direct representation of diverse features and interaction networks.
- Unique ID: Avoids
- Generative Adaptation: Some generative methods (e.g.,
HSTU[35],MTGR[65]) adoptsparse ID-based tokenizationby converting user behaviors into chronologically ordered token sequences and reformulating recommendation as a sequential transduction task usingcausal autoregressive modeling.HSTU[35]: Abandons numerical features and introducesactiontokens. It builds sequences by interleaving sparseIDsof items and actions in the form . This allows the model to predict either the next item or the next action.GenRank[62]: To address the overhead ofHSTU's interleaved sequence,GenRankcombines item tokens with action tokens, treating items as positional information and focusing on iteratively predicting actions associated with each item. This is anaction-oriented organization.DFGR[68]: Similar toGenRank,DFGRtreats the item and action as a single token by concatenating theirID embeddings.
- Limitations:
- Lack of Multimodal Semantic Information:
IDsare random and lack inherent semantic meaning. - Cold-Start Problem:
Sparse ID embeddingsare learned from interaction data, so new or rarely interacted items suffer from inadequate feature learning. - Vocabulary Explosion: The enormous item vocabulary (hundreds of millions of
tokens) leads to an excessively large output space fornext-item prediction, making it challenging for generative models.
- Lack of Multimodal Semantic Information:
4.1.1.2. Text-Based Identifiers
These methods represent items through their textual descriptions, leveraging the pre-trained vocabularies of LLMs. Recommendation is reframed as a question-answering or text generation task.
- Mechanism: Items are represented by textual attributes (e.g., "title: iPhone 17 Pro") or structured templates (e.g., "Product: iPhone; Brand: Apple; Category: Electronics").
LLMsuse their world knowledge and reasoning capabilities to infer user preferences. - Advantages:
- Alleviates Cold-Start/Long-Tail:
LLMs' pre-trained knowledge helps with items lacking interaction data. - Cross-Domain Generalizability: Enables transferability across different domains.
- Enhanced Interpretability/Conversational Interaction: Facilitates more natural user interactions.
- Alleviates Cold-Start/Long-Tail:
- Examples:
M6-Rec[69]: Uses product attributes and descriptions to populate natural language templates for item tokenization.LLMTreeRec[70]: Emphasizes the hierarchical structure of product attributes to constrain generation and avoid excessive text length.S-DPO[38]: Takes user interaction histories as text prompts and predicts the title of the target item, optimizing the probability of positive samples.
- Challenges:
- Ambiguity in Grounding: Generated text tokens may not uniquely identify a specific item.
- Computational Inefficiency: Text-based descriptions require a large number of tokens, reducing computational efficiency.
- Lack of Collaborative Signals: Purely text-based approaches may struggle to capture collaborative relationships (e.g., "Item_A-Item_B") directly within the model's parameter space. Some works (e.g.,
LLaRa[75]) incorporate item representations from traditional recommendation models to integrate collaborative and semantic information.
4.1.1.3. Semantic ID (SID)-Based Identifiers
SID-based methods address the limitations of both sparse ID-based (limited semantics, sparse vocabulary) and text-based (inefficient representation, grounding difficulty) approaches. They represent an item using a fixed-length sequence of correlated semantic IDs, providing compact, semantically rich representations with a controllable vocabulary size.
4.1.2. Semantic ID Construction
The construction of SIDs typically involves a two-step process: embedding extraction and quantization.
4.1.2.1. Embedding Extraction
This step transforms items' semantic information into a semantic embedding using pre-trained models.
- Static Content Features: Early works like
TIGER[29] andLC-Rec[82] generate embeddings solely from static item content (text, images) using models likeBERT[77] for text orCLIP[78] for multimodal data. - Collaborative Signals: To address the lack of collaborative information, later models (e.g.,
LETTER[37],EAGER[83],OneRec[15,32],UNGER[84]) inject collaborative signals and jointly learncollaborativeandsemantic cross-modality embeddings. - Scenario-Specific Information: For location-based recommendations, methods like
OneLoc[85] andGNPR-SID[86] injectgeographical informationinto embeddings. - Purely Collaborative Signals:
TokenRec[87] explores constructing embeddings using only collaborative signals, employing aGNN[88] to capture user-item interactions.
4.1.2.2. Quantization
Semantic embeddings are quantized into semantic ID sequences, often a tuple of codewords (e.g., ), where each codeword comes from a distinct codebook.
-
Residual-Based Quantization (RQ-VAE): This is the most widely adopted method (e.g.,
TIGER[29],OneRec[32]). It constructs a coarse-to-fine representation by quantizing the residual between the latent embedding and the cluster centroid.- For each item, its semantic representation is processed. At each level , the algorithm identifies the closest code vector from the
codebookto the current latent representation input : Then, the residual is used as input for the next quantization level. This process iterates until all levels are completed. OneRec[32] andOneLoc[85] adoptResKmeans[95] to preventcodebook collapseduringRQ-VAEtraining by limiting the maximum number of items assigned to anycodeword, boosting utilization and stability.- Limitation: Progressive residual quantization can lead to the
hourglass effect[96], wherecodebook tokensin intermediate layers become excessively concentrated, introducing bias. Also, residualSIDgeneration inLLMsexhibitsprefix dependency, limiting decoding efficiency.
- For each item, its semantic representation is processed. At each level , the algorithm identifies the closest code vector from the
-
Parallel Quantization: To address
prefix dependencyand improve efficiency, some works (e.g.,RPG[97],RecGPT[60]) predict multipleIDssimultaneously.RPG[97]: UsesProduct Quantization (PQ)[81] for ultra-longSIDsto enable fine-grained semantic modeling and combines it withparallel decoding.RecGPT[60]: Integratesfinite scalar quantization (FSQ)[79] with a hybridattention mechanism.
-
Cross-Domain
SIDConstruction:GMC[98] usescontrastive learningfor representational consistency within domains, whileRecBase[94] usescurriculum learningfor cross-domain representation capability.
4.1.3. Challenges for Semantic ID
4.1.3.1. SID Collision
Multiple distinct items mapping to identical SID sequences introduce ambiguity during item grounding. High collision rates degrade model performance.
- Causes: Unevenly distributed or collapsed centroids in quantization methods (
RQ-VAE,ResKmeans), leading to lowcodebook utilization. - Mitigation:
- Optimization Objectives:
SaviorRec[100] uses theSinkhorn algorithm[101] and an entropy-regularized loss for more uniform item assignment.OneRec[32] andLETTER[37] useconstrained k-meansto limit items per centroid. - Additional Token Positions:
TIGER[29] adds a random token,CAR[99] adds the itemsparse IDat the end of theSIDfor disambiguation. OneSearch[63] combinesResKmeans(shared characteristics) withoptimized product quantization (OPQ)(unique characteristics) for distinctiveness.
- Optimization Objectives:
4.1.3.2. Objective Inconsistency
This arises from multi-stage training (embedding extraction, SID quantization, generative model training), where insufficient inter-stage interaction hinders optimization towards the ultimate recommendation objective.
- Solutions:
- Self-supervised
SIDTraining:LMIndexer[102] uses a generative language model to encode item text intoSIDswhich then reconstruct the original text for supervision. - End-to-End Joint Optimization:
ETEGRec[104] proposes an end-to-end framework fortokenizerandGRmodel, aligned with sequence-item and preference-semantic loss.MMQ[105] usesbehavior-aware fine-tuningand a soft indexing mechanism for continuous gradient propagation.
- Self-supervised
4.1.3.3. Multi-modal Integration
Precisely modeling multimodal information (e.g., text, image, collaborative signals) during tokenization.
- Fusion during Embedding Extraction:
QARM[106] andOneRec[15,32,33] fine-tune pre-trained multimodal models with user-item behavior.UNGER[84] performscontrastive alignmentbetween multimodal and collaborative embeddings.OneLoc[85] incorporates geographic information. - Fusion during Quantization:
- Independent Quantization: Quantizing each modality separately [83,107].
- Inter-modal Alignment:
MME-SID[108] andLETTER[37] usecontrastive learningto strengthen alignment. MoEArchitecture:MMQ[105] designs a multimodal shared-specifictokenizerwithMixture-of-Experts (MoE)for weighted aggregation of outputs from modality-specific and sharedcodebooks.- Behavior-aligned Quantization:
BBQRec[109] extracts behavior-relevant information from multimodal data. - Positional Assignment:
TALKPLAY[110] encodes each modality as a separate position viaK-means clustering.EAGER-LLM[111] allocates specific codebook layers for multimodal and collaborative signals.
4.1.3.4. Interpretability and Reasoning
SID-based GR models generally lack interpretability and LLM-like reasoning capabilities.
- Solutions:
-
Integrating
SIDintoLLMs:PLUM[112] uses pre-training tasks likeSID-to-TitleandSID-to-Topicto equipLLMswith semantic correspondence betweenSIDsand natural language. -
Unified Frameworks:
OneRec-Think[34] bridges the semantic gap between discrete recommendation items and continuous reasoning spaces, designing a retrieval-based reasoning paradigm with multi-step deliberation.The following table,
Table 1from the original paper, provides a comparison of different tokenizer types: The following are the results from Table 1 of the original paper:
-
| Universality | Semantics | Vocabulary | Item Grounding | |
| Sparse ID | X | × | Large | ✓ |
| Text | ✓ | ✓ | Moderate | × |
| Semantic ID | × | ✓ | Moderate | ✓ |
4.2. Model Architecture
Generative architectures aim to overcome the limitations of traditional discriminative models by unifying the architecture, enhancing computational regularity, and improving MFU. These architectures enable the scaling of model parameters for performance gains. The paper categorizes architectures into encoder-decoder, decoder-only, and diffusion-based structures.
4.2.1. Encoder-Decoder Architecture
This structure balances user preference understanding (encoder) and next-item generation (decoder).
-
Direct Transfer from Pre-trained
LLMs: Early attempts directly adapted pre-trainedencoder-decoderlanguage models.P5[28]: Built onT5[116], unifies five recommendation tasks using specific prompts.M6-Rec[69]: Based onM6model [117], serializes user-item interactions into text for generating textual descriptions of recommended items.RecSysLLM[74]: Uses a structured prompt format withGLM[118] and a multi-task masked token prediction mechanism.- Challenge: General-purpose
LLMslack collaborative signals, theirlanguage modeling objectiveis misaligned with recommendation goals, and inference overhead is high.
-
Dedicated
Encoder-DecoderArchitectures for Recommendation:TIGER[29]: Pioneered generative retrieval using a standardtransformer encoder-decoder (T5)structure over pre-trainedsemantic IDs, framing recommendation assemantic ID sequence generation.OneRec[32]: ExtendsTIGERby constructing an end-to-end generative architecture, eliminating the traditional multi-stage pipeline.- Encoder: Captures distinct scales of user interaction patterns (static, short-term, positive-feedback, lifelong) through a unified
transformer-based network. - Decoder: Processes user sequences through
transformer layers. Each decoder layer incorporates aMixture of Experts (MoE)feed-forward network with a top routing strategy. The decoder's operations are: Where:- is the input to the -th decoder layer.
- is the output of the -th decoder layer.
- is the
causal self-attentionmechanism, which ensures that predictions for a token depend only on preceding tokens. - is the
cross-attentionmechanism, allowing the decoder to attend to the encoded information from the encoder. - is a
Mixture of Expertslayer, applied afterRMSNorm(Root Mean Square Normalization), to enhance model capacity while maintaining efficiency. RMSNormis a normalization technique used to stabilize training.- The equations represent the sequential application of
causal self-attention,cross-attention, and anMoElayer within each decoder layer, typically with residual connections (+) for stable training.
- Encoder: Captures distinct scales of user interaction patterns (static, short-term, positive-feedback, lifelong) through a unified
-
Scenario-Specific Customization: Architectures are optimized for specific applications:
OneSug[119] (e-commerce query auto-completion): Dedicated encoder for historical interactions, specialized decoder for query generation.OneSearch[63] (search scenarios): Multi-view behavior sequence injection to capture short-term and long-term preferences, uses unifiedencoder-decoderto generatesemantic ID sequences.OneLoc[85] (local life service): Incorporatesgeolocation-aware self-attentionin the encoder andneighbor-aware attentionin the decoder to guidePOI(Point of Interest) generation.EGA-V2[107] (advertising scenarios): Encodes user interaction sequences and autoregressively generates sequences of nextPOItokens and creative tokens via two dependent decoders in a multi-task schema.
4.2.2. Decoder-Only Architecture
These architectures are seen as more scalable and computationally effective, especially for long-sequence behavior contexts, due to the imbalance in computational resources in encoder-decoder models.
-
Leveraging Pre-trained
Decoder-Only LLMs:LLM-Promptingwith Text Generation: Approaches likeGenRec[122],BIGRec[72],Rec-R1[123],GPT4Rec[59],RecFound[124], andLlama4Rec[125] usedecoder-only LLMsas backbones. They promptLLMsto generate textual descriptions of target items, followed by anitem groundingstep to map text to concrete items.- Direct
Semantic IDModeling: Approaches likeMME-SID[108],EAGER-LLM[111],RecGPT[61],SpaceTime-GR[126],TALKPLAY[110], andGNPR-SID[86] introducesemantic IDsfor direct item modeling and generation.OneRec-think[34] unifies dialogue understanding,chain-of-thoughtreasoning, and personalized recommendation. - Challenge: Generic linguistic priors of
LLMsmay not align with recommendation scenarios; semantic gap between generated text and discrete item space.
-
Dedicated
Decoder-OnlyArchitectures for Recommendation:RecGPT[60] andFORGE[127]: Puredecoder-only transformerbackbones, trained autoregressively on user interaction sequences with pre-trainedsemantic IDsfor direct next item generation.SynerGen[128]: Unifies search and recommendation tasks usingtask-specific masking matricesto enforce causality, session isolation, and cross-task alignment.COBRA[89]: Fusessemantic IDswithdense embeddingsin the decoder for joint prediction of bothsparse semantic IDanddense vector representationof the next item. Uses a coarse-to-fine inference strategy.- Efficiency Improvements:
RPG[129]: Proposes amulti-token predictionmechanism based onparallel SID encodingand a graph-based decoding strategy for efficiency.CAR[99]: Groupssemantic IDsintoconcept blocksand uses anautoregressive transformer decoderforparallel block-wise prediction.OneRec-V2[33]: Introduces aLazy Decoderstructure with a context processor for multimodal user behavior signals. It omits standard key/value projection operations in attention, shares key-value pairs across decoder blocks, and employsgrouped query attentionto reduce computational overhead.
- Structured Behavioral Sequence Modeling:
HSTU[35]: Frames recommendation asstructured sequence predictionover behavioral time series, emphasizing intent modeling. The architecture uses a stack of layers with residual connections, and its core calculation includes: \begin{array} { r } { \mathbf { \boldsymbol { x } } ^ { ( i + 1 ) } = f _ { 2 } \big ( \mathrm { Norm } ( \mathrm { S i L U } \left( \mathbf { q } ^ { ( i ) } ^ { T } \mathbf { k } ^ { ( i ) } + \mathbf { r a b } ^ { p , t } \right) \mathbf { v } ^ { ( i ) } \big ) \odot \mathbf { u } ^ { ( i ) } \big ) . } \end{array} Where:- is the input to the -th layer.
- and are feed-forward networks (e.g.,
MLPs). - (Sigmoid Linear Unit) is an activation function.
- divides the output of into four components: , , (query), and (key).
- is a normalization function.
- represents the dot product for attention scores.
- likely refers to
relative attention biasesfor position and time , capturing positional and temporal relationships. - denotes element-wise multiplication.
- The equation shows that
HSTUreplaces the conventionalsoftmaxin attention with pointwise aggregated attention (`\mathrm { S i L U } \left( \mathbf { q } ^ { ( i ) } ^ { T } \mathbf { k } ^ { ( i ) } + \mathbf { r a b } ^ { p , t } \right) \mathbf { v } ^ { ( i ) } \big ) \odot \mathbf { u } ^ { ( i ) }) to preserve preference intensity signals, which is more suitable for dynamic item vocabularies.
MTGR[65]: Designs dedicatedattention patternsfor different feature types (full attention for static, dynamic autoregressive masking for real-time behaviors, diagonal mask for candidate items).INTSR[130]: Introducessession-level maskingand aQuery-Driven Blockto unify query-agnostic recommendation and query-based search.LiGR[131]: For re-ranking, proposes anin-session set-wise attention mechanismwhere items in a recommendation list are presented simultaneously, ignoring causal ordering.
4.2.3. Diffusion-Based Architecture
These models generate recommendations through parallel iterative denoising of the full target sequence, enabling bidirectional attention and flexible control over generation steps.
- Mechanism: Starts from random noise and gradually denoises it to generate data. This contrasts with autoregressive models that generate tokens sequentially.
- Examples:
Diff4Rec[115]: UsesVAEto map discrete interactions into latent vectors, where diffusion withcurriculum schedulinggenerates semantically consistent augmentations.CaDiRec[132]: EnhancesDiff4Recwithcontext-aware weightingandtransformer-based UNetfor temporal dependencies.RecDiff[133]: CombinesGCNwith diffusion in latent space to refine user representations.DDRM[134]: Directly denoises continuous embeddings viaMLPsfor ranking.- Multimodal Applications:
DiffCL[135] andDimeRec[136] leverage diffusion as a feature augmenter for multimodal scenarios. - Discrete Diffusion:
DiffGRM[137] pioneers discrete diffusion forSemantic IDgeneration.
4.3. Optimization Strategy
Optimization strategies have evolved from supervised learning with next-token prediction to incorporating Reinforcement Learning (RL)-based preference alignment for multi-dimensional objectives.
4.3.1. Supervised Learning
The primary objective in supervised learning for generative recommendation is typically next-token prediction (NTP).
4.3.1.1. NTP Modeling
Given historical behavior tokens, the model learns user preferences and predicts the next item autoregressively.
-
Basic
NTP:TIGER[29] adopts this, training anencoder-decoderin a sequence-to-sequence manner. Many otherGRstudies (e.g.,OneRec[15],RecGPT[60]) also useNTP. The objective can be defined as maximizing the conditional probability: Where:- is the sequence of items interacted by user .
- is the user.
- is the context.
- is the item at timestep .
- is the historical interaction prefix sequence .
- is the probability distribution learned by the model parameterized by .
- The model learns to predict given , , and .
-
Modified
NTPObjectives:LETTER[37]: ModifiesNTPloss into aranking-guided generation lossby altering temperature to penalize hard-negative samples.COBRA[89]: Uses acomposite loss functioncombining losses forsparse ID predictionanddense vector prediction.REG4Rec[139]: Adds an auxiliarycategory-prediction task.UNGER[84]: Introduces anintra-modality knowledge distillation taskfor transferring item knowledge.
-
LLMAdaptation: For models inheriting parameters fromLLMs, additional training tasks adapt them for item generation. User behaviors are serialized into textual prompts andLLMsare fine-tuned (full-parameter orparameter-efficient tuninglikeLoRA[76]).LC-Rec[82]: Designs tuning tasks to align language and collaborative semantics.RecFound[124]: Uses a broader set of recommendation-specific tasks and a step-wiseconvergence-oriented sample strategy.EAGER-LLM[111]: Adopts anannealing adapter tuning scheduleto preventcatastrophic forgetting.
4.3.1.2. NCE Modeling
When sparse IDs are used, NTP loss becomes challenging due to extremely large vocabulary size. NCE-style optimization (Noise Contrastive Estimation) [35] approximates the softmax over the full vocabulary, avoiding vanishing gradients and enabling efficient training.
Sampled Softmax:HSTU[35] andGenRank[62] approximate fullsoftmaxby treating the true next token as positive and sampling negatives from the catalog.Multi-token Objective:PinRec[66] uses amulti-token objectiveto predict beyond the next token within a timestep window.InfoNCEOptimization:IntSR[130] usesInfoNCEwith ahard negative sampling strategy.- Session-Level Training:
SessionRec[140] trains at the session level fornext session predictionand enhanceshard negative sampledistinction via a ranking task. - Discriminative Loss:
MTGR[65] adopts optimization similar to traditional discriminative models withdiscriminative loss.
4.3.2. Preference Alignment
To align with implicit multi-dimensional preferences and platform-level objectives, Reinforcement Learning (RL)-based preference alignment [141] is introduced. This optimizes for cumulative reward over sequential user interactions, pursuing long-term objectives and non-differentiable metrics. Inspired by LLM alignment techniques (e.g., DPO [144], GRPO [145]), two mainstream paradigms emerge.
4.3.2.1. DPO Modeling
DPO (Direct Preference Optimization) directly optimizes models using pairwise preference data, encouraging chosen samples over rejected ones, bypassing explicit reward modeling.
- Preference Pair Construction:
DPOeffectiveness relies on high-quality preference pairs. Posterior user behaviors (clicks, purchases) are positive examples, while negative generation strategies vary.S-DPO[38]: Extends standardDPOby pairing one positive with multiple negatives for more discriminative learning.RosePO[146]: Refines this withselective rejection samplingfor helpfulness and harmlessness.SPRec[147]: Introduces aself-evolving mechanismto dynamically select hard negatives from previous predictions.
- Heuristic + Predictive Models:
OneLoc[85],OneSearch[63], andOneSug[119] combine heuristic rules with predictive models andfeedback-aware weighting.OneLoc[85]: Usesbeam searchfor candidate generation, then ranks withGMV(Gross Merchandise Volume) prediction and rule-basedgeographic proximity scores.OneSug[119]: Constructs multi-level preference pairs across nine behavior types with calibrated weights reflecting user intent strength.OneSearch[63]: Introduces athree-tower reward modelfor user-item relevance scores andfeedback-based weighting(fromCTR,CVR) for graded alignment.
4.3.2.2. GRPO Modeling
GRPO (Group Relative Policy Optimization) extends preference optimization to a groupwise setting, assigning explicit reward signals to candidates and updating policy to shift probability mass toward higher-reward candidates. It requires a reliable and accurate reward function.
- Hybrid Reward System: Integrates diverse feedback sources (rule-based and model-based).
- Rule-Based Rewards: Ensure format compliance and alignment with observed user interactions.
VRAgent-R1[148]: Positive reward for conforming output.STREAM-Rec[91]: Graded behavioral rewards (high for matched items, negative for low matching quality).
- Posterior Ranking Metrics:
Rec-R1[123]: Uses metrics likeNDCG(Normalized Discounted Cumulative Gain) for candidate group quality.RecLLM-R1[138]: AdoptsLongest Common Subsequencealgorithm for rewarding correctly ranked items.
- Combined Rule-Based and Predictive Models:
OneRec[15]: Integrates a point-wiseP-Score model(predicting user preference) with rule-based components (format compliance, ecosystem relevance). UsesEarly Clipped Policy Optimization (ECPO)to stabilize optimization.
- Rule-Based Rewards: Ensure format compliance and alignment with observed user interactions.
- Reasoning-Guided Generation:
-
OneRec-Think[34]: Introduceschain-of-thoughtreasoning with item-text alignment and reasoning scaffolding, optimized via multi-path rewards. -
REG4Rec[139]: Expands reasoning space by generating multiplesemantic tokensper item, pruning inconsistent paths via self-reflection. -
RecZero[64]: Adopts a "Think-before-Recommendation"RLparadigm, leveraging structured reasoning templates and rule-based rewards.The following figure,
Figure 5from the original paper, illustrates the optimization strategies of several representative generative recommendation models:
VLM Description: The image is a schematic diagram that illustrates the architecture and categories of different recommendation systems, including information on preference alignment (GPRO) and auxiliary loss (Auxiliary Loss). The names and dates of various recommendation methods are clearly labeled, facilitating the understanding of the development and classification of different methods.
-
5. Experimental Setup
The paper is a survey and thus does not present new experimental results. However, it discusses the application of generative recommendation models in various settings, which implies certain experimental methodologies for the individual works it cites. Based on the common practices in the field and the descriptions of the applications, we can infer the typical experimental setup.
5.1. Datasets
The specific datasets used are not detailed in the survey itself but are referenced through the cited papers. Generative recommendation models are typically evaluated on large-scale datasets that capture user-item interactions and item attributes across diverse domains.
-
E-commerce Datasets: Often include product information (title, description, images, brand, category) and user interaction logs (clicks, purchases, views, search queries). Examples mentioned or implied:
M6-Rec[69]: Investigated in e-commerce scenarios.OneSug[119]: Designed for e-commerce query auto-completion.OneSearch[63]: Explores e-commerce search.
-
Streaming Media/Music Datasets: Interaction logs on content consumption (e.g., watch history, listening history, ratings).
-
Social Network Datasets: User activity, connections, content sharing, etc.
-
Location-Based Services (POI) Datasets: User check-ins, POI attributes (category, location, description), temporal information.
OneLoc[85] andGNPR-SID[86] are mentioned in this context. -
Advertising Datasets: User impressions, clicks, conversions, ad attributes.
EGA-V2[107] is mentioned for advertising scenarios.These datasets are typically chosen because they:
-
Reflect Real-World Scenarios: Provide realistic user behavior and item diversity.
-
Are Large-Scale: Necessary for training deep generative models.
-
Enable Validation of Specific Challenges: Datasets with
cold-startitems,long-taildistributions, or cross-domain interactions are crucial for evaluating solutions to these problems.
5.2. Evaluation Metrics
The survey mentions various evaluation metrics used in the context of generative recommendation, particularly in the Optimization Strategy section and Results & Analysis section (implicitly, as it describes how models improve performance).
-
Recommendation Quality Metrics:
CTR(Click-Through Rate):- Conceptual Definition: Measures the proportion of times an item was clicked after being shown to a user. It quantifies how appealing a recommendation is.
- Mathematical Formula:
- Symbol Explanation:
Number of Clicks: The total count of times users clicked on a recommended item.Number of Impressions: The total count of times a recommended item was shown to users.
CVR(Conversion Rate):- Conceptual Definition: Measures the proportion of times a user performs a desired action (e.g., purchase, sign-up) after clicking on a recommended item or seeing a recommendation. It quantifies the effectiveness of recommendations in driving business goals.
- Mathematical Formula:
- Symbol Explanation:
Number of Conversions: The total count of times users completed the desired action.Number of Clicks: The total count of times users clicked on a recommended item.
NDCG(Normalized Discounted Cumulative Gain):- Conceptual Definition: A measure of ranking quality that takes into account the position of relevant items in a ranked list. It gives higher scores to relevant items that appear higher in the list and considers the graded relevance of items.
- Mathematical Formula:
- Symbol Explanation:
- : The relevance score of the item at position .
- : The number of items in the ranked list being considered.
- : Discounted Cumulative Gain at position .
- : Ideal Discounted Cumulative Gain, which is the
DCGof the ideal ranking (items sorted by decreasing relevance).
GMV(Gross Merchandise Volume):- Conceptual Definition: The total value of sales over a given period. In recommendation, it's often a business-oriented metric to evaluate the commercial impact of recommendations.
- Mathematical Formula: No standard formula, typically the sum of prices of purchased items.
- Symbol Explanation:
- : The price of item .
-
Efficiency Metrics:
MFU(Model FLOPS Utilization):- Conceptual Definition: Measures how close the actual floating-point operations performed by the model are to the theoretical maximum
FLOPSof the hardware. HigherMFUindicates more efficient hardware usage. - Mathematical Formula: Not a universally standard formula, but generally calculated as:
- Symbol Explanation:
Actual FLOPS: The number of floating-point operations performed by the model per second.Peak FLOPS: The theoretical maximum floating-point operations per second that the hardware can achieve.
- Conceptual Definition: Measures how close the actual floating-point operations performed by the model are to the theoretical maximum
- Latency: The time taken for a model to generate a recommendation. Crucial for real-time systems.
- Throughput: The number of recommendations a system can process per unit of time.
-
Tokenizer-Specific Metrics:
- Collision Rate: Measures how often multiple distinct items are mapped to the same
semantic ID sequence. Lower is better. - Information Entropy: Measures the randomness or uncertainty in the distribution of
SIDs. Higher entropy suggests better utilization of thecodebook. - Codebook Utilization: Measures the proportion of
codewordsin acodebookthat are actually used. Higher utilization is generally better.
- Collision Rate: Measures how often multiple distinct items are mapped to the same
5.3. Baselines
The baselines mentioned in the survey are primarily traditional discriminative recommendation models and early LLM-enhanced discriminative approaches.
-
Traditional Discriminative Models:
Embedding & MLPparadigms, including various feature interaction networks (DCN[8]) and behavior modeling networks (DIN[52]).- Multi-stage cascaded systems (e.g.,
Covington et al.[22] for YouTube recommendations).
-
LLM-enhancedDiscriminative Models: Approaches that useLLMsfor semantic, data, or alignment enhancement but still operate within the discriminative framework (e.g.,P5[28],M6-Rec[69],RecSysLLM[74]). While these useLLMs, they are presented as a step before truly generative recommendation. -
Generative Retrieval Models: For
retrievaltasks, generative models are compared against traditional similarity-based retrieval systems (e.g.,DSSM[149] orApproximate Nearest Neighbor (ANN)methods).These baselines are representative as they reflect the state-of-the-art in traditional recommendation paradigms, providing a clear benchmark for the performance improvements offered by the new generative approaches across
tokenization,architecture, andoptimization.
6. Results & Analysis
As a survey paper, this document does not present novel experimental results in the form of tables or figures but rather synthesizes the findings and applications of existing generative recommendation models. The "results" discussed here are the reported successes and advantages of generative models over traditional discriminative approaches, as well as their practical deployment scenarios and remaining challenges.
6.1. Core Results Analysis
The paper highlights that generative recommendation models demonstrate significant advantages over traditional discriminative pipelines across multiple dimensions, leading to notable improvements in performance and efficiency.
- Mitigation of Cascaded Error Propagation: Discriminative models suffer from information loss and cumulative errors across retrieval, ranking, and re-ranking stages. Generative models, by directly generating item identifiers in an end-to-end fashion, effectively eliminate these cascaded errors, leading to higher overall recommendation quality.
- Improved Hardware Utilization (
MFU): Traditional discriminative models with their heterogeneous, small-scale operators exhibit very lowMFU(typically < 5%). Generative architectures, particularlydecoder-onlymodels, leverage unifiedtransformer-basedbackbones. This standardization results in more regular computational patterns and significantly higherMFU, approachingLLMlevels (> 40% during training), thus maximizing hardware computational efficiency. - Enhanced Scalability and Emergent Capabilities: The unified nature of generative architectures allows for scaling up model parameters. This scaling, especially in
decoder-onlymodels, often leads toemergent capabilitiesand predictable performance gains, which is difficult to achieve with smaller, fragmented discriminative models. The paper notes a rapid growth in generative model scale, from millions to billions of parameters, withdecoder-onlyarchitectures becoming dominant for models exceeding 1 billion parameters. - Semantic Richness and Cold-Start Alleviation:
Semantic ID (SID)-based tokenization, by encoding item attributes into semantically meaningful identifiers, allows models to better handlecold-startandlong-tailitems, and enables cross-domain generalization. This is a significant improvement oversparse IDapproaches that lack intrinsic semantic meaning. - Multi-dimensional Preference Optimization: The shift from supervised
next-token predictiontoReinforcement Learning (RL)-basedpreference alignmentallows generative models to optimize not just for immediate user clicks but also for long-term user preferences, platform-level objectives (e.g.,GMV, retention, diversity, fairness), and complex, non-differentiable metrics. This creates a more holistic and balanced optimization framework. - End-to-End Optimization: Generative models enable true end-to-end optimization, where the entire recommendation process is integrated into a single framework. This simplifies the system architecture, reduces engineering overhead, and allows for global optimization, leading to substantial performance improvements (e.g.,
OneRec[32],OneSug[119],ETEGRec[104] reported significant gains over cascaded approaches).
6.2. Data Presentation (Tables)
As a survey, the paper typically summarizes findings and trends rather than presenting new experimental results in tables. However, it does include a conceptual comparison table for tokenizers.
The following are the results from Table 1 of the original paper:
| Universality | Semantics | Vocabulary | Item Grounding | |
| Sparse ID | X | × | Large | ✓ |
| Text | ✓ | ✓ | Moderate | × |
| Semantic ID | × | ✓ | Moderate | ✓ |
- Analysis of Table 1:
- Sparse ID: Lacks
Universality(cannot generalize well across domains or tasks) andSemantics(IDs are arbitrary numbers). It has aLarge Vocabulary(millions of items) but excels atItem Grounding(each ID uniquely maps to an item). - Text: Offers high
Universality(can leverageLLMcapabilities across various text-based tasks) andSemantics(rich textual descriptions). It has aModerate Vocabulary(subword units ofLLMs). However, it struggles withItem Groundingbecause generated text might not uniquely map back to a specific item. - Semantic ID: Lacks
Universalitycompared to generalLLMs(often specialized for items) but possesses strongSemantics(encoded from item features). It has aModerate Vocabulary(controllable size, typically tens of thousands) and achieves reliableItem Grounding(designed to be unique and compact). - Conclusion:
Semantic IDis highlighted as the most promising paradigm due to its balance ofsemantics,vocabulary size, anditem groundingcapabilities, addressing key limitations of the other two approaches.
- Sparse ID: Lacks
6.3. Ablation Studies / Parameter Analysis
The survey does not include its own ablation studies or parameter analyses, as it is a review of existing work. However, the discussion of the evolution of models and challenges implicitly touches upon the outcomes of such studies performed by the individual papers being surveyed. For instance:
-
The comparison of
RQ-VAEandResKmeansforSID constructionand their impact onSID collisionandcodebook utilizationsuggests that research has performed ablation studies on different quantization techniques. -
The development of
Lazy DecoderinOneRec-V2[33] and its impact onMFUand efficiency implies that analyses were conducted to show the benefits of such architectural choices. -
The various
preference alignmenttechniques and their impact on different reward signals (e.g.,GMV,NDCG, diversity) indicate that the effectiveness of these components has been individually evaluated by the respective authors.The trend toward task-aware refinements in generative architectures (e.g.,
task-specific attention masking[128],MoE-based feature routing[15],geolocation-aware attention[85]) suggests that these specific components have been found to contribute positively to performance in their respective contexts, likely through ablation studies.
7. Conclusion & Reflections
7.1. Conclusion Summary
This survey concludes that generative recommendation represents a transformative paradigm shift from traditional discriminative models towards a unified generative framework. This shift enables end-to-end optimization and flexible adaptation across diverse recommendation scenarios. The paper comprehensively reviews this field through a tri-decoupled perspective:
-
Tokenization: Evolving from
sparse ID- andtext-basedapproaches to more efficient and semantically richSemantic ID (SID)-based methods.SIDsbalance vocabulary compactness, semantic expressiveness, and reliable item grounding, addressing limitations likecold-startand computational inefficiency. -
Architecture: Progressing from
encoder-decodermodels to increasingly dominantdecoder-onlyanddiffusion-basedstructures. These architectures unify the system, enhanceModel FLOPS Utilization (MFU), provide inherent scalability, and enable the seamless adoption of advancements fromLarge Language Model (LLM)research. -
Optimization: Transitioning from supervised
next-token predictionfor local decision boundaries toReinforcement Learning (RL)-basedpreference alignment. This allows for multi-dimensional optimization of user preferences and platform-level objectives, fostering end-to-end training and mitigating cumulative error.The survey highlights that generative models have achieved significant commercial success and offer a robust blueprint for developing next-generation recommender systems, while also acknowledging remaining challenges and outlining promising future research directions.
7.2. Limitations & Future Work
The authors identify several key challenges that also serve as promising future research directions:
-
End-to-End Modeling:
- Model Scaling: While end-to-end models mitigate error accumulation and improve
MFU, current deployed models are limited to ~1 billion parameters due to latency. Future work should explore scaling models toLLM-level capacity (tens of billions) while maintaining acceptable inference latency. - Unified Reward Design: Current
RL-basedpreference alignmentoften uses rule-based or single-aspect reward signals. A future direction is to develop aunified Reward Agent, potentially powered byLLMs, to automatically comprehend and balance multi-dimensional preferences (user-level, platform-level, diversity, fairness, safety) more robustly and with less manual intervention.
- Model Scaling: While end-to-end models mitigate error accumulation and improve
-
Efficiency:
- Algorithm-System Co-design: A lack of an integrated algorithm-system co-design framework tailored for streaming training and low-latency, high-throughput inference in recommendation scenarios. This is critical for industrial-scale systems processing hundreds of millions of samples daily.
- Ultra-Long Behavior Modeling: The computational complexity of
attention mechanismsmakes ultra-long behavior modeling an efficiency bottleneck. Future research needs to explore more efficient sequence modeling paradigms, such asmemory-augmented structuresandRAG-augmented training.
-
Reasoning:
- Constructing Reasoning Chains: It's challenging to create large-scale
chain-of-thought (CoT)data for personalized recommendation tasks, especially given unique user characteristics. - Adaptive Reasoning: Models need to learn
adaptive reasoning strategiesbased on query difficulty to satisfy strict latency requirements and avoid "overthinking." - Self-Evolving Recommenders: Developing models that continually reflect on decisions, revise reasoning policies, and improve from online feedback, while mitigating risks like
bias amplificationandcatastrophic forgetting.
- Constructing Reasoning Chains: It's challenging to create large-scale
-
Data Optimization:
- Training Data Bias: Generative models trained on historical positive interaction data inherit biases (exposure, position) that traditional
debiasing methodsmay not fully address. Designingdebiasing strategiesspecifically for generative recommendations is crucial. - High-Quality Data Construction: While abundant interaction data exists, constructing high-quality data like
reasoning-oriented CoT data,multi-aspect preference data, andexplicit intent-level annotationsremains a bottleneck.
- Training Data Bias: Generative models trained on historical positive interaction data inherit biases (exposure, position) that traditional
-
Interactive Agent:
- Personalized Dialogue Recommendations: Current agent-based systems struggle to deliver personalized dialogue that aligns with user preferences and provides user-style rationales.
- User-Centric Memory Mechanisms: Designing effective
memory mechanismstailored for conversational recommendation to enhance understanding of users and personalization.
-
From Recommendation to Generation:
- Strong Personalization & Sparse Feedback: The highly personalized content generated by multimodal models (e.g., video, audio) leads to extremely
sparse feedback signals, complicating algorithm development. - Cost-Value Trade-off: Balancing the substantial resource demands and generation costs of generative models against their potential value to ensure ecosystem sustainability.
- Strong Personalization & Sparse Feedback: The highly personalized content generated by multimodal models (e.g., video, audio) leads to extremely
7.3. Personal Insights & Critique
This survey provides an excellent, structured overview of an emerging and highly impactful field. The "tri-decoupled perspective" (tokenization, architecture, optimization) offers a robust conceptual framework for understanding the core components of generative recommendation and how they differentiate from traditional approaches. The emphasis on MFU, scalability, and end-to-end optimization clearly articulates the practical advantages and the industrial motivation behind this paradigm shift.
Inspirations:
- The Power of Unification: The central theme of unifying previously cascaded, fragmented systems into a single generative framework is inspiring. This approach has proven revolutionary in
NLP(withLLMs) and computer vision (withdiffusion models), and its application to recommendation promises similar breakthroughs in efficiency, scalability, and performance. - Addressing Fundamental Problems: The explicit focus on how
generative recommendationinherently addresses long-standing problems likecold-start,semantic isolation, andmulti-objective optimizationis particularly insightful.Semantic IDsseem to be a crucial innovation bridging the gap between raw item information and efficient model processing. - Future-Proofing: The ability of generative architectures to
syncwith advancements inLLMsand hardware (e.g.,KV-cache-friendly execution,MoE,GQA) suggests a more future-proof and sustainable research direction compared to highly specialized discriminative models. - Reasoning and Interpretability: The discussion around incorporating reasoning capabilities and interpretability, particularly through
SIDintegration withLLMsandchain-of-thoughtapproaches, is a vital step towards more trustworthy and user-friendly recommender systems.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Computational Cost for Training/Inference: While the survey highlights improved
MFUfor generative models, the absolute computational cost (especially for training billion-parameter models) is still immense. Industrial deployment of such large generative models, particularly for the entire user base with real-time requirements, poses significant challenges. The balance betweenMFUand overallFLOPsfor a given task needs careful consideration. -
Data Scarcity for Complex Objectives: The transition to
RL-basedpreference alignmentand multi-dimensional objectives is theoretically sound, but obtaining sufficient, high-quality, and unbiased reward signals (especially forlong-term values,diversity, orfairness) remains extremely difficult in practice. Manual annotation or evenLLM-driven reward modeling can be prone to bias or misrepresentation. -
SIDQuality and Generalization: The effectiveness ofSID-based tokenization heavily relies on the quality of theembedding extractionandquantizationprocesses. Issues likeSID collision,objective inconsistency, and handling rapidly evolving item catalogs could undermine their benefits. How wellSIDsgeneralize across vastly different domains or modalities (e.g., from e-commerce products to news articles) without extensive re-training is an open question. -
Evaluation in End-to-End Systems: Evaluating
end-to-end generative recommendationmodels in real-world settings is complex. Traditional metrics might not capture the full benefits of multi-objective optimization or emergent capabilities. Developing holistic, interpretable evaluation frameworks that align with both user satisfaction and business goals is critical. -
Risk of Generative AI: For the "From Recommendation to Generation" direction, generating entirely new content tailored to users brings ethical and practical challenges similar to other
generative AIapplications, such as content moderation, misinformation, safety, and brand alignment. The computational cost of generating highly personalized content (e.g., videos) for every user, on demand, is currently prohibitive for most platforms.The paper successfully provides a comprehensive map of this evolving landscape, setting the stage for future research to tackle these complex and exciting challenges. Its framework can definitely be applied to other domains shifting towards generative paradigms, offering a structured way to analyze tokenization, architecture, and optimization as fundamental building blocks.
Similar papers
Recommended via semantic vector search.