1. Bibliographic Information

1.1. Title

A Survey of Generative Recommendation from a Tri-Decoupled Perspective: Tokenization, Architecture, and Optimization

1.2. Authors

Xiaopeng Li, Bo Chen, Junda She, Shiteng Cao, You Wang, Qinlin Jia, Haiying He, Zheli Zhou, Zhao Liu, Ji Liu, Zhiyang Zhang, Yu Zhou, Guoping Tang, Yiqing Yang, Chengcheng Guo, Si Dong, Kuo Cai, Pengyue Jia, Maolin Wang, Wanyu Wang, Shiyao Wang, Xinchen Luo, Qigen Hu, Qiang Luo, Xiao Lv, Chaoyi Ma, Ruiming Tang, Kun Gai, Guorui Zhou, and Xiangyu Zhao.

Their affiliations include City University of Hong Kong and Kuaishou Technology. Kun Gai is unaffiliated. Ruiming Tang, Guorui Zhou, and Xiangyu Zhao are the corresponding authors.

1.3. Journal/Conference

This paper is published on Preprints.org, a free multidisciplinary platform providing preprint services. Preprints posted on Preprints.org appear in Web of Science, Crossref, Google Scholar, Scilit, and Europe PMC. As a preprint platform, it allows for early dissemination of research outputs before formal peer review and publication in a traditional journal or conference.

1.4. Publication Year

2025 (Posted Date: 4 December 2025)

1.5. Abstract

The recommender systems community is experiencing a rapid shift from traditional multi-stage cascaded discriminative pipelines (retrieval, ranking, and re-ranking) towards unified generative frameworks that directly generate items. This emerging paradigm, driven by advancements in generative models and the demand for end-to-end architectures that improve Model FLOPS Utilization (MFU), offers several benefits: mitigating cascaded error propagation, improving hardware utilization, and optimizing beyond local user behaviors. This survey provides a comprehensive analysis of generative recommendation from a tri-decoupled perspective: tokenization, architecture, and optimization. It traces the evolution of tokenization from sparse ID- and text-based encodings to semantic identifiers; analyzes encoder-decoder, decoder-only, and diffusion-based architectures; and reviews the transition from supervised next-token prediction to reinforcement learning-based preference alignment. The authors also summarize practical deployments across cascade stages and application scenarios and examine key open challenges. The survey aims to serve as a foundational reference for researchers and an actionable blueprint for industrial practitioners building next-generation generative recommender systems.

1.6. Original Source Link

/files/papers/6932e0a6574a23595ada718d/paper.pdf (This is a local file path, indicating the paper was provided as a local PDF. Its publication status is a preprint on Preprints.org.)

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inherent limitations of traditional discriminative recommender systems, which have been the predominant paradigm in academia and industry. These limitations include:

Semantic Isolation and Cold-Start Problem: Traditional systems treat items as atomic units within embedding tables, leading to semantic isolation. This exacerbates cold-start problems for new or less-interacted items and introduces computational inefficiencies, with embedding tables consuming over 90% of parameters.
Inefficient Architecture and Low MFU: Discriminative models often rely on specialized, small-scale operators, leading to considerable communication and data transfer overhead. This results in severely limited hardware utilization efficiency, with Model FLOPS Utilization (MFU) typically less than 5%, a stark contrast to Large Language Models (LLMs) achieving over 40% MFU during training.
Limited Scaling and Emergent Capabilities: Production discriminative systems usually employ modest-sized models (e.g., dense MLP < 0.1B parameters), which constrains their capacity for scaling up and prevents them from exhibiting emergent capabilities observed in LLMs.
Local and Constrained Optimization: Current discriminative training strategies primarily optimize local decision boundaries based on users' posterior behaviors, lacking explicit characterization of the full probability distribution over items and multi-dimensional preference modeling (e.g., platform-level objectives).
Cascaded Error Propagation: The multi-stage cascaded framework (retrieval, pre-ranking, ranking, re-ranking) inevitably introduces cumulative errors and information loss, degrading recommendation quality.

The rapid advancements in LLMs have demonstrated exceptional semantic understanding and reasoning capabilities, prompting researchers to explore their application in recommender systems. However, many LLM-enhanced approaches still operate within the discriminative paradigm. The paper argues for a Generative Recommendation (GR) paradigm shift that directly generates item identifiers, eliminating the need for multi-stage cascaded processing and addressing the aforementioned limitations.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Comprehensive Tri-Dimensional Survey: It presents the first comprehensive survey analyzing generative recommender systems through a tri-dimensional decomposition: tokenization, architectural design, and optimization strategies. This framework organizes existing work and traces the evolution from discriminative to generative paradigms.
Identification of Key Trends: Through systematic overview and analysis, it identifies key trends:
- Tokenization: Towards efficient representation with semantic identifiers (SIDs) that balance vocabulary compactness and semantic expressiveness.
- Architecture: Advances in model architecture that facilitate improved scalability and resource-efficient computation.
- Optimization: Multi-dimensional preference alignment aimed at balancing the objectives of users, the platform, and additional stakeholders.
In-depth Discussion of Applications and Challenges: It provides an in-depth discussion of GR applications across different stages and scenarios (e.g., cold start, cross-domain, search, auto-bidding), examines current challenges, and outlines promising future directions (e.g., end-to-end modeling, efficiency, reasoning, data optimization, interactive agents, and transition from recommendation to generation).

The key conclusions are that generative recommendation represents a fundamental paradigm shift with significant potential. It offers advantages in:

Tokenization: Revolutionizing item representation from sparse IDs to text-based or semantic IDs, enabling rich semantic modeling, addressing cold-start issues, and achieving parameter efficiency.
Architecture: Employing unified encoder-decoder, decoder-only, or diffusion-based structures that possess inherent scalability, higher MFU, and can leverage innovations from NLP.
Optimization: Moving from local decision boundary optimization to capturing full probability distributions using Next-Token Prediction (NTP) and Reinforcement Learning (RL)-based preference alignment for multi-dimensional preference and platform-level objective optimization, enabling end-to-end training.

3.1. Foundational Concepts

Recommender Systems: Systems designed to predict user preferences and suggest items (products, movies, music, etc.) that users are likely to enjoy. They are crucial for enhancing user engagement and platform value.
Discriminative Models: A class of models that learn to distinguish between different classes or predict a specific outcome. In recommender systems, they typically learn to score items or predict interaction probabilities (e.g., click-through rate, purchase probability) and then rank items based on these scores.
Generative Models: A class of models that learn the underlying distribution of data and can generate new, similar data instances. In generative recommendation, this means directly generating item identifiers or sequences of items.
Multi-stage Cascaded Pipeline: A common architecture in industrial recommender systems where the recommendation process is broken down into sequential stages (e.g., retrieval, pre-ranking, ranking, re-ranking). Each stage progressively filters down a large set of items to a smaller, more relevant list.
- Retrieval: The initial stage, which narrows down millions of items to a few thousand candidates that are potentially relevant to the user.
- Ranking: Ranks the retrieved candidates to present the most relevant ones to the user.
- Re-ranking: Further refines the ranked list, often considering global list properties like diversity or novelty, or specific business objectives.
Embedding Table: A lookup table used in discriminative models to convert sparse, categorical features (like user IDs or item IDs) into dense, continuous vector representations (embeddings). These embeddings capture semantic relationships and are then fed into neural networks.
MLP (Multi-Layer Perceptron): A type of artificial neural network consisting of multiple layers of nodes in a directed graph, where each layer is fully connected to the next. MLPs are used to learn complex non-linear relationships between features.
Model FLOPS Utilization (MFU): A metric that measures how efficiently a model utilizes the theoretical peak floating-point operations per second (FLOPS) of the hardware. A low MFU indicates that the hardware is underutilized.
LLM (Large Language Model): A type of deep learning model that has been trained on a massive amount of text data. LLMs are capable of understanding, generating, and reasoning about human language, exhibiting capabilities like semantic understanding and reasoning.
Next-Token Prediction (NTP): A common training objective for LLMs and generative sequence models, where the model is trained to predict the next token in a sequence given the preceding tokens.
Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by performing actions in an environment and receiving rewards or penalties. It's often used to optimize long-term objectives.
Tokenization: The process of converting input data (text, items, features) into discrete units (tokens) that a model can process.
Encoder-Decoder Architecture: A neural network architecture often used for sequence-to-sequence tasks (e.g., machine translation). An encoder processes the input sequence into a fixed-size context vector, and a decoder generates an output sequence from that context vector.
Decoder-Only Architecture: A neural network architecture that consists only of a decoder component, typically used in LLMs for generative tasks. It generates output tokens sequentially based on previous output tokens and an initial prompt.
Diffusion-Based Models: A class of generative models that learn to generate data by gradually denoising a signal starting from random noise. They can generate data in parallel and offer flexible control over the generation process.

3.2. Previous Works

The paper discusses previous works primarily in the context of discriminative models and their limitations, as well as early attempts to integrate LLMs into recommender systems.

Discriminative Models:
- Embedding & MLP Paradigm: This is the foundation of many traditional discriminative recommender systems, where features are encoded into dense embeddings and then processed by MLPs. Examples mentioned include DeepFM [7] and Deep & Cross Network (DCN) [8] for CTR prediction.
  - DeepFM [7]: Combines a Factorization Machine (FM) component for feature interactions with a deep neural network (DNN) component for high-order feature learning. The FM part models pairwise feature interactions, while the DNN learns complex non-linear interactions.
  - DCN [8]: Designed to efficiently capture feature interactions of various orders. It combines a deep neural network with a cross network that explicitly applies feature crosses at each layer, enabling it to learn bounded-degree feature interactions.
- Behavior Modeling: Models like DIN [52] focus on capturing users' short-term and long-term behavior patterns from their temporal interaction sequences.
  - DIN (Deep Interest Network) [52]: Introduces an attention mechanism to adaptively calculate the representation of user interests from their historical behavior data with respect to a candidate item. This allows the model to focus on relevant past behaviors. The core idea is that different candidate items should activate different subsets of user's past behaviors. The attention mechanism in DIN calculates an activation weight $w_i$ for each historical item $i$ given a candidate item $j$ : $w_i = \mathrm{softmax}(f(e_i, e_j))$ where $e_i$ is the embedding of historical item $i$ , $e_j$ is the embedding of candidate item $j$ , and $f$ is a small MLP that computes the compatibility between $e_i$ and $e_j$ . The user's interest representation is then a weighted sum of historical item embeddings: $V_u = \sum_{i \in \text{user's history}} w_i e_i$
- Multi-stage Cascaded Framework: Industrial systems often use stages like retrieval, pre-ranking, ranking, and re-ranking to handle millions of items under strict latency constraints, as exemplified by YouTube's recommendation system [22].
LLM-enhanced Approaches: Early integrations of LLMs into recommendation tasks are categorized into:
- Semantic Enhancement [26,27]: Using LLMs to extract or augment semantic information for items or users.
- Data Enhancement [55,56]: Using LLMs to generate synthetic data or augment existing data.
- Alignment Enhancement [57,58]: Using LLMs to align different modalities or objectives.
- The paper notes that despite these enhancements, these approaches remain fundamentally constrained by the discriminative paradigm.

3.3. Technological Evolution

The technological evolution in recommender systems has generally moved from simpler Machine Learning (ML)-based models (e.g., collaborative filtering [48,49], matrix factorization [50,51]) to Deep Learning (DL)-based models (Embedding & MLP paradigm). The current shift is from these discriminative DL models to generative models, heavily influenced by the rise of LLMs.

Early ML Models: Focused on similarity and implicit relationships, often suffering from sparsity and cold-start.
DL Era (Discriminative): Addressed limitations of ML models by learning complex feature interactions and user behaviors through neural networks. However, they introduced issues like semantic isolation in embedding tables, low MFU, and limited scalability.
LLM-enhanced (Transitional): Initially, LLMs were used to augment or enhance discriminative systems, but the core paradigm remained discriminative.
Generative Recommendation (Emerging): Represents a fundamental paradigm shift. Instead of scoring and ranking, it directly generates items. This new paradigm leverages LLM architectures and training strategies, aiming to overcome the limitations of discriminative models regarding tokenization, architecture, and optimization. Semantic ID (SID)-based methods are highlighted as a key evolution in tokenization, combining the benefits of semantic information with efficient representation.

3.4. Differentiation Analysis

The core differences and innovations of the generative approach, as presented in this paper, compared to the discriminative paradigm, are summarized across three dimensions:

Tokenization:
- Discriminative: Relies on sparse ID-based embeddings, where each item is an atomic unit with a randomly assigned ID. This leads to semantic isolation, cold-start problems, and large, sparse embedding tables (90%+ of parameters).
- Generative: Revolutionizes tokenization by operating at the semantic level. It uses textual or semantic identifiers (SIDs) for feature extraction, enabling rich semantic modeling, addressing cold-start/cross-domain issues, and achieving parameter efficiency through compact vocabulary design. SID-based methods, in particular, aim to provide compact, semantically rich representations with controllable vocabulary sizes.
Architecture:
- Discriminative: Employs a variety of specialized, small-scale operators in a cascaded framework, leading to irregular computation patterns, low MFU (typically < 5%), and limited scalability.
- Generative: Typically uses unified encoder-decoder, decoder-only, or diffusion-based architectures similar to LLMs. This leads to enhanced computational regularity, higher MFU (potential > 40%), and inherent scalability, allowing models to exhibit emergent capabilities. It also enables seamless leveraging of NLP community innovations.
Optimization:
- Discriminative: Relies on discriminative training to optimize local decision boundaries for users' posterior behaviors (e.g., CTR prediction). It struggles with explicitly characterizing the full probability distribution over items and multi-dimensional preference modeling.
- Generative: Trains with Next-Token Prediction (NTP) to naturally capture the full probability distribution over items and model the entire user behavior generation process. It incorporates preference alignment strategies, particularly Reinforcement Learning (RL)-based techniques, for multi-dimensional preference optimization (e.g., platform-level objectives), enabling end-to-end optimization and preventing cumulative information loss.
  
  Figure 2 from the original paper effectively illustrates these differences:
  
  VLM Description: The image is a chart that compares discriminative recommendation and generative recommendation, highlighting the differences in input, optimization, and architecture across the two paradigms. It specifically illustrates aspects like discriminative output, generative output with probability predictions, feature embedding, and model architecture, clearly showing the distinctions between the two recommendation approaches.

The original caption: Figure 2. Comparison of discriminative and generative recommendation paradigms.

4. Methodology

The paper systematically examines the methodology of generative recommendation through three core components: tokenizer, model architecture, and optimization strategy.

4.1. Tokenizer

The tokenizer defines how items and features are represented as discrete fundamental units (tokens) for the generative model. An effective tokenizer must balance semantic expressiveness with vocabulary size and ensure accurate grounding of generated tokens to actual items.

4.1.1. Evolution of Tokenizer Paradigms

The paper categorizes current approaches into three types: sparse ID-based, text-based, and semantic ID (SID)-based.

4.1.1.1. Sparse ID-Based Identifiers

These follow the conventional embedding & MLP paradigm of discriminative models. Each item is represented by a randomly assigned sparse ID, which carries no semantic information. The embedding layer assigns an independent parameterized vector to each sparse ID. MLP layers then process these embeddings to learn feature interactions and generate recommendations.

Advantages:
- Unique ID: Avoids ID collision with a unique ID for each item.
- Diverse Feature Representation: Facilitates direct representation of diverse features and interaction networks.
Generative Adaptation: Some generative methods (e.g., HSTU [35], MTGR [65]) adopt sparse ID-based tokenization by converting user behaviors into chronologically ordered token sequences and reformulating recommendation as a sequential transduction task using causal autoregressive modeling.
- HSTU [35]: Abandons numerical features and introduces action tokens. It builds sequences by interleaving sparse IDs of items and actions in the form $\left[ \mathrm { i t e m } _ { 1 } , \mathrm { act } _ { 1 } , \dots , \mathrm { i t e m } _ { n } , \mathrm { act } _ { n } \right]$ . This allows the model to predict either the next item or the next action.
- GenRank [62]: To address the overhead of HSTU's interleaved sequence, GenRank combines item tokens with action tokens, treating items as positional information and focusing on iteratively predicting actions associated with each item. This is an action-oriented organization.
- DFGR [68]: Similar to GenRank, DFGR treats the item and action as a single token by concatenating their ID embeddings.
Limitations:
- Lack of Multimodal Semantic Information: IDs are random and lack inherent semantic meaning.
- Cold-Start Problem: Sparse ID embeddings are learned from interaction data, so new or rarely interacted items suffer from inadequate feature learning.
- Vocabulary Explosion: The enormous item vocabulary (hundreds of millions of tokens) leads to an excessively large output space for next-item prediction, making it challenging for generative models.

4.1.1.2. Text-Based Identifiers

These methods represent items through their textual descriptions, leveraging the pre-trained vocabularies of LLMs. Recommendation is reframed as a question-answering or text generation task.

Mechanism: Items are represented by textual attributes (e.g., "title: iPhone 17 Pro") or structured templates (e.g., "Product: iPhone; Brand: Apple; Category: Electronics"). LLMs use their world knowledge and reasoning capabilities to infer user preferences.
Advantages:
- Alleviates Cold-Start/Long-Tail: LLMs' pre-trained knowledge helps with items lacking interaction data.
- Cross-Domain Generalizability: Enables transferability across different domains.
- Enhanced Interpretability/Conversational Interaction: Facilitates more natural user interactions.
Examples:
- M6-Rec [69]: Uses product attributes and descriptions to populate natural language templates for item tokenization.
- LLMTreeRec [70]: Emphasizes the hierarchical structure of product attributes to constrain generation and avoid excessive text length.
- S-DPO [38]: Takes user interaction histories as text prompts and predicts the title of the target item, optimizing the probability of positive samples.
Challenges:
- Ambiguity in Grounding: Generated text tokens may not uniquely identify a specific item.
- Computational Inefficiency: Text-based descriptions require a large number of tokens, reducing computational efficiency.
- Lack of Collaborative Signals: Purely text-based approaches may struggle to capture collaborative relationships (e.g., "Item_A-Item_B") directly within the model's parameter space. Some works (e.g., LLaRa [75]) incorporate item representations from traditional recommendation models to integrate collaborative and semantic information.

4.1.1.3. Semantic ID (SID)-Based Identifiers

SID-based methods address the limitations of both sparse ID-based (limited semantics, sparse vocabulary) and text-based (inefficient representation, grounding difficulty) approaches. They represent an item using a fixed-length sequence of correlated semantic IDs, providing compact, semantically rich representations with a controllable vocabulary size.

4.1.2. Semantic ID Construction

The construction of SIDs typically involves a two-step process: embedding extraction and quantization.

4.1.2.1. Embedding Extraction

This step transforms items' semantic information into a semantic embedding using pre-trained models.

Static Content Features: Early works like TIGER [29] and LC-Rec [82] generate embeddings solely from static item content (text, images) using models like BERT [77] for text or CLIP [78] for multimodal data.
Collaborative Signals: To address the lack of collaborative information, later models (e.g., LETTER [37], EAGER [83], OneRec [15,32], UNGER [84]) inject collaborative signals and jointly learn collaborative and semantic cross-modality embeddings.
Scenario-Specific Information: For location-based recommendations, methods like OneLoc [85] and GNPR-SID [86] inject geographical information into embeddings.
Purely Collaborative Signals: TokenRec [87] explores constructing embeddings using only collaborative signals, employing a GNN [88] to capture user-item interactions.

4.1.2.2. Quantization

Semantic embeddings are quantized into semantic ID sequences, often a tuple of codewords (e.g., $(c^0, c^1, c^2)$ ), where each codeword comes from a distinct codebook.

Residual-Based Quantization (RQ-VAE): This is the most widely adopted method (e.g., TIGER [29], OneRec [32]). It constructs a coarse-to-fine representation by quantizing the residual between the latent embedding and the cluster centroid.
- For each item, its semantic representation $z$ is processed. At each level $l$ , the algorithm identifies the closest code vector $c^l$ from the codebook $\{v_k^l\}_{k=1}^K$ to the current latent representation input $z$ : $c ^ { l } = \arg \operatorname* { m i n } _ { k } | | z - { \boldsymbol { v } } _ { k } ^ { l } | | _ { 2 } ^ { 2 }$ Then, the residual $r^{l+1} = z - v_c^l$ is used as input for the next quantization level. This process iterates until all $L$ levels are completed.
- OneRec [32] and OneLoc [85] adopt ResKmeans [95] to prevent codebook collapse during RQ-VAE training by limiting the maximum number of items assigned to any codeword, boosting utilization and stability.
- Limitation: Progressive residual quantization can lead to the hourglass effect [96], where codebook tokens in intermediate layers become excessively concentrated, introducing bias. Also, residual SID generation in LLMs exhibits prefix dependency, limiting decoding efficiency.
Parallel Quantization: To address prefix dependency and improve efficiency, some works (e.g., RPG [97], RecGPT [60]) predict multiple IDs simultaneously.
- RPG [97]: Uses Product Quantization (PQ) [81] for ultra-long SIDs to enable fine-grained semantic modeling and combines it with parallel decoding.
- RecGPT [60]: Integrates finite scalar quantization (FSQ) [79] with a hybrid attention mechanism.
Cross-Domain SID Construction: GMC [98] uses contrastive learning for representational consistency within domains, while RecBase [94] uses curriculum learning for cross-domain representation capability.

4.1.3. Challenges for Semantic ID

4.1.3.1. SID Collision

Multiple distinct items mapping to identical SID sequences introduce ambiguity during item grounding. High collision rates degrade model performance.

Causes: Unevenly distributed or collapsed centroids in quantization methods (RQ-VAE, ResKmeans), leading to low codebook utilization.
Mitigation:
- Optimization Objectives: SaviorRec [100] uses the Sinkhorn algorithm [101] and an entropy-regularized loss for more uniform item assignment. OneRec [32] and LETTER [37] use constrained k-means to limit items per centroid.
- Additional Token Positions: TIGER [29] adds a random token, CAR [99] adds the item sparse ID at the end of the SID for disambiguation.
- OneSearch [63] combines ResKmeans (shared characteristics) with optimized product quantization (OPQ) (unique characteristics) for distinctiveness.

4.1.3.2. Objective Inconsistency

This arises from multi-stage training (embedding extraction, SID quantization, generative model training), where insufficient inter-stage interaction hinders optimization towards the ultimate recommendation objective.

Solutions:
- Self-supervised SID Training: LMIndexer [102] uses a generative language model to encode item text into SIDs which then reconstruct the original text for supervision.
- End-to-End Joint Optimization: ETEGRec [104] proposes an end-to-end framework for tokenizer and GR model, aligned with sequence-item and preference-semantic loss. MMQ [105] uses behavior-aware fine-tuning and a soft indexing mechanism for continuous gradient propagation.

Precisely modeling multimodal information (e.g., text, image, collaborative signals) during tokenization.

Fusion during Embedding Extraction: QARM [106] and OneRec [15,32,33] fine-tune pre-trained multimodal models with user-item behavior. UNGER [84] performs contrastive alignment between multimodal and collaborative embeddings. OneLoc [85] incorporates geographic information.
Fusion during Quantization:
- Independent Quantization: Quantizing each modality separately [83,107].
- Inter-modal Alignment: MME-SID [108] and LETTER [37] use contrastive learning to strengthen alignment.
- MoE Architecture: MMQ [105] designs a multimodal shared-specific tokenizer with Mixture-of-Experts (MoE) for weighted aggregation of outputs from modality-specific and shared codebooks.
- Behavior-aligned Quantization: BBQRec [109] extracts behavior-relevant information from multimodal data.
- Positional Assignment: TALKPLAY [110] encodes each modality as a separate position via K-means clustering. EAGER-LLM [111] allocates specific codebook layers for multimodal and collaborative signals.

4.1.3.4. Interpretability and Reasoning

SID-based GR models generally lack interpretability and LLM-like reasoning capabilities.

Solutions:
- Integrating SID into LLMs: PLUM [112] uses pre-training tasks like SID-to-Title and SID-to-Topic to equip LLMs with semantic correspondence between SIDs and natural language.
- Unified Frameworks: OneRec-Think [34] bridges the semantic gap between discrete recommendation items and continuous reasoning spaces, designing a retrieval-based reasoning paradigm with multi-step deliberation.
  
  The following table, Table 1 from the original paper, provides a comparison of different tokenizer types: The following are the results from Table 1 of the original paper:

	Universality	Semantics	Vocabulary	Item Grounding
Sparse ID	X	×	Large	✓
Text	✓	✓	Moderate	×
Semantic ID	×	✓	Moderate	✓

4.2. Model Architecture

Generative architectures aim to overcome the limitations of traditional discriminative models by unifying the architecture, enhancing computational regularity, and improving MFU. These architectures enable the scaling of model parameters for performance gains. The paper categorizes architectures into encoder-decoder, decoder-only, and diffusion-based structures.

4.2.1. Encoder-Decoder Architecture

This structure balances user preference understanding (encoder) and next-item generation (decoder).

Direct Transfer from Pre-trained LLMs: Early attempts directly adapted pre-trained encoder-decoder language models.
- P5 [28]: Built on T5 [116], unifies five recommendation tasks using specific prompts.
- M6-Rec [69]: Based on M6 model [117], serializes user-item interactions into text for generating textual descriptions of recommended items.
- RecSysLLM [74]: Uses a structured prompt format with GLM [118] and a multi-task masked token prediction mechanism.
- Challenge: General-purpose LLMs lack collaborative signals, their language modeling objective is misaligned with recommendation goals, and inference overhead is high.
Dedicated Encoder-Decoder Architectures for Recommendation:
- TIGER [29]: Pioneered generative retrieval using a standard transformer encoder-decoder (T5) structure over pre-trained semantic IDs, framing recommendation as semantic ID sequence generation.
- OneRec [32]: Extends TIGER by constructing an end-to-end generative architecture, eliminating the traditional multi-stage pipeline.
  - Encoder: Captures distinct scales of user interaction patterns (static, short-term, positive-feedback, lifelong) through a unified transformer-based network.
  - Decoder: Processes user sequences through transformer layers. Each decoder layer incorporates a Mixture of Experts (MoE) feed-forward network with a top $k$ $k$ routing strategy. The decoder's operations are: $\begin{array} { r l } & { \mathbf { x } _ { m } ^ { ( i + 1 ) } = \mathbf { x } _ { m } ^ { ( i ) } + \mathrm { CausalSelfAttn } ( \mathbf { x } _ { m } ^ { ( i ) } ) , } \\ & { \mathbf { x } _ { m } ^ { ( i + 1 ) } = \mathbf { x } _ { m } ^ { ( i + 1 ) } + \mathrm { CrossAttn } ( \mathbf { x } _ { m } ^ { ( i + 1 ) } , \mathbf { Z } _ { \mathrm { enc } } , \mathbf { Z } _ { \mathrm { enc } } ) , } \\ & { \mathbf { x } _ { m } ^ { ( i + 1 ) } = \mathbf { x } _ { m } ^ { ( i + 1 ) } + \mathrm { MoE } ( \mathrm { RMSNorm } ( \mathbf { x } _ { m } ^ { ( i + 1 ) } ) ) , } \end{array}$ Where:
    - $\mathbf{x}_m^{(i)}$ is the input to the $i$ -th decoder layer.
    - $\mathbf{x}_m^{(i+1)}$ is the output of the $i$ -th decoder layer.
    - $\mathrm{CausalSelfAttn}(\cdot)$ is the causal self-attention mechanism, which ensures that predictions for a token depend only on preceding tokens.
    - $\mathrm{CrossAttn}(\cdot, \mathbf{Z}_{\mathrm{enc}}, \mathbf{Z}_{\mathrm{enc}})$ is the cross-attention mechanism, allowing the decoder to attend to the encoded information $\mathbf{Z}_{\mathrm{enc}}$ from the encoder.
    - $\mathrm{MoE}(\mathrm{RMSNorm}(\cdot))$ is a Mixture of Experts layer, applied after RMSNorm (Root Mean Square Normalization), to enhance model capacity while maintaining efficiency.
    - RMSNorm is a normalization technique used to stabilize training.
    - The equations represent the sequential application of causal self-attention, cross-attention, and an MoE layer within each decoder layer, typically with residual connections (+) for stable training.
Scenario-Specific Customization: Architectures are optimized for specific applications:
- OneSug [119] (e-commerce query auto-completion): Dedicated encoder for historical interactions, specialized decoder for query generation.
- OneSearch [63] (search scenarios): Multi-view behavior sequence injection to capture short-term and long-term preferences, uses unified encoder-decoder to generate semantic ID sequences.
- OneLoc [85] (local life service): Incorporates geolocation-aware self-attention in the encoder and neighbor-aware attention in the decoder to guide POI (Point of Interest) generation.
- EGA-V2 [107] (advertising scenarios): Encodes user interaction sequences and autoregressively generates sequences of next POI tokens and creative tokens via two dependent decoders in a multi-task schema.

4.2.2. Decoder-Only Architecture

These architectures are seen as more scalable and computationally effective, especially for long-sequence behavior contexts, due to the imbalance in computational resources in encoder-decoder models.

Leveraging Pre-trained Decoder-Only LLMs:
- LLM-Prompting with Text Generation: Approaches like GenRec [122], BIGRec [72], Rec-R1 [123], GPT4Rec [59], RecFound [124], and Llama4Rec [125] use decoder-only LLMs as backbones. They prompt LLMs to generate textual descriptions of target items, followed by an item grounding step to map text to concrete items.
- Direct Semantic ID Modeling: Approaches like MME-SID [108], EAGER-LLM [111], RecGPT [61], SpaceTime-GR [126], TALKPLAY [110], and GNPR-SID [86] introduce semantic IDs for direct item modeling and generation. OneRec-think [34] unifies dialogue understanding, chain-of-thought reasoning, and personalized recommendation.
- Challenge: Generic linguistic priors of LLMs may not align with recommendation scenarios; semantic gap between generated text and discrete item space.
Dedicated Decoder-Only Architectures for Recommendation:
- RecGPT [60] and FORGE [127]: Pure decoder-only transformer backbones, trained autoregressively on user interaction sequences with pre-trained semantic IDs for direct next item generation.
- SynerGen [128]: Unifies search and recommendation tasks using task-specific masking matrices to enforce causality, session isolation, and cross-task alignment.
- COBRA [89]: Fuses semantic IDs with dense embeddings in the decoder for joint prediction of both sparse semantic ID and dense vector representation of the next item. Uses a coarse-to-fine inference strategy.
- Efficiency Improvements:
  - RPG [129]: Proposes a multi-token prediction mechanism based on parallel SID encoding and a graph-based decoding strategy for efficiency.
  - CAR [99]: Groups semantic IDs into concept blocks and uses an autoregressive transformer decoder for parallel block-wise prediction.
  - OneRec-V2 [33]: Introduces a Lazy Decoder structure with a context processor for multimodal user behavior signals. It omits standard key/value projection operations in attention, shares key-value pairs across decoder blocks, and employs grouped query attention to reduce computational overhead.
- Structured Behavioral Sequence Modeling:
  - HSTU [35]: Frames recommendation as structured sequence prediction over behavioral time series, emphasizing intent modeling. The architecture uses a stack of layers with residual connections, and its core calculation includes: ${ \mathbf { u } } ^ { ( i ) } , { \mathbf { v } } ^ { ( i ) } , { \mathbf { q } } ^ { ( i ) } , { \mathbf { k } } ^ { ( i ) } = \mathrm { S p l i t } \big ( \mathrm { S i L U } ( f _ { 1 } ( { \mathbf { x } } ^ { ( i ) } ) ) \big ) ,$ \begin{array} { r } { \mathbf { \boldsymbol { x } } ^ { ( i + 1 ) } = f _ { 2 } \big ( \mathrm { Norm } ( \mathrm { S i L U } \left( \mathbf { q } ^ { ( i ) } ^ { T } \mathbf { k } ^ { ( i ) } + \mathbf { r a b } ^ { p , t } \right) \mathbf { v } ^ { ( i ) } \big ) \odot \mathbf { u } ^ { ( i ) } \big ) . } \end{array} Where:
    - $\mathbf{x}^{(i)}$ is the input to the $i$ -th layer.
    - $f_1(\cdot)$ and $f_2(\cdot)$ are feed-forward networks (e.g., MLPs).
    - $\mathrm{SiLU}(\cdot)$ (Sigmoid Linear Unit) is an activation function.
    - $\mathrm{Split}(\cdot)$ divides the output of $\mathrm{SiLU}(f_1(\mathbf{x}^{(i)}))$ into four components: $\mathbf{u}^{(i)}$ , $\mathbf{v}^{(i)}$ , $\mathbf{q}^{(i)}$ (query), and $\mathbf{k}^{(i)}$ (key).
    - $\mathrm{Norm}(\cdot)$ is a normalization function.
    - $\mathbf{q}^{(i)^T} \mathbf{k}^{(i)}$ represents the dot product for attention scores.
    - $\mathbf{rab}^{p,t}$ likely refers to relative attention biases for position $p$ and time $t$ , capturing positional and temporal relationships.
    - $\odot$ denotes element-wise multiplication.
    - The equation shows that HSTU replaces the conventional softmax in attention with pointwise aggregated attention (` $\mathrm { S i L U } \left( \mathbf { q } ^ { ( i ) } ^ { T } \mathbf { k } ^ { ( i ) } + \mathbf { r a b } ^ { p , t } \right) \mathbf { v } ^ { ( i ) } \big ) \odot \mathbf { u } ^ { ( i ) }$ ) to preserve preference intensity signals, which is more suitable for dynamic item vocabularies.
  - MTGR [65]: Designs dedicated attention patterns for different feature types (full attention for static, dynamic autoregressive masking for real-time behaviors, diagonal mask for candidate items).
  - INTSR [130]: Introduces session-level masking and a Query-Driven Block to unify query-agnostic recommendation and query-based search.
  - LiGR [131]: For re-ranking, proposes an in-session set-wise attention mechanism where items in a recommendation list are presented simultaneously, ignoring causal ordering.

4.2.3. Diffusion-Based Architecture

These models generate recommendations through parallel iterative denoising of the full target sequence, enabling bidirectional attention and flexible control over generation steps.

Mechanism: Starts from random noise and gradually denoises it to generate data. This contrasts with autoregressive models that generate tokens sequentially.
Examples:
- Diff4Rec [115]: Uses VAE to map discrete interactions into latent vectors, where diffusion with curriculum scheduling generates semantically consistent augmentations.
- CaDiRec [132]: Enhances Diff4Rec with context-aware weighting and transformer-based UNet for temporal dependencies.
- RecDiff [133]: Combines GCN with diffusion in latent space to refine user representations.
- DDRM [134]: Directly denoises continuous embeddings via MLPs for ranking.
- Multimodal Applications: DiffCL [135] and DimeRec [136] leverage diffusion as a feature augmenter for multimodal scenarios.
- Discrete Diffusion: DiffGRM [137] pioneers discrete diffusion for Semantic ID generation.

4.3. Optimization Strategy

Optimization strategies have evolved from supervised learning with next-token prediction to incorporating Reinforcement Learning (RL)-based preference alignment for multi-dimensional objectives.

4.3.1. Supervised Learning

The primary objective in supervised learning for generative recommendation is typically next-token prediction (NTP).

4.3.1.1. NTP Modeling

Given historical behavior tokens, the model learns user preferences and predicts the next item autoregressively.

Basic NTP: TIGER [29] adopts this, training an encoder-decoder in a sequence-to-sequence manner. Many other GR studies (e.g., OneRec [15], RecGPT [60]) also use NTP. The objective can be defined as maximizing the conditional probability: $p _ { \theta } ( i _ { 1 : T } \mid u , c ) = \prod _ { t = 1 } ^ { T } p _ { \theta } ( i _ { t } \mid i _ { < t } , u , c )$ Where:
- $i_{1:T}$ is the sequence of items interacted by user $u$ .
- $u$ is the user.
- $c$ is the context.
- $i_t$ is the item at timestep $t$ .
- $i_{<t}$ is the historical interaction prefix sequence $(i_1, \ldots, i_{t-1})$ .
- $p_\theta(\cdot)$ is the probability distribution learned by the model parameterized by $\theta$ .
- The model learns to predict $i_t$ given $i_{<t}$ , $u$ , and $c$ .
Modified NTP Objectives:
- LETTER [37]: Modifies NTP loss into a ranking-guided generation loss by altering temperature to penalize hard-negative samples.
- COBRA [89]: Uses a composite loss function combining losses for sparse ID prediction and dense vector prediction.
- REG4Rec [139]: Adds an auxiliary category-prediction task.
- UNGER [84]: Introduces an intra-modality knowledge distillation task for transferring item knowledge.
LLM Adaptation: For models inheriting parameters from LLMs, additional training tasks adapt them for item generation. User behaviors are serialized into textual prompts and LLMs are fine-tuned (full-parameter or parameter-efficient tuning like LoRA [76]).
- LC-Rec [82]: Designs tuning tasks to align language and collaborative semantics.
- RecFound [124]: Uses a broader set of recommendation-specific tasks and a step-wise convergence-oriented sample strategy.
- EAGER-LLM [111]: Adopts an annealing adapter tuning schedule to prevent catastrophic forgetting.

4.3.1.2. NCE Modeling

When sparse IDs are used, NTP loss becomes challenging due to extremely large vocabulary size. NCE-style optimization (Noise Contrastive Estimation) [35] approximates the softmax over the full vocabulary, avoiding vanishing gradients and enabling efficient training.

Sampled Softmax: HSTU [35] and GenRank [62] approximate full softmax by treating the true next token as positive and sampling negatives from the catalog.
Multi-token Objective: PinRec [66] uses a multi-token objective to predict beyond the next token within a timestep window.
InfoNCE Optimization: IntSR [130] uses InfoNCE with a hard negative sampling strategy.
Session-Level Training: SessionRec [140] trains at the session level for next session prediction and enhances hard negative sample distinction via a ranking task.
Discriminative Loss: MTGR [65] adopts optimization similar to traditional discriminative models with discriminative loss.

4.3.2. Preference Alignment

To align with implicit multi-dimensional preferences and platform-level objectives, Reinforcement Learning (RL)-based preference alignment [141] is introduced. This optimizes for cumulative reward over sequential user interactions, pursuing long-term objectives and non-differentiable metrics. Inspired by LLM alignment techniques (e.g., DPO [144], GRPO [145]), two mainstream paradigms emerge.

4.3.2.1. DPO Modeling

DPO (Direct Preference Optimization) directly optimizes models using pairwise preference data, encouraging chosen samples over rejected ones, bypassing explicit reward modeling.

Preference Pair Construction: DPO effectiveness relies on high-quality preference pairs. Posterior user behaviors (clicks, purchases) are positive examples, while negative generation strategies vary.
- S-DPO [38]: Extends standard DPO by pairing one positive with multiple negatives for more discriminative learning.
- RosePO [146]: Refines this with selective rejection sampling for helpfulness and harmlessness.
- SPRec [147]: Introduces a self-evolving mechanism to dynamically select hard negatives from previous predictions.
Heuristic + Predictive Models: OneLoc [85], OneSearch [63], and OneSug [119] combine heuristic rules with predictive models and feedback-aware weighting.
- OneLoc [85]: Uses beam search for candidate generation, then ranks with GMV (Gross Merchandise Volume) prediction and rule-based geographic proximity scores.
- OneSug [119]: Constructs multi-level preference pairs across nine behavior types with calibrated weights reflecting user intent strength.
- OneSearch [63]: Introduces a three-tower reward model for user-item relevance scores and feedback-based weighting (from CTR, CVR) for graded alignment.

4.3.2.2. GRPO Modeling

GRPO (Group Relative Policy Optimization) extends preference optimization to a groupwise setting, assigning explicit reward signals to candidates and updating policy to shift probability mass toward higher-reward candidates. It requires a reliable and accurate reward function.

Hybrid Reward System: Integrates diverse feedback sources (rule-based and model-based).
- Rule-Based Rewards: Ensure format compliance and alignment with observed user interactions.
  - VRAgent-R1 [148]: Positive reward for conforming output.
  - STREAM-Rec [91]: Graded behavioral rewards (high for matched items, negative for low matching quality).
- Posterior Ranking Metrics:
  - Rec-R1 [123]: Uses metrics like NDCG (Normalized Discounted Cumulative Gain) for candidate group quality.
  - RecLLM-R1 [138]: Adopts Longest Common Subsequence algorithm for rewarding correctly ranked items.
- Combined Rule-Based and Predictive Models:
  - OneRec [15]: Integrates a point-wise P-Score model (predicting user preference) with rule-based components (format compliance, ecosystem relevance). Uses Early Clipped Policy Optimization (ECPO) to stabilize optimization.
Reasoning-Guided Generation:
- OneRec-Think [34]: Introduces chain-of-thought reasoning with item-text alignment and reasoning scaffolding, optimized via multi-path rewards.
- REG4Rec [139]: Expands reasoning space by generating multiple semantic tokens per item, pruning inconsistent paths via self-reflection.
- RecZero [64]: Adopts a "Think-before-Recommendation" RL paradigm, leveraging structured reasoning templates and rule-based rewards.
  
  The following figure, Figure 5 from the original paper, illustrates the optimization strategies of several representative generative recommendation models:
  
  VLM Description: The image is a schematic diagram that illustrates the architecture and categories of different recommendation systems, including information on preference alignment (GPRO) and auxiliary loss (Auxiliary Loss). The names and dates of various recommendation methods are clearly labeled, facilitating the understanding of the development and classification of different methods.

5. Experimental Setup

The paper is a survey and thus does not present new experimental results. However, it discusses the application of generative recommendation models in various settings, which implies certain experimental methodologies for the individual works it cites. Based on the common practices in the field and the descriptions of the applications, we can infer the typical experimental setup.

5.1. Datasets

The specific datasets used are not detailed in the survey itself but are referenced through the cited papers. Generative recommendation models are typically evaluated on large-scale datasets that capture user-item interactions and item attributes across diverse domains.

E-commerce Datasets: Often include product information (title, description, images, brand, category) and user interaction logs (clicks, purchases, views, search queries). Examples mentioned or implied:
- M6-Rec [69]: Investigated in e-commerce scenarios.
- OneSug [119]: Designed for e-commerce query auto-completion.
- OneSearch [63]: Explores e-commerce search.
Streaming Media/Music Datasets: Interaction logs on content consumption (e.g., watch history, listening history, ratings).
Social Network Datasets: User activity, connections, content sharing, etc.
Location-Based Services (POI) Datasets: User check-ins, POI attributes (category, location, description), temporal information. OneLoc [85] and GNPR-SID [86] are mentioned in this context.
Advertising Datasets: User impressions, clicks, conversions, ad attributes. EGA-V2 [107] is mentioned for advertising scenarios.

These datasets are typically chosen because they:
Reflect Real-World Scenarios: Provide realistic user behavior and item diversity.
Are Large-Scale: Necessary for training deep generative models.
Enable Validation of Specific Challenges: Datasets with cold-start items, long-tail distributions, or cross-domain interactions are crucial for evaluating solutions to these problems.

5.2. Evaluation Metrics

The survey mentions various evaluation metrics used in the context of generative recommendation, particularly in the Optimization Strategy section and Results & Analysis section (implicitly, as it describes how models improve performance).

Recommendation Quality Metrics:
- CTR (Click-Through Rate):
  - Conceptual Definition: Measures the proportion of times an item was clicked after being shown to a user. It quantifies how appealing a recommendation is.
  - Mathematical Formula: $\mathrm{CTR} = \frac{\text{Number of Clicks}}{\text{Number of Impressions}}$
  - Symbol Explanation:
    - Number of Clicks: The total count of times users clicked on a recommended item.
    - Number of Impressions: The total count of times a recommended item was shown to users.
- CVR (Conversion Rate):
  - Conceptual Definition: Measures the proportion of times a user performs a desired action (e.g., purchase, sign-up) after clicking on a recommended item or seeing a recommendation. It quantifies the effectiveness of recommendations in driving business goals.
  - Mathematical Formula: $\mathrm{CVR} = \frac{\text{Number of Conversions}}{\text{Number of Clicks}}$
  - Symbol Explanation:
    - Number of Conversions: The total count of times users completed the desired action.
    - Number of Clicks: The total count of times users clicked on a recommended item.
- NDCG (Normalized Discounted Cumulative Gain):
  - Conceptual Definition: A measure of ranking quality that takes into account the position of relevant items in a ranked list. It gives higher scores to relevant items that appear higher in the list and considers the graded relevance of items.
  - Mathematical Formula: $\mathrm{DCG}_p = \sum_{i=1}^p \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)}$ $\mathrm{NDCG}_p = \frac{\mathrm{DCG}_p}{\mathrm{IDCG}_p}$
  - Symbol Explanation:
    - $\mathrm{rel}_i$ : The relevance score of the item at position $i$ .
    - $p$ : The number of items in the ranked list being considered.
    - $\mathrm{DCG}_p$ : Discounted Cumulative Gain at position $p$ .
    - $\mathrm{IDCG}_p$ : Ideal Discounted Cumulative Gain, which is the DCG of the ideal ranking (items sorted by decreasing relevance).
- GMV (Gross Merchandise Volume):
  - Conceptual Definition: The total value of sales over a given period. In recommendation, it's often a business-oriented metric to evaluate the commercial impact of recommendations.
  - Mathematical Formula: No standard formula, typically the sum of prices of purchased items. $\mathrm{GMV} = \sum_{\text{purchased items } j} \mathrm{Price}_j$
  - Symbol Explanation:
    - $\mathrm{Price}_j$ : The price of item $j$ .
Efficiency Metrics:
- MFU (Model FLOPS Utilization):
  - Conceptual Definition: Measures how close the actual floating-point operations performed by the model are to the theoretical maximum FLOPS of the hardware. Higher MFU indicates more efficient hardware usage.
  - Mathematical Formula: Not a universally standard formula, but generally calculated as: $\mathrm{MFU} = \frac{\text{Actual FLOPS}}{\text{Peak FLOPS}}$
  - Symbol Explanation:
    - Actual FLOPS: The number of floating-point operations performed by the model per second.
    - Peak FLOPS: The theoretical maximum floating-point operations per second that the hardware can achieve.
- Latency: The time taken for a model to generate a recommendation. Crucial for real-time systems.
- Throughput: The number of recommendations a system can process per unit of time.
Tokenizer-Specific Metrics:
- Collision Rate: Measures how often multiple distinct items are mapped to the same semantic ID sequence. Lower is better.
- Information Entropy: Measures the randomness or uncertainty in the distribution of SIDs. Higher entropy suggests better utilization of the codebook.
- Codebook Utilization: Measures the proportion of codewords in a codebook that are actually used. Higher utilization is generally better.

5.3. Baselines

The baselines mentioned in the survey are primarily traditional discriminative recommendation models and early LLM-enhanced discriminative approaches.

Traditional Discriminative Models:
- Embedding & MLP paradigms, including various feature interaction networks (DCN [8]) and behavior modeling networks (DIN [52]).
- Multi-stage cascaded systems (e.g., Covington et al. [22] for YouTube recommendations).
LLM-enhanced Discriminative Models: Approaches that use LLMs for semantic, data, or alignment enhancement but still operate within the discriminative framework (e.g., P5 [28], M6-Rec [69], RecSysLLM [74]). While these use LLMs, they are presented as a step before truly generative recommendation.
Generative Retrieval Models: For retrieval tasks, generative models are compared against traditional similarity-based retrieval systems (e.g., DSSM [149] or Approximate Nearest Neighbor (ANN) methods).

These baselines are representative as they reflect the state-of-the-art in traditional recommendation paradigms, providing a clear benchmark for the performance improvements offered by the new generative approaches across tokenization, architecture, and optimization.

6. Results & Analysis

As a survey paper, this document does not present novel experimental results in the form of tables or figures but rather synthesizes the findings and applications of existing generative recommendation models. The "results" discussed here are the reported successes and advantages of generative models over traditional discriminative approaches, as well as their practical deployment scenarios and remaining challenges.

6.1. Core Results Analysis

The paper highlights that generative recommendation models demonstrate significant advantages over traditional discriminative pipelines across multiple dimensions, leading to notable improvements in performance and efficiency.

Mitigation of Cascaded Error Propagation: Discriminative models suffer from information loss and cumulative errors across retrieval, ranking, and re-ranking stages. Generative models, by directly generating item identifiers in an end-to-end fashion, effectively eliminate these cascaded errors, leading to higher overall recommendation quality.
Improved Hardware Utilization (MFU): Traditional discriminative models with their heterogeneous, small-scale operators exhibit very low MFU (typically < 5%). Generative architectures, particularly decoder-only models, leverage unified transformer-based backbones. This standardization results in more regular computational patterns and significantly higher MFU, approaching LLM levels (> 40% during training), thus maximizing hardware computational efficiency.
Enhanced Scalability and Emergent Capabilities: The unified nature of generative architectures allows for scaling up model parameters. This scaling, especially in decoder-only models, often leads to emergent capabilities and predictable performance gains, which is difficult to achieve with smaller, fragmented discriminative models. The paper notes a rapid growth in generative model scale, from millions to billions of parameters, with decoder-only architectures becoming dominant for models exceeding 1 billion parameters.
Semantic Richness and Cold-Start Alleviation: Semantic ID (SID)-based tokenization, by encoding item attributes into semantically meaningful identifiers, allows models to better handle cold-start and long-tail items, and enables cross-domain generalization. This is a significant improvement over sparse ID approaches that lack intrinsic semantic meaning.
Multi-dimensional Preference Optimization: The shift from supervised next-token prediction to Reinforcement Learning (RL)-based preference alignment allows generative models to optimize not just for immediate user clicks but also for long-term user preferences, platform-level objectives (e.g., GMV, retention, diversity, fairness), and complex, non-differentiable metrics. This creates a more holistic and balanced optimization framework.
End-to-End Optimization: Generative models enable true end-to-end optimization, where the entire recommendation process is integrated into a single framework. This simplifies the system architecture, reduces engineering overhead, and allows for global optimization, leading to substantial performance improvements (e.g., OneRec [32], OneSug [119], ETEGRec [104] reported significant gains over cascaded approaches).

6.2. Data Presentation (Tables)

As a survey, the paper typically summarizes findings and trends rather than presenting new experimental results in tables. However, it does include a conceptual comparison table for tokenizers.

The following are the results from Table 1 of the original paper:

	Universality	Semantics	Vocabulary	Item Grounding
Sparse ID	X	×	Large	✓
Text	✓	✓	Moderate	×
Semantic ID	×	✓	Moderate	✓

Analysis of Table 1:
- Sparse ID: Lacks Universality (cannot generalize well across domains or tasks) and Semantics (IDs are arbitrary numbers). It has a Large Vocabulary (millions of items) but excels at Item Grounding (each ID uniquely maps to an item).
- Text: Offers high Universality (can leverage LLM capabilities across various text-based tasks) and Semantics (rich textual descriptions). It has a Moderate Vocabulary (subword units of LLMs). However, it struggles with Item Grounding because generated text might not uniquely map back to a specific item.
- Semantic ID: Lacks Universality compared to general LLMs (often specialized for items) but possesses strong Semantics (encoded from item features). It has a Moderate Vocabulary (controllable size, typically tens of thousands) and achieves reliable Item Grounding (designed to be unique and compact).
- Conclusion: Semantic ID is highlighted as the most promising paradigm due to its balance of semantics, vocabulary size, and item grounding capabilities, addressing key limitations of the other two approaches.

6.3. Ablation Studies / Parameter Analysis

The survey does not include its own ablation studies or parameter analyses, as it is a review of existing work. However, the discussion of the evolution of models and challenges implicitly touches upon the outcomes of such studies performed by the individual papers being surveyed. For instance:

The comparison of RQ-VAE and ResKmeans for SID construction and their impact on SID collision and codebook utilization suggests that research has performed ablation studies on different quantization techniques.
The development of Lazy Decoder in OneRec-V2 [33] and its impact on MFU and efficiency implies that analyses were conducted to show the benefits of such architectural choices.
The various preference alignment techniques and their impact on different reward signals (e.g., GMV, NDCG, diversity) indicate that the effectiveness of these components has been individually evaluated by the respective authors.

The trend toward task-aware refinements in generative architectures (e.g., task-specific attention masking [128], MoE-based feature routing [15], geolocation-aware attention [85]) suggests that these specific components have been found to contribute positively to performance in their respective contexts, likely through ablation studies.

7. Conclusion & Reflections

7.1. Conclusion Summary

This survey concludes that generative recommendation represents a transformative paradigm shift from traditional discriminative models towards a unified generative framework. This shift enables end-to-end optimization and flexible adaptation across diverse recommendation scenarios. The paper comprehensively reviews this field through a tri-decoupled perspective:

Tokenization: Evolving from sparse ID- and text-based approaches to more efficient and semantically rich Semantic ID (SID)-based methods. SIDs balance vocabulary compactness, semantic expressiveness, and reliable item grounding, addressing limitations like cold-start and computational inefficiency.
Architecture: Progressing from encoder-decoder models to increasingly dominant decoder-only and diffusion-based structures. These architectures unify the system, enhance Model FLOPS Utilization (MFU), provide inherent scalability, and enable the seamless adoption of advancements from Large Language Model (LLM) research.
Optimization: Transitioning from supervised next-token prediction for local decision boundaries to Reinforcement Learning (RL)-based preference alignment. This allows for multi-dimensional optimization of user preferences and platform-level objectives, fostering end-to-end training and mitigating cumulative error.

The survey highlights that generative models have achieved significant commercial success and offer a robust blueprint for developing next-generation recommender systems, while also acknowledging remaining challenges and outlining promising future research directions.

7.2. Limitations & Future Work

The authors identify several key challenges that also serve as promising future research directions:

End-to-End Modeling:
- Model Scaling: While end-to-end models mitigate error accumulation and improve MFU, current deployed models are limited to ~1 billion parameters due to latency. Future work should explore scaling models to LLM-level capacity (tens of billions) while maintaining acceptable inference latency.
- Unified Reward Design: Current RL-based preference alignment often uses rule-based or single-aspect reward signals. A future direction is to develop a unified Reward Agent, potentially powered by LLMs, to automatically comprehend and balance multi-dimensional preferences (user-level, platform-level, diversity, fairness, safety) more robustly and with less manual intervention.
Efficiency:
- Algorithm-System Co-design: A lack of an integrated algorithm-system co-design framework tailored for streaming training and low-latency, high-throughput inference in recommendation scenarios. This is critical for industrial-scale systems processing hundreds of millions of samples daily.
- Ultra-Long Behavior Modeling: The computational complexity of attention mechanisms makes ultra-long behavior modeling an efficiency bottleneck. Future research needs to explore more efficient sequence modeling paradigms, such as memory-augmented structures and RAG-augmented training.
Reasoning:
- Constructing Reasoning Chains: It's challenging to create large-scale chain-of-thought (CoT) data for personalized recommendation tasks, especially given unique user characteristics.
- Adaptive Reasoning: Models need to learn adaptive reasoning strategies based on query difficulty to satisfy strict latency requirements and avoid "overthinking."
- Self-Evolving Recommenders: Developing models that continually reflect on decisions, revise reasoning policies, and improve from online feedback, while mitigating risks like bias amplification and catastrophic forgetting.
Data Optimization:
- Training Data Bias: Generative models trained on historical positive interaction data inherit biases (exposure, position) that traditional debiasing methods may not fully address. Designing debiasing strategies specifically for generative recommendations is crucial.
- High-Quality Data Construction: While abundant interaction data exists, constructing high-quality data like reasoning-oriented CoT data, multi-aspect preference data, and explicit intent-level annotations remains a bottleneck.
Interactive Agent:
- Personalized Dialogue Recommendations: Current agent-based systems struggle to deliver personalized dialogue that aligns with user preferences and provides user-style rationales.
- User-Centric Memory Mechanisms: Designing effective memory mechanisms tailored for conversational recommendation to enhance understanding of users and personalization.
From Recommendation to Generation:
- Strong Personalization & Sparse Feedback: The highly personalized content generated by multimodal models (e.g., video, audio) leads to extremely sparse feedback signals, complicating algorithm development.
- Cost-Value Trade-off: Balancing the substantial resource demands and generation costs of generative models against their potential value to ensure ecosystem sustainability.

7.3. Personal Insights & Critique

This survey provides an excellent, structured overview of an emerging and highly impactful field. The "tri-decoupled perspective" (tokenization, architecture, optimization) offers a robust conceptual framework for understanding the core components of generative recommendation and how they differentiate from traditional approaches. The emphasis on MFU, scalability, and end-to-end optimization clearly articulates the practical advantages and the industrial motivation behind this paradigm shift.

Inspirations:

The Power of Unification: The central theme of unifying previously cascaded, fragmented systems into a single generative framework is inspiring. This approach has proven revolutionary in NLP (with LLMs) and computer vision (with diffusion models), and its application to recommendation promises similar breakthroughs in efficiency, scalability, and performance.
Addressing Fundamental Problems: The explicit focus on how generative recommendation inherently addresses long-standing problems like cold-start, semantic isolation, and multi-objective optimization is particularly insightful. Semantic IDs seem to be a crucial innovation bridging the gap between raw item information and efficient model processing.
Future-Proofing: The ability of generative architectures to sync with advancements in LLMs and hardware (e.g., KV-cache-friendly execution, MoE, GQA) suggests a more future-proof and sustainable research direction compared to highly specialized discriminative models.
Reasoning and Interpretability: The discussion around incorporating reasoning capabilities and interpretability, particularly through SID integration with LLMs and chain-of-thought approaches, is a vital step towards more trustworthy and user-friendly recommender systems.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Computational Cost for Training/Inference: While the survey highlights improved MFU for generative models, the absolute computational cost (especially for training billion-parameter models) is still immense. Industrial deployment of such large generative models, particularly for the entire user base with real-time requirements, poses significant challenges. The balance between MFU and overall FLOPs for a given task needs careful consideration.
Data Scarcity for Complex Objectives: The transition to RL-based preference alignment and multi-dimensional objectives is theoretically sound, but obtaining sufficient, high-quality, and unbiased reward signals (especially for long-term values, diversity, or fairness) remains extremely difficult in practice. Manual annotation or even LLM-driven reward modeling can be prone to bias or misrepresentation.
SID Quality and Generalization: The effectiveness of SID-based tokenization heavily relies on the quality of the embedding extraction and quantization processes. Issues like SID collision, objective inconsistency, and handling rapidly evolving item catalogs could undermine their benefits. How well SIDs generalize across vastly different domains or modalities (e.g., from e-commerce products to news articles) without extensive re-training is an open question.
Evaluation in End-to-End Systems: Evaluating end-to-end generative recommendation models in real-world settings is complex. Traditional metrics might not capture the full benefits of multi-objective optimization or emergent capabilities. Developing holistic, interpretable evaluation frameworks that align with both user satisfaction and business goals is critical.
Risk of Generative AI: For the "From Recommendation to Generation" direction, generating entirely new content tailored to users brings ethical and practical challenges similar to other generative AI applications, such as content moderation, misinformation, safety, and brand alignment. The computational cost of generating highly personalized content (e.g., videos) for every user, on demand, is currently prohibitive for most platforms.

The paper successfully provides a comprehensive map of this evolving landscape, setting the stage for future research to tackle these complex and exciting challenges. Its framework can definitely be applied to other domains shifting towards generative paradigms, offering a structured way to analyze tokenization, architecture, and optimization as fundamental building blocks.